A Regex amuse-bouche

Before continuing with the Template series, I thought I’d talk briefly about an interesting (well, at least to me) solution to a little problem. System and user libraries (the kind that end in .so or .a, not Perl libraries) have a section at the top that maps a function name (‘load_user’ or whatever) to an offset into the library, say, 0x193a.

This arrangement worked fine for many years for C, Algol, FORTRAN and most other languages out there. But then along came languages that upset the apple cart, like C++ and Smalltalk, where a programmer could write two ‘load_user’ functions, call ‘load_user(1234)’ or ‘load_user(“Smith, John”)’ and expect the linker to load the right version of ‘load_user.’

The problem here is that the library, the linker and all of the other programs in the tool chain expect there to only be one function called ‘load_user’ in any given library.

Those of us that do Perl 5 and Perl 6 programming don’t have to worry about this, but if you ever want to link to a C++ library, you probably should know at least a bit about “name mangling.”

For a while, utilities like ‘CFront’ for the Macintosh (which the author actually filed bug reports on) were used to “rename” functions like ‘load_user(int)’ and ‘load_user(char*)’ to ‘i_load_user’ and ‘cs_load_user’ before being added to the library, and other tools to do the reverse.

Has Your Mother Sold Her Mangle?

Eventually things settled down, and this process of changing names to fit into the library was “baked in” to the tool chains. Not consistently, of course, couldn’t have that. But conventions arose and even today Wikipedia lists at least 12 different ways to “mangle” ‘void h(void)’ into the existing library formats.

We’ll just look at the first one, ‘_Z1hv’. The ‘_Z’ can be safely ignored, its purpose there is mainly to tell the linker something “special” is going on. ‘1h’ is the function name, and ‘v’ is its first (and only) parameter. Suppose, then, that you were tasked with writing a tool that undid this name mangling.

Your first cut at extracting something useful might look something like

'_Z9load_useri' ~~ m{ ^ '_Z' \d+ (\w+) (.) $ };

Assuming $mangle-me has ‘_Z9load_useri’ in it (The mangled version of ‘void load_user(int)’) the regex engine goes through a bunch of simple steps.

  • Read and ignore ‘_Z’
  • Read and ignore ‘9’
  • Capture ‘load_user’ into $0
  • Capture ‘i’ into $1
  • There is no fifth thing.

But the person that wrote this library is playing silly buggers with someone (obviously us in this case) and there’s also a ‘_Z9load_userss’ which comes out of the other end of the mangle looking like ‘void load_user(char*, char*)’, loading a user with first and last names.

Now we’re in a bit of a quandary. Run the same expression and see what happens:

'_Z9load_userss' ~~ m{ ^ '_Z' \d+ (\w+) (.) $ };

Sure enough, $1 is ‘s’, just as we wanted it, but what about $0? It’s now ‘load_users’, which… y’know, looks too legit to quit. But we must. And now we’re faced with the quandary. Do we make the first parameter an optional capture? ‘m{ … (.)? (.) $ }’ like so?

No, that would capture the ‘r’ of ‘_Z9load_users’. There must be something else in the name that we’re overlooking, some clue… Aha! ‘load_user’ has 9 characters, and look just before it, we’ve got the number 9! Surely that tells us the number of characters in the function name! (and thankfully it actually does.)

Regexes 201

Now, how can we use this to our advantage? First things first, let’s get rid of some dead weight. We don’t care (for the moment) about parameters, so let’s just match the name and number of characters. And because we’re getting all serious up in here, let’s create a quick test.

use Test;
'_Z9load_user' ~~ m{ ^ '_Z' (\d+) (\w+) };
is $0, '9';
is $1, 'load_user';

Run the test script, see if it passes, I’m sure you know the drill. Go ahead and copy that, I’ll wait. Okay, the tests pass, so it’s time to play. I usually am working in a library that’s in git, so I’m usually on the “edit, run tests, git reset, edit…” treadmill by this point.

So… How do we make use of this number? Well, let’s pull up the Regexes page over at docs.perl.org and look around. Back in Perl 5 there used to be this feature ‘m{ a{5} }x’ that matched just 5 copies of whatever it was in front of, that might be a good place to start looking.

That’s now morphed into ‘m{ a ** 5 }’. Great, so let’s replace 5 with $0 and go for it.

'_Z9load_user' ~~ m{ ^ '_Z' (\d+) (\w ** $0) };

“Quantifier quantifies nothing…” That’s weird. $0 is right there, staring me in the face. Maybe I just got the syntax wrong somehow?

'_Z9load_user' ~~ m{ ^ '_Z' (\d+) (\w ** 9) };

Nope, that works. What’s going on here? $0 is defined… Wait, it’s a variable inside a regex, that used to require the ‘e’ modifier, didn’t it? Or something like that… <read the manpage, scratch head… nothing there> Hm. Are we at a dead end?

Kick it up a notch

No, we just need to remember about how string interpolation works. In Perl 6, “Hello, {$name}!” is a perfectly fine way to interpolate variables into your expression, and it works because no matter where it is, {} signals a code block. Let’s try that, surround $0 with braces.

'_Z9load_user' ~~ m{ ^ '_Z' (\d+) (\w ** {$0}) };

Weird. This time the test failed with ” instead of ‘load_user’. Maybe $0 really isn’t defined? Now that it’s just regular Perl code, let’s check.

'_Z9load_user' ~~ m{ ^ '_Z' (\d+) (\w ** {warn "Got '$0'"; $0}) };

“Use of Nil in string context.” So it’s really empty. Now, we have to really do some reading. Looking at the section on general quantifiers says “only basic literal syntax for the right-hand side of the quantifier [what we want to play with] is supported,” so it looks like we’re at a dead end.

But things like ‘{$0}’ do work, so we can use variables. That means that my problem isn’t that the variable is being ignored, it’s just not being populated when I need it. Let’s look at the section on Capture numbers to see when they get populated.

Aha, you need to “publish” the capture using ‘{}’ right after it. Let’s see if that works…

'_Z9load_user' ~~ m{ ^ '_Z' (\d+) {} (\w ** {warn "Got '$0'"; $0}) };

Nope, something else is going on. And the next block down tells us the final solution – ‘:my’. This lets us create a variable inside the scope of the regular expression and use it as well, so let’s do just that.

'_Z9load_user' ~~ m{ ^ '_Z'
                     :my $length;          # Put $length in the proper scope
                     (\d+) {$length = +$0} # Capture the length
                     (\w ** {$length})     # And extract that many chars.
                   };

And reformat things just a wee bit so we’ve got some room to work with. Now the test actually runs, and reads only as many characters of the function name as needs be.

And just one more thing…

It’s not just function names that follow this pattern, it’s also namespaces, and any special types that the function might use as parameters, so let’s package this up into something more useful.

my regexp pascalish-string {
  :my $length;
  (\d+) {$length = +$0}
  (\w ** {$length})
};
'_Z9load_user' ~~ m{ ^ '_Z' <pascalish-string> };
is $/<pascalish-string>[0], 9;
is $/<pascalish-string>[1], 'load_user';

Pascal implementations were done back when RAM was at more of a premium, and stored a string like ‘load_user’ as ‘\x{09}load_user’ so the compiler knew how many bytes were available immediately rather than having to guess. It was limiting, but this was on computers like the early Macs (we’re talking pre-OS X, for that matter pre-System 7, for those of you that remember that far back.)

So we can use this <pascalish-string> regular expression anywhere we want to match one of our counted terms. Because we’re using ‘my’ inside a regular expression nested inside another regular expression inside a burrito wrapped in an enigma, there are no scoping troubles.

There are probably other ways of doing this, and I would love to see them. If you do come up with a better way to solve this, let me know in the comments and I’ll work your solution into an upcoming article.

As usual, gentle reader, thank you for your time and attention, and if you have any comments, questions, clarifications or criticisms (constructive, please) let me know.