Spacing Out

After having had some comments about the grammar approach I’ve been using, I’ve started to rethink things. I may have isolated at least one problem people may have been having. I’m working on a grammar for a language called ‘picat’ – you can look up a quick explanation at picat.org.

It’s a constraint-based programming language that maps insanely well onto Perl 6. A fragment of the grammar I’m working on follows, done in a top-down fashion. The actual grammar rule <comment> isn’t the important thing, because this problem can occur with anything.

If you must know, it’s a C-style /* .. */ comment. Of course I ran the test to make sure this little block of code properly matched beforehand. This way I could go along making one small change at a time, simple because it’s fairly late at night and I’ve got a flight to catch tomorrow..

<comment>
<comment>
<comment>

'go' '=>'
   'doors(10).'

Breaking up is hard to do

The natural thing to do here is, of course, say to yourself “Hrm, I’ve got 3 <comment> comment blocks in a row. We all know there are only 3 important numbers in computer science, 0, 1, and Infinity. So 3 is wrong and should be replaced with <comment>+.”

<comment>+

'go' '=>'
   'doors(10).'

I then rerun the test, because I’m sticking to my nighttime rule of “one change, one retest”, and to my horror it breaks. I’ve only changed one thing, but … why is it breaking? Surely <A>+ should at least match <A> <A> <A> … that’s how DFA equivalences work in finite automata.

That’s also one point where Perl 6 and traditional DFAs (Deterministic Finite Automata) part ways. After a few years of doing Perl 6 programming, I see Perl 6 as almost overly helpful. Tools like flex and bison made me think of grammars as something that belonged outside the language.

Where it all breaks down

Unfortunately modules like Grammar::Debugger, through no fault of their own, can’t quite help here. While it’s a great module to tell you what particular rule or token failed, the problem here is between the terms.

<A> {whitespace-optional} <A> is subtly different than <A>+ because <A> <A> lets the parser read whitespace between the two terms; <A>+ assumes the terms come one after the other, whitespace be darned.

So, the simplest solution I have to offer is to let the comment eat the whitespace after it as well, so you can insert your <comment> token anywhere you like and it’ll still eat the whitespace no matter how you add it.

Another solution proposed on Reddit would be to use <A>˽+, with a space between the closing ‘>’ and the modifier. Said user went beyond the call of duty and composed a “Seven stages of whitespace” post to make the point.

The <comment> token I have, like I said, is for C/C++ style “balanced” comments. Here they’re not balanced; /* This is a comment */ but this is not */, and/* This is a comment /* so is this */ this looks like it should but really isn’t. */

token comment
  {
  '/*' .+? '*/' \s*
  }

And all is well with the grammar. You can put this rule anywhere you like and it’ll behave whether you write <comment> <comment> or <comment>+. This little article was inspired by a Twitter user inspired after reading my first tutorial series. They got into the actual work of creating a grammar and problems started to happen.

Wrapping up

My original tutorial series was just that, a tutorial, I felt that getting too deep into the process interrupts the flow, so I didn’t talk about the work that went into it. Now that the series is pretty much done, I think it’ll be beneficial to talk about the actual problems of debugging one of these beasts.

And these thing can most definitely be beasts. Using my ANTLR4 to Perl 6 converter you can generate some incredibly huge grammars. But just generating them doesn’t necessarily mean they’ll compile, although a few do right out of the box, which I’m genuinely amazed at.

The full test suite actually chooses a few grammars, converts them to Perl 6, compiles them and tests against sample input. I’m not sure how faithful they are to the real grammar, but they work.

Perl 6 does amazing things with precompiling and JITing. Grammars and regular expressions are one of the hardest-working things in Perl 6, so they get compiled down to functions. This means I can’t step into them even inside NQP, the dark side of Perl 6.

I’ve got ideas, so I’m going to keep working on grammar stuff. That means when I run into problems, well, it’s time to write another article. So look forward to a new series. Likely with a prosaic name of “Perl 6 Grammars Debun^wDebugged” or something similar. Thank you again, dear reader. Comments, clarifications and questions are of course welcome.