Colin Charles Agenda

An introduction to ANTLR (sparse notes)

I attended Clinton Roy’s excellent session titled An introduction to ANTLR: A parser toolkit for problems large and small. Now that the slides, and the video (1, 2) are online, I don’t know if my bits of notes are of any use (I made them while the tutorial was in progress), but they’re sitting on my desktop, and really should just get published. The files referenced below, you get via checking out the Antlr Tutorial Preparation wiki page.

Why use Antlr?
To parse configuration files, syntax highlighting, Domain Specific Languages (DSL), interpreters, translation/transformation.

Generates easy to follow code, LL(*) parsing algorithm. Bison is more powerful than Antlr. Compined lexer, and parser generator.

fun (int a, char b); <– as you do LL, till you hit the “;”, you have no idea if you’re dealing with a function or a declaration. Of course, there are look-ahead LL parsers too. An LL3 parser, which can see 3 tokens ahead, you still can’t see ahead enough, till you hit the ;. This is why, there exists an LL(*) – pick the smallest look-ahead, your grammar would need

Antlr, will help you get rid of using regular expressions.

Island grammars – one language, inside another (like HTML, inside PHP, or Doxygen inside C) are supported by Antlr.

Antlr Wiki is good, but hard to find things. Mailing list is great. The book by Terrence Parr is good, but out-dated, so go ahead, and get the online PDF version. A new cookbook/recipe list is coming out soon.

Using AntlrWorks. java -jar antlrworks.jar

conffile.g parses a = 1, b = foo.

IDENT   :       (‘_’|’a’..’z’|’A’..’Z’)(‘_’|’a’..’z’|’A’..’Z’|’0′..’9′)*;
NUMBER  :       ‘0’..’9’+;
WS      :       ‘\r’ | ‘\n’ | ‘ ‘ {$channel=HIDDEN;};

The above are lexer rules. WS = whitespace. It reads from bottom up. White space, a number (0-9). IDENT will match foo, foo, foo1, but not 1foo (identifiers don’t start with numbers).

{$channel=HIDDEN;} <– IDENTs and NUMBERs get through the channel get through the parse. The whitespace, the parser sees them, but it will ignore them (i.e. hide them).

[-(/tmp/antlrworks)> l
total 248
drwxrwxr-x  2 byte byte   4096 2008-01-31 11:20 ./
drwxrwxrwt 44 root root 192512 2008-01-31 11:20 ../
-rw-rw-r–  1 byte byte    340 2008-01-31 11:20 conffile__.g
-rw-rw-r–  1 byte byte   8492 2008-01-31 11:20
-rw-rw-r–  1 byte byte   4180 2008-01-31 11:20
-rw-rw-r–  1 byte byte     28 2008-01-31 11:20 conffile.tokens

conffile__.g – lexer file
conffile.tokes – tokens

CMinus.g takes input, which is a C program. Go to the interpreter, and you can then see the entire parse tree. Very impressive!

Technorati Tags: , ,