Formal
Languages
and the
Theory of
Computation

Fall 2016
course
navigation

parsing from a CS perspective

(Most of this is from the texbook "programming language pragmatics")

grammar 1

expr = id | number | - expr | ( expr ) | expr op expr op = "+" | "-" | "*" | "/" number = digit+ id = letter+ digit = "0" | "1" | ... | "8" | "9" letter = "a" | "b" | ... | "y" | "z"

grammar 2

expr = term | expr addop term term = factor | term mulop factor factor = id | number | -factor | ( expr ) addop = "+" | "-" mulop = "*" | "/" number = digit+ id = letter+ digit = "0" | "1" | ... | "8" | "9" letter = "a" | "b" | ... | "y" | "z"

exercise 1

Show that 1 is ambiguous while 2 is not by parsing
a * x + b
to find (i) two different parse trees for grammar 1, but only (i) one unique parse tree for grammar 2.

compilers

Programming languages are almost always context free. In fact, often they belong to one of several more restrictive subsets of context free languages, to make their parsing faster.
Aside: a notable exception is perl, which is *not* context free. Here's an example of some perl code that shows this, taken from http://www.perlmonks.org/?node_id=663393 . whatever / 25 ; # / ; die "this dies!"; What you need to know is that perl function invocation can be done with parens, i.e. whatever(), but the parens aren't needed. And depending on whether the definition of whatever had any arguments how it interprets the next part will vary, since it may or may not expect a "noun". Regular expressions in perl look like this : /regex_goes_here/ and a semi-colon is used between statements. So if whatever is a function that takes an argument, that code will be taken to be whatever(/25;#/); die("this dies!"); (Yes, die() is a built-in perl function, typically used to report an error.) However, if whatever does not take any arguments, then what comes next won't be taken to be a noun, and the next / will mean division. Then the # sign isn't inside a regex, and will mean a comment : (whatever / 25) ; # the rest is a comment ... which is completely different.
Back to our story.
A compiler for a computer language is typically composed of two pieces:
* a "lexer" which recognizes "tokens" which match regular expressions (i.e. numbers, variables, language keywords, special symbols) ... in practice this is often done with a finite state machine, created automatically by computer software (i.e. flex) * a "parser" which applies a grammar to build a parse tree by finding rules of the grammar which correspond to the code to parse ... in practice this is often done with software generated by other software (i.e. bison) called a "compiler compiler". The parse tree is then used to execute the program and/or generate compiled code, often after some optimizing. There are a numer of algorithmic choices of how to search for the correspondance, depending on how the code is parsed (left-to-right or right-to-left) and how the grammar rules are applied (left-hand-side or right-hand-side). So a parser is a "language recognizer" where here "language" means "programming language". Two of the classic parse-tree-generation algorithms are LL ("Left-to-right through input, Left-most grammar rule derivation") This is a "top down" parse that generates the tree as in our math text, from the Start rule. LR ("Left-to-right through input, Right-most grammar rule derivation") This one on the other head is a "bottom up" parse that matches the right hand side of rules, looking for terminal tokens to consume. Not all grammars can be parsed by these techniques - each approach has some set of languages that it can handle. The details are tricky. It turns out that : LR parsers can handle "deterministic context-free languages", that is, those where the push-down automatia is deterministic. The grammars that these can handle are popular for programming languages because they can be parsed in linear time, which mean they are fast and practical. The LR languages come in different versions depending on how many tokens need to be used to decide the (right hand side) rule to apply, and so for example an LR(1) grammar would only need one look-ahead token to do the right thing without backtracking. See https://en.wikipedia.org/wiki/LR_parser . LL parsers are classified by how many tokens they need to look ahead to work, so for example and LL(1) parser can do the right thing without backtracking with one "look ahead" token. See for example https://en.wikipedia.org/wiki/LL_parser . LL(1) grammars are also popular for practical coding languages. Each LL(k) is a different subset of the set of context free languages. Recursive descent parsers, an intuitive parse-tree-generation technique, can handle LL grammars with a specific form of backtracking. See for example https://en.wikipedia.org/wiki/Recursive_descent_parser

sources

Really understanding and playing around with this stuff is a whole 'nother CS course, typically called something like "compilers" or "language design". Logan's doing some of that this term; Dylan did some last term.
Here are a few places to read about it.
* http://stackoverflow.com/questions/5975741/what-is-the-difference-between-ll-and-lr-parsing * http://web.stanford.edu/class/archive/cs/cs143/cs143.1128/ course notes from a Compilers course at Stanford * http://matt.might.net/teaching/compilers/spring-2015/ Matt Might's course notes at U Utah (What Logan is looking at; builds a python parser in Racket) * http://cpansearch.perl.org/src/JTBRAUN/Parse-RecDescent-1.967013/tutorial/tutorial.html The man(1) of descent, by Damian Conway Being a scholarly Treatife on the Myfterious Origins and diverfe Ufes of that Module known as Parfe::RecDescent (The title is a spoof on Darwin's "The Descent of Man", with man(1) refering to unix "man" pages, and Parse::RecDescent being Conway's perl library for recursive descent parsing. One of his examples is parsing Abbot vs Costello's versions of "who's on first") LL and LR Parsing Demystified Another attempt at explaining the difference in tree traversal, using a polish algebraic notation (i.e. HP calculators) example http://blog.reverberate.org/2013/07/ll-and-lr-parsing-demystified.html

exercise 2 (*)

Try to parse the "a*x+b" example using a lexer along with LR, LL,and recursive descent approaches for the above grammars.
What would you have the lexer do?
If any of these approaches won't work, explain why not and suggest changes to the grammar (and possibly the language) that would make that algorithm a better fit for that grammar.

jims parsing exercise answers