parsing from a CS perspective
(Most of this is from the texbook "programming language pragmatics")
grammar 1
expr = id | number | - expr | ( expr ) | expr op expr
op = "+" | "-" | "*" | "/"
number = digit+
id = letter+
digit = "0" | "1" | ... | "8" | "9"
letter = "a" | "b" | ... | "y" | "z"
grammar 2
expr = term | expr addop term
term = factor | term mulop factor
factor = id | number | -factor | ( expr )
addop = "+" | "-"
mulop = "*" | "/"
number = digit+
id = letter+
digit = "0" | "1" | ... | "8" | "9"
letter = "a" | "b" | ... | "y" | "z"
exercise 1
Show that 1 is ambiguous while 2 is not by parsing
a * x + b
to find (i) two different parse trees for grammar 1,
but only (i) one unique parse tree for grammar 2.
compilers
Programming languages are almost always context free.
In fact, often they belong to one of several more restrictive
subsets of context free languages, to make their parsing faster.
Aside: a notable exception is perl, which is *not* context free.
Here's an example of some perl code that shows this,
taken from http://www.perlmonks.org/?node_id=663393 .
whatever / 25 ; # / ; die "this dies!";
What you need to know is that perl function invocation
can be done with parens, i.e. whatever(), but
the parens aren't needed. And depending on whether
the definition of whatever had any arguments
how it interprets the next part will vary,
since it may or may not expect a "noun".
Regular expressions in perl look like this :
/regex_goes_here/
and a semi-colon is used between statements.
So if whatever is a function that takes
an argument, that code will be taken to be
whatever(/25;#/); die("this dies!");
(Yes, die() is a built-in perl function,
typically used to report an error.)
However, if whatever does not take any arguments,
then what comes next won't be taken to be a noun,
and the next / will mean division. Then the #
sign isn't inside a regex, and will mean a comment :
(whatever / 25) ; # the rest is a comment
... which is completely different.
Back to our story.
A compiler for a computer language is typically composed of two pieces:
* a "lexer" which recognizes "tokens" which match regular expressions
(i.e. numbers, variables, language keywords, special symbols)
... in practice this is often done with a finite state machine,
created automatically by computer software (i.e. flex)
* a "parser" which applies a grammar to build a parse tree
by finding rules of the grammar which correspond to
the code to parse ... in practice this is often done
with software generated by other software (i.e. bison)
called a "compiler compiler".
The parse tree is then used to execute the program
and/or generate compiled code, often after some optimizing.
There are a numer of algorithmic choices of how to search
for the correspondance, depending on how the code is
parsed (left-to-right or right-to-left) and how the
grammar rules are applied (left-hand-side or right-hand-side).
So a parser is a "language recognizer" where here
"language" means "programming language".
Two of the classic parse-tree-generation algorithms are
LL ("Left-to-right through input, Left-most grammar rule derivation")
This is a "top down" parse that generates the tree as
in our math text, from the Start rule.
LR ("Left-to-right through input, Right-most grammar rule derivation")
This one on the other head is a "bottom up" parse that matches
the right hand side of rules, looking for terminal tokens
to consume.
Not all grammars can be parsed by these techniques -
each approach has some set of languages that it can handle.
The details are tricky. It turns out that :
LR parsers can handle "deterministic context-free languages",
that is, those where the push-down automatia is deterministic.
The grammars that these can handle are popular for programming
languages because they can be parsed in linear time,
which mean they are fast and practical. The LR languages
come in different versions depending on how many tokens
need to be used to decide the (right hand side) rule to apply,
and so for example an LR(1) grammar would only need one
look-ahead token to do the right thing without backtracking.
See https://en.wikipedia.org/wiki/LR_parser .
LL parsers are classified by how many tokens they need to look
ahead to work, so for example and LL(1) parser can do the
right thing without backtracking with one "look ahead" token.
See for example https://en.wikipedia.org/wiki/LL_parser .
LL(1) grammars are also popular for practical coding languages.
Each LL(k) is a different subset of the set of context
free languages.
Recursive descent parsers, an intuitive parse-tree-generation
technique, can handle LL grammars with a specific form
of backtracking. See for example
https://en.wikipedia.org/wiki/Recursive_descent_parser
sources
Really understanding and playing around with this stuff
is a whole 'nother CS course, typically called something like
"compilers" or "language design". Logan's doing some of that
this term; Dylan did some last term.
Here are a few places to read about it.
* http://stackoverflow.com/questions/5975741/what-is-the-difference-between-ll-and-lr-parsing
* http://web.stanford.edu/class/archive/cs/cs143/cs143.1128/
course notes from a Compilers course at Stanford
* http://matt.might.net/teaching/compilers/spring-2015/
Matt Might's course notes at U Utah
(What Logan is looking at; builds a python parser in Racket)
* http://cpansearch.perl.org/src/JTBRAUN/Parse-RecDescent-1.967013/tutorial/tutorial.html
The man(1) of descent, by Damian Conway
Being a scholarly Treatife on the Myfterious Origins
and diverfe Ufes of that Module known as Parfe::RecDescent
(The title is a spoof on Darwin's "The Descent of Man",
with man(1) refering to unix "man" pages, and
Parse::RecDescent being Conway's perl library
for recursive descent parsing. One of his examples
is parsing Abbot vs Costello's versions of
"who's on first")
LL and LR Parsing Demystified
Another attempt at explaining the difference in tree traversal,
using a polish algebraic notation (i.e. HP calculators) example
http://blog.reverberate.org/2013/07/ll-and-lr-parsing-demystified.html
exercise 2 (*)
Try to parse the "a*x+b" example using a lexer
along with LR, LL,and recursive descent approaches
for the above grammars.
What would you have the lexer do?
If any of these approaches won't work, explain
why not and suggest changes to the grammar
(and possibly the language) that would make
that algorithm a better fit for that grammar.