parsing from a CS perspective

(Most of this is from the texbook "programming language pragmatics")

grammar 1

 expr = id | number | - expr | ( expr ) | expr op expr
 
 op = "+" | "-" | "*" | "/"
 
 number = digit+
 id = letter+
 digit = "0" | "1" | ... | "8" | "9"
 letter = "a" | "b" | ... | "y" | "z"

grammar 2

 expr = term | expr addop term
 term = factor | term mulop factor
 factor = id | number | -factor | ( expr )
 
 addop = "+" | "-"
 mulop = "*" | "/"
 
 number = digit+
 id = letter+
 digit = "0" | "1" | ... | "8" | "9"
 letter = "a" | "b" | ... | "y" | "z"

exercise 1

Show that 1 is ambiguous while 2 is not by parsing

  a * x + b

to find (i) two different parse trees for grammar 1, but only (i) one unique parse tree for grammar 2.

compilers

Programming languages are almost always context free. In fact, often they belong to one of several more restrictive subsets of context free languages, to make their parsing faster.

   Aside: a notable exception is perl, which is *not* context free.
 
   Here's an example of some perl code that shows this,
   taken from http://www.perlmonks.org/?node_id=663393 .
 
     whatever  / 25 ; # / ; die "this dies!";
 
   What you need to know is that perl function invocation
   can be done with parens, i.e. whatever(), but
   the parens aren't needed. And depending on whether
   the definition of whatever had any arguments
   how it interprets the next part will vary,
   since it may or may not expect a "noun".
  
   Regular expressions in perl look like this :
 
      /regex_goes_here/
 
   and a semi-colon is used between statements.
   So if whatever is a function that takes
   an argument, that code will be taken to be
 
      whatever(/25;#/); die("this dies!");
 
   (Yes, die() is a built-in perl function,
   typically used to report an error.)
 
   However, if whatever does not take any arguments,
   then what comes next won't be taken to be a noun,
   and the next / will mean division. Then the #
   sign isn't inside a regex, and will mean a comment :
 
      (whatever / 25) ;   # the rest is a comment
 
   ... which is completely different.

Back to our story.

A compiler for a computer language is typically composed of two pieces:

 
 * a "lexer" which recognizes "tokens" which match regular expressions
   (i.e. numbers, variables, language keywords, special symbols)
   ... in practice this is often done with a finite state machine,
   created automatically by computer software (i.e. flex)
 
 * a "parser" which applies a grammar to build a parse tree
   by finding rules of the grammar which correspond to
   the code to parse ... in practice this is often done
   with software generated by other software (i.e. bison)
   called a "compiler compiler".
 
   The parse tree is then used to execute the program
   and/or generate compiled code, often after some optimizing.
 
   There are a numer of algorithmic choices of how to search
   for the correspondance, depending on how the code is
   parsed (left-to-right or right-to-left) and how the 
   grammar rules are applied (left-hand-side or right-hand-side).
 
   So a parser is a "language recognizer" where here
   "language" means "programming language".
 
   Two of the classic parse-tree-generation algorithms are
 
     LL ("Left-to-right through input, Left-most grammar rule derivation")
     This is a "top down" parse that generates the tree as
     in our math text, from the Start rule.
 
     LR ("Left-to-right through input, Right-most grammar rule derivation")
     This one on the other head is a "bottom up" parse that matches
     the right hand side of rules, looking for terminal tokens
     to consume.
 
   Not all grammars can be parsed by these techniques -
   each approach has some set of languages that it can handle.
   The details are tricky. It turns out that :
 
     LR parsers can handle "deterministic context-free languages",
     that is, those where the push-down automatia is deterministic.
     The grammars that these can handle are popular for programming
     languages because they can be parsed in linear time,
     which mean they are fast and practical. The LR languages
     come in different versions depending on how many tokens
     need to be used to decide the (right hand side) rule to apply,
     and so for example an LR(1) grammar would only need one
     look-ahead token to do the right thing without backtracking.
     See https://en.wikipedia.org/wiki/LR_parser .
 
     LL parsers are classified by how many tokens they need to look
     ahead to work, so for example and LL(1) parser can do the
     right thing without backtracking with one "look ahead" token.
     See for example https://en.wikipedia.org/wiki/LL_parser .
     LL(1) grammars are also popular for practical coding languages.
     Each LL(k) is a different subset of the set of context
     free languages.
 
     Recursive descent parsers, an intuitive parse-tree-generation
     technique, can handle LL grammars with a specific form
     of backtracking. See for example
     https://en.wikipedia.org/wiki/Recursive_descent_parser

sources

Really understanding and playing around with this stuff is a whole 'nother CS course, typically called something like "compilers" or "language design". Logan's doing some of that this term; Dylan did some last term.

Here are a few places to read about it.

 * http://stackoverflow.com/questions/5975741/what-is-the-difference-between-ll-and-lr-parsing
 
 * http://web.stanford.edu/class/archive/cs/cs143/cs143.1128/
   course notes from a Compilers course at Stanford
 
 * http://matt.might.net/teaching/compilers/spring-2015/
   Matt Might's course notes at U Utah
  (What Logan is looking at; builds a python parser in Racket)
 
 * http://cpansearch.perl.org/src/JTBRAUN/Parse-RecDescent-1.967013/tutorial/tutorial.html
   The man(1) of descent, by Damian Conway
   Being a scholarly Treatife on the Myfterious Origins
   and diverfe Ufes of that Module known as Parfe::RecDescent
  (The title is a spoof on Darwin's "The Descent of Man",
  with man(1) refering to unix "man" pages, and
  Parse::RecDescent being Conway's perl library
  for recursive descent parsing. One of his examples
  is parsing Abbot vs Costello's versions of
  "who's on first")
 
 LL and LR Parsing Demystified
 Another attempt at explaining the difference in tree traversal,
 using a polish algebraic notation (i.e. HP calculators) example
 http://blog.reverberate.org/2013/07/ll-and-lr-parsing-demystified.html

exercise 2 (*)

Try to parse the "a*x+b" example using a lexer along with LR, LL,and recursive descent approaches for the above grammars.

What would you have the lexer do?

If any of these approaches won't work, explain why not and suggest changes to the grammar (and possibly the language) that would make that algorithm a better fit for that grammar.

jims parsing exercise answers

http://cs.marlboro.edu/ courses/ fall2016/formal_languages/ notes/ programming_language_parsers
last modified Thursday September 29 2016 1:14 am EDT

Formal
Languages
and the
Theory of
Computation

course

navigation

parsing from a CS perspective

grammar 1

grammar 2

exercise 1

compilers

sources

exercise 2 (*)

FormalLanguagesand theTheory ofComputation

course

navigation

parsing from a CS perspective

grammar 1

grammar 2

exercise 1

compilers

sources

exercise 2 (*)

Formal
Languages
and the
Theory of
Computation