parsing code

Check out our fall 2016 notes .

buzzwords (many of these have a wikipedia page)

BNF notation and its variations
parse tree
abstract syntax tree
recursive descent parsing
LL, LL(1), LL(k)
LR, SLR, LALR, and their (1) and (k) variations
deterministic context free language
lexer & parser generator (e.g. flex & bison)
top-down vs bottom-up parsing
backtracking
lookahead
shift & reduce

Here's the basic picture.

source ---> lexer -----> parser ----------> compiler --> executable
                  tokens        parse tree

This is meant to represent the transformation from some source code to a target executable (which itself is another language, perhaps assembler.)

The language grammar describes both the lexer and parser. In practice its convenient to pull out the lowest level stuff (i.e. turning 1.234e-10 into a float) into the lexer, which typically uses regular expressions, before doing the trickier stuff.

The actual compilation step is at its simplest a "template" operation, where the templates are the code structures in the target language. If for exampling you're turning "1+2" into "plus(1,2)", then the lexer gives you (INT, PLUS, INT), the parser gives you a tree, and the template is "plus({},{})" for that node of the tree, into which the compiler sticks values for the children.

TLDR

Parsing is the act of taking (a) some source code "3*1+1=2" and (b) a grammar syntax grammar and from those producing a parse tree.

LL (Left_to_right through source, Left_to_right through the grammar rules) top-down approach that (in principle) generates all legal strings in the language while looking for the the one of interest. In practice one looks ahead a bit in order to eliminate most of the search tree. This type of parser is the easiest to understand and write by hand, but in practice not the most useful for real compilers.

An LR parser (Left_to_right through input, Right_to_left match in the grammar) can without guessing or backtracking parse any deterministic context free language in linear time1. It does the job by building a pushdown automata, pushing input tokens onto a stack (the "shift" operation), and using states to find the correct grammar rule, which when run backwards simplifies (the "reduce" operation) the stack.

Nearly all programming languages are context free, and most real are parsed with some variation of LR.

LALR (lookahead LR) is a specific sort of a slightly simpler simpler approach which is typically powerful enough.

(The abbreviations are somewhat confusing - LR(1) means 1 lookahead symbols, so LR is itself lookahead ... LALR refers to a specific version in which the stack of symbols already seen but not yet understood is represented by a summary, to safe memory and use fewer states in the state machine.)

example

Here's a python recursive_descent implementation of a grammar_2_fall2018_notes to lisp (lex,parse,compiler) program!

resources

CS143 compilers at Stanford in 2012
- handouts, lecture notes, assignments
- for purposes, handouts 8 - 10 & lectures 3 - 6 or so
Matt Might's compiler course - students build a python3 compiler in racket.
The man(1) of descent
python 3.7 full grammar in BNF
C syntax in BNF

(Are we having fun yet?)

https://cs.marlboro.college /cours /fall2019 /formal_languages /notes /parsing
last modified Fri March 14 2025 1:19 am

Formal
Languages
and the
Theory of
Computation

course

site

parsing code

buzzwords (many of these have a wikipedia page)

TLDR

example

resources

FormalLanguagesand theTheory ofComputation

course

site

parsing code

buzzwords (many of these have a wikipedia page)

TLDR

example

resources

Formal
Languages
and the
Theory of
Computation