Top-Down Parsing

From Wiki**3

Compiladores
Introdução ao Desenvolvimento de Compiladores
Aspectos Teóricos de Análise Lexical
A Ferramenta Flex
Introdução à Sintaxe
Análise Sintáctica Descendente
Gramáticas Atributivas
A Ferramenta YACC
Análise Sintáctica Ascendente
Análise Semântica
Geração de Código
Tópicos de Optimização

LL(1) Parsers

LL(1) is a top-down, also called predictive, parsing algorithm that works by reading input from left to right (hence the fist L) and building the leftmost-derivations (the second L) using 1-symbol lookahead.

Parsing Algorithm

The parsing algorithm works as follows: at the beginning the stack contains the initial symbol and the input contains the full input string. Then, for each non-terminal symbol at the top of the stack, select the rule whose FIRST set contains a terminal symbol that matches the first input symbol. If one cannot be found, the parser rejects the input and processing stops. If a match can be found, however, then the non-terminal symbol at the top of the stack is replaced with the body of the rule selected in the previous step (the symbols in the rule's body are pushed from right to left, allowing the leftmost symbol to be processed first). If at any stage the top of the stack contains a terminal symbol, the input must match it (otherwise, input is rejected) and both are removed (from the stack and from the input). The parsing process continues until both the stack and the input are empty.

Problems

Since the parser relies on the FIRST set to decide which rule to select at each stage, the LOOKAHEAD sets of each production for a non-terminal must be disjoint (otherwise, there would be ambiguity). This type of ambiguity may arise either because various productions start with the same string or because of left corners (elements in their FIRST sets may coincide with elements in other productions).

The following rule shows a case of direct ambiguity in 'a'. Non-terminal C is a left corner of A and must also be dealt with:

A -> a M N d | a X Y z | C D e
C -> a | x M

Left-recursion (a special case in which the left corner is the head of the rule, even if indirectly) in rules is also a problem, since the parser replaces items on the stack with their bodies: if a rule's body starts with the symbol that has just left the stack, then the parser is in no better situation to parse the input (in fact, infinite recursion ensues). In addition to "plain" left-recursion, mutual left-recursion (in which two or more rules depend on each other via left corners) is also a problem.

The following rule shows direct left-recursion (in A) and indirect recursion (between the second production of A and the first production of B):

A -> a M N d | B X Y z | A D e
B -> A M d | i N A

Grammar Rewriting

Keep in mind at all times: never eliminate the rules corresponding to the initial symbol

Before building the parse table for an LL(1) parser, the grammar must be checked for each of the above problems. If any of them are present, then the grammar must be rewritten. Only after all the problems have been handled can the parse table be computed.

Computation of the parse table will be covered below, as the final step in preparing a parser for a given grammar.

Consider the following sample grammar, in which A is the initial symbol and { x, y, z } is the set of terminal symbols (ε is the empty string):

A -> A x | B x | B C
B -> A y | ε
C -> z A | z y

Inspecting the grammar, the following sets can be found:

LOOKAHEAD(A -> A x) = FIRST(A x) = { x, z }
LOOKAHEAD(A -> B x) = FIRST(B x) = { x, z }
LOOKAHEAD(A -> B C) = FIRST(B C) = { x, z }

Since, for the grammar to be LL(1) it cannot have any non-empty intersection between the LOOKAHEAD sets of each non-terminal, it can be concluded that the above grammar as it is cannot be processed by an LL(1) parser (although we just looked at A, something similar happens with C). In addition, A is left-recursive (first production) and mutually recursive with B (second and third productions of A and first of B).

Step 1: Eliminating Mutual Recursion

Mutual recursion may be eliminated by substituting one of the rules in the other: in this case, since A is the initial symbol (and cannot be removed from the grammar) and B is only used in A, it can be expanded in A and removed from the grammar (it becomes unreachable).

A -> A x | A y x | x | A y C | C
B -> A y | ε
C -> z A | z y

In this case, note also that C has become a left corner of A and must be handled.

Step 2: Eliminating the Left Corner

We eliminate the left corner by expanding it. In this case, after expansion, C becomes unreachable and can be eliminated:

A -> A x | A y x | x | A y z A | A y z y | z A | z y
C -> z A | z y

At this point, the grammar does not have any mutual recursion or non-terminal left corners (other than the start symbol). The next step is to eliminate left-recursion.

Step 3: Eliminating Left Recursion

The strategy for eliminating left-recursion is to transform a left-recursive grammar into a righ-recursive one.

The rewriting of the grammar from left- to right-recursive is based on the observation that a left-recursive rule may be written as a sequence of n recursive (Aα1..Aαn) and m non-recursive (β1..βm) productions:

A -> A α1 | ... | A αn | β1 | ... | βm

This grammar accepts any sequence started by a βi (1[math]\le[/math]i[math]\le[/math]m) followed by a (possibly empty) series of αj (1[math]\le[/math]j[math]\le[/math]n) and can be rewritten as follows:

A  -> β1 A' | ... | βm A'
A' -> α1 A' | ... | αn A' | ε

In the new grammar, the first set of rules (for non-terminal A) accepts any sequence started by a βi (1[math]\le[/math]i[math]\le[/math]m) and followed by an A'-build phrase, i.e., any sequence of αj (1[math]\le[/math]j[math]\le[/math]n) (ε accounts for the possibility of the sequence being empty).

By applying the rewriting rule to the grammar under study, we get the following strings:

α1 = x              β1 = x
α2 = y x            β2 = z A
α3 = y z A          β3 = z y
α4 = y z y

Using the rewriting rule, we get the following grammar:

A  -> x A' | z A A' | z y A'
A' -> x A' | y x A' | y z A A' | y z y A' | ε

Step 4: Factorization

At this point, there is no left recursion nor non-terminal left corners, but the grammar is still ambiguous (several productions have the same LOOKAHEAD sets:

A  -> x A' | z A A' | z y A'
A' -> x A' | y x A' | y z A A' | y z y A' | ε

Factoring the grammar, we get the following (instead of creating a new non-terminal for factoring y in A', we used the already existing definition of A):

A  -> x A' | z A"
A' -> x A' | y A | ε
A" -> A A' | y A'

There's still a left corner in A'', and its expansion is necessary, since there is no automatic simple way of guaranteeing that the LOOKAHEADS for A'' are disjoint:

A  -> x A' | z A"
A' -> x A' | y A | ε
A" -> x A' A' | y A' | z A" A'

Building the Parse Table

The parse table T[n,t] is computed as follows (all entries left empty correspond to syntax errors):

  1. For each production A -> α in the grammar, repeat 2 and 3:
  2. For each terminal a in FIRST(α) add A -> α to T[A,a]
  3. If ε ∈ FIRST(α) then add A -> α to T[A,b], ∀b ∈ FOLLOW(A). If ε ∈ FIRST(α) and $ ∈ FOLLOW(A), then add A -> α to T[A,$]

Step 5: Computing the FIRST and FOLLOW Sets

Now that the grammar can, in principle, be parsed by a LL(1) parser, the FIRST and FOLLOW sets must be computed for building the parse table (the FIRST sets for the non-terminal symbols are needed to compute the corresponding FOLLOWS but are not directly used in the parse table: it uses the FIRST sets of each production instead -- see below).

FIRST(A)  = FIRST(x A') ∪ FIRST(z A") = {x, z}
FIRST(A') = FIRST(x A') ∪ FIRST(y A) ∪ {ε} = {x, y, ε}
FRIST(A") = FIRST(x A' A') ∪ FIRST(y A') ∪ FIRST(z A" A') = {x, y, z}

The FOLLOW sets must obey the following constraints:

FOLLOW(A)  ⊇ {$}                      FOLLOW(A') ⊇ FOLLOW(A)             FOLLOW(A") ⊇ FOLLOW(A)
FOLLOW(A)  ⊇ FOLLOW(A')
FOLLOW(A') ⊇ FIRST(A') \ {ε}          FOLLOW(A') ⊇ FOLLOW(A")            FOLLOW(A") ⊇ FIRST(A') \ {ε}

Thus, the FOLLOW sets must all be identical.

FOLLOW(A) = FOLLOW(A') = FOLLOW(A") = {$} ∪ FIRST(A') \ {ε} = {x, y, $}

Step 6: The Parse Table

For building the table we will consider the FIRST and FOLLOW sets above. Notice that, despite the rewriting work done above, the grammar is still not LL(1) (two cells in the parse table encode ambiguous behaviour).

x y z $
A A -> x A' A -> z A"
A' A' -> x A'
A' -> ε
A' -> y A
A' -> ε
A' -> ε
A" A" -> x A' A' A" -> y A' A" -> z A" A'

Parsing the Input String

The parsing algorithm works as described above: at the beginning the stack contains the initial symbol and the input contains the full input string. Then, for each non-terminal symbol at the top of the stack, select the rule whose FIRST set contains a terminal symbol that matches the first input symbol. If one cannot be found, the parser rejects the input and processing stops. If a match can be found, however, then the non-terminal symbol at the top of the stack is replaced with the body of the rule selected in the previous step (the symbols in the rule's body are pushed from right to left, allowing the leftmost symbol to be processed first). If at any stage the top of the stack contains a terminal symbol, the input must match it (otherwise, input is rejected) and both are removed (from the stack and from the input). The parsing process continues until both the stack and the input are empty.

Step 7: Parsing

The following table parses the sequence x y z y x. Note that when the parse table cells contain more than one entry, a choice must be made. In the following example, we explore all possibilities, but (although there is no special basis for it) the choice with the longest body would possibly contribute to process more input.

STEP STACK INPUT ACTION
1 A $ x y z y x $ A -> x A'
2 x A' $ x y z y x $ ( x )
*3 A' $ y z y x $ A' -> ε
*4 $ y z y x $ REJECT
3 A' $ y z y x $ A' -> y A
4 y A $ y z y x $ ( y )
5 A $ z y x $ A -> z A"
6 z A" $ z y x $ ( z )
7 A" $ y x $ A" -> y A'
8 y A' $ y x $ ( y )
*9 A' $ x $ A' -> ε
*10 $ x $ REJECT
9 A' $ x $ A' -> x A'
10 x A' $ x $ ( x )
11 A' $ $ A' -> ε
12 $ $ ACCEPT

Examples

Exercises