Homework #3

CS 374 Compilers
Homework #3

Due: Tuesday 2/14 at the beginning of class

Test cases: 3 text files with valid MiniJava programs, and 5 text files with MiniJava programs with syntax errors. E-mail these text files as attachments to class mailing list GBC_CS_374_A_SPRING_2012@cnav.gettysburg.edu), by Friday 2/10.

Note: This work is to be done in assigned groups. Each group will submit one assignment. Although you may divide the work, both team members should be able to present/describe their partner's work upon request.

0. Preparation: There are no files to download for this exercise. You'll simply be extending your work from the previous exercise. Read Appendix Section A.2, the MiniJava grammar. You may find it useful to have the online JavaCC documentation available during development.

1. Eliminating Left Recursion: You'll note that the MiniJava grammar production for Exp is left-recursive. Rewrite Exp such that it is not left-recursive. This can be done in two ways.

General Process: This is described on pp. 51-52 of the text, and involves significant restructuring of the grammar and the introduction of a single new production (e.g. ExpPrime). The downside to this technique is that it does nothing to address issues of operator precedence and associativity. All proper structuring of the abstract syntax tree is deferred to the next stage. Although not required, you should should probably consider the following option.

Precedence Hierarchy Process: Since our left recursion results from our expressions, we can also eliminate left recursion by establishing a hierarchy of expressions according to operator precedence. An example of this can be seen in transformation of text grammar 3.5 to one similar to text grammar 3.8. In grammar 3.5, the left recursion results from the binary operator. The operators "*" and "/" take precedence over (i.e. bind more tightly than) the operators "+" and "-".

We can thus transform

E -> id 
   | num 
   | E ( * | / | + | - ) E 
   | "(" E ")"

E -> T ( ( + | - ) T )*
T -> F ( ( * | / ) F )*
F -> id 
   | num 
   | "(" E ")"

Here, E stands for Expression, T stands for Term, and F stands for Factor. While this does not address the issue of operator associativity, it does address the issue of operator precedence and eliminates left recursion. For our MiniJava grammar, the original Exp production is given as:

Exp     -> Exp op Exp
        -> Exp "[" Exp "]"
        -> Exp . length
        -> Exp . id "(" ExpList ")"
        -> INTEGER_LITERAL
        -> true
        -> false
        -> id
        -> this
        -> new int "[" Exp "]"
        -> new id "(" ")"
        -> ! Exp
        -> "(" Exp ")"

The first four productions are responsible for the left-recursion. Operator precedence (high to low) according to Java's specification is:

Exp[Exp]        Exp.length      Exp.id(ExpList) 
!Exp            new Exp()       new int[Exp]
Exp * Exp
Exp + Exp       Exp - Exp
Exp < Exp
Exp && Exp

Operators on the same line have the same precedence. Following the previous example and the example JavaCC grammar for Java 1.2, we can thus rewrite our MiniJava grammar according to this more complex precedence hierarchy:

Exp             -> And
And             -> LessThan ( && LessThan )*
LessThan        -> AdditiveExp [ < AdditiveExp ]
AdditiveExp     -> Times ( ( + | - ) Times )*
Times           -> PrefixExp ( "*" PrefixExp )*
PrefixExp       -> Not | PostfixExp
Not             -> ( ! )+ PostfixExp
PostfixExp      -> PrimaryExp ( "[" Exp "]" 
                                | . id "(" ExpList ")" 
                                | . length )*
PrimaryExp      -> INTEGER_LITERAL
                -> true
                -> false
                -> id
                -> this
                -> "(" Exp ")"
                -> new int "[" Exp "]"
                -> new id "(" ")"

Again, while this does address operator precedence, this does not address the issue of operator associativity. This can be handled in the next phase when we construct the abstract syntax tree.

Whichever method you choose, be sure to document your grammar transformations well in your README file.

2. JavaCC Grammar: Translate your transformed grammar to JavaCC. At the end of your lexical analysis file, begin adding your productions as follows:

void Program() :
{}
{
        MainClass() ( ClassDecl() )*
}

void MainClass() :
{}
{
        <CLASS> <IDENTIFIER> <LBRACE> <PUBLIC> <STATIC> <VOID> <MAIN> <LPAREN> <STRING> <LSQPAREN> <RSQPAREN> <IDENTIFIER> <RPAREN> 
                <LBRACE> Statement() <RBRACE> <RBRACE>
}

...

Hint: Although there is no epsilon, you may make a part of a production optional using square brackets.

You may look to the example JavaCC Java grammars for guidance, but you may not use anything you don't understand. For example, the inclusion of additional Java structures beyond the MiniJava specification will be considered plagiarism and violation of the honor code. Again, do not use what you do not understand. It's also better that you do not copy and paste. The experience of constructing the grammar yourself is not to be underestimated. Use the example grammar only when you get stuck.

When you've completed this translation, you should be able to generate your parser with the command javacc although you will probably have choice conflict warnings.

3. Resolving Choice Conflicts: Resolve each conflict sequentially, using the LOOKAHEAD facility of javacc. At this point, you should study the JavaCC LOOKAHEAD MiniTutorial in depth. For each choice conflict,

Understand the conflict. What is the ambiguity that makes the recursive-descent parser unable to figure out which choice to make at a choice point? To do this, you might imagine yourself as the parser, and come up with a concrete example of the ambiguity.
Determine what information is needed to resolve the conflict. How far ahead would you need to look in example to see which way parsing should proceed? Is it a set number of tokens, or does it involve identifying grammar constructions? From the concrete example, generalize what information is needed looking ahead before the correct choice can be determined.
Use the appropriate LOOKAHEAD form to provide this information. For our purposes, do not worry about the computational expense of significant backtracking. It's more important for us to keep our grammar as close to the original as possible.

4. Testing: Download MiniJava example programs and test them to see that they successfully parse. Note: At this point, we're not actually building the abstract syntax tree. We're looking for a successful match to (parse with) our grammar. Create 3 additional correct MiniJava programs, and 5 additional incorrect MiniJava programs that should yield syntax errors. Verify that these are processed correctly and give appropriate output.

Supplemental Chapter Comments

(These comments are to supplement your reading. Question(s) asked are for you to think about on your own and need not be turned in with the homework.)

3.0: Give special attention to this preamble before section 3.1. It provides a good explanation of the limitations of the expressiveness of regular expressions. Understanding the difference between regular expressions and context-free grammars will give you a clearer sense of the division between lexical and syntactic analysis. You should carry around in your mind at least one simple context-free grammar that cannot be expressed using regular expressions. What is your favorite example?

3.1: As with regular expressions, context-free grammars are covered in CS 301, but are essential for you to understand for our purposes. Give special attention to this section, as you will be writing a context-free grammar to parse a subset of Java (MiniJava). Much of your project work will be based on traversals and transformation of tree data structures which correspond to parse trees.

3.2: This material is parallel to the material to the finite automata material of the previous chapter, in that it gives a deeper understanding of the output of various parser generators. This is especially important when is comes time to disambiguate an ambiguous grammar. Often, parsing tools that encounter an ambiguity will report the ambiguity using the terminology of the underlying parsing algorithm. For example, a reported "shift/reduce conflict" from yacc would be utterly incomprehensible without understanding the underlying mechanism being created (see Section 3.3). Since JavaCC is an LL(k) parser generator, it is important to understand this section well and then skim the JavaCC LOOKAHEAD MiniTutorial. This tutorial will help you resolve one lookahead complication in the LL(k) parsing of MiniJava.

3.3: It is unclear whether or not we will have time to cover this section in class. If you are possibly interested in going to graduate school, I strongly recommend gaining a solid understanding of all of this material. Also, both yacc and bison, the most common C parser generators, and SableCC for Java generate LALR(1) parsers described in this section. Thus, you should seek to understand shift-reduce and reduce-reduce conflicts if you wish to be able to resolve them in the future. It is worthwhile material.

3.4: Read about JavaCC and again look at JavaCC usage examples we have on our system (/usr/share/javacc/examples). Ignore the SableCC material.

3.5: This reading is optional, but interesting. Programmers often complain about error-reporting in compilers, but look at the problem from the compiler writer's perspective!