Homework #2

CS 374 Compilers
Homework #2

Due: Tuesday 2/7

Test cases: 1 text file with valid tokens / comments, and 5 text files with one invalid token / comment each
e-mail these text files as attachments to class mailing list (GBC_CS_374_A_SPRING_2012@cnav.gettysburg.edu), by Friday 2/3

Note: This work is to be done in assigned groups. Each group will submit one assignment. Although you may divide the work, both team members should be able to present/describe their partner's work upon request.

0. HW2 Preparation: Download the given starter code (Main.java and MiniJavaLexer.jj) and read Chapter 2 thoroughly. Browse the online JavaCC documentation to get a sense of what is available to assist you in this phase and the next.

1. Lexical specification: Compare MiniJavaLexer.jj to the sample JavaCC specification of Program 2.9. You'll need to define the TOKEN section of MiniJavaLexer.jj according to the lexical specification of Appendix A with one exception: block comments may not be nested. Ensure the lexical analysis ignores whitespace (" ", "\t", "\r", "\n", "\f") by having a section called SKIP which is syntactically equivalent to the TOKEN section. Similarly, define comments in a separate section called SPECIAL_TOKEN. Note that not all keywords, operators, etc. are specified in the TOKEN section of MiniJavaLexer.jj. If left as is, these will be classified (incorrectly) as IDENTIFIERs. (You can test your regular expressions at this stage.) To have such keywords, operators, etc. correctly classified as distinct tokens, define them one by one by replacing the string of each (e.g. "class" in the MiniJavaToken production) with an appropriate token name (e.g. "<CLASS>"), and placing a suitable token definition (e.g. < CLASS: "class" >) in the TOKEN section before the IDENTIFIER token. (This italicized phrase is important because matches are attempted sequentially.)

To compile and run the code:

javacc MiniJavaLexer.jj
javac *.java
java Main

The resulting code should read a sequence of tokens from standard input without any message but "Lexical analysis successfull [sic]". (Feel free to correct this misspelling.) However, as soon as input is given that does not follow the lexical specification, you should observe the program terminate with an Exception and an error message concerning the illegal input.

To get more detailed output about your tokens, you may substitute

 ( MiniJavaToken() {System.out.println("Kind: " + token.kind + " line " 
+ token.beginLine + ", column " + token.beginColumn + " - line " + token.endLine 
+ ", column " + token.endColumn + " : " + token.image);} )*

for

( MiniJavaToken()  )*

in MiniJavaLexer.jj. Basically, this prints most of the fields of the Token class for each token read. This detailed information should be omitted from output in future homework exercises.

Supplemental Chapter Comments

(These comments are to supplement your reading. Question(s) asked are for you to think about on your own and need not be turned in with the homework.)

2.1: One handy beginner heuristic to determine what is/isn't a token is to ask oneself the question "Where could I add whitespace and not change the program's meaning?" For example, the Java code "while(i<max){" could be equivalently written "while ( i < max ) {". However, the Java code "System.out.println("squirrel");" is not equivalent to "System.out.println ( " squirrel " ) ;". (Why not?) By "heuristic", I mean that this is not an infallible guide, but with generally lead you to the correct understanding of tokens. In many programming languages, whitespace can arbitrarily separate tokens. Read the Token class documentation to get a sense of how JavaCC represents tokens.

2.2: Regular expressions are covered in detail in CS 301, but you will need a good working knowledge of them here. After reading this section, read the JavaCC grammar for specifying regular expressions. The concatenation "." is omitted in Javacc regular expression specification. Kleene closure (i.e. "Kleene star") and "Kleene plus" ("( ... )+"), meaning "one or more of ...", are not superscripted. There is no epsilon. See Program 2.9 for an example JavaCC regular expression specification. Feel free to peruse the example grammars on our system in /usr/share/javacc/examples to see the common use of JavaCC regular expression specifications. You may be surprised to find that there are parsers for the Java language at this location. Only browse these Java parsing resources if you get stuck. Do not copy and paste from these files; merely browse them to overcome conceptual hurdles. The goal here is to gain the experience of specifying regular expressions yourself so that you can apply these skills beyond this project and course.

2.3-2.4: The specifications of the previous section are converted into code which is based on finite automata. Understanding these sections will (1) help you understand what the output of a lexical analyzer generator does, and (2) reveal the equivalence between languages expressed by regular expressions and accepted by finite automata. If you think there is a remote chance that you'll be considering graduate school, you should know this material forwards and backwards before taking the GRE subject test. This material demonstrates a beautiful and practical application of theoretical computer science. Since this is a topic of CS 301, we will not be covering this material in class.

2.5: Skip the SableCC material. We'll focus on JavaCC. Browse the online JavaCC documentation and the JavaCC examples we have on our system (/Courses/cs374/bin/JavaCC/javacc2.0/examples/). The SimpleExamples directory is a good starting point.