Welcome to the Congo - CongoCC that is

CongoCC is an open-source resource site that provides Code-generation Components, currently:

  • Up-to-date Java parsers (Java 8 - 18 supported)
  • Python 3.10 parsers
  • C# parser

Plans for other code-generation tools include:

  • Updating FreeMarker, a leading templating engine

The Congo Rainforest is the World's Second Largest Rainforest

The common link between the Congo and the resources available from CongoCC is the presence of trees - real, physical trees in the rainforest and AST trees in CongoCC parsers. As this website develops and evolves, more blogs will added that describe how the grammar files that you write are processed by the CongoCC jar file to generate compilable code that can be used to parse Java, Python, or C# text files.

To learn more about the history and evolution of the CongoCC project, click on the About menu option. The CongoCC repository and be cloned from https://github.com/congo-cc/

Individual Blogs Are Listed Below

Learning CongoCC by Examples - Part 1

28 April 2022 by Nelson Chamberlain

Dissecting the Arithmetic Grammars

As developers, we are trained from the beginning to create simple straightforward programs. Programs with clear logic and structure, with well-defined beginnings and exits. Simple good - convoluted bad.

JavaCC, as well as most other parser & compiler generators, emphasize a different kind of logic because they need to handle statements and expressions of arbitrary complexity. With most Java applications, we can manage and structure most situations using loops (for, while, etc) and iterating through lists. We can arrange or manage the content being processed so they fit looping and list constructs, even if the logic becomes extremely complicated at times.

The arithmetic grammars that come with JavaCC21 (examples/arithmetic/Arithmetic1/2.javacc) can generate a parser that can process arithmetic statements of arbitrary complexity with nesting and grouping of items while observing the generally accepted order of operation or precedence rules. The Arithmetic2.javacc grammar includes the Arithmetic1 grammar and adds the ability to calculate the results of the arithmetic statement that you enter, so this commentary will use that one.

Not only do the Arithmetic grammars accomplish the complicated task of correctly processing arbitrary arithmetic statements in MDAS order, it does this while also using very few lines in the grammar by fully exploiting the recursive capabilities of JavaCC. The heart of the arithmetic capabilities are shown below:

 1    AdditiveExpression :
 2       MultiplicativeExpression
 3       (
 4         (<PLUS>|<MINUS>)
 5         MultiplicativeExpression
 6       )*
 7    ;
 8
 9   MultiplicativeExpression :
10      (<NUMBER> | ParentheticalExpression)
11      (
12         (<TIMES>|<DIVIDE>)
13         (<NUMBER> | ParentheticalExpression)
14      )*
15   ;
16
17   ParentheticalExpression :
18      <OPEN_PAREN>
19      AdditiveExpression
20      <CLOSE_PAREN>
21   ;
22
23   Root : AdditiveExpression <EOF> ;

Now most of us have had to write a simple four-function calculator program at one time or another and it almost certainly looked nothing like the above productions. It probably had well-defined functions with meaningful names like addValue() and an accumulator of some sort and it all probably seemed pretty straightforward. So parsing a line of arithmetic doesn’t seem like it should be that much more difficult.

However, the reason for this programming simplicity is that the arbitrary complexity of mixed arithmetic operations was defined away before even starting by using the four-function calculator as the model. If one were given some random arithmetic statement, such as:

 3 + 8 * 5 -7

it is our responsibility, as the human in the loop, to perform the operations on the calculator in the correct (or at least standard sequence) of Multiply - Divide - Add - Subtract (MDAS). Or in this case:

 8
 *
 5
 =    (result of 40 is displayed)
 +
 3
 =    (result of 43 is displayed)
 -
 7
 =     (result of 36 is displayed)

If one were to perform the calculations in simple left-to-right order, the result would be 48. And performed right-to-left order, the result would be -13.

Parentheses can be used to clarify the proper order of operations, either standard MDAS sequence or otherwise. For example, to be extremely clear, the original arithmetic statement could be written:

3 + (8 * 5) - 7     (result is 36)

The parentheses can specify other sequences, such as:

(3 + 8) * 5 - 7     (result is 48)

or

(3 + 8) * (5 - 7)   (result is -22)

or even

(3 + (8 * (5 - 7))) (result is -13)

where the innermost pair of parentheses is solved first.

Imagine trying to get all those possible ways of grouping and ordering the operations in your code. The simple straightforward code suddenly doesn’t look so easy. And what if the arithmetic statement has nonsensical grouping, such as:

((((3 + 8) * 5 - 7)))

will your straightforward code handle that possibility?

The snippet of grammar above demonstrates the power of recursion to handle any combination of grouping with parentheses in a comparatively small number of lines. So let’s step through these grammar lines and see how they work.

If you haven’t done so already, from a command line/terminal window, change directory to the examples/arithmetic folder of the JavaCC21 repository that you downloaded/cloned and issue these commands:

java -jar <path to jarfile>/javacc-full.jar Arithmetic2.javacc
javac ex2/*.java
java ex2.Calc

The cursor will appear on a blank like below the last command, waiting for you to enter an arithmetic statement. Let’s begin by entering a simple addition statement:

3 + 5

Which produces the following output when ex2.Calc is run:

 Dumping the AST...
 Root
   AdditiveExpression
     3
     +
     5
   EOF
 The result is: 8.0

The generated parser begins by calling the parser’s Root production (line 23) from the main() function. As we can see above, all Root does is trigger the AdditiveExpression production (line 1). AdditiveExpression in turn triggers the MultiplicativeExpression (line 9) which begins by trying to find a <NUMBER>.

In this example, it finds the number 3 (line 10) and puts it on the stack. Next it tries to find either the TIMES or DIVIDE token (line 12) and fails, so the MultiplicativeExpression production returns and AdditiveExpression resumes at line 4 and finds a PLUS token that goes on the stack.

Processing then returns to the MultiplicativeExpression (line 9) which again looks for either a NUMBER token or a ParentheticalExpression (line 10). Since it finds a NUMBER, a "5" that also gets put onto the stack, it then looks for either the TIMES or the DIVIDE token. Failing to find those, processing returns to the AdditiveExpression.

AdditiveExpression fails to find any additional input text to process so processing returns to Root production (line 23) where it finds the End of File token and then terminates. Not shown above is the main() function that evaluates the items in the stack, converting them into numerical values and then performing the assigned calculations, which produces the expected result of 8.0.

Now let’s do it again, but with parentheses. Our expression this time is:

(3 + 5)

which produces the following output:

Dumping the AST...
Root
  ParentheticalExpression
    (
    AdditiveExpression
      3
      +
      5
    )
  EOF
The result is: 8.0

Processing is similar to the last example, where processing goes from line 23 to line 2 to line 10 but this time continues onto line 18 where it does indeed find an OPEN_PAREN token. Processing then continues on line 19 which redirects processing to the AdditiveExpression production (line 1) and the whole process of looking for additive or multiplicative expressions starts all over again.

Note that in the output, the AdditiveExpression is indented so it is listed under ParentheticalExpression. The technical reason for this indenting is that the AdditiveExpression is in a new NodeScope belonging to the ParentheticalExpression. We’ll cover NodeScopes later but for now we’ll just say that each NodeScope is its own "context" with its own set of values which can contain other NodeScopes, etc. It is loosely similar to defining variables with the same name inside and outside different programming scopes (typically defined by curly braces { . . }).

Take a minute or few to experiment with parentheses and simple addition statements. What results do you get with (3) + (5) and and (3 + (5)). They should all produce a result of 8.0.

What happens if you put parentheses around each of the items, such as (3) () (5)? Right, it terminates with a ParseException because it doesn't expect to find an operator ( * - /) alone inside a ParentheticalExpression.

Now let’s mix our arithmetic operations. Let’s try:

3 + 5 * 7

which should produce the following output:

Dumping the AST...
Root
  AdditiveExpression
    3
    +
    MultiplicativeExpression
      5
      *
      7
  EOF
The result is: 38.0

This result matches the expected result following the standard order of operations, where multiplication or division operations are performed before additive or subtractive operations. Let’s go through this sequence line by line.

As before, Root (line 23) calls AdditiveExpression (line 1) which in line 2 calls MultiplicativeExpression (line 9) which finds a NUMBER (line 10) but then returns to AdditiveExpression where it finds a PLUS token (line 4). It then returns to MultiplicativeExpression (line 9) where it finds a NUMBER (line 10), a TIMES token (line 11) and another NUMBER token, successfully completing the MultiplicativeExpression production which completes the AdditiveExpression production which then terminates parsing with the EOF. The main function then dumps the AST tree and then evaluates the numbers it found and performs the operations in the desired order, with the multiplicative expressions (most indented values) performed first.

If you explicitly specify the order by putting parentheses around the multiplication part (5 * 7) the same result is displayed. The only difference in the lines of output is that the MultiplicativeExpression is now indented under the ParentheticalExpression but the multiplication is still performed first.

Now let’s put the parentheses around the additive operation, as so:

(3 + 5) * 7

The output that is displayed looks like the following:

Dumping the AST...
Root
  MultiplicativeExpression
    ParentheticalExpression
      (
      AdditiveExpression
        3
        +
        5
      )
    *
    7
  EOF
The result is: 56.0

Let’s run thru the sequence of processing line by line. As before, Root (line 23) calls AdditiveExpression (line 1) which in line 2 calls MultiplicativeExpression (line 9) which does NOT find a NUMBER (line 10) so it looks for a ParentheticalExpression (line 17). It does find an OPEN_PAREN token (line 18) followed by an AdditiveExpression (line 19) followed by a CLOSE_PAREN token, successfully completing that production.

Processing then continues in MultiplicativeExpression at line 12 where a TIMES token is found, followed by a NUMBER token (line 13) which then completes that production. The AST is then dumped and its contents evaluated and the result calculated as shown above.

Together, these relatively few lines of grammar code contain the almost magical capability to correctly parse arithmetic expressions of arbitrary complexity and nesting levels and rejecting incorrectly formed statements (incorrect number of parentheses, etc). And the arithmetic operations are performed in the correct order, according to the standard rules of MDAS.

This is possible because the AdditiveExpression and MultiplicativeExpression productions cover all four arithmetic operations between them. Based on the arithmetic expression you provide as input, these two productions, aided by the ParentheticalExpression production, call themselves recursively or call each other as required.

My suspicion is that the simplicity of this short grammar disguises how difficult it was to originally write it. I suspect this because I see this arithmetic example in numerous places and pretty much identically expressed. It is a marvelous example of sophisticated simplicity, but a pattern that isn’t used that often because of its sophistication requires a lot of thought and planning and in depth understanding of how all the different components function. And let’s admit it, most of us just need to get the job done and don’t really have the time (and willingness) to invest in really elegant solutions.

Next, we’ll look at NodeScopes and State Diagrams

Learning CongoCC by Examples - Part 2

13 March 2022 by Nelson Chamberlain

Dissecting the NodeScope Class Inside the Parser

When you run CongoCC, it generates a parser that is based on the tokens and production rules that you defined in your grammar. At the bottom of the parser’s java file is the nested class NodeScope which is used throughout the parser. In this commentary, we’re going to explore how this class is used and why it is used.

Nested classes such as NodeScope are typically used for reasons that include:

  • It logically groups classes that are only used in one place, such as helper classes

  • It increases encapsulation by hiding implementation details of the nested class and preventing other classes from depending on nested class details

  • Nesting small classes within top-level classes can place the code closer to where it is used.

Nested classes can be put into two categories:

  • Static nested classes which cannot directly access instance variables of the enclosing class

  • Inner classes which can directly access the enclosing class’s methods and fields. There are two special kinds of inner classes:

    • Local classes are defined in a block frequently within the body of a method

    • Anonymous classes allow you to declare and instantiate a class at the same time, many times being inserted as a parameter in the constructor of a different object.

Class NodeScope begins as follows:

723     @SuppressWarnings("serial")
724     class NodeScope extends ArrayList<Node>  {
725         NodeScope parentScope;
726         NodeScope() {
727             this.parentScope= Calc.this.currentNodeScope;
728             Calc.this.currentNodeScope= this;
729         }

Line 723 begins with @SuppressWarnings("serial") because the official Java documentation on inner classes "strongly discourages" serialization of inner classes. This annotation simply prevents an unnecessary warning being displayed.

Line 724 declares the NodeScope class as extending java.util.ArrayList and that it will hold only items of type Node. Node is an interface type generated by CongoCC. We’ll look more closely at Node at some future time, but for now let’s just say that it is a Swiss Army Knife kind of object, which includes:

  • implements the java.lang.Comparable interface allowing ordering of objects

  • provides methods to get Node and Token location information (getBeginLine(), getEndLine(), getBeginColumn(), etc)

  • provides default methods for managing and manipulating child and descendant nodes in document/tree-like structures

  • and lots, lots more

Line 725 declares an NodeScope instance named parentScope. When the NodeScope constructor is called (lines 726 - 729), the parentScope is set to what had been the currentNodeScope and the newly constructed NodeScope object is assigned to be the currentNodeScope. In other words, when we construct a new NodeScope, we’re chaining them together and the previously existing NodeScope becomes the "parent" of the newly constructed one.

And what happens when there was no previously existing NodeScope? The parentScope is a null value, which is exactly what happens the first time a NodeScope is constructed. This characteristic turns out to be very useful when looking for the root node, or the very first NodeScope in the chain, as lines 735 - 742 show below:

735         Node rootNode() {
736             NodeScope ns= this;
737             while (ns.parentScope!=null) {
738                 ns= ns.parentScope;
739             }
740             return ns.isEmpty()?null:
741             ns.get(0);
742         }

This method begins with the current NodeScope and backs up thru the chain of NodeScopes until it finds one whose parentScope is equal to null. If that NodeScope is empty then null is returned; otherwise the zeroth Node in the NodeScope’s ArrayList is returned.

The chaining of NodeScopes turns out to be an extremely useful feature because it allows the Parser to manage recursive calls. Each time a new NodeScope is opened (constructed), it is "nested" inside the previous NodeScope (parent). Each nested NodeScope is a new "context", fresh and independent of any previously existing NodeScopes. When we were running the Arithmetic2 example, we saw evidence of the nesting when the AST was dumped, which indented each time nesting (openNodeScope) occurred.

Here’s a sample of the dumped output:

(3 + (4 - 1)) * 2
Dumping the AST...
Root
  MultiplicativeExpression
    ParentheticalExpression
      (
      AdditiveExpression
        3
        +
        ParentheticalExpression
          (
          AdditiveExpression
            4
            -
            1
          )
      )
    *
    2
  EOF
The result is: 12.0

So let’s break a long-standing rule that generated code should never be modified by hand and put in some print statements into the parser (Calc) to let us watch the sequence of events. In this particular case it’s definitely OK because anytime we’re done with these modifications, we can just rerun CongoCC and regenerate the parser.

Assuming you haven’t changed the Arithmetic2.javacc, open ex2/Calc.java and let’s start by annotating every time openNodeScope is called. Around line 200, add the println statement shown below.

202  if (buildTree) {
203      AdditiveExpression1= new AdditiveExpression();
204  --->System.out.println("AddExpr: call openNodeScope");
205      openNodeScope(AdditiveExpression1);
206  }

And then around 272, add the following printnln statement:

273  if (buildTree) {
274      MultiplicativeExpression2= new MultiplicativeExpression();
275  --->System.out.println("MultExpr call openNodeScope");
276      openNodeScope(MultiplicativeExpression2);
277  }

And around line 361, add the following println statement:

366  if (buildTree) {
367      ParentheticalExpression3= new ParentheticalExpression();
368  --->System.out.println("ParenExpr call openNodeScope");
369      openNodeScope(ParentheticalExpression3);
370  }

And finally around line 410 add the following println statement:

415  if (buildTree) {
416      Root4= new Root();
417  --->System.out.println("Root call openNodeScope");
418      openNodeScope(Root4);
419  }

Now let’s announce everytime that a Token is consumed. Around line 552, add the following println statement:

552  System.out.println("consumeToken: " + nextToken.getImage());

Next, let’s add some lines to the openNodeScope method itself, around line 654:

654  String s = "unassigned 1";
655  if (lastConsumedToken == null) {
656      s = "lastConsumedToken == null" ;
657  } else {
658      s = "lastConsumedToken == " + lastConsumedToken.getImage();
659  }
660  System.out.println("openNodeScope: " + s) ;

Now let’s add println statements to both of the closeNodeScope methods. Around line 669, add the following statement:

671  System.out.println("closeNodeScope(Node, int) called for Token " +
672  lastConsumedToken.getImage() + " at " + lastConsumedToken.getEndOffset());

The second closeNodeScope method has a boolean as the second parameter and it seems to be the only close method used in this example. This second closeNodeScope method needs println statements in two places, around the lines shown following:

697  System.out.println("closeNodeScope(Node, boolean) called for Token " +
698  lastConsumedToken.getImage() + " Ending at " + lastConsumedToken.getEndOffset());
. . . .
722  else  {
723        System.out.println("closeNodeScope(Node, boolean) else Token =  " +
724            lastConsumedToken.getImage() + " ending at " + lastConsumedToken.getEndOffset());
725    currentNodeScope.close();
726  }

And finally, let’s do add a line that will make the beginning status a little more clear, as follows:

111  lastConsumedToken.setImage("DUMMY_START_TOKEN");

Save and exit the updated Calc.java file and then run the following two commands in the terminal window:

javac ex2/*.java
java ex2/Calc

If everything was entered correctly, it should have compiled without complaints and when you run ex2/Cacl, the cursor will advance to the next line and wait for you to input a mathematical expression for it to parse and evaluate. Let’s try something very simple such as:

123
Root call openNodeScope
openNodeScope: lastConsumedToken == DUMMY_START_TOKEN
AddExpr: call openNodeScope
openNodeScope: lastConsumedToken == DUMMY_START_TOKEN
MultExpr call openNodeScope
openNodeScope: lastConsumedToken == DUMMY_START_TOKEN
consumeToken: 123
closeNodeScope(Node, boolean) else Token =  123 ending at 3
closeNodeScope(Node, boolean) else Token =  123 ending at 3
consumeToken:
closeNodeScope(Node, boolean) called for Token  Ending at 3
Dumping the AST...
Root
  123
  EOF
The result is: 123.0

We entered a number all by itself, with no operators or parentheses. When we pressed enter, the Root production rule was called which called the openNodeScope method which now printed the information that the lastConsumedToken was DUMMY_START_TOKEN. The value of DUMMY_START_TOKEN was set in the last line that we added, where we called setImage so LastConsumedToken would have something to print (borrowing the contents from the constructor that uses lexer.DUMMY_START_TOKEN).

The first thing the Root production rule does is call the AdditiveExpression production rule, which also calls openNodeScope. Because we haven’t consumed any input so far, the message again refers to DUMMY_START_TOKEN. Inside the AdditiveExpression production rule it goes on to call the MultiplicativeExpression, which itself calls the openNodeScope method. And since we still haven’t consumed any Tokens, it again reports that lastConsumedToken == DUMMY_START_TOKEN.

Next we consume a <NUMBER> Token, this one with a value of 123. There is nothing else for it to do so it closes the current NodeScope, which causes the previous NodeScope to close.

At this point, there is no additional input to consume, which CongoCC helpfully interpreted as EOF (end of file) which allows the Root production rule to complete and call closeNodeScope with an empty Token image. At this point, the main function resumes and dumps the AST.

Time to start messing around with this annotated code. Give it more complicated math expressions to parse and evaluate (calc the value). Give it lots of parentheses to work with, such as:

(((((456+234-123+789)))))
Root call openNodeScope
openNodeScope: lastConsumedToken == DUMMY_START_TOKEN
AddExpr: call openNodeScope
openNodeScope: lastConsumedToken == DUMMY_START_TOKEN
MultExpr call openNodeScope
openNodeScope: lastConsumedToken == DUMMY_START_TOKEN
ParenExpr call openNodeScope
openNodeScope: lastConsumedToken == DUMMY_START_TOKEN
consumeToken: (
AddExpr: call openNodeScope
openNodeScope: lastConsumedToken == (
MultExpr call openNodeScope
openNodeScope: lastConsumedToken == (
ParenExpr call openNodeScope
openNodeScope: lastConsumedToken == (
consumeToken: (
AddExpr: call openNodeScope
openNodeScope: lastConsumedToken == (
MultExpr call openNodeScope
openNodeScope: lastConsumedToken == (
ParenExpr call openNodeScope
openNodeScope: lastConsumedToken == (
consumeToken: (
. . .
            (
            AdditiveExpression
              456
              +
              234
              -
              123
              +
              789
            )
          )
        )
      )
    )
  EOF
The result is: 1356.0

We can see how every time it hits an OPEN_PAREN it calls openNodeScope, opening up a fresh working environment until it finally hits some <NUMBER> and add/subtract operators which are all included in the innermost context (NodeScope). It remains in the same NodeScope until it reaches a CLOSE_PAREN and closes the currentNodeScope. If it had reached a times or divide operator instead of the CLOSE_PAREN, a new openNodeScope would be called to handle these multiplicative operators.

Fun stuff. When you’re done with these annotations, you can delete everything in ex2 folder because you can regenerate the regular code anytime you want.

Next, we’ll look at States and State Diagrams.

CongoCC Basics

01 February 2022 by Jonathan Revusky

Parsing vs. Scanning

Consider the following JavaCC construct:

 ("foo" "bar" "baz")?

While the above three tokens, "foo", "bar", and "baz", are of equal weight and deserving of equal treatment, one of them is somehow different from the others.

Well, the difference will be more apparent if we write the above construct in pseudo-code:

 scan ahead one token and if it is a "foo" :
    consume "foo"
    consume "bar"
    consume "baz

When expressed in pseudo-code, we can now see which of the tokens is different. In the above pseudo-code, the "foo" token is mentioned twice and the other two are only mentioned once. One can see this without knowing anything about JavaCC or parsing or programming at all!

Being first in the token list is what makes the "foo" token different. Unlike the other two, it is actually used twice — the first time to scan ahead and make a decision, and then, along with the other two tokens, assuming that the "foo" was found, it is consumed.

Another key difference is that if we look at the three "consume" lines in the above, the first line, that consumes a "foo", is different from the other two. Here is why:

It cannot fail!

If, in the previous line, we scanned ahead and saw that the next token is a "foo", we cannot really fail to now consume a "foo". We know that this is the next token coming off the stream, because we just checked! On the other hand, the two following lines could fail. Maybe the token after the "foo" is not a "bar", and even if it is, maybe the one after that is not a "baz".

So the above also illustrates the most basic fact about how JavaCC works:

Unless it is somehow overridden, at any choice point a JavaCC parser decides by peeking ahead exactly one token.

In fact, a more precise way of expressing the above point about "foo" being different from the other two tokens would be to say:

What is different about "foo" is that it it is at a choice point and the other two tokens are not.

You see, developing a complete conceptual model of JavaCC involves understanding this very basic distinction. A JavaCC-generated parser, at any given point, is in one of two states.

  • A "parsing" state in which it is consuming input

  • A "scanning" state in which the parser is scanning input without consuming it.

Note that the above distinction does not just apply to consuming input as opposed to simply scanning it. It is only in the parsing state that we build up the AST and execute whatever other user-defined actions; these don’t happen when we are merely scanning ahead.

However, I think that a lot of JavaCC users run into problems because they are continually tripping up over this sort of basic conceptual issue. This may not be so much because the core concept is so difficult, but more because regular parsing and scanahead actually interact in some rather complex, messy ways.

For example, consider the following grammar rule:

 FooOrBar :
   Foo
   |
   Bar
   |
   {throw new ParseException("Expecting a foo or bar here.");}
 ;

This seems like a pretty natural production. One noteworthy thing is that the above grammar production cannot work in a syntactic lookahead, such as the following.

 (LOOKAHEAD(FooOrBar()) Foobar Baz Bat)?

Here is why: the lookahead always succeeds!

You see, the third choice, after Foo and Bar is a Java code block and, in a scanning step, any Java code action is automatically assumed to succeed. So during parsing, if the next token wasn’t a Foo or a Bar, the exception would be thrown, leaving the user to wonder why it worked earlier.

It could just as well be written: LOOKAHEAD({true})!

FYI, this has been fixed in JavaCC 21 by introducing a new FAIL statement which works as expected in either a scanning or parsing state. So the above is preferably written:

 FooOrBar :
    Foo
    |
    Bar
    |
    FAIL "Expecting a Foo or Bar here."
 ;

The above construct does work in a scanning state because if what follows matches neither Foo nor Bar then it hits the FAIL statement which causes the lookahead routine to return false. So, the FooOrBar rule works as expected in either a parsing or a scanning state.

What the above example actually illustrates is a more general concept:

Generally speaking, the key to writing robust grammars in JavaCC is to write grammar rules that work as expected in a parsing or a scanning state.

Well, at least to the greatest extent possible. I guess the above is also a specific case of the Principle of Least Surprise.


Older posts are available in the archive.