Expression of Intent

This weekend I worked on the Factual language parser. I started with a brute-force recursive descent parser, but ended up with something more elegant. The final product expresses the intent better than the original code did. It is easier to maintain, and easier to extend. You can follow along with the changesets on Codeplex. Here’s how I did it.

Step 1: Make it work
The goal was to create a parser for a known language. It was not to create a framework for any language that you could throw at it. My constraints were well defined:

  • Parse the Factual language
  • Produce an Abstract Syntax Tree
  • Work within an open source license

These constraints defined my solution. Since I am working on an open source project, I did not want to take a dependency upon a proprietary parser generator. And since I knew the language that I was parsing, I was able to choose parser patterns that fit that language.

It would be foolish to take on this project without knowing the theory of compiler design. I have seen too many ad-hoc parsers created by people who did not do their research. The mistakes that they made could have easily been avoided. “Make it work” does not mean throwing together the simplest thing that would work. It means think it through, but go straight at the problem. Persistence is no substitute for understanding.

Knowing the theory, I knew that my language as defined was LL(4). It is unambiguous with a lookahead of only 4 tokens. I also knew that with a slight modification, I could transform it into an LL(1) grammar. This kind of grammar is easily parsable using recursive descent.

I started with this unit test:

[TestMethod]
public void WhenNamespaceHasNoDot_NamepaceIsRecognized()
{
    Parser parser = new Parser(new StringReader("namespace GameModel;"));
    Namespace result = parser.Parse();
    Pred.Assert(result.Identifier, Is.EqualTo("GameModel"));
    Pred.Assert(result.LineNumber, Is.EqualTo(1));
}

I did not write this code, as some proponents of TDD would recommend:

public Namespace Parse()
{
    return new Namespace("GameModel", 1);
}

Instead, I wrote this:

public Namespace Parse()
{
    Consume();
    Namespace namespaceNode = MatchNamespace();
    return namespaceNode;
}

private Namespace MatchNamespace()
{
    Token namespaceToken = Expect(Symbol.Namespace, "Add a 'namespace' declaration.");

    if (Lookahead.Symbol != Symbol.Identifier)
        throw new FactualException("Provide a dotted identifier for the namespace.", Lookahead.LineNumber);
    string namespaceIdentifier = MatchDottedIdentifier();

    Expect(Symbol.Semicolon, "Terminate the namespace declaration with a semicolon.");

    return new Namespace(namespaceIdentifier, namespaceToken.LineNumber);
}

That first method would have made the test pass, but it represents none of the understanding of the problem space. I find it better to capture that understanding in code as soon as possible, rather than writing a foolish method that I will throw away later.

After getting it working, I had a suite of unit tests. Each subsequent step was performed while keeping that same test suite passing.

Step 2: Clean up the working code

It bothered me that MatchNamespace knew that MatchDottedIdentifier expected an Identifier. So I did a quick refactoring to move that knowledge closer to the right place.

private Namespace MatchNamespace()
{
    ...

    if (!StartOfDottedIdentifier())
        throw new FactualException("Provide a dotted identifier for the namespace.", Lookahead.LineNumber);
    string namespaceIdentifier = MatchDottedIdentifier();

    ...
}

private bool StartOfDottedIdentifier()
{
    return Lookahead.Symbol == Symbol.Identifier;
}

private string MatchDottedIdentifier()
{
    StringBuilder result = new StringBuilder();

    Token idenifier = Expect(Symbol.Identifier, "Begin with an identifier.");
    result.Append(idenifier.Value);

    ...
}

Step 3: Find the patterns

After cleaning the code in a similar way in several places, I noticed that there were Start and Match pairs for almost every rule. I compared this with my knowledge of parsers, and found that it is a valid generalization.

In an LL(1) grammar, you can calculate the First and Follow sets of every production. The First set is the set of tokens that could appear at the beginning of the production. The Follow set is the set that could appear after the production has been reduced. I recognized that Start represented the First set. The Follow set is useful for generating the First set (First includes Follow if the production could be empty), but was not strictly required for my grammar.

Having discovered that pattern, I extracted a common interface (actually an abstract base class).

public abstract class Rule<T>
{
    public abstract bool Start(Symbol symbol);
    public abstract T Match(TokenStream tokenStream);
}

I implemented this interface with one derived class per production.

Step 4: Find the common intent

Several of the rules had a common structure. They each called a sequence of other rules in turn. Each time, they would check the Start, and then call Match. This commonality was not accidental. Each of these rules intended to represent a sequence. They only differed in which sequence, and what was done with it. So I factored out that intent.

public class RuleSequence3<T1, T2, T3, T> : Rule<T>
{
    private Rule<T1> _rule1;
    private Rule<T2> _rule2;
    private string _error2;
    private Rule<T3> _rule3;
    private string _error3;
    private Func<T1, T2, T3, T> _reduce;

    public RuleSequence3(Rule<T1> rule1, Rule<T2> rule2, string error2, Rule<T3> rule3, string error3, Func<T1, T2, T3, T> reduce)
    {
        _rule1 = rule1;
        _rule2 = rule2;
        _error2 = error2;
        _rule3 = rule3;
        _error3 = error3;
        _reduce = reduce;
    }

    public override bool Start(Symbol symbol)
    {
        return _rule1.Start(symbol);
    }

    public override T Match(TokenStream tokenStream)
    {
        T1 value1 = _rule1.Match(tokenStream);

        if (!_rule2.Start(tokenStream.Lookahead.Symbol))
            throw new FactualException(_error2, tokenStream.Lookahead.LineNumber);
        T2 value2 = _rule2.Match(tokenStream);

        if (!_rule3.Start(tokenStream.Lookahead.Symbol))
            throw new FactualException(_error3, tokenStream.Lookahead.LineNumber);
        T3 value3 = _rule3.Match(tokenStream);

        return _reduce(value1, value2, value3);
    }
}

Step 5: Express the intent

Now that rules can be built from the outside rather than coded from within, I had the opportunity to create a fluent interface. The code that uses this interface expresses the grammar of the language, rather than the steps required to parse it. The resulting code reads like a specification rather than an implementation.

// dotted_identifier -> identifier ("." identifier)*
// namespace_declaration -> "namespace" dotted_identifier ";"
var dottedIdentifier = Separated(
    Terminal(Symbol.Identifier),
    Symbol.Dot,
    identifier => new StringBuilder().Append(identifier.Value),
    (stringBuilder, identifier) => stringBuilder.Append(".").Append(identifier.Value));
var namespaceRule = Sequence(
    Terminal(Symbol.Namespace),
    dottedIdentifier, "Provide a dotted identifier for the namespace.",
    Terminal(Symbol.Semicolon), "Terminate the namespace declaration with a semicolon.",
    (namespaceToken, identifier, ignored) => new Namespace(identifier.ToString(), namespaceToken.LineNumber));

I find that this code is much easier to maintain and extend. The engineering of getting the parser right has been done once. Additional productions can take advantage of that engineering, so that simple mistakes are avoided. The goal was no reuse, but in the end I had created a reusable parser generator.

I might have been able to go straight there, but only at the risk of over engineering the solution. By refactoring my way from a working system toward an expression of intent, I was able to bring in just the theory that this particular problem requires. I know the limitations of my parser, and I can live within them.

Leave a Reply

You must be logged in to post a comment.