Monadic Parser Combinators with C#

A friend of mine recently pointed me to an interesting paper which details the idea of monadic parser combinators. Here several ideas from functional programming are fused to arrive at a new perspective on building document parsers. With parser combinators one takes a hierarchical, bottom-up approach in which small atomic parsers are combined in a well defined way to arrive at more and more complex units.

Over the years, functional paradigms have become more and more pronounced in C#. With C# 2 we saw the introduction of anonymous delegates, a precursor to lambda function which followed in C# 3. I have always enjoyed functional programming ideas and thought it would be a nice C# exercise to implement monadic parsers. So the plan is to explain the idea of monadic parser combinators and their implementation in C# in this post and then hopefully apply these ideas in a follow up post to develop a simple Lisp interpreter.

If you are interested in a more detailed explanation take a look at Hutton and Meijers paper and references therein.

Parser combinators

A parser combinator is a higher-order function which serves as a tool to string atomic parsers together and form more complex parsing units. On an abstract level a parser may be described as a function which consumes a string of input and constructs an abstrat syntax tree (AST).

(1)   \begin{equation*} \text{Parser}: \text{String} \rightarrow \text{Tree} \end{equation*}

Although the above definition makes sense, when considered as a building a block, a parser for values of type TV and inputs of type TI should be rather thought of as a function which maps an input to a list of tuples containing a partial tree and the unconsumed input,

(2)   \begin{equation*} \text{Parser TV}: \text{TI} \rightarrow [(\text{TV},\text{TI})] \end{equation*}

It is a good idea to return a list of tuples rather than a tuple for two reasons: first of all, parsing may fail and a convenient way to signal this to the caller is to return an empty result list rather than throwing a runtime exception. On the other a list opens up the possibility of non-deterministic parsing, where a piece of input may be interpreted in more than one way, thus spawning different interpretation trees processed in parallel.

Transcripted to C# the above definition might look like this,

public class Result<TV,TI> {
   public readonly TV Value;
   public readonly TI Input;
   public Result(TV value, TI input) {
      Value = value; Input = input;
   }
}
/* parser definition */
public delegate Result<TV,TI>[] Parser<TV,TI>(TI input);

/* parser map definition */
public delegate Parser<TVout,TI> ParserMap<TVin,TVout,TI>(TVin value);

Here we have abstracted on both the parsed and the input type by introducing generics. The last line defines a ParserMap; a function which takes a value of type TVin and returns a Parser for values of type TVout.

Three primitive parsers provide the basic building blocks of combinator parsing. The result parser map produces parsers which succeed without consuming any of the input and return the supplied value v.

(3)   \begin{equation*} \text{result}:: \text{TV} \rightarrow \text{Parser TV}\\ \text{result v = inp} \rightarrow [(\text{v, inp})] \end{equation*}

Dually, on the other hand the parser fail always fails, regardless of the input string,

(4)   \begin{equation*} \text{zero}:: \text{Parser TV}\\ \text{zero = inp} \rightarrow [] \end{equation*}

Using the delegates defined above the following C# code captures these ideas,

/* some primitive parsers and parser maps */
public class Primitives<TV,TI> {

   /* result parser map creates a parser which produces
    * a Result array of length 1 with the passed in value
    * "value" inserted into the value field of the Result 
    * and the supplied input completely uncomsumed in the
    * input field */
   public static Parser<TVout,TI> result<TVout>(TVout v) {
      return inp => {
         return new Result<TVout,TI>[] {
            new Result<TVout,TI>(value: v, input: inp)
         };
      };
   }

   /* fail parser always fails with an empty
    * result list */
   public static Result<TV,TI>[] fail(TI input) {
      return new Result<TV,TI>[] {};
   }
...
}

The final primitive is the shift parser, which succeeds by consuming one character of the input string and returning it as the result,

(5)   \begin{equation*} \text{result}:: \text{TV} \rightarrow \text{Parser TV}\\ \text{result v = inp} \rightarrow [(\text{v, inp})] \end{equation*}

public abstract class CharPrimitives<TI> : Primitives<char,TI> {
   /* shift peels of one character from the
    * input and returns it in the result */
   public abstract Result<char,TI>[] shift (TI input);
...
}

The introduction of monadic sequencing combinator allows to chain parser together. It combines the sequencing of parser with the processing of their result values

(6)   \begin{equation*} \text{bind}::\text{Parser TVin} \rightarrow (\text{TVin} \rightarrow \text{Parser TVout}) \rightarrow \text{Parser TVout}\\ \text{p 'bind' f} = \text{input} \rightarrow \text{concat}[\text{f v input'}|(\text{v,input'}) \leftarrow \text{p input}] \end{equation*}

The above gives instructions on how to combine a parser p and a parser map f: first of all the parser p is applied to the input string yielding a list of result tuples. f is now applied to each value in turn yielding a parser which is then fed the unconsumed string). The result lists are concatenated and returned. In C# this looks like this

public static class ParserCombinatorExtensions {
   /* all-powerful bind */
   public static Parser<TVout,TI> bind<TVin,TVout,TI>(this Parser<TVin,TI> p,
      ParserMap<TVin,TVout,TI> f) {
      return input => {
            List<Result<TVout,TI>> result = new List<Result<TVout,TI>> ();
            Array.ForEach(p(input), r => { 
               Array.ForEach(f(r.Value)(r.Input), R => result.Add(R))
            });
            return result.ToArray();
         };
      }
      ...
}

Note here that we have defined extension methods for the delegate Parser. This allows for a convenient infix notation of the form q = p.bind(f).

A simple application of the bind operation will make the rationale a lot easier to grasp: We can now use bind to build a number class of useful single character parsers. We define the map sat which accepts a predicate (a boolean-value function which selects a subset of the characters) and returns a parser which consumes a single character of it satisfies the predicate and fails otherwise.

/* sat uses the shift parser to consume some of the input and 
 * checks if it s */
public Parser<TV,TI> sat(Parser<TV,TI> shift, Predicate<TV> f) {
   return shift.bind<TV,TV,TI> (c => {
      if(f(c))
         return result<TV>(c);
      else
         return fail;
   });
}

Using sat it is easy to define parsers for digits, whitespace, lower case characters and uppercase characters.

Parser<char,TI> digit, space, lowerCase, upperCase;
digit = sat (shift, c => "0123456789".Contains (c));
space = sat (shift, c => " \t\n".Contains (c) );
lowerCase = sat (shift, c => "abcdefghijklmnopqrstuvwxyz".Contains (c));
upperCase = sat (shift, c => "ABCDEFGHIJKLMNOPQRSTUVWXYZ".Contains (c));

Application of the digit parser to the string “1984” produces the result array [(‘1’,”984″)], while the lowerCase parser fails with [].

It is useful to define two more parser maps: an or operation which accepts an array of parser and returns a parser which tries the supplied parsers in turn and return the result of the first which succeeds.

public static class ParserCombinatorExtensions {
   ...
   /* try a series of parsers and pick the first
    * one to succeed */
   public static Parser<TV,TI> or<TV,TI>( this Parser<TV,TI> p,
      params Parser<TV,TI>[] parsers ) {
      return input => {
         Result<TV,TI>[] r;

         /* apply yourself */
         r = p(input);
         if(r.Count() > 0)
            return r;

         /* apply the parsers in the argument */
         foreach (Parser<TV,TI> P in parsers) {
            r = P (input);
            /* the first one to succeed returns
             * the result */
            if (r.Count () > 0)
               return r;
         };
         /* if nobody succeeds we fail */
         return new Result<TV,TI>[] { };
      };
   }
   ...
}

We can now conveniently construct a letter parser like so

Parser<char,TI> letter = letter = lowerCase.or (upperCase );

Repetition parsers

to be continued…