Lexing II: readtable–macro parsing

30 Aug 2015

Ode to Lisp’s reader

Lisp is an exceptionally well-designed language. One of the many ways this shines through is in how the parsing works. Lisp’s parser is simple and extremely flexible, to the point that arbitrarily complex changes can be introduced into the language in a natural way. For instance, cl-ppcre is a regular-expression library which adds perl-like regular-expression syntax to Lisp; cl-yacc does the same for LALR grammars (no need for a separate yacc command-line program, the special production-rule syntax is simply added to the language!); or, to give a more drastic example of syntactic flexibility, sweet expressions changes the syntax of Lisp to include infix operations and indentation-based expressions (i.e., use the off-side rule instead of nested parenthesis). In the vast majority of programming languages, making any change to the language’s syntax would require a separate program to recompile the new syntax into the old syntax, but in Lisp we can do away with such an intermediary, and implement such changes directly by writing a Lisp program to change the behavior of the Lisp parser. The reason we can do this is because the parsing mechanism is exposed to the programmer, and easy to extend.

Readtable–macro parsing

Lisp parses its input via a readtable–macro mechanism. A special table called the readtable assigns a type to each character, such as whitespace, constituent, macro-character, and so on. Then the default reader algorithm reads one character at a time and decides what to do based on the character’s type. For instance, constituent characters accumulate into tokens — which could mean identifiers or numbers — and macro characters (such as () will trigger parsing sub-procedures, called reader macros, that handle the characters that follow (presumably recursively calling the default reader algorithm at times, and returning control to it when the corresponding closing character ) is found). Lisp allows us to plug into this mechanism by changing the readtable and by defining our own reader macros. The end-result is that we may use Lisp to change the syntax of Lisp to arbitrarily fit our purpose.

Lyc’s lexical analyser

Lyc should have the best of both Lisp, Python and C. A readtable–macro mechanism is not very good at handling infix notation, implicit operator precedence, and other syntactic niceties of Python and C — the next post in this blog will suggest an approach to deal with those. But I believe that, together with the parsing-via-chain-of-coroutines method outlined in the previous post, it is an ideal mechanism for lexical analysis: powerful, flexible, easily extensible, and sufficiently fast.

A readtable-macro algorithm requires a read-table. This is a table classifying sequences of characters according to their syntax-type, which in lyc is one of whitespace, newline, token, macro, closing, constituent, or invalid. By omission, a single character is a constituent sequence.

Having such a readtable, we can, at each step, read the largest sequence of characters which is part of the readtable, and obtain its syntax type. We assume that some automatic process keeps track of the line number, column number, and visual-column number¹ of the currently-read sequence.

The algorithm that follows is based on Lisp’s default reader algorithm.

Stage 1

One sequence is read from the input stream, until either:
- A closing sequence has been read (EOF qualifies, but also closing delimiters), and then the procedure exits, or
- A non-whitespace, non-newline sequence is reached; we signal an error if this sequence does not begin in the first relative-column², and if it does:
  - We unread this sequence,
  - We emit a BEGIN token and store it for later use, and
  - We continue onto step 2.
One sequence x is repeatedly read from the input stream and dispatched according to it syntax-type to one of the following items.
1. If x is a closing sequence, then we unread x, we emit an END token — associated with the remembered BEGIN token — and the procedure exits.
2. If x is a whitespace sequence, it is discarded.
3. If x is a newline sequence, we go to Stage 2.
4. If x is a token sequence, then we emit the corresponding CONSTITUENT token.
5. If x is a macro sequence, then let $W$ be the visual-column-number of the next character in the stream; we then delegate the scanning of the stream to the corresponding macro-tokenizer. This happens by making a new call to a coroutine, and feeding to it all the subsequent lines whose first non-whitespace character does not appear before column $W$ , and, in those lines, only the characters that appear from column $W$ onwards. We capture and emit all tokens emitted by the coroutine.
6. If x is a constituent sequence, then new sequences are read, until a non-constituent sequence appears; this last sequence is unread, all the previous constituent sequences are concatenated, and we emit this concatenation as a CONSTITUENT token.
7. If x is an invalid sequence, then an error is signaled.

Stage 2

A number $W$ is maintained; intially $W = + \infty$ .
One sequence is repeatedly read from the input stream, until either:
- An EOF has been reached, and then the procedure exits, or
- A non-whitespace, non-newline sequence is reached, and then:
  - If the relative-column-number of $S$ is greater than $W$ , we signal an unexpected indentation error.
  - If the relative-column-number of $S$ is less than $W$ , we emit and INDENT token, which we associate with the previous remembered BEGIN token (which has a list of associated indent tokens), and set $W$ relative-column-number of $S$ .
  - Otherwise the relative-column-number of $S$ is exactly $W$ .
We then recursively delegate the scanning of the stream to the Standard Circe Tokenizer, feeding it this first line of text and subsequent lines, until a line of text is scanned whose first non-whitespace character appears before column $W$ , while hiding from the called tokenizer the $W-1$ whitespace characters that prefix the fed lines.
We goto 2.

Example

Suppose we had defined appropriate macros for the macro characters “ and #. Then, should we tokenize the following “code”:

test “lexing”
    of readtable-macro
    tokenizer
    # yeah!
      also-with
    “off-side rule
     for strings
     and comments”

We would get the following token sequence:

BEGIN
   CONSTITUENT(test)
   BEGIN_MACRO(“)
     STRING(lexing)
   END_MACRO(”)
   INDENT
   BEGIN
      CONSTITUENT(of)
      CONSTITUENT(readtable-macro)
   END, range=(3:4‥3:4))
   BEGIN
      CONSTITUENT(tokenizer)
   END
   BEGIN
      BEGIN_MACRO(#)
         COMMENT( yeah!\n also-with\n)
      END_MACRO
      BEGIN_MACRO(“)
         STRING(off-side rule\nfor strings\nand comments)
      END_MACRO(”)
   END
END

This is already working!

This is the method by which the code in this blog is highlighted³.

test “lexing”
    of readtable-macro
    tokenizer
    # yeah!
      also-with
    “off-side rule
     for strings
     and comments”

Footnotes

We will someday blog about characters; suffice to say that the visual column of e is 5 in the string abcde, and 6 in the string abde. By the way, if the right-arrow symbol  does not appear in your browser (if it appears garbled), try using a later version, the appropriate font should be loaded together with this website. ↩
The relative-column number is the column number of the read character, where the number is measured relative to the characters that were read. The column-number of the character, relative to the file where it has originated, may be different. ↩
The coloured identifiers (such as the orange test in the example) are currently hardcoded. But one day the colours will correspond to the type of object that the identifier represents. ↩