Lexing II: readtable–macro parsing
30 Aug 2015Ode to Lisp’s reader
Lisp is an exceptionally well-designed language. One of the many ways this shines through is in how the parsing works. Lisp’s parser is simple and extremely flexible, to the point that arbitrarily complex changes can be introduced into the language in a natural way. For instance, cl-ppcre is a regular-expression library which adds perl-like regular-expression syntax to Lisp; cl-yacc does the same for LALR grammars (no need for a separate yacc
command-line program, the special production-rule syntax is simply added to the language!); or, to give a more drastic example of syntactic flexibility, sweet expressions changes the syntax of Lisp to include infix operations and indentation-based expressions (i.e., use the off-side rule instead of nested parenthesis). In the vast majority of programming languages, making any change to the language’s syntax would require a separate program to recompile the new syntax into the old syntax, but in Lisp we can do away with such an intermediary, and implement such changes directly by writing a Lisp program to change the behavior of the Lisp parser. The reason we can do this is because the parsing mechanism is exposed to the programmer, and easy to extend.
Readtable–macro parsing
Lisp parses its input via a readtable–macro mechanism. A special table called the readtable assigns a type to each character, such as whitespace, constituent, macro-character, and so on. Then the default reader algorithm reads one character at a time and decides what to do based on the character’s type. For instance, constituent characters accumulate into tokens — which could mean identifiers or numbers — and macro characters (such as (
) will trigger parsing sub-procedures, called reader macros, that handle the characters that follow (presumably recursively calling the default reader algorithm at times, and returning control to it when the corresponding closing character )
is found). Lisp allows us to plug into this mechanism by changing the readtable and by defining our own reader macros. The end-result is that we may use Lisp to change the syntax of Lisp to arbitrarily fit our purpose.
Lyc’s lexical analyser
Lyc should have the best of both Lisp, Python and C. A readtable–macro mechanism is not very good at handling infix notation, implicit operator precedence, and other syntactic niceties of Python and C — the next post in this blog will suggest an approach to deal with those. But I believe that, together with the parsing-via-chain-of-coroutines method outlined in the previous post, it is an ideal mechanism for lexical analysis: powerful, flexible, easily extensible, and sufficiently fast.
A readtable-macro algorithm requires a read-table. This is a table classifying sequences of characters according to their syntax-type, which in lyc is one of whitespace, newline, token, macro, closing, constituent, or invalid. By omission, a single character is a constituent sequence.
Having such a readtable, we can, at each step, read the largest sequence of characters which is part of the readtable, and obtain its syntax type. We assume that some automatic process keeps track of the line number, column number, and visual-column number1 of the currently-read sequence.
The algorithm that follows is based on Lisp’s default reader algorithm.
Stage 1
- One sequence is
read
from the input stream, until either:- A closing sequence has been read (EOF qualifies, but also closing delimiters), and then the procedure exits, or
- A non-whitespace, non-newline sequence is reached; we signal an error if this sequence does not begin in the first relative-column2, and if it does:
- We
unread
this sequence, - We
emit
a BEGIN token and store it for later use, and - We continue onto step 2.
- We
- One sequence x is repeatedly
read
from the input stream and dispatched according to it syntax-type to one of the following items.- If x is a closing sequence, then we
unread
x, weemit
an END token — associated with the remembered BEGIN token — and the procedure exits. - If x is a whitespace sequence, it is discarded.
- If x is a newline sequence, we go to Stage 2.
- If x is a token sequence, then we
emit
the corresponding CONSTITUENT token. - If x is a macro sequence, then let be the visual-column-number of the next character in the stream; we then delegate the scanning of the stream to the corresponding macro-tokenizer. This happens by making a new call to a coroutine, and feeding to it all the subsequent lines whose first non-whitespace character does not appear before column , and, in those lines, only the characters that appear from column onwards. We
capture
andemit
all tokens emitted by the coroutine. - If x is a constituent sequence, then new sequences are
read
, until a non-constituent sequence appears; this last sequence isunread
, all the previous constituent sequences are concatenated, and weemit
this concatenation as a CONSTITUENT token. - If x is an invalid sequence, then an error is signaled.
- If x is a closing sequence, then we
Stage 2
- A number is maintained; intially .
- One sequence is repeatedly read from the input stream, until either:
- An EOF has been reached, and then the procedure exits, or
- A non-whitespace, non-newline sequence is reached, and then:
- If the relative-column-number of is greater than , we signal an unexpected indentation error.
- If the relative-column-number of is less than , we emit and INDENT token, which we associate with the previous remembered BEGIN token (which has a list of associated indent tokens), and set relative-column-number of .
- Otherwise the relative-column-number of is exactly .
- We then recursively delegate the scanning of the stream to the Standard Circe Tokenizer, feeding it this first line of text and subsequent lines, until a line of text is scanned whose first non-whitespace character appears before column , while hiding from the called tokenizer the whitespace characters that prefix the fed lines.
- We goto 2.
Example
Suppose we had defined appropriate macros for the macro characters “
and #
. Then, should we tokenize the following “code”:
test “lexing”
of readtable-macro
tokenizer
# yeah!
also-with
“off-side rule
for strings
and comments”
We would get the following token sequence:
BEGIN
CONSTITUENT(test)
BEGIN_MACRO(“)
STRING(lexing)
END_MACRO(”)
INDENT
BEGIN
CONSTITUENT(of)
CONSTITUENT(readtable-macro)
END, range=(3:4‥3:4))
BEGIN
CONSTITUENT(tokenizer)
END
BEGIN
BEGIN_MACRO(#)
COMMENT( yeah!\n also-with\n)
END_MACRO
BEGIN_MACRO(“)
STRING(off-side rule\nfor strings\nand comments)
END_MACRO(”)
END
END
This is already working!
This is the method by which the code in this blog is highlighted3.
test “lexing”
of readtable-macro
tokenizer
# yeah!
also-with
“off-side rule
for strings
and comments”
Footnotes
-
We will someday blog about characters; suffice to say that the visual column of
e
is 5 in the stringabcde
, and 6 in the stringabde
. By the way, if the right-arrow symbol
does not appear in your browser (if it appears garbled), try using a later version, the appropriate font should be loaded together with this website. ↩ -
The relative-column number is the column number of the read character, where the number is measured relative to the characters that were
read
. The column-number of the character, relative to the file where it has originated, may be different. ↩ -
The coloured identifiers (such as the orange
test
in the example) are currently hardcoded. But one day the colours will correspond to the type of object that the identifier represents. ↩