Lexing I: the off-side rule
02 Aug 2015The idea of grouping blocks of instructions by using indentation is attributed to Peter J. Landin. It first appeared in his 1966 paper ‘The next 700 programming languages’, where it was named ‘off-side rule’. Guido van Rossum’s Python — one of my favourite programming languages — would later popularize the use of the off-side rule.
Indentation is usually implemented during lexical analysis. Properly parsing indentation requires the lexical analyser to maintain a stack of indentation levels as part of its state. This makes an indentation-aware lexer more memory expensive than your typical lexical analyser, which for most languages is just a finite-state machine.
I personally find great advantage in having such a visually-immediate way of denoting program structure. I have at some point wondered why the off-side rule hasn’t been extended to other language constructs, such as strings and comments1.
Lexing indentation
Lexical analysis is usually tasked with distinguishing keywords, identifiers, literals and punctuation, and discarding whitespace. Such a task can be very easily specified by defining the regular expressions which recognize the different kinds of tokens. These regular expressions are then compiled (into finite-state automata) and then fed the input characters one-by-one. Straightforward.
In order to process indentation, a simple change to this scheme is needed: the characters at the beginning of a new line are preprocessed, the number of leading whitespace characters are used in order to know how the indentation level changed. More precisely, a stack S
is maintained, recording the indentation levels (i.e. the number of whitespace characters) of all the blocks under which the current block is nested. Here is some pseudo-code:
S := Int[]
S.push(0)
while not EOF():
N := count-number-of-whitespaces()
# Emit INDENT/DEDENT tokens as appropriate
if N > top(S):
S.append(N)
emit INDENT
elif N < top(S):
while N < top(S):
pop(S)
emit DEDENT
if N ≠ top(S):
raise «Indentation Error»
assert N = top(S)
# Feed the remaining characters to the compiled regexps,
which may in turn emit various tokens; we capture
these tokens and emit them
while (char := read()) ≠ '\n':
yield char
yield '\n'
The count-number-of-whitespaces
function should return the number of whitespace characters at the beginning of the current line. We may think of several ways of implementing this function; for instance, the leading whitespace may be discarded, or fed into the regexps. Python discards all whitespace, but ignores indentation between delimiters. By the way, do you notice something off-side-ruleish about the multiline comments? All the code show is parsed with the same parsing techniques that we will describe in this blog.
Generalizing the off-side rule
We can think of the algorithm above as a kind of multi-party protocol: Alice is able to produce a sequence of characters, and she wants to obtain a sequence of tokens; Bob knows how to produce indentation tokens, but nothing else, and Charlie knows how to produce the remaining tokens, but is completely blind to indentation. Bob asks Alice for one character at a time, by invoking read
, and produces indentation tokens by invoking emit
. We can imagine that Charlie also asks Bob for one character at a time by invoking read
, and Bob sends Charlie a character by invoking yield
.
There is nothing preventing us from generalizing this setting to an arbitrary number of players. Since we are using a stack, we may also gain from looking at a recursive variant of the algorithm, i.e., trying to use the call-stack instead. Putting these two ideas together suggests that we create a chain of coroutines, one for each indentation level; the routine at level will invoke a level routine when the indentation level increases; level read
s characters from level , and yield
s characters to level ; it also capture
s tokens from level and emit
s tokens to level .
Then, whenever the indentation level has increased — or when a new opening-delimiter is seen, say — a new call is made to the coroutine that handles such an event. The new coroutine , when read
ing characters from the previous level, will fail to see any characters that appear previous to its own indentation level. so, in some sense, it doesn’t have to worry about them. If the previous-level coroutine, say , finds that the indentation has decreased below the minimum required for the coroutine , will yield EOF
— an end-of-file status — to .
In case you are wondering, the precise and only difference between chain-of-couroutines and the usual recursive-descent parsing is that the caller controls what is seen by the callee, rather than just completely delegating the processing of the characters.
An example
We now show some pseudo-code that uses this idea to parse indented blocks of identifiers (sequences of alphanumeric characters). The algorithm will serve as the foundation for a very powerful method of lexical analysis, which we will describe in our next post. It is a bit harder to grasp what is going on, and the algorithm is potentially slower2. But I personally find the end result to be conceptually quite elegant, and I think this will shine through when we later extend it.
# This coroutine parses a 'block-seq', which is a sequence
of blocks of identifiers
Each block in the sequence is of the form:
sequence of identifiers
(indented, possibly empty) block-seq
coroutine parse-block-seq():
loop:
# Each cycle through this loop parses a single block
skip-empty-lines()
if first-non-empty-line-is-indented():
raise «Unexpected Indentation»
elif EOF(): return
# Stage I: Parse the first sequence of identifiers
loop:
if EOF(): return
char := read()
if char.is-whitespace():
pass # ignore non-beginning of line whitespace
elif char.is-newline():
break # move on Stage II
else:
assert char.is-alphanumeric()
char-seq = char + read-and-concatenate-alphanums()
emit IDENTIFIER(char-seq)
# Stage II: Parse the indented block-seq
loop:
N := count-number-of-whitespaces()
if N = 0:
emit(END) and break # parse the next block
else:
emit INDENT
# invoke new instance of this subroutine;
the 'char-generator' block defines the generator
which feeds characters to this new instance,
and the 'capture' block defines what to do when
the new instance emits a token
parse-block-seq():
char-generator:
for-each-line:
W := count-leading-whitespace-up-to(N)
# If this level received an EOF or the
indentation level decreased below N:
if W = EOF or W < N:
yield EOF
else:
yield all chars after the Nth
capture token:
emit token
if W > 1: raise «Indentation Error»
emit DEDENT
emit END
Should we tokenize the following “code”:
test lexing
of block-seq
tokenizer
yeah
We would get the following token sequence:
BEGIN
IDENTIFIER(test)
IDENTIFIER(lexing)
INDENT
BEGIN
IDENTIFIER(of)
IDENTIFIER(block-seq)
END
BEGIN
IDENTIFIER(parser)
INDENT
BEGIN
IDENTIFIER(yeah)
END
END
END
Footnotes
-
Recently, I discovered that coffeescript’s block strings allow for this. ↩
-
Because the routines at each level must
read
every character before it can beread
by the routine at the next level. However, since we will only be using the call-stack to parse indentation levels, an ad-hoc modification can ensure that such multiple reading is unnecessary. ↩