Lexing I: the off-side rule

02 Aug 2015

The idea of grouping blocks of instructions by using indentation is attributed to Peter J. Landin. It first appeared in his 1966 paper ‘The next 700 programming languages’, where it was named ‘off-side rule’. Guido van Rossum’s Python — one of my favourite programming languages — would later popularize the use of the off-side rule.

Indentation is usually implemented during lexical analysis. Properly parsing indentation requires the lexical analyser to maintain a stack of indentation levels as part of its state. This makes an indentation-aware lexer more memory expensive than your typical lexical analyser, which for most languages is just a finite-state machine.

I personally find great advantage in having such a visually-immediate way of denoting program structure. I have at some point wondered why the off-side rule hasn’t been extended to other language constructs, such as strings and comments¹.

Lexing indentation

Lexical analysis is usually tasked with distinguishing keywords, identifiers, literals and punctuation, and discarding whitespace. Such a task can be very easily specified by defining the regular expressions which recognize the different kinds of tokens. These regular expressions are then compiled (into finite-state automata) and then fed the input characters one-by-one. Straightforward.

In order to process indentation, a simple change to this scheme is needed: the characters at the beginning of a new line are preprocessed, the number of leading whitespace characters are used in order to know how the indentation level changed. More precisely, a stack S is maintained, recording the indentation levels (i.e. the number of whitespace characters) of all the blocks under which the current block is nested. Here is some pseudo-code:

S := Int[]
S.push(0)

while not EOF():
    N := count-number-of-whitespaces()

    # Emit INDENT/DEDENT tokens as appropriate
    if N > top(S):
        S.append(N)
        emit INDENT
    elif N < top(S):
        while N < top(S):
            pop(S)
            emit DEDENT
        if N ≠ top(S):
            raise «Indentation Error»
    assert N = top(S)

    # Feed the remaining characters to the compiled regexps,
      which may in turn emit various tokens; we capture
      these tokens and emit them
    while (char := read()) ≠ '\n':
        yield char


    yield '\n'

The count-number-of-whitespaces function should return the number of whitespace characters at the beginning of the current line. We may think of several ways of implementing this function; for instance, the leading whitespace may be discarded, or fed into the regexps. Python discards all whitespace, but ignores indentation between delimiters. By the way, do you notice something off-side-ruleish about the multiline comments? All the code show is parsed with the same parsing techniques that we will describe in this blog.

Generalizing the off-side rule

We can think of the algorithm above as a kind of multi-party protocol: Alice is able to produce a sequence of characters, and she wants to obtain a sequence of tokens; Bob knows how to produce indentation tokens, but nothing else, and Charlie knows how to produce the remaining tokens, but is completely blind to indentation. Bob asks Alice for one character at a time, by invoking read, and produces indentation tokens by invoking emit. We can imagine that Charlie also asks Bob for one character at a time by invoking read, and Bob sends Charlie a character by invoking yield.

There is nothing preventing us from generalizing this setting to an arbitrary number of players. Since we are using a stack, we may also gain from looking at a recursive variant of the algorithm, i.e., trying to use the call-stack instead. Putting these two ideas together suggests that we create a chain of coroutines, one for each indentation level; the routine at level $i$ will invoke a level $i+1$ routine when the indentation level increases; level $i$ reads characters from level $i-1$ , and yields characters to level $i + 1$ ; it also captures tokens from level $i + 1$ and emits tokens to level $i - 1$ .

Then, whenever the indentation level has increased — or when a new opening-delimiter is seen, say — a new call is made to the coroutine that handles such an event. The new coroutine $C$ , when reading characters from the previous level, will fail to see any characters that appear previous to its own indentation level. so, in some sense, it doesn’t have to worry about them. If the previous-level coroutine, say $P$ , finds that the indentation has decreased below the minimum required for the coroutine $C$ , $P$ will yield EOF — an end-of-file status — to $C$ .

In case you are wondering, the precise and only difference between chain-of-couroutines and the usual recursive-descent parsing is that the caller controls what is seen by the callee, rather than just completely delegating the processing of the characters.

Illustration of lexing via a chain of coroutines.

An example

We now show some pseudo-code that uses this idea to parse indented blocks of identifiers (sequences of alphanumeric characters). The algorithm will serve as the foundation for a very powerful method of lexical analysis, which we will describe in our next post. It is a bit harder to grasp what is going on, and the algorithm is potentially slower². But I personally find the end result to be conceptually quite elegant, and I think this will shine through when we later extend it.

# This coroutine parses a 'block-seq', which is a sequence
  of blocks of identifiers

  Each block in the sequence is of the form:
      sequence of identifiers
         (indented, possibly empty) block-seq

coroutine parse-block-seq():
    loop:
        # Each cycle through this loop parses a single block
        skip-empty-lines()
        if first-non-empty-line-is-indented():
            raise «Unexpected Indentation»
        elif EOF(): return


        # Stage I: Parse the first sequence of identifiers
        loop:
            if EOF(): return
            char := read()
            if char.is-whitespace():
                pass # ignore non-beginning of line whitespace
            elif char.is-newline():
                break # move on Stage II
            else:
                assert char.is-alphanumeric()
                char-seq = char + read-and-concatenate-alphanums()
                emit IDENTIFIER(char-seq)

        # Stage II: Parse the indented block-seq
        loop:
            N := count-number-of-whitespaces()
            if N = 0:
                emit(END) and break # parse the next block
            else:
                emit INDENT
                # invoke new instance of this subroutine;
                  the 'char-generator' block defines the generator
                  which feeds characters to this new instance,
                  and the 'capture' block defines what to do when
                  the new instance emits a token
                parse-block-seq():
                    char-generator:
                        for-each-line:
                            W := count-leading-whitespace-up-to(N)
                            # If this level received an EOF or the
                              indentation level decreased below N:
                            if W = EOF or W < N:
                                yield EOF
                            else:
                                yield all chars after the Nth
                    capture token:
                        emit token
                if W > 1: raise «Indentation Error»
                emit DEDENT
                emit END

Should we tokenize the following “code”:

test lexing
   of block-seq
   tokenizer
      yeah

We would get the following token sequence:

BEGIN
   IDENTIFIER(test)
   IDENTIFIER(lexing)
   INDENT
   BEGIN
      IDENTIFIER(of)
      IDENTIFIER(block-seq)
   END
   BEGIN
      IDENTIFIER(parser)
      INDENT
      BEGIN
         IDENTIFIER(yeah)
      END
   END
END

Footnotes

Recently, I discovered that coffeescript’s block strings allow for this. ↩
Because the routines at each level must read every character before it can be read by the routine at the next level. However, since we will only be using the call-stack to parse indentation levels, an ad-hoc modification can ensure that such multiple reading is unnecessary. ↩

Lyc

Lexing I: the off-side rule

Lexing indentation

Generalizing the off-side rule

An example

Footnotes