Readtable-Macro Transducer-Chain Parsing¶
This repository contains an implementation of a framework for using the readtable-macro transducer-chain (RMTS) parsing method.
This is described in the following three blog posts:
- Lexing I: the off-side rule
- Lexing II: readtable–macro parsing
- Parsing via chains of tree-transducers (yet to be written)
Note
You may jump to the full Table of contents below. Following is an overview of the source code.
Structure of the source code¶
Common¶
Several utilities used throughout the project.
- Context - A python class to simulate variables with dynamic scope (Lisp’s special variables).
- Errors - A collection of exceptions.
- Record - A python dict accessible via getattr/setattr (so
one can write
record.item
instead ofrecord["item"]
). - StringStuff - Some functions to manipulate strings, and information about characters with non-single-space width.
- SysArgsParser - A utility to parse command line arguments.
- Util - General utility functions.
- Unused by this project, needs to be refactored/moved:
- Hierarchy
- Instance
- Memoize
Streams¶
Self-contained implementation of an object from which characters can be read, while keeping track of offset, and line and column numbers. This probably should be replaced with Python’s own TextIOBase class. But it was much simpler for me to just have a class with exactly what I needed, instead of worrying about implementing whatever Python needs. Will refactor this at some point.
- CharacterStream - Is the base class.
- FileStream - Implementation to read from a file.
- StringStream - Implementation to read from a String.
- StreamPosition, StreamRange - For describing a position in a stream, or a range of positions.
- IndentedCharacterStream - This one allows for
push
andpop
operations. This is how we parse indentation in a recursive descent parser.
Syntax¶
Syntax types.
- Code - The base class for all syntax types.
- Identifier - An identifier.
- Literal - A Literal.
- Node and Element - A Node is a doubly-linked list, where each Element in the list holds
(a reference to) an instance of Code. Currently we define 6 types of nodes:
- Form - A Form denotes a Node where the first element is singled out, and called the head of the form.
It is usually understood that the first element is applied to the remaining elements, called arguments of the form.
One may denote a Form by
head( arg1, arg2, ...)
or⦅head arg1 arg2 ...⦆
. - Tuple - A Tuple is a Node where no element is singled out. It usually represents a sequence of elements, taken
together.
One may denote a Tuple by
(elm1, elm2, ...)
. - PreForm and PreTuple - A PreForm or a PreTuple denotes a Form or a Tuple, respectively, if it holds
two or more elements appear, or, if there is a single element, denotes that very same single element.
One may denote a PreForm by
ₐ⟅elm1 ...⟆ₐ
, and a PreTuple by⟅elm1 ...⟆
. - Punctuator - A Node whose single element should be parsed for punctuation.
- Form - A Form denotes a Node where the first element is singled out, and called the head of the form.
It is usually understood that the first element is applied to the remaining elements, called arguments of the form.
One may denote a Form by
- NodeIterator - An iterator for iterating over the elements of a Node.
- Token - An Element representing a token. The outcome of tokenization should be a single Node instance whose elements are all Tokens.
- LispPrinter - A collection of functions for printing Code using S-Expressions (Lisp prefix notation).
Tokenization¶
Readtable-macro parsing.
- Readtable - Readtables map sequence of characters to dictionaries of properties.
- TokenizationContext
- Tokenizer - A tokenizer generates tokens (Token) from a CharacterStream.
- IndentationReadtableTokenizer - A tokenizer that uses a readtable to read indented blocks of code.
- DelimiterTokenizer - Tokenizer for parsing blocks of code surrounded by delimiters.
- StringTokenizer - Tokenizer for strings with interpolated code.
- CommentTokenizer - Tokenizer for comments with interpolated code.
- RawCommentTokenizer - Comments without interpolated code.
- DelimitedIdentifierTokenizer - Tokenizer for identifiers surrounded by delimiters.
Transducers¶
Tree transducers.
Arrangement - A transformation of a single node.
ArrangementRule - A rule for when and how to transform a child of a node.
ReadDirection - Left-to-right or right-to-left.
- TreeTransducer - A transformation of a tree (represented by nested nodes).
- The arrangement rules currently implemented are:
- ApplyToRest - To process a form like
return
, which, if appearing in isolation, actually meansreturn()
. - Block - Indented blocks.
- Comments - Comments.
- Constituents - Constituents (identifiers, numerical literals, etc)
- DefaultPunctuation - Parses ,-and-;-separated lists into nested tuples.
- Delimiters
- IfElse
- LeftRightBinaryOperator
- LeftRightBinaryOperatorReversedArgs
- LeftRightBinaryOperatorTwoSymbols
- LeftRightBinaryTokenCapturingOperator
- LeftRightNaryOperator
- LeftRightUnaryPostfixNospaceOperator
- LeftRightUnaryPostfixNospaceTokenCapturingOperator
- LeftRightUnaryPrefixNospaceOperator
- LeftRightUnaryPrefixNospaceTokenCapturingOperator
- RightLeftBinaryOperator
- RightLeftUnaryPrefixNospaceOperator
- RightLeftUnaryPrefixOperator
- Strings
- ApplyToRest - To process a form like
Table of contents¶
- Common
- Streams
- Syntax
- Tokenization
- Transducers
- Arrangement
- ArrangementRule
- ReadDirection
- TreeTransducer
- TopDownTreeTransducer
- BottomUpTreeTransducer
- ConvertPreForms
- ApplyToRest
- Block
- Comments
- Constituents
- DefaultPunctuation
- Delimiters
- IfElse
- LeftRightBinaryOperator
- LeftRightBinaryOperatorReversedArgs
- LeftRightBinaryOperatorTwoSymbols
- LeftRightBinaryTokenCapturingOperator
- LeftRightNaryOperator
- LeftRightUnaryPostfixNospaceOperator
- LeftRightUnaryPostfixNospaceTokenCapturingOperator
- LeftRightUnaryPrefixNospaceOperator
- LeftRightUnaryPrefixNospaceTokenCapturingOperator
- RightLeftBinaryOperator
- RightLeftUnaryPrefixNospaceOperator
- RightLeftUnaryPrefixOperator
- Strings
- Parsers