In this concluding part of the series, I will build an abstract syntax tree or
AST. An AST holds the information present in the arithmetic expression and can be
evaluated to yield the result. It is built from grammar rules just like the translator
in part 2 and evaluated using the design pattern known as the visitor pattern. Note
that the ast
module in Python can be used to build the tree, as I will show below,
but it is more instructive to build it from first principles. The ast
module is a
general method to contruct ASTs for code, not just for arithmetic expressions.
One may argue that building an AST is overkill for simple arithmetic expressions - the method given in part 1 or in part 2 of this series should suffice. But the purpose here is to study it to get a peek at how programming languages parse expressions (and code in general) behind the scenes.
An AST is a tree-like structure that holds the essential information in an expression. The interior nodes are the operators of the expression while the leaves are the operands.
In contrast, a parse tree is contructed from the grammar with the nonterminals as the inner nodes and the terminals or tokens of the expression as leaves.
The diagram below shows an AST and a parse tree for the expression 5 * (2 + 4)
.
The diagram shows that the AST comprises of only the operators and operands of the
expression, without the parentheses, while the parse tree is elaborate - consisting
of the nonterminals of the grammar (expr
, term
and factor
) and the expression
itself (reading the leaves from left to right).
ast
moduleWe can use the Python standard library module ast
to build an AST. I will use the
same expression as above to illustrate.
>>> import ast
>>> tree = ast.parse('5 * (2 + 4)', mode='eval')
>>> ast.dump(tree)
'Expression(body=BinOp(left=Num(n=5), op=Mult(), right=BinOp(left=Num(n=2), op=Add(), right=Num(n=4))))'
We see that the AST has, among other classes, two classes named BinOp
and Num
.
These stand for the interior and leaf nodes of the AST respectively. I will use these
classes in the design of the parser below.
The AST node (or tree) above can be evaluated to yield a value. It first needs to be
compiled into a code object by the compile
builtin function and then passed to the
eval
function to find the value.
>>> co = compile(tree, filename='<string>', mode='eval')
>>> eval(co)
30
The lexical analyzer or lexer for the translator is identical to what was built in part 2. I am repeating it below for convenience.
import re
class Num:
def __init__(self, val):
self.value = float(val) if '.' in val else int(val)
def __repr__(self):
return f'Num: {self.value}'
def lexan():
'''Lexical analyzer - creates a stream of tokens'''
inp = input('Enter the expression ')
i = 0
while i < len(inp):
c = inp[i]
if c.isdigit() or c == '.':
val = re.match(r'[.0-9]+', inp[i:])[0]
yield Num(val)
i += len(val)
else:
if inp[i:i + 2] == '**':
yield '**'
i += 1
elif c != ' ':
yield c
i += 1
The grammar for the parser will be the same as in part 2. I have repeated it below:
However, instead of performing the operations while parsing, we build nodes of the tree.
There are three kinds of nodes - Num
, BinOp
and UMinus
. Num
represents a number -
a sequence of digits with an optional decimal point. BinOp
stands for a binary operation.
It has three parts - left
(left operand), op
(operator) and right
(right operand).
UMinus
stands for a unary minus operation. It has two parts - op
(operator - -
sign)
and value
(the number). The Python classes for BinOp
and UMinus
are shown below -
Num
was shown above:
class BinOp:
def __init__(self, op, left, right):
self.left = left
self.op = op
self.right = right
def __repr__(self):
return f'<BinOp: left={self.left} op={self.op} right={self.right}>'
class UMinus:
def __init__(self, value):
self.op = '-'
self.value = value
def __repr__(self):
return f'<UMinus: value={self.op}{self.value}>'
The rest of the code for the parser is similar to what was built in part 2. The difference is in the places where operations were performed - we now build nodes.
def expr():
'''Function to start the parsing. It builds the `+` and `-` nodes as
appropriate.'''
result = term()
while True:
if lookahead == '+':
match('+')
result = BinOp(left=result, op='+', right=term())
elif lookahead == '-':
match('-')
result = BinOp(left=result, op='-', right=term())
else:
break
return result
def term():
'''This function builds the `*` and `/` nodes.'''
result = factor()
while True:
if lookahead == '*':
match('*')
result = BinOp(left=result, op='*', right=factor())
elif lookahead == '/':
match('/')
result = BinOp(left=result, op='/', right=factor())
else:
break
return result
def factor():
'''This recursive function builds the node for the right associative
power operation.'''
result = base()
if lookahead == '**':
match('**')
result = BinOp(left=result, op='**', right=factor())
return result
def base():
'''This function evaluates another expression in parentheses or negates
a number or simply return a number.'''
if lookahead == '(':
match('(')
result = expr()
match(')')
elif isinstance(lookahead, Num):
result = lookahead
match(lookahead)
elif lookahead == '-':
match('-')
result = UMinus(value=base())
else:
match('number')
return result
def match(t):
'''Checks if token is valid and gets the next one.'''
global lookahead
if t == lookahead:
try:
lookahead = next(token_gen)
except StopIteration:
lookahead = ''
else:
raise RuntimeError(f'Malformed input. Token {lookahead}')
To test the parser, we can code a driver function as follows:
def main():
'''This function sets a starting value for the `lookahead` and calls the
parser.'''
global token_gen, lookahead
token_gen = lexan()
lookahead = next(token_gen)
tree = expr()
match('')
print(tree)
Running main
with the string 5 * (2 + 4)
gives the following output:
>>> main()
Enter the expression 5 * (2 + 4)
<BinOp: left=Num: 5 op=* right=<BinOp: left=Num: 2 op=+ right=Num: 4>>
Note the similarity with what was obtained above by a dump of the AST using Python's
ast
module.
Now we discuss the evaluation of the AST.
This is a design pattern commonly used in parsing and compiling. There are two ideas
inherent in it. Firstly, it separates the manipulation of a complex data structure like
an AST from the data structure itself. The classes Num
, BinOp
and UMinus
do not
carry any logic to manipulate the data they hold. Secondly, the visitor class, given
below, uses a clever technique to dispatch actions to different methods, based on the
node type. It uses recursion in the process. The visitor pattern can be used to
evaluate an AST or take some other action like generating instructions for instance.
class NodeVisitor:
def visit(self, node):
method_name = 'visit_' + type(node).__name__
method = getattr(self, method_name, None)
if method:
return method(node)
else:
self.generic_visit(node)
def generic_visit(self, node):
raise RuntimeError(f'No method found for node {node}')
class Visitor(NodeVisitor):
def visit_BinOp(self, node):
if node.op == '+':
return self.visit(node.left) + self.visit(node.right)
elif node.op == '-':
return self.visit(node.left) - self.visit(node.right)
elif node.op == '*':
return self.visit(node.left) * self.visit(node.right)
elif node.op == '/':
return self.visit(node.left) / self.visit(node.right)
elif node.op == '**':
return self.visit(node.left) ** self.visit(node.right)
def visit_UMinus(self, node):
return -self.visit(node.value)
def visit_Num(self, node):
return node.value
The NodeVisitor
class dispatches a node of the AST to methods based on the type of
the node. It uses the getattr
builtin function to dynamically get the method from
its name. If no method is found, then the generic_visit
method is called to raise
an exception. The Visitor
class inherits from the NodeVisitor
class and provides
visit methods for the different node types. Its methods call themselves recursively
to traverse the AST.
To test this code, we can modify the driver function as shown below:
def main():
'''This function sets a starting value for the `lookahead` and calls the
parser.'''
global token_gen, lookahead
token_gen = lexan()
lookahead = next(token_gen)
tree = expr()
match('')
v = Visitor()
result = v.visit(tree)
print(result)
Following are some tests carried out in the REPL:
>>> main()
Enter the expression 5 * (2 + 4)
30
>>> main()
Enter the expression -(1 + 2) / 2
-1.5
>>> main()
Enter the expression 10 + ( 3 * 2 ) ** 2 ** 3 - 25 / 5
1679621.0
The last example shows the same result as obtained in parts 1 and 2 of the series.
The complete listing of the translator program including the lexer, parser, AST evaluation and the driver components:
import re
class Num:
def __init__(self, val):
self.value = float(val) if '.' in val else int(val)
def __repr__(self):
return f'Num: {self.value}'
class BinOp:
def __init__(self, op, left, right):
self.left = left
self.op = op
self.right = right
def __repr__(self):
return f'<BinOp: left={self.left} op={self.op} right={self.right}>'
class UMinus:
def __init__(self, value):
self.op = '-'
self.value = value
def __repr__(self):
return f'<UMinus: value={self.op}{self.value}>'
class NodeVisitor:
def visit(self, node):
method_name = 'visit_' + type(node).__name__
method = getattr(self, method_name, None)
if method:
return method(node)
else:
self.generic_visit(node)
def generic_visit(self, node):
raise RuntimeError(f'No method found for node {node}')
class Visitor(NodeVisitor):
def visit_BinOp(self, node):
if node.op == '+':
return self.visit(node.left) + self.visit(node.right)
elif node.op == '-':
return self.visit(node.left) - self.visit(node.right)
elif node.op == '*':
return self.visit(node.left) * self.visit(node.right)
elif node.op == '/':
return self.visit(node.left) / self.visit(node.right)
elif node.op == '**':
return self.visit(node.left) ** self.visit(node.right)
def visit_UMinus(self, node):
return -self.visit(node.value)
def visit_Num(self, node):
return node.value
def lexan():
'''Lexical analyzer - creates a stream of tokens'''
inp = input('Enter the expression ')
i = 0
while i < len(inp):
c = inp[i]
if c.isdigit() or c == '.':
val = re.match(r'[.0-9]+', inp[i:])[0]
yield Num(val)
i += len(val)
else:
if inp[i:i + 2] == '**':
yield '**'
i += 1
elif c != ' ':
yield c
i += 1
def expr():
'''Function to start the parsing. It builds the `+` and `-` nodes as
appropriate.'''
result = term()
while True:
if lookahead == '+':
match('+')
result = BinOp(left=result, op='+', right=term())
elif lookahead == '-':
match('-')
result = BinOp(left=result, op='-', right=term())
else:
break
return result
def term():
'''This function builds the `*` and `/` nodes.'''
result = factor()
while True:
if lookahead == '*':
match('*')
result = BinOp(left=result, op='*', right=factor())
elif lookahead == '/':
match('/')
result = BinOp(left=result, op='/', right=factor())
else:
break
return result
def factor():
'''This recursive function builds the node for the right associative
power operation.'''
result = base()
if lookahead == '**':
match('**')
result = BinOp(left=result, op='**', right=factor())
return result
def base():
'''This function evaluates another expression in parentheses or negates
a number or simply return a number.'''
if lookahead == '(':
match('(')
result = expr()
match(')')
elif isinstance(lookahead, Num):
result = lookahead
match(lookahead)
elif lookahead == '-':
match('-')
result = UMinus(value=base())
else:
match('number')
return result
def match(t):
'''Checks if token is valid and gets the next one.'''
global lookahead
if t == lookahead:
try:
lookahead = next(token_gen)
except StopIteration:
lookahead = ''
else:
raise RuntimeError(f'Malformed input. Token {lookahead}')
def main():
'''This function sets a starting value for the `lookahead` and calls the
parser.'''
global token_gen, lookahead
token_gen = lexan()
lookahead = next(token_gen)
tree = expr()
match('')
v = Visitor()
result = v.visit(tree)
print(result)
if __name__ == '__main__':
main()
This concludes this post and the series. In this post, you learned about ASTs and the visitor pattern for evaluating them. ASTs are used by compilers as a representation of code from which target language instructions are generated. Here I applied the concept to a simple calculator. Any of the techniques presented in this 3-part series can be used to build the calculator.
Compilers - Principles, techniques and tools by Aho, Sethi and Ullman
Let's build a simple interpreter. Part 7 - Blog by Ruslan Spivak
Python cookbook by David Beazley and Brian Jones