Lark implements the Earley parser algorithm, thus it is able to parse all context-free grammars. In addition, Lark supports scannerless mode, so terminals (tokens) can be resolved at parse time.
Terminals are represented by uppercase names.
In a grammar, comments can be introduced with a double slash (//).
Specifying the grammar
The grammar can be specified in the Lark's constructor(?):
grammar = """
…
""
parser = Lark(grammar)
In a grammar, TERMINALS are specifice in uppercase letters, rules in lowercase letters.
By default, Lark creates a parser that uses the Earley algorithm to parse the input.
Earley is powerful, but also slow. If the grammar is LR-compatible, the algorithm can be changed to LALR(1):
parser = Lark(grammar, parser='lalr')
Parse digits and characters
from lark import Lark
#
# Initialize parser with «EBNF» of the language
#
parser = Lark("""
start: item+
item: DIGIT -> dig
| CHAR -> chr
CHAR : ("a" .. "z")
DIGIT: ("0" .. "9")
""")
#
# Parse program. Note that langauge does not allow
# for white spaces:
#
parsed = parser.parse('abc123xy9')
for i in parsed.children:
rule = i.data
print('Rule {} was found, 1st child is {}'.format(rule, i.children[0]))
Because white space play such a fundamental role, Lark allows to import their definition with %import common.WS and then use their definition to ignore them with %ignore WS:
from lark import Lark
parser = Lark("""
start: item+
item: DIGIT -> dig
| CHAR -> chr
CHAR : ("a" .. "z")
DIGIT: ("0" .. "9")
%import common.WS
%ignore WS
""")
parsed = parser.parse("""
ab c1
23x y 9
""")
for i in parsed.children:
rule = i.data
print('Rule {} was found, 1st child is {}'.format(rule, i.children[0]))
The following example impots the C and C++ like comments as well as whitespaces and ignores them.
It then iterates over words (defined by the regular expression\w+ that are found in the text to be parsed.
import lark
parser = lark.Lark("""
start: word*
//
// Note the question mark before the word:
// It allows to get the value of the word directly
// with word.value when iterating over the
// parsed text.
// Without question mark, we'd be iterating over
// Tree-objects rather than Token-objects.
//
?word: /\w+/
%import common.WS
%import common.CPP_COMMENT
%import common.C_COMMENT
%ignore WS
%ignore CPP_COMMENT
%ignore C_COMMENT
""")
parsed = parser.parse("""
one two // this is a comment and should not be extracted
three four /* another comment */ five
six seven
""")
for word in parsed.children:
print(word.value)
%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.CNAME // Identifiers, might start with underscore: foo, bar42, _baz etc.
%import common.NEWLINE
%import common.WS_INLINE
// Comments:
%import common.SH_COMMENT // Shell comment (starting with #)
%import common.CPP_COMMENT // Comment that starts with // and goes through the end of line
%import common.C_COMMENT // Comment like /* this one */
%import common.SQL_COMMENT // select -- this is the comment
In an %import statement, the -> renames the imported(?) terminal:
%import common.ESCAPED_STRING -> STRING
%import common.SIGNED_NUMBER -> NUMBER