Python library: Lark

Specifying the grammar

The grammar can be specified in the Lark's constructor(?):

grammar = """
  …
""

parser = Lark(grammar)

In a grammar, TERMINALS are specifice in uppercase letters, rules in lowercase letters.

Start symbol

The default start symbol in a grammar is start:

parser = Lark("""
start: …
""")

It's possible to define another start symbol:

parser = Lark("""
non_default_start: …
""",
start = 'non_default_start')

Eearly vs. LALR(1)

By default, Lark creates a parser that uses the Earley algorithm to parse the input.

Earley is powerful, but also slow. If the grammar is LR-compatible, the algorithm can be changed to LALR(1):

parser = Lark(grammar, parser='lalr')

Parse digits and characters

from lark import Lark

#
#   Initialize parser with «EBNF» of the language
#
parser = Lark("""
start: item+

item: DIGIT  -> dig
    | CHAR   -> chr

CHAR : ("a" .. "z")
DIGIT: ("0" .. "9")
""")

#
#   Parse program. Note that langauge does not allow
#   for white spaces:
#
parsed = parser.parse('abc123xy9')

for i in parsed.children:
    rule     = i.data
    print('Rule {} was found, 1st child is {}'.format(rule, i.children[0]))

Github repository about-Python, path: /libraries/Lark/digit-char.py

Ignoring white space

Many programming languages ignore white space. So, we define the token(?) WS and the rule whitespace that consists of zero or more WS:

from lark import Lark
from lark import lexer

parser = Lark("""
start: whitespace item+

item: DIGIT whitespace -> dig
    | CHAR  whitespace -> chr

whitespace: WS*

CHAR : ("a" .. "z")
DIGIT: ("0" .. "9")

WS: (" " | "\\n" | "\\t")
""")

parsed = parser.parse("""
  ab c1
  23x y 9
""")

for i in parsed.children:
    rule     = i.data
    if rule in ['dig', 'chr']:
       print('Rule {} was found, 1st child is {}'.format(rule, i.children[0]))

Github repository about-Python, path: /libraries/Lark/digit-char-ignore-WS.py

Ignoring white space (%ignore)

Because most programming languages (except for example Python …) ignore white space, Lark allows to ignore them with the special keyword %ignore:

from lark import Lark

parser = Lark("""
start: item+

item: DIGIT -> dig
    | CHAR  -> chr

CHAR : ("a" .. "z")
DIGIT: ("0" .. "9")

WS: (" " | "\\n" | "\\t")
%ignore WS
""")

parsed = parser.parse("""
  ab c1
  23x y 9
""")

for i in parsed.children:
    rule     = i.data
    print('Rule {} was found, 1st child is {}'.format(rule, i.children[0]))

Github repository about-Python, path: /libraries/Lark/digit-char-WS.py

Importing WS (%import common.WS)

Because white space play such a fundamental role, Lark allows to import their definition with %import common.WS and then use their definition to ignore them with %ignore WS:

from lark import Lark

parser = Lark("""
start: item+

item: DIGIT -> dig
    | CHAR  -> chr

CHAR : ("a" .. "z")
DIGIT: ("0" .. "9")

%import common.WS
%ignore WS
""")

parsed = parser.parse("""
  ab c1
  23x y 9
""")

for i in parsed.children:
    rule     = i.data
    print('Rule {} was found, 1st child is {}'.format(rule, i.children[0]))

Github repository about-Python, path: /libraries/Lark/digit-char-import-common.WS.py

Ignore comments and whitespace

The following example impots the C and C++ like comments as well as whitespaces and ignores them.

It then iterates over words (defined by the regular expression \w+ that are found in the text to be parsed.

import lark

parser = lark.Lark("""
start: word*

//
// Note the question mark before the word:
//   It allows to get the value of the word directly
//   with word.value when iterating over the
//   parsed text.
//   Without question mark, we'd be iterating over
//   Tree-objects rather than Token-objects.
//
?word: /\w+/

%import common.WS
%import common.CPP_COMMENT
%import common.C_COMMENT

%ignore WS
%ignore CPP_COMMENT
%ignore C_COMMENT
""")

parsed = parser.parse("""
one two // this is a comment and should not be extracted
three four /* another comment */ five
  six seven
""")

for word in parsed.children:
    print(word.value)

Github repository about-Python, path: /libraries/Lark/ignore/extract-words-remove-comments.py

When running this example, it prints

one
two
three
four
five
six
seven

Note that Lark also defines the import WORD which is defined as lowercase or uppercase letters of the alphabet without any diacritcal marks.

TODO

%import

Predifined importable terminals(?):

%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.CNAME            // Identifiers, might start with underscore: foo, bar42, _baz etc.
%import common.NEWLINE
%import common.WS_INLINE

    // Comments:
%import common.SH_COMMENT       // Shell comment (starting with #)
%import common.CPP_COMMENT      // Comment that starts with // and goes through the end of line
%import common.C_COMMENT        // Comment like /* this one */
%import common.SQL_COMMENT      // select -- this is the comment

In an %import statement, the -> renames the imported(?) terminal:

%import common.ESCAPED_STRING -> STRING
%import common.SIGNED_NUMBER  -> NUMBER