Recovered draft. This post was ported from the older Jekyll site and lightly refactored for this site. The original source trailed off into outline notes, so this version keeps the finished sections and omits the dangling stubs.
Introduction
For this post, I will introduce Grammatical Framework (GF) through an extended example targeted at a general audience familiar with logic, functional programming, or mathematics.
GF is a powerful tool for specifying grammars. A grammar specification in GF is actually an abstract syntax. With an abstract syntax specified, one can then define various linearization rules which compositionally evaluate to strings. An abstract syntax tree may then be linearized to various strings admitted by different concrete syntaxes. Conversely, given a string admitted by the language being defined, GF’s parser can generate all the abstract syntax trees which linearize to that string.
Focus of this tutorial
We introduce this subject assuming no familiarity with GF, but a general mathematical and programming maturity. While GF is certainly geared toward domain-specific machine translation, learning about it is also a useful exercise in abstractly understanding syntax and the role it plays across many different domains.
Specific things covered here include:
- Historical developments leading to GF.
- The difference between abstract and concrete syntax.
- The basic GF judgments:
- Abstract:
cat,fun - Concrete:
lincat,lin - Auxiliary:
oper,param
- Abstract:
- The Resource Grammar Library (RGL).
- A way of interfacing GF with Haskell and transforming ASTs externally.
- The Portable Grammar Format (PGF).
- GF’s module system and directory structure.
- A brief comparison with tools like BNFC.
These topics are understood through a simple example language, Arith, which
serves as a case study for understanding the theoretical and practical
tradeoffs involved.
A grammar for basic arithmetic
Judgments
A central contribution of Per Martin-Lof in the development of type theory was the recognition of the centrality of judgments in logic. Many mathematicians are primarily concerned with the truth of a mathematical proposition or theorem, but there are many other judgments one can make, for example:
- is a proposition
- is possible
- is probable
These judgments are understood not in the object language in which we state our propositions, possibilities, or probabilities, but as assertions in the metalanguage which require evidence for us to know and believe them.
For the type theorist, interested in designing and building programming languages over various logics, these judgments become a prime focus. The role of the type checker in a programming language is to present evidence for, or decide the validity of, such judgments. The four main judgments of type theory are:
- is a type
- and are equal types
- is a term of type
- and are equal terms of type
We can present these with Frege’s turnstile, :
These judgments become more interesting when interpreted in context. Given a series of judgments , denoted , where each may depend on the previous ones, we can make a judgment under those hypotheses:
For instance:
One reason hypothetical judgments are so interesting is that they allow us to translate from the metalanguage to the object language using lambda expressions. This appears in the following introduction rule:
So a typical judgment from analysis might be written:
Abstract Judgments
The core syntax of GF is quite simple. The abstract syntax specification, in a
file like Arith.gf, is written as:
abstract Arith = {
...
}
Note. GF files use the
.gfextension. Abstract syntax, concrete syntax, and module structure all live inside these files.
The two primary abstract judgments are:
cat, denoting a syntactic categoryfun, denoting an n-ary function over categories
For a tiny arithmetic language, we might write:
cat Exp ;
fun
Add : Exp -> Exp -> Exp ;
Mul : Exp -> Exp -> Exp ;
EInt : Int -> Exp ;
This gives us two binary operators, Add and Mul, and a coercion EInt
from the built-in Int category into expressions. The expression ( 3 + 4 ) * 5 can then be represented by the abstract syntax tree:
Mul
(Add (EInt 3) (EInt 4))
(EInt 5)
Or, expanded in a more tree-like form:
Mul
Add
EInt 3
EInt 4
EInt 5
Builtin categories. GF natively supports a small but important collection of categories, including
Int,Float, andString.
In Haskell terms, this resembles an algebraic data type:
data Exp = Add Exp Exp | Mul Exp Exp | EInt Int
The analogy is useful, but should not be pushed too far. In linguistics, GF
categories like CN, NP, or Det do not behave like stable mathematical
objects in quite the same way.
Arith.gf
Putting the pieces together, our minimal abstract syntax file becomes:
abstract Arith = {
flags startcat = Exp ;
cat
Exp ;
fun
Add : Exp -> Exp -> Exp ; -- "+"
Mul : Exp -> Exp -> Exp ; -- "*"
EInt : Int -> Exp ; -- "33"
}
The line flags startcat = Exp ; is metadata for the compiler. It tells GF
which category to use at the root when generating random trees. If you forget
to supply a start category and then run gr in the GF shell, you can get the
somewhat cryptic error:
Category S is not in scope
CallStack (from HasCallStack):
error, called at src/compiler/GF/Command/Commands.hs:881:38 in gf-3.10.4-...
Concrete Judgments
We now append our abstract syntax file Arith.gf with a first concrete syntax.
Suppose we want an English linearization for an expression like
Mul (Add (EInt 3) (EInt 4)) (EInt 5), rendered as:
the product of the sum of 3 and 4 and 5
The concrete syntax header looks like:
concrete ArithEng1 of Arith = {
...
}
The dual concrete syntax judgments are:
catis matched bylincatfunis matched bylin
Here is a first English linearization:
lincat
Exp = Str ;
lin
Add e1 e2 = "the sum of" ++ e1 ++ "and" ++ e2 ;
Mul e1 e2 = "the product of" ++ e1 ++ "and" ++ e2 ;
EInt i = i.s ;
The lincat judgment says that expressions evaluate to strings. More elaborate
linearization types are possible, but Str is enough for this small example.
Given a fun judgment of the form
the corresponding concrete syntax provides a lin rule, compositionally
assembling strings. In this case, the linearization of our example expression
can be sketched as:
linearize (Mul (Add (EInt 3) (EInt 4)) (EInt 5))
->* "the product of" ++ linearize (Add (EInt 3) (EInt 4)) ++ "and" ++ linearize (EInt 5)
->* "the product of" ++ ("the sum of" ++ "3" ++ "and" ++ "4") ++ "and" ++ "5"
->* "the product of the sum of 3 and 4 and 5"
GF’s constraints can feel restrictive, but those same constraints are part of why parsing remains tractable.
The GF Shell
Once GF is installed, one can open both the abstract and concrete syntax by launching the shell on the concrete file:
$ gf ArithEng1.gf
One of the simplest commands is gr, which generates random trees:
Arith> gr
Add (EInt 999) (Mul (Add (EInt 999) (EInt 999)) (EInt 999))
0 msec
Arith> help gr
gr, generate_random
generate random trees in the current abstract syntax
This is useful for smoke testing. The command l linearizes trees, and the two
can be piped together:
Arith> gr -tr | l
Add (Mul (Mul (EInt 999) (EInt 999)) (Add (EInt 999) (EInt 999))) (Add (Add (EInt 999) (EInt 999)) (Add (EInt 999) (EInt 999)))
the sum of the product of the product of 999 and 999 and the sum of 999 and 999 and the sum of the sum of 999 and 999 and the sum of 999 and 999
And we can run a quick sanity check around parsing and linearization:
Arith> gr -tr | l -tr | p -tr | l
Add (EInt 999) (Mul (Add (EInt 999) (EInt 999)) (Add (EInt 999) (EInt 999)))
the sum of 999 and the product of the sum of 999 and 999 and the sum of 999 and 999
Add (EInt 999) (Mul (Add (EInt 999) (EInt 999)) (Add (EInt 999) (EInt 999)))
the sum of 999 and the product of the sum of 999 and 999 and the sum of 999 and 999
For an unambiguous grammar, this behaves exactly as one would hope.
Exercises
Exercise 1.1. Extend Arith with variables x and y so that
product of x and y parses uniquely.
Exercise 1.2. Extend the previous exercise with unary predicates so that
3 is prime and x is odd parse, and then add a binary predicate so that
3 equals 3 parses.
Exercise 2. Write a concrete syntax in your favorite language,
ArithFaveLang.gf.
Exercise 3. Write a second English concrete syntax, ArithEng2.gf, that
mimics how children learn arithmetic, using “plus” and “times”. Observe the
ambiguous parses in the GF shell, then replace those words with + and *
and remedy the ambiguity with parentheses.
Solutions
Exercise 1.1
Add a new category Var, two lexical variable names VarX and VarY, and a
coercion from variables to expressions:
-- Add the following between `{}` in `Arith.gf`
cat Var ;
fun
VExp : Var -> Exp ;
VarX : Var ;
VarY : Var ;
-- Add the following between `{}` in `ArithEng1.gf`
lincat Var = Str ;
lin
VExp v = v ;
VarX = "x" ;
VarY = "y" ;
Exercise 1.2
Now add a proposition category and a few predicates:
-- Add the following between `{}` in `Arith.gf`
flags startcat = Prop ;
cat Prop ;
fun
Odd : Exp -> Prop ;
Prime : Exp -> Prop ;
Equal : Exp -> Exp -> Prop ;
-- Add the following between `{}` in `ArithEng1.gf`
lincat Prop = Str ;
lin
Odd e = e ++ "is odd" ;
Prime e = e ++ "is prime" ;
Equal e1 e2 = e1 ++ "equals" ++ e2 ;
Exercise 3
If we switch to the more colloquial operators:
lin
Add e1 e2 = e1 ++ "plus" ++ e2 ;
Mul e1 e2 = e1 ++ "times" ++ e2 ;
then the parser exposes the familiar ambiguity:
Arith> gr -tr | l -tr | p -tr | l
Add (Mul (VExp VarX) (Mul (EInt 999) (VExp VarX))) (EInt 999)
x times 999 times x plus 999
Add (Mul (VExp VarX) (Mul (EInt 999) (VExp VarX))) (EInt 999)
Add (Mul (Mul (VExp VarX) (EInt 999)) (VExp VarX)) (EInt 999)
Mul (VExp VarX) (Add (Mul (EInt 999) (VExp VarX)) (EInt 999))
Mul (VExp VarX) (Mul (EInt 999) (Add (VExp VarX) (EInt 999)))
Mul (Mul (VExp VarX) (EInt 999)) (Add (VExp VarX) (EInt 999))
x times 999 times x plus 999
x times 999 times x plus 999
x times 999 times x plus 999
x times 999 times x plus 999
x times 999 times x plus 999
This is precisely the sort of thing that forces one to confront fixity, precedence, and what a “natural” linearization really means.