A grammar is context-free (CFG) if the rules of the grammar do not depend on the context in which the tokens appear. Context-free grammars consist of:
- A set of terminal symbols .
- i.e., symbols that appear in the input stream (tokens).
- By convention, these are in lower-case letters, operators, digits, brackets.
- A set of non-terminal symbols .
- i.e., a set of “intermediate” symbols that help us express the grammar. We substitute these out for tokens that eventually become terminals.
- By convention, these start with upper-case letters.
- A start symbol . It’s denoted .
- i.e., a special non-terminal symbol that refers to the whole language.
- A set of productions .
- These express what the language is. We can re-write non-terminals in terms of non-terminals and terminals.
- We also define $$$ as the end-of-input token.
We also define a derivation of a sentence as a possible conversion from a sentence to a production rule (also called a sentential form) used to describe it.
The set of CFGs and the grammars they accept are provably a decidable language. In addition, any context-free language is a member of the complexity class .
Productions
Productions can be described as followed. The start production is often the first one.
There could be a recursive definition in our grammar. The below is a sentential form of .
The arrow means one can re-write or express the LHS of the arrow as the RHS of the arrow. Only non-terminals appear on the LHS (and only one for CFGs). Note that we can use the vertical pipe to express alternatives.
or alternatively:
These lines and this convention allows us to specify a grammar only with its productions.
Derivation
Given a more complicated grammar, we have two key questions: which production do we use? Which non-terminal in a production do we re-write?
The leftmost (rightmost) derivation is where we consistently re-write the leftmost (rightmost) non-terminal in a sentential form. We try to be consistent with being leftmost or rightmost. Ideally our grammar must be defined in a way that it’s possible to use a single consistent approach. We also define some important sets. Given a sentential form of grammar symbols (terminals and non-terminals), the first set is the set of terminals that can begin any sentential from derived from .
Given a non-terminal symbol , the follow set is the set of terminals that can appear immediately after in some derivation. We can compute the follow set by repeating these steps until nothing can be added to any set:
- Add \$$ to \mathcal{Follow}(S)S$$ is the end-of-input symbol.
- For a production , add all elements of , except for , add to .
- If productions occur where follows in a production rule, and if , then we add the elements of into .