[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
In some grammars, Bison’s deterministic LR(1) parsing algorithm cannot decide whether to apply a certain grammar rule at a given point. That is, it may not be able to decide (on the basis of the input read so far) which of two possible reductions (applications of a grammar rule) applies, or whether to apply a reduction or read more of the input and apply a reduction later in the input. These are known respectively as reduce/reduce conflicts (see section Reduce/Reduce Conflicts), and shift/reduce conflicts (see section Shift/Reduce Conflicts).
To use a grammar that is not easily modified to be LR(1), a
more general parsing algorithm is sometimes necessary. If you include
%glr-parser
among the Bison declarations in your file
(see section Outline of a Bison Grammar), the result is a Generalized LR
(GLR) parser. These parsers handle Bison grammars that
contain no unresolved conflicts (i.e., after applying precedence
declarations) identically to deterministic parsers. However, when
faced with unresolved shift/reduce and reduce/reduce conflicts,
GLR parsers use the simple expedient of doing both,
effectively cloning the parser to follow both possibilities. Each of
the resulting parsers can again split, so that at any given time, there
can be any number of possible parses being explored. The parsers
proceed in lockstep; that is, all of them consume (shift) a given input
symbol before any of them proceed to the next. Each of the cloned
parsers eventually meets one of two possible fates: either it runs into
a parsing error, in which case it simply vanishes, or it merges with
another parser, because the two of them have reduced the input to an
identical set of symbols.
During the time that there are multiple parsers, semantic actions are recorded, but not performed. When a parser disappears, its recorded semantic actions disappear as well, and are never performed. When a reduction makes two parsers identical, causing them to merge, Bison records both sets of semantic actions. Whenever the last two parsers merge, reverting to the single-parser case, Bison resolves all the outstanding actions either by precedences given to the grammar rules involved, or by performing both actions, and then calling a designated user-defined function on the resulting values to produce an arbitrary merged result.
1.5.1 Using GLR on Unambiguous Grammars | Using GLR parsers on unambiguous grammars. | |
1.5.2 Using GLR to Resolve Ambiguities | Using GLR parsers to resolve ambiguities. | |
1.5.3 GLR Semantic Actions | Considerations for semantic values and deferred actions. | |
1.5.4 Controlling a Parse with Arbitrary Predicates | Controlling a parse with arbitrary computations. | |
1.5.5 Considerations when Compiling GLR Parsers | GLR parsers require a modern C compiler. |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
In the simplest cases, you can use the GLR algorithm to parse grammars that are unambiguous but fail to be LR(1). Such grammars typically require more than one symbol of lookahead.
Consider a problem that arises in the declaration of enumerated and subrange types in the programming language Pascal. Here are some examples:
type subrange = lo .. hi; type enum = (a, b, c); |
The original language standard allows only numeric literals and constant identifiers for the subrange bounds (‘lo’ and ‘hi’), but Extended Pascal (ISO/IEC 10206) and many other Pascal implementations allow arbitrary expressions there. This gives rise to the following situation, containing a superfluous pair of parentheses:
type subrange = (a) .. b; |
Compare this to the following declaration of an enumerated type with only one value:
type enum = (a); |
(These declarations are contrived, but they are syntactically valid, and more-complicated cases can come up in practical programs.)
These two declarations look identical until the ‘..’ token. With normal LR(1) one-token lookahead it is not possible to decide between the two forms when the identifier ‘a’ is parsed. It is, however, desirable for a parser to decide this, since in the latter case ‘a’ must become a new identifier to represent the enumeration value, while in the former case ‘a’ must be evaluated with its current meaning, which may be a constant or even a function call.
You could parse ‘(a)’ as an “unspecified identifier in parentheses”, to be resolved later, but this typically requires substantial contortions in both semantic actions and large parts of the grammar, where the parentheses are nested in the recursive rules for expressions.
You might think of using the lexer to distinguish between the two forms by returning different tokens for currently defined and undefined identifiers. But if these declarations occur in a local scope, and ‘a’ is defined in an outer scope, then both forms are possible—either locally redefining ‘a’, or using the value of ‘a’ from the outer scope. So this approach cannot work.
A simple solution to this problem is to declare the parser to use the GLR algorithm. When the GLR parser reaches the critical state, it merely splits into two branches and pursues both syntax rules simultaneously. Sooner or later, one of them runs into a parsing error. If there is a ‘..’ token before the next ‘;’, the rule for enumerated types fails since it cannot accept ‘..’ anywhere; otherwise, the subrange type rule fails since it requires a ‘..’ token. So one of the branches fails silently, and the other one continues normally, performing all the intermediate actions that were postponed during the split.
If the input is syntactically incorrect, both branches fail and the parser reports a syntax error as usual.
The effect of all this is that the parser seems to “guess” the correct branch to take, or in other words, it seems to use more lookahead than the underlying LR(1) algorithm actually allows for. In this example, LR(2) would suffice, but also some cases that are not LR(k) for any k can be handled this way.
In general, a GLR parser can take quadratic or cubic worst-case time, and the current Bison parser even takes exponential time and space for some grammars. In practice, this rarely happens, and for many grammars it is possible to prove that it cannot happen. The present example contains only one conflict between two rules, and the type-declaration context containing the conflict cannot be nested. So the number of branches that can exist at any time is limited by the constant 2, and the parsing time is still linear.
Here is a Bison grammar corresponding to the example above. It parses a vastly simplified form of Pascal type declarations.
%token TYPE DOTDOT ID %left '+' '-' %left '*' '/' %% type_decl: TYPE ID '=' type ';' ; type: '(' id_list ')' | expr DOTDOT expr ; id_list: ID | id_list ',' ID ; expr: '(' expr ')' | expr '+' expr | expr '-' expr | expr '*' expr | expr '/' expr | ID ; |
When used as a normal LR(1) grammar, Bison correctly complains about one reduce/reduce conflict. In the conflicting situation the parser chooses one of the alternatives, arbitrarily the one declared first. Therefore the following correct input is not recognized:
type t = (a) .. b; |
The parser can be turned into a GLR parser, while also telling Bison to be silent about the one known reduce/reduce conflict, by adding these two declarations to the Bison grammar file (before the first ‘%%’):
%glr-parser %expect-rr 1 |
No change in the grammar itself is required. Now the parser recognizes all valid declarations, according to the limited syntax above, transparently. In fact, the user does not even notice when the parser splits.
So here we have a case where we can use the benefits of GLR, almost without disadvantages. Even in simple cases like this, however, there are at least two potential problems to beware. First, always analyze the conflicts reported by Bison to make sure that GLR splitting is only done where it is intended. A GLR parser splitting inadvertently may cause problems less obvious than an LR parser statically choosing the wrong alternative in a conflict. Second, consider interactions with the lexer (see section Semantic Info in Token Types) with great care. Since a split parser consumes tokens without performing any actions during the split, the lexer cannot obtain information via parser actions. Some cases of lexer interactions can be eliminated by using GLR to shift the complications from the lexer to the parser. You must check the remaining cases for correctness.
In our example, it would be safe for the lexer to return tokens based on their current meanings in some symbol table, because no new symbols are defined in the middle of a type declaration. Though it is possible for a parser to define the enumeration constants as they are parsed, before the type declaration is completed, it actually makes no difference since they cannot be used within the same enumerated type declaration.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Let’s consider an example, vastly simplified from a C++ grammar.
%{ #include <stdio.h> #define YYSTYPE char const * int yylex (void); void yyerror (char const *); %} %token TYPENAME ID %right '=' %left '+' %glr-parser %% prog: %empty | prog stmt { printf ("\n"); } ; stmt: expr ';' %dprec 1 | decl %dprec 2 ; expr: ID { printf ("%s ", $$); } | TYPENAME '(' expr ')' { printf ("%s <cast> ", $1); } | expr '+' expr { printf ("+ "); } | expr '=' expr { printf ("= "); } ; decl: TYPENAME declarator ';' { printf ("%s <declare> ", $1); } | TYPENAME declarator '=' expr ';' { printf ("%s <init-declare> ", $1); } ; declarator: ID { printf ("\"%s\" ", $1); } | '(' declarator ')' ; |
This models a problematic part of the C++ grammar—the ambiguity between certain declarations and statements. For example,
T (x) = y+z; |
parses as either an expr
or a stmt
(assuming that ‘T’ is recognized as a TYPENAME
and
‘x’ as an ID
).
Bison detects this as a reduce/reduce conflict between the rules
expr : ID
and declarator : ID
, which it cannot resolve at the
time it encounters x
in the example above. Since this is a
GLR parser, it therefore splits the problem into two parses, one for
each choice of resolving the reduce/reduce conflict.
Unlike the example from the previous section (see section Using GLR on Unambiguous Grammars),
however, neither of these parses “dies,” because the grammar as it stands is
ambiguous. One of the parsers eventually reduces stmt : expr ';'
and
the other reduces stmt : decl
, after which both parsers are in an
identical state: they’ve seen ‘prog stmt’ and have the same unprocessed
input remaining. We say that these parses have merged.
At this point, the GLR parser requires a specification in the
grammar of how to choose between the competing parses.
In the example above, the two %dprec
declarations specify that Bison is to give precedence
to the parse that interprets the example as a
decl
, which implies that x
is a declarator.
The parser therefore prints
"x" y z + T <init-declare> |
The %dprec
declarations only come into play when more than one
parse survives. Consider a different input string for this parser:
T (x) + y; |
This is another example of using GLR to parse an unambiguous
construct, as shown in the previous section (see section Using GLR on Unambiguous Grammars).
Here, there is no ambiguity (this cannot be parsed as a declaration).
However, at the time the Bison parser encounters x
, it does not
have enough information to resolve the reduce/reduce conflict (again,
between x
as an expr
or a declarator
). In this
case, no precedence declaration is used. Again, the parser splits
into two, one assuming that x
is an expr
, and the other
assuming x
is a declarator
. The second of these parsers
then vanishes when it sees +
, and the parser prints
x T <cast> y + |
Suppose that instead of resolving the ambiguity, you wanted to see all
the possibilities. For this purpose, you must merge the semantic
actions of the two possible parsers, rather than choosing one over the
other. To do so, you could change the declaration of stmt
as
follows:
stmt: expr ';' %merge <stmtMerge> | decl %merge <stmtMerge> ; |
and define the stmtMerge
function as:
static YYSTYPE stmtMerge (YYSTYPE x0, YYSTYPE x1) { printf ("<OR> "); return ""; } |
with an accompanying forward declaration in the C declarations at the beginning of the file:
%{ #define YYSTYPE char const * static YYSTYPE stmtMerge (YYSTYPE x0, YYSTYPE x1); %} |
With these declarations, the resulting parser parses the first example
as both an expr
and a decl
, and prints
"x" y z + T <init-declare> x T <cast> y z + = <OR> |
Bison requires that all of the productions that participate in any particular merge have identical ‘%merge’ clauses. Otherwise, the ambiguity would be unresolvable, and the parser will report an error during any parse that results in the offending merge.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The nature of GLR parsing and the structure of the generated parsers give rise to certain restrictions on semantic values and actions.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
By definition, a deferred semantic action is not performed at the same time as the associated reduction. This raises caveats for several Bison features you might use in a semantic action in a GLR parser.
In any semantic action, you can examine yychar
to determine the type of
the lookahead token present at the time of the associated reduction.
After checking that yychar
is not set to YYEMPTY
or YYEOF
,
you can then examine yylval
and yylloc
to determine the
lookahead token’s semantic value and location, if any.
In a nondeferred semantic action, you can also modify any of these variables to
influence syntax analysis.
See section Lookahead Tokens.
In a deferred semantic action, it’s too late to influence syntax analysis.
In this case, yychar
, yylval
, and yylloc
are set to
shallow copies of the values they had at the time of the associated reduction.
For this reason alone, modifying them is dangerous.
Moreover, the result of modifying them is undefined and subject to change with
future versions of Bison.
For example, if a semantic action might be deferred, you should never write it
to invoke yyclearin
(see section Special Features for Use in Actions) or to attempt to free
memory referenced by yylval
.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Another Bison feature requiring special consideration is YYERROR
(see section Special Features for Use in Actions), which you can invoke in a semantic action to
initiate error recovery.
During deterministic GLR operation, the effect of YYERROR
is
the same as its effect in a deterministic parser.
The effect in a deferred action is similar, but the precise point of the
error is undefined; instead, the parser reverts to deterministic operation,
selecting an unspecified stack on which to continue with a syntax error.
In a semantic predicate (see Controlling a Parse with Arbitrary Predicates) during nondeterministic
parsing, YYERROR
silently prunes
the parse that invoked the test.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
GLR parsers require that you use POD (Plain Old Data) types for semantic values and location types when using the generated parsers as C++ code.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
In addition to the %dprec
and %merge
directives,
GLR parsers
allow you to reject parses on the basis of arbitrary computations executed
in user code, without having Bison treat this rejection as an error
if there are alternative parses. (This feature is experimental and may
evolve. We welcome user feedback.) For example,
widget: %?{ new_syntax } "widget" id new_args { $$ = f($3, $4); } | %?{ !new_syntax } "widget" id old_args { $$ = f($3, $4); } ; |
is one way to allow the same parser to handle two different syntaxes for
widgets. The clause preceded by %?
is treated like an ordinary
action, except that its text is treated as an expression and is always
evaluated immediately (even when in nondeterministic mode). If the
expression yields 0 (false), the clause is treated as a syntax error,
which, in a nondeterministic parser, causes the stack in which it is reduced
to die. In a deterministic parser, it acts like YYERROR.
As the example shows, predicates otherwise look like semantic actions, and therefore you must be take them into account when determining the numbers to use for denoting the semantic values of right-hand side symbols. Predicate actions, however, have no defined value, and may not be given labels.
There is a subtle difference between semantic predicates and ordinary actions in nondeterministic mode, since the latter are deferred. For example, we could try to rewrite the previous example as
widget: { if (!new_syntax) YYERROR; } "widget" id new_args { $$ = f($3, $4); } | { if (new_syntax) YYERROR; } "widget" id old_args { $$ = f($3, $4); } ; |
(reversing the sense of the predicate tests to cause an error when they are
false). However, this
does not have the same effect if new_args
and old_args
have overlapping syntax.
Since the mid-rule actions testing new_syntax
are deferred,
a GLR parser first encounters the unresolved ambiguous reduction
for cases where new_args
and old_args
recognize the same string
before performing the tests of new_syntax
. It therefore
reports an error.
Finally, be careful in writing predicates: deferred actions have not been evaluated, so that using them in a predicate will have undefined effects.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The GLR parsers require a compiler for ISO C89 or
later. In addition, they use the inline
keyword, which is not
C89, but is C99 and is a common extension in pre-C99 compilers. It is
up to the user of these parsers to handle
portability issues. For instance, if using Autoconf and the Autoconf
macro AC_C_INLINE
, a mere
%{ #include <config.h> %} |
will suffice. Otherwise, we suggest
%{ #if (__STDC_VERSION__ < 199901 && ! defined __GNUC__ \ && ! defined inline) # define inline #endif %} |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] |
This document was generated by Rick Perry on December 29, 2013 using texi2html 1.82.