Bison 3.0.2: 1.5 Writing GLR Parsers

1.5 Writing GLR Parsers

In some grammars, Bison’s deterministic LR(1) parsing algorithm cannot decide whether to apply a certain grammar rule at a given point. That is, it may not be able to decide (on the basis of the input read so far) which of two possible reductions (applications of a grammar rule) applies, or whether to apply a reduction or read more of the input and apply a reduction later in the input. These are known respectively as reduce/reduce conflicts (see section Reduce/Reduce Conflicts), and shift/reduce conflicts (see section Shift/Reduce Conflicts).

To use a grammar that is not easily modified to be LR(1), a more general parsing algorithm is sometimes necessary. If you include %glr-parser among the Bison declarations in your file (see section Outline of a Bison Grammar), the result is a Generalized LR (GLR) parser. These parsers handle Bison grammars that contain no unresolved conflicts (i.e., after applying precedence declarations) identically to deterministic parsers. However, when faced with unresolved shift/reduce and reduce/reduce conflicts, GLR parsers use the simple expedient of doing both, effectively cloning the parser to follow both possibilities. Each of the resulting parsers can again split, so that at any given time, there can be any number of possible parses being explored. The parsers proceed in lockstep; that is, all of them consume (shift) a given input symbol before any of them proceed to the next. Each of the cloned parsers eventually meets one of two possible fates: either it runs into a parsing error, in which case it simply vanishes, or it merges with another parser, because the two of them have reduced the input to an identical set of symbols.

During the time that there are multiple parsers, semantic actions are recorded, but not performed. When a parser disappears, its recorded semantic actions disappear as well, and are never performed. When a reduction makes two parsers identical, causing them to merge, Bison records both sets of semantic actions. Whenever the last two parsers merge, reverting to the single-parser case, Bison resolves all the outstanding actions either by precedences given to the grammar rules involved, or by performing both actions, and then calling a designated user-defined function on the resulting values to produce an arbitrary merged result.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

1.5.1 Using GLR on Unambiguous Grammars

In the simplest cases, you can use the GLR algorithm to parse grammars that are unambiguous but fail to be LR(1). Such grammars typically require more than one symbol of lookahead.

Consider a problem that arises in the declaration of enumerated and subrange types in the programming language Pascal. Here are some examples:

type subrange = lo .. hi;
type enum = (a, b, c);

The original language standard allows only numeric literals and constant identifiers for the subrange bounds (‘lo’ and ‘hi’), but Extended Pascal (ISO/IEC 10206) and many other Pascal implementations allow arbitrary expressions there. This gives rise to the following situation, containing a superfluous pair of parentheses:

type subrange = (a) .. b;

Compare this to the following declaration of an enumerated type with only one value:

type enum = (a);

(These declarations are contrived, but they are syntactically valid, and more-complicated cases can come up in practical programs.)

These two declarations look identical until the ‘..’ token. With normal LR(1) one-token lookahead it is not possible to decide between the two forms when the identifier ‘a’ is parsed. It is, however, desirable for a parser to decide this, since in the latter case ‘a’ must become a new identifier to represent the enumeration value, while in the former case ‘a’ must be evaluated with its current meaning, which may be a constant or even a function call.

You could parse ‘(a)’ as an “unspecified identifier in parentheses”, to be resolved later, but this typically requires substantial contortions in both semantic actions and large parts of the grammar, where the parentheses are nested in the recursive rules for expressions.

You might think of using the lexer to distinguish between the two forms by returning different tokens for currently defined and undefined identifiers. But if these declarations occur in a local scope, and ‘a’ is defined in an outer scope, then both forms are possible—either locally redefining ‘a’, or using the value of ‘a’ from the outer scope. So this approach cannot work.

A simple solution to this problem is to declare the parser to use the GLR algorithm. When the GLR parser reaches the critical state, it merely splits into two branches and pursues both syntax rules simultaneously. Sooner or later, one of them runs into a parsing error. If there is a ‘..’ token before the next ‘;’, the rule for enumerated types fails since it cannot accept ‘..’ anywhere; otherwise, the subrange type rule fails since it requires a ‘..’ token. So one of the branches fails silently, and the other one continues normally, performing all the intermediate actions that were postponed during the split.

If the input is syntactically incorrect, both branches fail and the parser reports a syntax error as usual.

The effect of all this is that the parser seems to “guess” the correct branch to take, or in other words, it seems to use more lookahead than the underlying LR(1) algorithm actually allows for. In this example, LR(2) would suffice, but also some cases that are not LR(k) for any k can be handled this way.

In general, a GLR parser can take quadratic or cubic worst-case time, and the current Bison parser even takes exponential time and space for some grammars. In practice, this rarely happens, and for many grammars it is possible to prove that it cannot happen. The present example contains only one conflict between two rules, and the type-declaration context containing the conflict cannot be nested. So the number of branches that can exist at any time is limited by the constant 2, and the parsing time is still linear.

Here is a Bison grammar corresponding to the example above. It parses a vastly simplified form of Pascal type declarations.

%token TYPE DOTDOT ID

%left '+' '-'
%left '*' '/'

%%
type_decl: TYPE ID '=' type ';' ;

type:
  '(' id_list ')'
| expr DOTDOT expr
;

id_list:
  ID
| id_list ',' ID
;

expr:
  '(' expr ')'
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| ID
;

When used as a normal LR(1) grammar, Bison correctly complains about one reduce/reduce conflict. In the conflicting situation the parser chooses one of the alternatives, arbitrarily the one declared first. Therefore the following correct input is not recognized:

type t = (a) .. b;

The parser can be turned into a GLR parser, while also telling Bison to be silent about the one known reduce/reduce conflict, by adding these two declarations to the Bison grammar file (before the first ‘%%’):

%glr-parser
%expect-rr 1

No change in the grammar itself is required. Now the parser recognizes all valid declarations, according to the limited syntax above, transparently. In fact, the user does not even notice when the parser splits.

So here we have a case where we can use the benefits of GLR, almost without disadvantages. Even in simple cases like this, however, there are at least two potential problems to beware. First, always analyze the conflicts reported by Bison to make sure that GLR splitting is only done where it is intended. A GLR parser splitting inadvertently may cause problems less obvious than an LR parser statically choosing the wrong alternative in a conflict. Second, consider interactions with the lexer (see section Semantic Info in Token Types) with great care. Since a split parser consumes tokens without performing any actions during the split, the lexer cannot obtain information via parser actions. Some cases of lexer interactions can be eliminated by using GLR to shift the complications from the lexer to the parser. You must check the remaining cases for correctness.

In our example, it would be safe for the lexer to return tokens based on their current meanings in some symbol table, because no new symbols are defined in the middle of a type declaration. Though it is possible for a parser to define the enumeration constants as they are parsed, before the type declaration is completed, it actually makes no difference since they cannot be used within the same enumerated type declaration.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

1.5.2 Using GLR to Resolve Ambiguities

Let’s consider an example, vastly simplified from a C++ grammar.

%{
  #include <stdio.h>
  #define YYSTYPE char const *
  int yylex (void);
  void yyerror (char const *);
%}

%token TYPENAME ID

%right '='
%left '+'

%glr-parser

%%

prog:
  %empty
| prog stmt   { printf ("\n"); }
;

stmt:
  expr ';'  %dprec 1
| decl      %dprec 2
;

expr:
  ID               { printf ("%s ", $$); }
| TYPENAME '(' expr ')'
                   { printf ("%s <cast> ", $1); }
| expr '+' expr    { printf ("+ "); }
| expr '=' expr    { printf ("= "); }
;

decl:
  TYPENAME declarator ';'
                   { printf ("%s <declare> ", $1); }
| TYPENAME declarator '=' expr ';'
                   { printf ("%s <init-declare> ", $1); }
;

declarator:
  ID               { printf ("\"%s\" ", $1); }
| '(' declarator ')'
;

This models a problematic part of the C++ grammar—the ambiguity between certain declarations and statements. For example,

T (x) = y+z;

parses as either an expr or a stmt (assuming that ‘T’ is recognized as a TYPENAME and ‘x’ as an ID). Bison detects this as a reduce/reduce conflict between the rules expr : ID and declarator : ID, which it cannot resolve at the time it encounters x in the example above. Since this is a GLR parser, it therefore splits the problem into two parses, one for each choice of resolving the reduce/reduce conflict. Unlike the example from the previous section (see section Using GLR on Unambiguous Grammars), however, neither of these parses “dies,” because the grammar as it stands is ambiguous. One of the parsers eventually reduces stmt : expr ';' and the other reduces stmt : decl, after which both parsers are in an identical state: they’ve seen ‘prog stmt’ and have the same unprocessed input remaining. We say that these parses have merged.

At this point, the GLR parser requires a specification in the grammar of how to choose between the competing parses. In the example above, the two %dprec declarations specify that Bison is to give precedence to the parse that interprets the example as a decl, which implies that x is a declarator. The parser therefore prints

"x" y z + T <init-declare>

The %dprec declarations only come into play when more than one parse survives. Consider a different input string for this parser:

T (x) + y;

This is another example of using GLR to parse an unambiguous construct, as shown in the previous section (see section Using GLR on Unambiguous Grammars). Here, there is no ambiguity (this cannot be parsed as a declaration). However, at the time the Bison parser encounters x, it does not have enough information to resolve the reduce/reduce conflict (again, between x as an expr or a declarator). In this case, no precedence declaration is used. Again, the parser splits into two, one assuming that x is an expr, and the other assuming x is a declarator. The second of these parsers then vanishes when it sees +, and the parser prints

x T <cast> y +

Suppose that instead of resolving the ambiguity, you wanted to see all the possibilities. For this purpose, you must merge the semantic actions of the two possible parsers, rather than choosing one over the other. To do so, you could change the declaration of stmt as follows:

stmt:
  expr ';'  %merge <stmtMerge>
| decl      %merge <stmtMerge>
;

and define the stmtMerge function as:

static YYSTYPE
stmtMerge (YYSTYPE x0, YYSTYPE x1)
{
  printf ("<OR> ");
  return "";
}

with an accompanying forward declaration in the C declarations at the beginning of the file:

%{
  #define YYSTYPE char const *
  static YYSTYPE stmtMerge (YYSTYPE x0, YYSTYPE x1);
%}

With these declarations, the resulting parser parses the first example as both an expr and a decl, and prints

"x" y z + T <init-declare> x T <cast> y z + = <OR>

Bison requires that all of the productions that participate in any particular merge have identical ‘%merge’ clauses. Otherwise, the ambiguity would be unresolvable, and the parser will report an error during any parse that results in the offending merge.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

1.5.3 GLR Semantic Actions

The nature of GLR parsing and the structure of the generated parsers give rise to certain restrictions on semantic values and actions.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

1.5.3.1 Deferred semantic actions

By definition, a deferred semantic action is not performed at the same time as the associated reduction. This raises caveats for several Bison features you might use in a semantic action in a GLR parser.

In any semantic action, you can examine yychar to determine the type of the lookahead token present at the time of the associated reduction. After checking that yychar is not set to YYEMPTY or YYEOF, you can then examine yylval and yylloc to determine the lookahead token’s semantic value and location, if any. In a nondeferred semantic action, you can also modify any of these variables to influence syntax analysis. See section Lookahead Tokens.

In a deferred semantic action, it’s too late to influence syntax analysis. In this case, yychar, yylval, and yylloc are set to shallow copies of the values they had at the time of the associated reduction. For this reason alone, modifying them is dangerous. Moreover, the result of modifying them is undefined and subject to change with future versions of Bison. For example, if a semantic action might be deferred, you should never write it to invoke yyclearin (see section Special Features for Use in Actions) or to attempt to free memory referenced by yylval.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

1.5.3.2 YYERROR

Another Bison feature requiring special consideration is YYERROR (see section Special Features for Use in Actions), which you can invoke in a semantic action to initiate error recovery. During deterministic GLR operation, the effect of YYERROR is the same as its effect in a deterministic parser. The effect in a deferred action is similar, but the precise point of the error is undefined; instead, the parser reverts to deterministic operation, selecting an unspecified stack on which to continue with a syntax error. In a semantic predicate (see Controlling a Parse with Arbitrary Predicates) during nondeterministic parsing, YYERROR silently prunes the parse that invoked the test.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

1.5.3.3 Restrictions on semantic values and locations

GLR parsers require that you use POD (Plain Old Data) types for semantic values and location types when using the generated parsers as C++ code.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

1.5.4 Controlling a Parse with Arbitrary Predicates

In addition to the %dprec and %merge directives, GLR parsers allow you to reject parses on the basis of arbitrary computations executed in user code, without having Bison treat this rejection as an error if there are alternative parses. (This feature is experimental and may evolve. We welcome user feedback.) For example,

widget:
  %?{  new_syntax } "widget" id new_args  { $$ = f($3, $4); }
| %?{ !new_syntax } "widget" id old_args  { $$ = f($3, $4); }
;

is one way to allow the same parser to handle two different syntaxes for widgets. The clause preceded by %? is treated like an ordinary action, except that its text is treated as an expression and is always evaluated immediately (even when in nondeterministic mode). If the expression yields 0 (false), the clause is treated as a syntax error, which, in a nondeterministic parser, causes the stack in which it is reduced to die. In a deterministic parser, it acts like YYERROR.

As the example shows, predicates otherwise look like semantic actions, and therefore you must be take them into account when determining the numbers to use for denoting the semantic values of right-hand side symbols. Predicate actions, however, have no defined value, and may not be given labels.

There is a subtle difference between semantic predicates and ordinary actions in nondeterministic mode, since the latter are deferred. For example, we could try to rewrite the previous example as

widget:
  { if (!new_syntax) YYERROR; }
    "widget" id new_args  { $$ = f($3, $4); }
|  { if (new_syntax) YYERROR; }
    "widget" id old_args   { $$ = f($3, $4); }
;

(reversing the sense of the predicate tests to cause an error when they are false). However, this does not have the same effect if new_args and old_args have overlapping syntax. Since the mid-rule actions testing new_syntax are deferred, a GLR parser first encounters the unresolved ambiguous reduction for cases where new_args and old_args recognize the same string before performing the tests of new_syntax. It therefore reports an error.

Finally, be careful in writing predicates: deferred actions have not been evaluated, so that using them in a predicate will have undefined effects.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

1.5.5 Considerations when Compiling GLR Parsers

The GLR parsers require a compiler for ISO C89 or later. In addition, they use the inline keyword, which is not C89, but is C99 and is a common extension in pre-C99 compilers. It is up to the user of these parsers to handle portability issues. For instance, if using Autoconf and the Autoconf macro AC_C_INLINE, a mere

%{
  #include <config.h>
%}

will suffice. Otherwise, we suggest

%{
  #if (__STDC_VERSION__ < 199901 && ! defined __GNUC__ \
       && ! defined inline)
  # define inline
  #endif
%}

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

This document was generated by Rick Perry on December 29, 2013 using texi2html 1.82.

1.5.1 Using GLR on Unambiguous Grammars		Using GLR parsers on unambiguous grammars.
1.5.2 Using GLR to Resolve Ambiguities		Using GLR parsers to resolve ambiguities.
1.5.3 GLR Semantic Actions		Considerations for semantic values and deferred actions.
1.5.4 Controlling a Parse with Arbitrary Predicates		Controlling a parse with arbitrary computations.
1.5.5 Considerations when Compiling GLR Parsers		GLR parsers require a modern C compiler.