Bison 3.0.2: 5.8 Tuning LR

5.8 Tuning LR

The default behavior of Bison’s LR-based parsers is chosen mostly for historical reasons, but that behavior is often not robust. For example, in the previous section, we discussed the mysterious conflicts that can be produced by LALR(1), Bison’s default parser table construction algorithm. Another example is Bison’s %define parse.error verbose directive, which instructs the generated parser to produce verbose syntax error messages, which can sometimes contain incorrect information.

In this section, we explore several modern features of Bison that allow you to tune fundamental aspects of the generated LR-based parsers. Some of these features easily eliminate shortcomings like those mentioned above. Others can be helpful purely for understanding your parser.

Most of the features discussed in this section are still experimental. More user feedback will help to stabilize them.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.8.1 LR Table Construction

For historical reasons, Bison constructs LALR(1) parser tables by default. However, LALR does not possess the full language-recognition power of LR. As a result, the behavior of parsers employing LALR parser tables is often mysterious. We presented a simple example of this effect in Mysterious Conflicts.

As we also demonstrated in that example, the traditional approach to eliminating such mysterious behavior is to restructure the grammar. Unfortunately, doing so correctly is often difficult. Moreover, merely discovering that LALR causes mysterious behavior in your parser can be difficult as well.

Fortunately, Bison provides an easy way to eliminate the possibility of such mysterious behavior altogether. You simply need to activate a more powerful parser table construction algorithm by using the %define lr.type directive.

Directive: %define lr.type type

Specify the type of parser tables within the LR(1) family. The accepted values for type are:

lalr (default)
ielr
canonical-lr

(This feature is experimental. More user feedback will help to stabilize it.)

For example, to activate IELR, you might add the following directive to you grammar file:

%define lr.type ielr

For the example in Mysterious Conflicts, the mysterious conflict is then eliminated, so there is no need to invest time in comprehending the conflict or restructuring the grammar to fix it. If, during future development, the grammar evolves such that all mysterious behavior would have disappeared using just LALR, you need not fear that continuing to use IELR will result in unnecessarily large parser tables. That is, IELR generates LALR tables when LALR (using a deterministic parsing algorithm) is sufficient to support the full language-recognition power of LR. Thus, by enabling IELR at the start of grammar development, you can safely and completely eliminate the need to consider LALR’s shortcomings.

While IELR is almost always preferable, there are circumstances where LALR or the canonical LR parser tables described by Knuth (see section Knuth 1965) can be useful. Here we summarize the relative advantages of each parser table construction algorithm within Bison:

LALR
There are at least two scenarios where LALR can be worthwhile:
- GLR without static conflict resolution.
  When employing GLR parsers (see section Writing GLR Parsers), if you do not resolve any conflicts statically (for example, with %left or %precedence), then the parser explores all potential parses of any given input. In this case, the choice of parser table construction algorithm is guaranteed not to alter the language accepted by the parser. LALR parser tables are the smallest parser tables Bison can currently construct, so they may then be preferable. Nevertheless, once you begin to resolve conflicts statically, GLR behaves more like a deterministic parser in the syntactic contexts where those conflicts appear, and so either IELR or canonical LR can then be helpful to avoid LALR’s mysterious behavior.
- Malformed grammars.
  Occasionally during development, an especially malformed grammar with a major recurring flaw may severely impede the IELR or canonical LR parser table construction algorithm. LALR can be a quick way to construct parser tables in order to investigate such problems while ignoring the more subtle differences from IELR and canonical LR.
IELR
IELR (Inadequacy Elimination LR) is a minimal LR algorithm. That is, given any grammar (LR or non-LR), parsers using IELR or canonical LR parser tables always accept exactly the same set of sentences. However, like LALR, IELR merges parser states during parser table construction so that the number of parser states is often an order of magnitude less than for canonical LR. More importantly, because canonical LR’s extra parser states may contain duplicate conflicts in the case of non-LR grammars, the number of conflicts for IELR is often an order of magnitude less as well. This effect can significantly reduce the complexity of developing a grammar.
Canonical LR
While inefficient, canonical LR parser tables can be an interesting means to explore a grammar because they possess a property that IELR and LALR tables do not. That is, if %nonassoc is not used and default reductions are left disabled (see section Default Reductions), then, for every left context of every canonical LR state, the set of tokens accepted by that state is guaranteed to be the exact set of tokens that is syntactically acceptable in that left context. It might then seem that an advantage of canonical LR parsers in production is that, under the above constraints, they are guaranteed to detect a syntax error as soon as possible without performing any unnecessary reductions. However, IELR parsers that use LAC are also able to achieve this behavior without sacrificing %nonassoc or default reductions. For details and a few caveats of LAC, see section LAC.

For a more detailed exposition of the mysterious behavior in LALR parsers and the benefits of IELR, see section Denny 2008 March, and Denny 2010 November.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.8.2 Default Reductions

After parser table construction, Bison identifies the reduction with the largest lookahead set in each parser state. To reduce the size of the parser state, traditional Bison behavior is to remove that lookahead set and to assign that reduction to be the default parser action. Such a reduction is known as a default reduction.

Default reductions affect more than the size of the parser tables. They also affect the behavior of the parser:

Delayed yylex invocations.
A consistent state is a state that has only one possible parser action. If that action is a reduction and is encoded as a default reduction, then that consistent state is called a defaulted state. Upon reaching a defaulted state, a Bison-generated parser does not bother to invoke yylex to fetch the next token before performing the reduction. In other words, whether default reductions are enabled in consistent states determines how soon a Bison-generated parser invokes yylex for a token: immediately when it reaches that token in the input or when it eventually needs that token as a lookahead to determine the next parser action. Traditionally, default reductions are enabled, and so the parser exhibits the latter behavior.

The presence of defaulted states is an important consideration when designing yylex and the grammar file. That is, if the behavior of yylex can influence or be influenced by the semantic actions associated with the reductions in defaulted states, then the delay of the next yylex invocation until after those reductions is significant. For example, the semantic actions might pop a scope stack that yylex uses to determine what token to return. Thus, the delay might be necessary to ensure that yylex does not look up the next token in a scope that should already be considered closed.
Delayed syntax error detection.
When the parser fetches a new token by invoking yylex, it checks whether there is an action for that token in the current parser state. The parser detects a syntax error if and only if either (1) there is no action for that token or (2) the action for that token is the error action (due to the use of %nonassoc). However, if there is a default reduction in that state (which might or might not be a defaulted state), then it is impossible for condition 1 to exist. That is, all tokens have an action. Thus, the parser sometimes fails to detect the syntax error until it reaches a later state.

While default reductions never cause the parser to accept syntactically incorrect sentences, the delay of syntax error detection can have unexpected effects on the behavior of the parser. However, the delay can be caused anyway by parser state merging and the use of %nonassoc, and it can be fixed by another Bison feature, LAC. We discuss the effects of delayed syntax error detection and LAC more in the next section (see section LAC).

For canonical LR, the only default reduction that Bison enables by default is the accept action, which appears only in the accepting state, which has no other action and is thus a defaulted state. However, the default accept action does not delay any yylex invocation or syntax error detection because the accept action ends the parse.

For LALR and IELR, Bison enables default reductions in nearly all states by default. There are only two exceptions. First, states that have a shift action on the error token do not have default reductions because delayed syntax error detection could then prevent the error token from ever being shifted in that state. However, parser state merging can cause the same effect anyway, and LAC fixes it in both cases, so future versions of Bison might drop this exception when LAC is activated. Second, GLR parsers do not record the default reduction as the action on a lookahead token for which there is a conflict. The correct action in this case is to split the parse instead.

To adjust which states have default reductions enabled, use the %define lr.default-reduction directive.

Directive: %define lr.default-reduction where

Specify the kind of states that are permitted to contain default reductions. The accepted values of where are:

most (default for LALR and IELR)
consistent
accepting (default for canonical LR)

(The ability to specify where default reductions are permitted is experimental. More user feedback will help to stabilize it.)

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.8.3 LAC

Canonical LR, IELR, and LALR can suffer from a couple of problems upon encountering a syntax error. First, the parser might perform additional parser stack reductions before discovering the syntax error. Such reductions can perform user semantic actions that are unexpected because they are based on an invalid token, and they cause error recovery to begin in a different syntactic context than the one in which the invalid token was encountered. Second, when verbose error messages are enabled (see section The Error Reporting Function yyerror), the expected token list in the syntax error message can both contain invalid tokens and omit valid tokens.

The culprits for the above problems are %nonassoc, default reductions in inconsistent states (see section Default Reductions), and parser state merging. Because IELR and LALR merge parser states, they suffer the most. Canonical LR can suffer only if %nonassoc is used or if default reductions are enabled for inconsistent states.

LAC (Lookahead Correction) is a new mechanism within the parsing algorithm that solves these problems for canonical LR, IELR, and LALR without sacrificing %nonassoc, default reductions, or state merging. You can enable LAC with the %define parse.lac directive.

Directive: %define parse.lac value

Enable LAC to improve syntax error handling.

none (default)
full

(This feature is experimental. More user feedback will help to stabilize it. Moreover, it is currently only available for deterministic parsers in C.)

Conceptually, the LAC mechanism is straight-forward. Whenever the parser fetches a new token from the scanner so that it can determine the next parser action, it immediately suspends normal parsing and performs an exploratory parse using a temporary copy of the normal parser state stack. During this exploratory parse, the parser does not perform user semantic actions. If the exploratory parse reaches a shift action, normal parsing then resumes on the normal parser stacks. If the exploratory parse reaches an error instead, the parser reports a syntax error. If verbose syntax error messages are enabled, the parser must then discover the list of expected tokens, so it performs a separate exploratory parse for each token in the grammar.

There is one subtlety about the use of LAC. That is, when in a consistent parser state with a default reduction, the parser will not attempt to fetch a token from the scanner because no lookahead is needed to determine the next parser action. Thus, whether default reductions are enabled in consistent states (see section Default Reductions) affects how soon the parser detects a syntax error: immediately when it reaches an erroneous token or when it eventually needs that token as a lookahead to determine the next parser action. The latter behavior is probably more intuitive, so Bison currently provides no way to achieve the former behavior while default reductions are enabled in consistent states.

Thus, when LAC is in use, for some fixed decision of whether to enable default reductions in consistent states, canonical LR and IELR behave almost exactly the same for both syntactically acceptable and syntactically unacceptable input. While LALR still does not support the full language-recognition power of canonical LR and IELR, LAC at least enables LALR’s syntax error handling to correctly reflect LALR’s language-recognition power.

There are a few caveats to consider when using LAC:

Infinite parsing loops.
IELR plus LAC does have one shortcoming relative to canonical LR. Some parsers generated by Bison can loop infinitely. LAC does not fix infinite parsing loops that occur between encountering a syntax error and detecting it, but enabling canonical LR or disabling default reductions sometimes does.
Verbose error message limitations.
Because of internationalization considerations, Bison-generated parsers limit the size of the expected token list they are willing to report in a verbose syntax error message. If the number of expected tokens exceeds that limit, the list is simply dropped from the message. Enabling LAC can increase the size of the list and thus cause the parser to drop it. Of course, dropping the list is better than reporting an incorrect list.
Performance.
Because LAC requires many parse actions to be performed twice, it can have a performance penalty. However, not all parse actions must be performed twice. Specifically, during a series of default reductions in consistent states and shift actions, the parser never has to initiate an exploratory parse. Moreover, the most time-consuming tasks in a parse are often the file I/O, the lexical analysis performed by the scanner, and the user’s semantic actions, but none of these are performed during the exploratory parse. Finally, the base of the temporary stack used during an exploratory parse is a pointer into the normal parser state stack so that the stack is never physically copied. In our experience, the performance penalty of LAC has proved insignificant for practical grammars.

While the LAC algorithm shares techniques that have been recognized in the parser community for years, for the publication that introduces LAC, see section Denny 2010 May.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.8.4 Unreachable States

If there exists no sequence of transitions from the parser’s start state to some state s, then Bison considers s to be an unreachable state. A state can become unreachable during conflict resolution if Bison disables a shift action leading to it from a predecessor state.

By default, Bison removes unreachable states from the parser after conflict resolution because they are useless in the generated parser. However, keeping unreachable states is sometimes useful when trying to understand the relationship between the parser and the grammar.

Directive: %define lr.keep-unreachable-state value: Request that Bison allow unreachable states to remain in the parser tables. value must be a Boolean. The default is false.

There are a few caveats to consider:

Missing or extraneous warnings.
Unreachable states may contain conflicts and may use rules not used in any other state. Thus, keeping unreachable states may induce warnings that are irrelevant to your parser’s behavior, and it may eliminate warnings that are relevant. Of course, the change in warnings may actually be relevant to a parser table analysis that wants to keep unreachable states, so this behavior will likely remain in future Bison releases.
Other useless states.
While Bison is able to remove unreachable states, it is not guaranteed to remove other kinds of useless states. Specifically, when Bison disables reduce actions during conflict resolution, some goto actions may become useless, and thus some additional states may become useless. If Bison were to compute which goto actions were useless and then disable those actions, it could identify such states as unreachable and then remove those states. However, Bison does not compute which goto actions are useless.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

This document was generated by Rick Perry on December 29, 2013 using texi2html 1.82.

5.8.1 LR Table Construction		Choose a different construction algorithm.
5.8.2 Default Reductions		Disable default reductions.
5.8.3 LAC		Correct lookahead sets in the parser states.
5.8.4 Unreachable States		Keep unreachable parser states for debugging.