[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3. Bison Grammar Files

Bison takes as input a context-free grammar specification and produces a C-language function that recognizes correct instances of the grammar.

The Bison grammar file conventionally has a name ending in ‘.y’. See section Invoking Bison.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.1 Outline of a Bison Grammar

A Bison grammar file has four main sections, shown here with the appropriate delimiters:

 
%{
  Prologue
%}

Bison declarations

%%
Grammar rules
%%

Epilogue

Comments enclosed in ‘/* … */’ may appear in any of the sections. As a GNU extension, ‘//’ introduces a comment that continues until end of line.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.1.1 The prologue

The Prologue section contains macro definitions and declarations of functions and variables that are used in the actions in the grammar rules. These are copied to the beginning of the parser implementation file so that they precede the definition of yyparse. You can use ‘#include’ to get the declarations from a header file. If you don’t need any C declarations, you may omit the ‘%{’ and ‘%}’ delimiters that bracket this section.

The Prologue section is terminated by the first occurrence of ‘%}’ that is outside a comment, a string literal, or a character constant.

You may have more than one Prologue section, intermixed with the Bison declarations. This allows you to have C and Bison declarations that refer to each other. For example, the %union declaration may use types defined in a header file, and you may wish to prototype functions that take arguments of type YYSTYPE. This can be done with two Prologue blocks, one before and one after the %union declaration.

 
%{
  #define _GNU_SOURCE
  #include <stdio.h>
  #include "ptypes.h"
%}
%union {
  long int n;
  tree t;  /* tree is defined in ‘ptypes.h’. */
}
%{
  static void print_token_value (FILE *, int, YYSTYPE);
  #define YYPRINT(F, N, L) print_token_value (F, N, L)
%}

When in doubt, it is usually safer to put prologue code before all Bison declarations, rather than after. For example, any definitions of feature test macros like _GNU_SOURCE or _POSIX_C_SOURCE should appear before all Bison declarations, as feature test macros can affect the behavior of Bison-generated #include directives.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.1.2 Prologue Alternatives

The functionality of Prologue sections can often be subtle and inflexible. As an alternative, Bison provides a %code directive with an explicit qualifier field, which identifies the purpose of the code and thus the location(s) where Bison should generate it. For C/C++, the qualifier can be omitted for the default location, or it can be one of requires, provides, top. See section %code Summary.

Look again at the example of the previous section:

 
%{
  #define _GNU_SOURCE
  #include <stdio.h>
  #include "ptypes.h"
%}
%union {
  long int n;
  tree t;  /* tree is defined in ‘ptypes.h’. */
}
%{
  static void print_token_value (FILE *, int, YYSTYPE);
  #define YYPRINT(F, N, L) print_token_value (F, N, L)
%}

Notice that there are two Prologue sections here, but there’s a subtle distinction between their functionality. For example, if you decide to override Bison’s default definition for YYLTYPE, in which Prologue section should you write your new definition? You should write it in the first since Bison will insert that code into the parser implementation file before the default YYLTYPE definition. In which Prologue section should you prototype an internal function, trace_token, that accepts YYLTYPE and yytokentype as arguments? You should prototype it in the second since Bison will insert that code after the YYLTYPE and yytokentype definitions.

This distinction in functionality between the two Prologue sections is established by the appearance of the %union between them. This behavior raises a few questions. First, why should the position of a %union affect definitions related to YYLTYPE and yytokentype? Second, what if there is no %union? In that case, the second kind of Prologue section is not available. This behavior is not intuitive.

To avoid this subtle %union dependency, rewrite the example using a %code top and an unqualified %code. Let’s go ahead and add the new YYLTYPE definition and the trace_token prototype at the same time:

 
%code top {
  #define _GNU_SOURCE
  #include <stdio.h>

  /* WARNING: The following code really belongs
   * in a '%code requires'; see below.  */

  #include "ptypes.h"
  #define YYLTYPE YYLTYPE
  typedef struct YYLTYPE
  {
    int first_line;
    int first_column;
    int last_line;
    int last_column;
    char *filename;
  } YYLTYPE;
}

%union {
  long int n;
  tree t;  /* tree is defined in ‘ptypes.h’. */
}
%code {
  static void print_token_value (FILE *, int, YYSTYPE);
  #define YYPRINT(F, N, L) print_token_value (F, N, L)
  static void trace_token (enum yytokentype token, YYLTYPE loc);
}

In this way, %code top and the unqualified %code achieve the same functionality as the two kinds of Prologue sections, but it’s always explicit which kind you intend. Moreover, both kinds are always available even in the absence of %union.

The %code top block above logically contains two parts. The first two lines before the warning need to appear near the top of the parser implementation file. The first line after the warning is required by YYSTYPE and thus also needs to appear in the parser implementation file. However, if you’ve instructed Bison to generate a parser header file (see section %defines), you probably want that line to appear before the YYSTYPE definition in that header file as well. The YYLTYPE definition should also appear in the parser header file to override the default YYLTYPE definition there.

In other words, in the %code top block above, all but the first two lines are dependency code required by the YYSTYPE and YYLTYPE definitions. Thus, they belong in one or more %code requires:

 
%code top {
  #define _GNU_SOURCE
  #include <stdio.h>
}
%code requires {
  #include "ptypes.h"
}
%union {
  long int n;
  tree t;  /* tree is defined in ‘ptypes.h’. */
}
%code requires {
  #define YYLTYPE YYLTYPE
  typedef struct YYLTYPE
  {
    int first_line;
    int first_column;
    int last_line;
    int last_column;
    char *filename;
  } YYLTYPE;
}
%code {
  static void print_token_value (FILE *, int, YYSTYPE);
  #define YYPRINT(F, N, L) print_token_value (F, N, L)
  static void trace_token (enum yytokentype token, YYLTYPE loc);
}

Now Bison will insert #include "ptypes.h" and the new YYLTYPE definition before the Bison-generated YYSTYPE and YYLTYPE definitions in both the parser implementation file and the parser header file. (By the same reasoning, %code requires would also be the appropriate place to write your own definition for YYSTYPE.)

When you are writing dependency code for YYSTYPE and YYLTYPE, you should prefer %code requires over %code top regardless of whether you instruct Bison to generate a parser header file. When you are writing code that you need Bison to insert only into the parser implementation file and that has no special need to appear at the top of that file, you should prefer the unqualified %code over %code top. These practices will make the purpose of each block of your code explicit to Bison and to other developers reading your grammar file. Following these practices, we expect the unqualified %code and %code requires to be the most important of the four Prologue alternatives.

At some point while developing your parser, you might decide to provide trace_token to modules that are external to your parser. Thus, you might wish for Bison to insert the prototype into both the parser header file and the parser implementation file. Since this function is not a dependency required by YYSTYPE or YYLTYPE, it doesn’t make sense to move its prototype to a %code requires. More importantly, since it depends upon YYLTYPE and yytokentype, %code requires is not sufficient. Instead, move its prototype from the unqualified %code to a %code provides:

 
%code top {
  #define _GNU_SOURCE
  #include <stdio.h>
}
%code requires {
  #include "ptypes.h"
}
%union {
  long int n;
  tree t;  /* tree is defined in ‘ptypes.h’. */
}
%code requires {
  #define YYLTYPE YYLTYPE
  typedef struct YYLTYPE
  {
    int first_line;
    int first_column;
    int last_line;
    int last_column;
    char *filename;
  } YYLTYPE;
}
%code provides {
  void trace_token (enum yytokentype token, YYLTYPE loc);
}
%code {
  static void print_token_value (FILE *, int, YYSTYPE);
  #define YYPRINT(F, N, L) print_token_value (F, N, L)
}

Bison will insert the trace_token prototype into both the parser header file and the parser implementation file after the definitions for yytokentype, YYLTYPE, and YYSTYPE.

The above examples are careful to write directives in an order that reflects the layout of the generated parser implementation and header files: %code top, %code requires, %code provides, and then %code. While your grammar files may generally be easier to read if you also follow this order, Bison does not require it. Instead, Bison lets you choose an organization that makes sense to you.

You may declare any of these directives multiple times in the grammar file. In that case, Bison concatenates the contained code in declaration order. This is the only way in which the position of one of these directives within the grammar file affects its functionality.

The result of the previous two properties is greater flexibility in how you may organize your grammar file. For example, you may organize semantic-type-related directives by semantic type:

 
%code requires { #include "type1.h" }
%union { type1 field1; }
%destructor { type1_free ($$); } <field1>
%printer { type1_print (yyoutput, $$); } <field1>
%code requires { #include "type2.h" }
%union { type2 field2; }
%destructor { type2_free ($$); } <field2>
%printer { type2_print (yyoutput, $$); } <field2>

You could even place each of the above directive groups in the rules section of the grammar file next to the set of rules that uses the associated semantic type. (In the rules section, you must terminate each of those directives with a semicolon.) And you don’t have to worry that some directive (like a %union) in the definitions section is going to adversely affect their functionality in some counter-intuitive manner just because it comes first. Such an organization is not possible using Prologue sections.

This section has been concerned with explaining the advantages of the four Prologue alternatives over the original Yacc Prologue. However, in most cases when using these directives, you shouldn’t need to think about all the low-level ordering issues discussed here. Instead, you should simply use these directives to label each block of your code according to its purpose and let Bison handle the ordering. %code is the most generic label. Move code to %code requires, %code provides, or %code top as needed.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.1.3 The Bison Declarations Section

The Bison declarations section contains declarations that define terminal and nonterminal symbols, specify precedence, and so on. In some simple grammars you may not need any declarations. See section Bison Declarations.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.1.4 The Grammar Rules Section

The grammar rules section contains one or more Bison grammar rules, and nothing else. See section Syntax of Grammar Rules.

There must always be at least one grammar rule, and the first ‘%%’ (which precedes the grammar rules) may never be omitted even if it is the first thing in the file.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.1.5 The epilogue

The Epilogue is copied verbatim to the end of the parser implementation file, just as the Prologue is copied to the beginning. This is the most convenient place to put anything that you want to have in the parser implementation file but which need not come before the definition of yyparse. For example, the definitions of yylex and yyerror often go here. Because C requires functions to be declared before being used, you often need to declare functions like yylex and yyerror in the Prologue, even if you define them in the Epilogue. See section Parser C-Language Interface.

If the last section is empty, you may omit the ‘%%’ that separates it from the grammar rules.

The Bison parser itself contains many macros and identifiers whose names start with ‘yy’ or ‘YY’, so it is a good idea to avoid using any such names (except those documented in this manual) in the epilogue of the grammar file.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.2 Symbols, Terminal and Nonterminal

Symbols in Bison grammars represent the grammatical classifications of the language.

A terminal symbol (also known as a token type) represents a class of syntactically equivalent tokens. You use the symbol in grammar rules to mean that a token in that class is allowed. The symbol is represented in the Bison parser by a numeric code, and the yylex function returns a token type code to indicate what kind of token has been read. You don’t need to know what the code value is; you can use the symbol to stand for it.

A nonterminal symbol stands for a class of syntactically equivalent groupings. The symbol name is used in writing grammar rules. By convention, it should be all lower case.

Symbol names can contain letters, underscores, periods, and non-initial digits and dashes. Dashes in symbol names are a GNU extension, incompatible with POSIX Yacc. Periods and dashes make symbol names less convenient to use with named references, which require brackets around such names (see section Named References). Terminal symbols that contain periods or dashes make little sense: since they are not valid symbols (in most programming languages) they are not exported as token names.

There are three ways of writing terminal symbols in the grammar:

How you choose to write a terminal symbol has no effect on its grammatical meaning. That depends only on where it appears in rules and on when the parser function returns that symbol.

The value returned by yylex is always one of the terminal symbols, except that a zero or negative value signifies end-of-input. Whichever way you write the token type in the grammar rules, you write it the same way in the definition of yylex. The numeric code for a character token type is simply the positive numeric code of the character, so yylex can use the identical value to generate the requisite code, though you may need to convert it to unsigned char to avoid sign-extension on hosts where char is signed. Each named token type becomes a C macro in the parser implementation file, so yylex can use the name to stand for the code. (This is why periods don’t make sense in terminal symbols.) See section Calling Convention for yylex.

If yylex is defined in a separate file, you need to arrange for the token-type macro definitions to be available there. Use the ‘-d’ option when you run Bison, so that it will write these macro definitions into a separate header file ‘name.tab.h’ which you can include in the other source files that need it. See section Invoking Bison.

If you want to write a grammar that is portable to any Standard C host, you must use only nonnull character tokens taken from the basic execution character set of Standard C. This set consists of the ten digits, the 52 lower- and upper-case English letters, and the characters in the following C-language string:

 
"\a\b\t\n\v\f\r !\"#%&'()*+,-./:;<=>?[\\]^_{|}~"

The yylex function and Bison must use a consistent character set and encoding for character tokens. For example, if you run Bison in an ASCII environment, but then compile and run the resulting program in an environment that uses an incompatible character set like EBCDIC, the resulting program may not work because the tables generated by Bison will assume ASCII numeric values for character tokens. It is standard practice for software distributions to contain C source files that were generated by Bison in an ASCII environment, so installers on platforms that are incompatible with ASCII must rebuild those files before compiling them.

The symbol error is a terminal symbol reserved for error recovery (see section Error Recovery); you shouldn’t use it for any other purpose. In particular, yylex should never return this value. The default value of the error token is 256, unless you explicitly assigned 256 to one of your tokens with a %token declaration.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.3 Grammar Rules

A Bison grammar is a list of rules.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.3.1 Syntax of Grammar Rules

A Bison grammar rule has the following general form:

 
result: components…;

where result is the nonterminal symbol that this rule describes, and components are various terminal and nonterminal symbols that are put together by this rule (see section Symbols, Terminal and Nonterminal).

For example,

 
exp: exp '+' exp;

says that two groupings of type exp, with a ‘+’ token in between, can be combined into a larger grouping of type exp.

White space in rules is significant only to separate symbols. You can add extra white space as you wish.

Scattered among the components can be actions that determine the semantics of the rule. An action looks like this:

 
{C statements}

This is an example of braced code, that is, C code surrounded by braces, much like a compound statement in C. Braced code can contain any sequence of C tokens, so long as its braces are balanced. Bison does not check the braced code for correctness directly; it merely copies the code to the parser implementation file, where the C compiler can check it.

Within braced code, the balanced-brace count is not affected by braces within comments, string literals, or character constants, but it is affected by the C digraphs ‘<%’ and ‘%>’ that represent braces. At the top level braced code must be terminated by ‘}’ and not by a digraph. Bison does not look for trigraphs, so if braced code uses trigraphs you should ensure that they do not affect the nesting of braces or the boundaries of comments, string literals, or character constants.

Usually there is only one action and it follows the components. See section Actions.

Multiple rules for the same result can be written separately or can be joined with the vertical-bar character ‘|’ as follows:

 
result:
  rule1-components…
| rule2-components…
…
;

They are still considered distinct rules even when joined in this way.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.3.2 Empty Rules

A rule is said to be empty if its right-hand side (components) is empty. It means that result can match the empty string. For example, here is how to define an optional semicolon:

 
semicolon.opt: | ";";

It is easy not to see an empty rule, especially when | is used. The %empty directive allows to make explicit that a rule is empty on purpose:

 
semicolon.opt:
  %empty
| ";"
;

Flagging a non-empty rule with %empty is an error. If run with ‘-Wempty-rule’, bison will report empty rules without %empty. Using %empty enables this warning, unless ‘-Wno-empty-rule’ was specified.

The %empty directive is a Bison extension, it does not work with Yacc. To remain compatible with POSIX Yacc, it is customary to write a comment ‘/* empty */’ in each rule with no components:

 
semicolon.opt:
  /* empty */
| ";"
;

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.3.3 Recursive Rules

A rule is called recursive when its result nonterminal appears also on its right hand side. Nearly all Bison grammars need to use recursion, because that is the only way to define a sequence of any number of a particular thing. Consider this recursive definition of a comma-separated sequence of one or more expressions:

 
expseq1:
  exp
| expseq1 ',' exp
;

Since the recursive use of expseq1 is the leftmost symbol in the right hand side, we call this left recursion. By contrast, here the same construct is defined using right recursion:

 
expseq1:
  exp
| exp ',' expseq1
;

Any kind of sequence can be defined using either left recursion or right recursion, but you should always use left recursion, because it can parse a sequence of any number of elements with bounded stack space. Right recursion uses up space on the Bison stack in proportion to the number of elements in the sequence, because all the elements must be shifted onto the stack before the rule can be applied even once. See section The Bison Parser Algorithm, for further explanation of this.

Indirect or mutual recursion occurs when the result of the rule does not appear directly on its right hand side, but does appear in rules for other nonterminals which do appear on its right hand side.

For example:

 
expr:
  primary
| primary '+' primary
;
primary:
  constant
| '(' expr ')'
;

defines two mutually-recursive nonterminals, since each refers to the other.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.4 Defining Language Semantics

The grammar rules for a language determine only the syntax. The semantics are determined by the semantic values associated with various tokens and groupings, and by the actions taken when various groupings are recognized.

For example, the calculator calculates properly because the value associated with each expression is the proper number; it adds properly because the action for the grouping ‘x + y’ is to add the numbers associated with x and y.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.4.1 Data Types of Semantic Values

In a simple program it may be sufficient to use the same data type for the semantic values of all language constructs. This was true in the RPN and infix calculator examples (see section Reverse Polish Notation Calculator).

Bison normally uses the type int for semantic values if your program uses the same data type for all language constructs. To specify some other type, define the %define variable api.value.type like this:

 
%define api.value.type {double}

or

 
%define api.value.type {struct semantic_type}

The value of api.value.type should be a type name that does not contain parentheses or square brackets.

Alternatively, instead of relying of Bison’s %define support, you may rely on the C/C++ preprocessor and define YYSTYPE as a macro, like this:

 
#define YYSTYPE double

This macro definition must go in the prologue of the grammar file (see section Outline of a Bison Grammar). If compatibility with POSIX Yacc matters to you, use this. Note however that Bison cannot know YYSTYPE’s value, not even whether it is defined, so there are services it cannot provide. Besides this works only for languages that have a preprocessor.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.4.2 More Than One Value Type

In most programs, you will need different data types for different kinds of tokens and groupings. For example, a numeric constant may need type int or long int, while a string constant needs type char *, and an identifier might need a pointer to an entry in the symbol table.

To use more than one data type for semantic values in one parser, Bison requires you to do two things:


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.4.3 Generating the Semantic Value Type

The special value union of the %define variable api.value.type instructs Bison that the tags used with the %token and %type directives are genuine types, not names of members of YYSTYPE.

For example:

 
%define api.value.type union
%token <int> INT "integer"
%token <int> 'n'
%type <int> expr
%token <char const *> ID "identifier"

generates an appropriate value of YYSTYPE to support each symbol type. The name of the member of YYSTYPE for tokens than have a declared identifier id (such as INT and ID above, but not 'n') is id. The other symbols have unspecified names on which you should not depend; instead, relying on C casts to access the semantic value with the appropriate type:

 
/* For an "integer".  */
yylval.INT = 42;
return INT;

/* For an 'n', also declared as int.  */
*((int*)&yylval) = 42;
return 'n';

/* For an "identifier".  */
yylval.ID = "42";
return ID;

If the %define variable api.token.prefix is defined (see section api.token.prefix), then it is also used to prefix the union member names. For instance, with ‘%define api.token.prefix {TOK_}’:

 
/* For an "integer".  */
yylval.TOK_INT = 42;
return TOK_INT;

This Bison extension cannot work if %yacc (or ‘-y’/‘--yacc’) is enabled, as POSIX mandates that Yacc generate tokens as macros (e.g., ‘#define INT 258’, or ‘#define TOK_INT 258’).

This feature is new, and user feedback would be most welcome.

A similar feature is provided for C++ that in addition overcomes C++ limitations (that forbid non-trivial objects to be part of a union): ‘%define api.value.type variant’, see C++ Variants.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.4.4 The Union Declaration

The %union declaration specifies the entire collection of possible data types for semantic values. The keyword %union is followed by braced code containing the same thing that goes inside a union in C.

For example:

 
%union {
  double val;
  symrec *tptr;
}

This says that the two alternative types are double and symrec *. They are given names val and tptr; these names are used in the %token and %type declarations to pick one of the types for a terminal or nonterminal symbol (see section Nonterminal Symbols).

As an extension to POSIX, a tag is allowed after the %union. For example:

 
%union value {
  double val;
  symrec *tptr;
}

specifies the union tag value, so the corresponding C type is union value. If you do not specify a tag, it defaults to YYSTYPE.

As another extension to POSIX, you may specify multiple %union declarations; their contents are concatenated. However, only the first %union declaration can specify a tag.

Note that, unlike making a union declaration in C, you need not write a semicolon after the closing brace.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.4.5 Providing a Structured Semantic Value Type

Instead of %union, you can define and use your own union type YYSTYPE if your grammar contains at least one ‘<type>’ tag. For example, you can put the following into a header file ‘parser.h’:

 
union YYSTYPE {
  double val;
  symrec *tptr;
};

and then your grammar can use the following instead of %union:

 
%{
#include "parser.h"
%}
%define api.value.type {union YYSTYPE}
%type <val> expr
%token <tptr> ID

Actually, you may also provide a struct rather that a union, which may be handy if you want to track information for every symbol (such as preceding comments).

The type you provide may even be structured and include pointers, in which case the type tags you provide may be composite, with ‘.’ and ‘->’ operators.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.4.6 Actions

An action accompanies a syntactic rule and contains C code to be executed each time an instance of that rule is recognized. The task of most actions is to compute a semantic value for the grouping built by the rule from the semantic values associated with tokens or smaller groupings.

An action consists of braced code containing C statements, and can be placed at any position in the rule; it is executed at that position. Most rules have just one action at the end of the rule, following all the components. Actions in the middle of a rule are tricky and used only for special purposes (see section Actions in Mid-Rule).

The C code in an action can refer to the semantic values of the components matched by the rule with the construct $n, which stands for the value of the nth component. The semantic value for the grouping being constructed is $$. In addition, the semantic values of symbols can be accessed with the named references construct $name or $[name]. Bison translates both of these constructs into expressions of the appropriate type when it copies the actions into the parser implementation file. $$ (or $name, when it stands for the current grouping) is translated to a modifiable lvalue, so it can be assigned to.

Here is a typical example:

 
exp:
…
| exp '+' exp     { $$ = $1 + $3; }

Or, in terms of named references:

 
exp[result]:
…
| exp[left] '+' exp[right]  { $result = $left + $right; }

This rule constructs an exp from two smaller exp groupings connected by a plus-sign token. In the action, $1 and $3 ($left and $right) refer to the semantic values of the two component exp groupings, which are the first and third symbols on the right hand side of the rule. The sum is stored into $$ ($result) so that it becomes the semantic value of the addition-expression just recognized by the rule. If there were a useful semantic value associated with the ‘+’ token, it could be referred to as $2.

See section Named References, for more information about using the named references construct.

Note that the vertical-bar character ‘|’ is really a rule separator, and actions are attached to a single rule. This is a difference with tools like Flex, for which ‘|’ stands for either “or”, or “the same action as that of the next rule”. In the following example, the action is triggered only when ‘b’ is found:

 
a-or-b: 'a'|'b'   { a_or_b_found = 1; };

If you don’t specify an action for a rule, Bison supplies a default: $$ = $1. Thus, the value of the first symbol in the rule becomes the value of the whole rule. Of course, the default action is valid only if the two data types match. There is no meaningful default action for an empty rule; every empty rule must have an explicit action unless the rule’s value does not matter.

$n with n zero or negative is allowed for reference to tokens and groupings on the stack before those that match the current rule. This is a very risky practice, and to use it reliably you must be certain of the context in which the rule is applied. Here is a case in which you can use this reliably:

 
foo:
  expr bar '+' expr  { … }
| expr bar '-' expr  { … }
;
bar:
  %empty    { previous_expr = $0; }
;

As long as bar is used only in the fashion shown here, $0 always refers to the expr which precedes bar in the definition of foo.

It is also possible to access the semantic value of the lookahead token, if any, from a semantic action. This semantic value is stored in yylval. See section Special Features for Use in Actions.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.4.7 Data Types of Values in Actions

If you have chosen a single data type for semantic values, the $$ and $n constructs always have that data type.

If you have used %union to specify a variety of data types, then you must declare a choice among these types for each terminal or nonterminal symbol that can have a semantic value. Then each time you use $$ or $n, its data type is determined by which symbol it refers to in the rule. In this example,

 
exp:
  …
| exp '+' exp    { $$ = $1 + $3; }

$1 and $3 refer to instances of exp, so they all have the data type declared for the nonterminal symbol exp. If $2 were used, it would have the data type declared for the terminal symbol '+', whatever that might be.

Alternatively, you can specify the data type when you refer to the value, by inserting ‘<type>’ after the ‘$’ at the beginning of the reference. For example, if you have defined types as shown here:

 
%union {
  int itype;
  double dtype;
}

then you can write $<itype>1 to refer to the first subunit of the rule as an integer, or $<dtype>1 to refer to it as a double.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.4.8 Actions in Mid-Rule

Occasionally it is useful to put an action in the middle of a rule. These actions are written just like usual end-of-rule actions, but they are executed before the parser even recognizes the following components.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.4.8.1 Using Mid-Rule Actions

A mid-rule action may refer to the components preceding it using $n, but it may not refer to subsequent components because it is run before they are parsed.

The mid-rule action itself counts as one of the components of the rule. This makes a difference when there is another action later in the same rule (and usually there is another at the end): you have to count the actions along with the symbols when working out which number n to use in $n.

The mid-rule action can also have a semantic value. The action can set its value with an assignment to $$, and actions later in the rule can refer to the value using $n. Since there is no symbol to name the action, there is no way to declare a data type for the value in advance, so you must use the ‘$<…>n’ construct to specify a data type each time you refer to this value.

There is no way to set the value of the entire rule with a mid-rule action, because assignments to $$ do not have that effect. The only way to set the value for the entire rule is with an ordinary action at the end of the rule.

Here is an example from a hypothetical compiler, handling a let statement that looks like ‘let (variable) statement’ and serves to create a variable named variable temporarily for the duration of statement. To parse this construct, we must put variable into the symbol table while statement is parsed, then remove it afterward. Here is how it is done:

 
stmt:
  "let" '(' var ')'
    {
      $<context>$ = push_context ();
      declare_variable ($3);
    }
  stmt
    {
      $$ = $6;
      pop_context ($<context>5);
    }

As soon as ‘let (variable)’ has been recognized, the first action is run. It saves a copy of the current semantic context (the list of accessible variables) as its semantic value, using alternative context in the data-type union. Then it calls declare_variable to add the new variable to that list. Once the first action is finished, the embedded statement stmt can be parsed.

Note that the mid-rule action is component number 5, so the ‘stmt’ is component number 6. Named references can be used to improve the readability and maintainability (see section Named References):

 
stmt:
  "let" '(' var ')'
    {
      $<context>let = push_context ();
      declare_variable ($3);
    }[let]
  stmt
    {
      $$ = $6;
      pop_context ($<context>let);
    }

After the embedded statement is parsed, its semantic value becomes the value of the entire let-statement. Then the semantic value from the earlier action is used to restore the prior list of variables. This removes the temporary let-variable from the list so that it won’t appear to exist while the rest of the program is parsed.

In the above example, if the parser initiates error recovery (see section Error Recovery) while parsing the tokens in the embedded statement stmt, it might discard the previous semantic context $<context>5 without restoring it. Thus, $<context>5 needs a destructor (see section Freeing Discarded Symbols). However, Bison currently provides no means to declare a destructor specific to a particular mid-rule action’s semantic value.

One solution is to bury the mid-rule action inside a nonterminal symbol and to declare a destructor for that symbol:

 
%type <context> let
%destructor { pop_context ($$); } let
%%

stmt:
  let stmt
    {
      $$ = $2;
      pop_context ($let);
    };
let:
  "let" '(' var ')'
    {
      $let = push_context ();
      declare_variable ($3);
    };

Note that the action is now at the end of its rule. Any mid-rule action can be converted to an end-of-rule action in this way, and this is what Bison actually does to implement mid-rule actions.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.4.8.2 Mid-Rule Action Translation

As hinted earlier, mid-rule actions are actually transformed into regular rules and actions. The various reports generated by Bison (textual, graphical, etc., see Understanding Your Parser) reveal this translation, best explained by means of an example. The following rule:

 
exp: { a(); } "b" { c(); } { d(); } "e" { f(); };

is translated into:

 
$@1: %empty { a(); };
$@2: %empty { c(); };
$@3: %empty { d(); };
exp: $@1 "b" $@2 $@3 "e" { f(); };

with new nonterminal symbols $@n, where n is a number.

A mid-rule action is expected to generate a value if it uses $$, or the (final) action uses $n where n denote the mid-rule action. In that case its nonterminal is rather named @n:

 
exp: { a(); } "b" { $$ = c(); } { d(); } "e" { f = $1; };

is translated into

 
@1: %empty { a(); };
@2: %empty { $$ = c(); };
$@3: %empty { d(); };
exp: @1 "b" @2 $@3 "e" { f = $1; }

There are probably two errors in the above example: the first mid-rule action does not generate a value (it does not use $$ although the final action uses it), and the value of the second one is not used (the final action does not use $3). Bison reports these errors when the midrule-value warnings are enabled (see section Invoking Bison):

 
$ bison -fcaret -Wmidrule-value mid.y
mid.y:2.6-13: warning: unset value: $$
 exp: { a(); } "b" { $$ = c(); } { d(); } "e" { f = $1; };
      ^^^^^^^^
mid.y:2.19-31: warning: unused value: $3
 exp: { a(); } "b" { $$ = c(); } { d(); } "e" { f = $1; };
                   ^^^^^^^^^^^^^

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.4.8.3 Conflicts due to Mid-Rule Actions

Taking action before a rule is completely recognized often leads to conflicts since the parser must commit to a parse in order to execute the action. For example, the following two rules, without mid-rule actions, can coexist in a working parser because the parser can shift the open-brace token and look at what follows before deciding whether there is a declaration or not:

 
compound:
  '{' declarations statements '}'
| '{' statements '}'
;

But when we add a mid-rule action as follows, the rules become nonfunctional:

 
compound:
  { prepare_for_local_variables (); }
     '{' declarations statements '}'
|    '{' statements '}'
;

Now the parser is forced to decide whether to run the mid-rule action when it has read no farther than the open-brace. In other words, it must commit to using one rule or the other, without sufficient information to do it correctly. (The open-brace token is what is called the lookahead token at this time, since the parser is still deciding what to do about it. See section Lookahead Tokens.)

You might think that you could correct the problem by putting identical actions into the two rules, like this:

 
compound:
  { prepare_for_local_variables (); }
    '{' declarations statements '}'
| { prepare_for_local_variables (); }
    '{' statements '}'
;

But this does not help, because Bison does not realize that the two actions are identical. (Bison never tries to understand the C code in an action.)

If the grammar is such that a declaration can be distinguished from a statement by the first token (which is true in C), then one solution which does work is to put the action after the open-brace, like this:

 
compound:
  '{' { prepare_for_local_variables (); }
    declarations statements '}'
| '{' statements '}'
;

Now the first token of the following declaration or statement, which would in any case tell Bison which rule to use, can still do so.

Another solution is to bury the action inside a nonterminal symbol which serves as a subroutine:

 
subroutine:
  %empty  { prepare_for_local_variables (); }
;
compound:
  subroutine '{' declarations statements '}'
| subroutine '{' statements '}'
;

Now Bison can execute the action in the rule for subroutine without deciding which rule for compound it will eventually use.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.5 Tracking Locations

Though grammar rules and semantic actions are enough to write a fully functional parser, it can be useful to process some additional information, especially symbol locations.

The way locations are handled is defined by providing a data type, and actions to take when rules are matched.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.5.1 Data Type of Locations

Defining a data type for locations is much simpler than for semantic values, since all tokens and groupings always use the same type.

You can specify the type of locations by defining a macro called YYLTYPE, just as you can specify the semantic value type by defining a YYSTYPE macro (see section Data Types of Semantic Values). When YYLTYPE is not defined, Bison uses a default structure type with four members:

 
typedef struct YYLTYPE
{
  int first_line;
  int first_column;
  int last_line;
  int last_column;
} YYLTYPE;

When YYLTYPE is not defined, at the beginning of the parsing, Bison initializes all these fields to 1 for yylloc. To initialize yylloc with a custom location type (or to chose a different initialization), use the %initial-action directive. See section Performing Actions before Parsing.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.5.2 Actions and Locations

Actions are not only useful for defining language semantics, but also for describing the behavior of the output parser with locations.

The most obvious way for building locations of syntactic groupings is very similar to the way semantic values are computed. In a given rule, several constructs can be used to access the locations of the elements being matched. The location of the nth component of the right hand side is @n, while the location of the left hand side grouping is @$.

In addition, the named references construct @name and @[name] may also be used to address the symbol locations. See section Named References, for more information about using the named references construct.

Here is a basic example using the default data type for locations:

 
exp:
  …
| exp '/' exp
    {
      @$.first_column = @1.first_column;
      @$.first_line = @1.first_line;
      @$.last_column = @3.last_column;
      @$.last_line = @3.last_line;
      if ($3)
        $$ = $1 / $3;
      else
        {
          $$ = 1;
          fprintf (stderr, "%d.%d-%d.%d: division by zero",
                   @3.first_line, @3.first_column,
                   @3.last_line, @3.last_column);
        }
    }

As for semantic values, there is a default action for locations that is run each time a rule is matched. It sets the beginning of @$ to the beginning of the first symbol, and the end of @$ to the end of the last symbol.

With this default action, the location tracking can be fully automatic. The example above simply rewrites this way:

 
exp:
  …
| exp '/' exp
    {
      if ($3)
        $$ = $1 / $3;
      else
        {
          $$ = 1;
          fprintf (stderr, "%d.%d-%d.%d: division by zero",
                   @3.first_line, @3.first_column,
                   @3.last_line, @3.last_column);
        }
    }

It is also possible to access the location of the lookahead token, if any, from a semantic action. This location is stored in yylloc. See section Special Features for Use in Actions.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.5.3 Default Action for Locations

Actually, actions are not the best place to compute locations. Since locations are much more general than semantic values, there is room in the output parser to redefine the default action to take for each rule. The YYLLOC_DEFAULT macro is invoked each time a rule is matched, before the associated action is run. It is also invoked while processing a syntax error, to compute the error’s location. Before reporting an unresolvable syntactic ambiguity, a GLR parser invokes YYLLOC_DEFAULT recursively to compute the location of that ambiguity.

Most of the time, this macro is general enough to suppress location dedicated code from semantic actions.

The YYLLOC_DEFAULT macro takes three parameters. The first one is the location of the grouping (the result of the computation). When a rule is matched, the second parameter identifies locations of all right hand side elements of the rule being matched, and the third parameter is the size of the rule’s right hand side. When a GLR parser reports an ambiguity, which of multiple candidate right hand sides it passes to YYLLOC_DEFAULT is undefined. When processing a syntax error, the second parameter identifies locations of the symbols that were discarded during error processing, and the third parameter is the number of discarded symbols.

By default, YYLLOC_DEFAULT is defined this way:

 
# define YYLLOC_DEFAULT(Cur, Rhs, N)                      \
do                                                        \
  if (N)                                                  \
    {                                                     \
      (Cur).first_line   = YYRHSLOC(Rhs, 1).first_line;   \
      (Cur).first_column = YYRHSLOC(Rhs, 1).first_column; \
      (Cur).last_line    = YYRHSLOC(Rhs, N).last_line;    \
      (Cur).last_column  = YYRHSLOC(Rhs, N).last_column;  \
    }                                                     \
  else                                                    \
    {                                                     \
      (Cur).first_line   = (Cur).last_line   =            \
        YYRHSLOC(Rhs, 0).last_line;                       \
      (Cur).first_column = (Cur).last_column =            \
        YYRHSLOC(Rhs, 0).last_column;                     \
    }                                                     \
while (0)

where YYRHSLOC (rhs, k) is the location of the kth symbol in rhs when k is positive, and the location of the symbol just before the reduction when k and n are both zero.

When defining YYLLOC_DEFAULT, you should consider that:


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.6 Named References

As described in the preceding sections, the traditional way to refer to any semantic value or location is a positional reference, which takes the form $n, $$, @n, and @$. However, such a reference is not very descriptive. Moreover, if you later decide to insert or remove symbols in the right-hand side of a grammar rule, the need to renumber such references can be tedious and error-prone.

To avoid these issues, you can also refer to a semantic value or location using a named reference. First of all, original symbol names may be used as named references. For example:

 
invocation: op '(' args ')'
  { $invocation = new_invocation ($op, $args, @invocation); }

Positional and named references can be mixed arbitrarily. For example:

 
invocation: op '(' args ')'
  { $$ = new_invocation ($op, $args, @$); }

However, sometimes regular symbol names are not sufficient due to ambiguities:

 
exp: exp '/' exp
  { $exp = $exp / $exp; } // $exp is ambiguous.

exp: exp '/' exp
  { $$ = $1 / $exp; } // One usage is ambiguous.

exp: exp '/' exp
  { $$ = $1 / $3; } // No error.

When ambiguity occurs, explicitly declared names may be used for values and locations. Explicit names are declared as a bracketed name after a symbol appearance in rule definitions. For example:

 
exp[result]: exp[left] '/' exp[right]
  { $result = $left / $right; }

In order to access a semantic value generated by a mid-rule action, an explicit name may also be declared by putting a bracketed name after the closing brace of the mid-rule action code:

 
exp[res]: exp[x] '+' {$left = $x;}[left] exp[right]
  { $res = $left + $right; }

In references, in order to specify names containing dots and dashes, an explicit bracketed syntax $[name] and @[name] must be used:

 
if-stmt: "if" '(' expr ')' "then" then.stmt ';'
  { $[if-stmt] = new_if_stmt ($expr, $[then.stmt]); }

It often happens that named references are followed by a dot, dash or other C punctuation marks and operators. By default, Bison will read ‘$name.suffix’ as a reference to symbol value $name followed by ‘.suffix’, i.e., an access to the suffix field of the semantic value. In order to force Bison to recognize ‘name.suffix’ in its entirety as the name of a semantic value, the bracketed syntax ‘$[name.suffix]’ must be used.

The named references feature is experimental. More user feedback will help to stabilize it.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.7 Bison Declarations

The Bison declarations section of a Bison grammar defines the symbols used in formulating the grammar and the data types of semantic values. See section Symbols, Terminal and Nonterminal.

All token type names (but not single-character literal tokens such as '+' and '*') must be declared. Nonterminal symbols must be declared if you need to specify which data type to use for the semantic value (see section More Than One Value Type).

The first rule in the grammar file also specifies the start symbol, by default. If you want some other symbol to be the start symbol, you must declare it explicitly (see section Languages and Context-Free Grammars).


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.7.1 Require a Version of Bison

You may require the minimum version of Bison to process the grammar. If the requirement is not met, bison exits with an error (exit status 63).

 
%require "version"

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.7.2 Token Type Names

The basic way to declare a token type name (terminal symbol) is as follows:

 
%token name

Bison will convert this into a #define directive in the parser, so that the function yylex (if it is in this file) can use the name name to stand for this token type’s code.

Alternatively, you can use %left, %right, %precedence, or %nonassoc instead of %token, if you wish to specify associativity and precedence. See section Operator Precedence.

You can explicitly specify the numeric code for a token type by appending a nonnegative decimal or hexadecimal integer value in the field immediately following the token name:

 
%token NUM 300
%token XNUM 0x12d // a GNU extension

It is generally best, however, to let Bison choose the numeric codes for all token types. Bison will automatically select codes that don’t conflict with each other or with normal characters.

In the event that the stack type is a union, you must augment the %token or other token declaration to include the data type alternative delimited by angle-brackets (see section More Than One Value Type).

For example:

 
%union {              /* define stack type */
  double val;
  symrec *tptr;
}
%token <val> NUM      /* define token NUM and its type */

You can associate a literal string token with a token type name by writing the literal string at the end of a %token declaration which declares the name. For example:

 
%token arrow "=>"

For example, a grammar for the C language might specify these names with equivalent literal string tokens:

 
%token  <operator>  OR      "||"
%token  <operator>  LE 134  "<="
%left  OR  "<="

Once you equate the literal string and the token name, you can use them interchangeably in further declarations or the grammar rules. The yylex function can use the token name or the literal string to obtain the token type code number (see section Calling Convention for yylex). Syntax error messages passed to yyerror from the parser will reference the literal string instead of the token name.

The token numbered as 0 corresponds to end of file; the following line allows for nicer error messages referring to “end of file” instead of “$end”:

 
%token END 0 "end of file"

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.7.3 Operator Precedence

Use the %left, %right, %nonassoc, or %precedence declaration to declare a token and specify its precedence and associativity, all at once. These are called precedence declarations. See section Operator Precedence, for general information on operator precedence.

The syntax of a precedence declaration is nearly the same as that of %token: either

 
%left symbols

or

 
%left <type> symbols

And indeed any of these declarations serves the purposes of %token. But in addition, they specify the associativity and relative precedence for all the symbols:

For backward compatibility, there is a confusing difference between the argument lists of %token and precedence declarations. Only a %token can associate a literal string with a token type name. A precedence declaration always interprets a literal string as a reference to a separate token. For example:

 
%left  OR "<="         // Does not declare an alias.
%left  OR 134 "<=" 135 // Declares 134 for OR and 135 for "<=".

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.7.4 Nonterminal Symbols

When you use %union to specify multiple value types, you must declare the value type of each nonterminal symbol for which values are used. This is done with a %type declaration, like this:

 
%type <type> nonterminal

Here nonterminal is the name of a nonterminal symbol, and type is the name given in the %union to the alternative that you want (see section The Union Declaration). You can give any number of nonterminal symbols in the same %type declaration, if they have the same value type. Use spaces to separate the symbol names.

You can also declare the value type of a terminal symbol. To do this, use the same <type> construction in a declaration for the terminal symbol. All kinds of token declarations allow <type>.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.7.5 Performing Actions before Parsing

Sometimes your parser needs to perform some initializations before parsing. The %initial-action directive allows for such arbitrary code.

Directive: %initial-action { code }

Declare that the braced code must be invoked before parsing each time yyparse is called. The code may use $$ (or $<tag>$) and @$ — initial value and location of the lookahead — and the %parse-param.

For instance, if your locations use a file name, you may use

 
%parse-param { char const *file_name };
%initial-action
{
  @$.initialize (file_name);
};

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.7.6 Freeing Discarded Symbols

During error recovery (see section Error Recovery), symbols already pushed on the stack and tokens coming from the rest of the file are discarded until the parser falls on its feet. If the parser runs out of memory, or if it returns via YYABORT or YYACCEPT, all the symbols on the stack must be discarded. Even if the parser succeeds, it must discard the start symbol.

When discarded symbols convey heap based information, this memory is lost. While this behavior can be tolerable for batch parsers, such as in traditional compilers, it is unacceptable for programs like shells or protocol implementations that may parse and execute indefinitely.

The %destructor directive defines code that is called when a symbol is automatically discarded.

Directive: %destructor { code } symbols

Invoke the braced code whenever the parser discards one of the symbols. Within code, $$ (or $<tag>$) designates the semantic value associated with the discarded symbol, and @$ designates its location. The additional parser parameters are also available (see section The Parser Function yyparse).

When a symbol is listed among symbols, its %destructor is called a per-symbol %destructor. You may also define a per-type %destructor by listing a semantic type tag among symbols. In that case, the parser will invoke this code whenever it discards any grammar symbol that has that semantic type tag unless that symbol has its own per-symbol %destructor.

Finally, you can define two different kinds of default %destructors. (These default forms are experimental. More user feedback will help to determine whether they should become permanent features.) You can place each of <*> and <> in the symbols list of exactly one %destructor declaration in your grammar file. The parser will invoke the code associated with one of these whenever it discards any user-defined grammar symbol that has no per-symbol and no per-type %destructor. The parser uses the code for <*> in the case of such a grammar symbol for which you have formally declared a semantic type tag (%type counts as such a declaration, but $<tag>$ does not). The parser uses the code for <> in the case of such a grammar symbol that has no declared semantic type tag.

For example:

 
%union { char *string; }
%token <string> STRING1 STRING2
%type  <string> string1 string2
%union { char character; }
%token <character> CHR
%type  <character> chr
%token TAGLESS

%destructor { } <character>
%destructor { free ($$); } <*>
%destructor { free ($$); printf ("%d", @$.first_line); } STRING1 string1
%destructor { printf ("Discarding tagless symbol.\n"); } <>

guarantees that, when the parser discards any user-defined symbol that has a semantic type tag other than <character>, it passes its semantic value to free by default. However, when the parser discards a STRING1 or a string1, it also prints its line number to stdout. It performs only the second %destructor in this case, so it invokes free only once. Finally, the parser merely prints a message whenever it discards any symbol, such as TAGLESS, that has no semantic type tag.

A Bison-generated parser invokes the default %destructors only for user-defined as opposed to Bison-defined symbols. For example, the parser will not invoke either kind of default %destructor for the special Bison-defined symbols $accept, $undefined, or $end (see section Bison Symbols), none of which you can reference in your grammar. It also will not invoke either for the error token (see section error), which is always defined by Bison regardless of whether you reference it in your grammar. However, it may invoke one of them for the end token (token 0) if you redefine it from $end to, for example, END:

 
%token END 0

Finally, Bison will never invoke a %destructor for an unreferenced mid-rule semantic value (see section Actions in Mid-Rule). That is, Bison does not consider a mid-rule to have a semantic value if you do not reference $$ in the mid-rule’s action or $n (where n is the right-hand side symbol position of the mid-rule) in any later action in that rule. However, if you do reference either, the Bison-generated parser will invoke the <> %destructor whenever it discards the mid-rule symbol.


Discarded symbols are the following:

The parser can return immediately because of an explicit call to YYABORT or YYACCEPT, or failed error recovery, or memory exhaustion.

Right-hand side symbols of a rule that explicitly triggers a syntax error via YYERROR are not discarded automatically. As a rule of thumb, destructors are invoked only when user actions cannot manage the memory.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.7.7 Printing Semantic Values

When run-time traces are enabled (see section Tracing Your Parser), the parser reports its actions, such as reductions. When a symbol involved in an action is reported, only its kind is displayed, as the parser cannot know how semantic values should be formatted.

The %printer directive defines code that is called when a symbol is reported. Its syntax is the same as %destructor (see section Freeing Discarded Symbols).

Directive: %printer { code } symbols

Invoke the braced code whenever the parser displays one of the symbols. Within code, yyoutput denotes the output stream (a FILE* in C, and an std::ostream& in C++), $$ (or $<tag>$) designates the semantic value associated with the symbol, and @$ its location. The additional parser parameters are also available (see section The Parser Function yyparse).

The symbols are defined as for %destructor (see section Freeing Discarded Symbols.): they can be per-type (e.g., ‘<ival>’), per-symbol (e.g., ‘exp’, ‘NUM’, ‘"float"’), typed per-default (i.e., ‘<*>’, or untyped per-default (i.e., ‘<>’).

For example:

 
%union { char *string; }
%token <string> STRING1 STRING2
%type  <string> string1 string2
%union { char character; }
%token <character> CHR
%type  <character> chr
%token TAGLESS

%printer { fprintf (yyoutput, "'%c'", $$); } <character>
%printer { fprintf (yyoutput, "&%p", $$); } <*>
%printer { fprintf (yyoutput, "\"%s\"", $$); } STRING1 string1
%printer { fprintf (yyoutput, "<>"); } <>

guarantees that, when the parser print any symbol that has a semantic type tag other than <character>, it display the address of the semantic value by default. However, when the parser displays a STRING1 or a string1, it formats it as a string in double quotes. It performs only the second %printer in this case, so it prints only once. Finally, the parser print ‘<>’ for any symbol, such as TAGLESS, that has no semantic type tag. See also


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.7.8 Suppressing Conflict Warnings

Bison normally warns if there are any conflicts in the grammar (see section Shift/Reduce Conflicts), but most real grammars have harmless shift/reduce conflicts which are resolved in a predictable way and would be difficult to eliminate. It is desirable to suppress the warning about these conflicts unless the number of conflicts changes. You can do this with the %expect declaration.

The declaration looks like this:

 
%expect n

Here n is a decimal integer. The declaration says there should be n shift/reduce conflicts and no reduce/reduce conflicts. Bison reports an error if the number of shift/reduce conflicts differs from n, or if there are any reduce/reduce conflicts.

For deterministic parsers, reduce/reduce conflicts are more serious, and should be eliminated entirely. Bison will always report reduce/reduce conflicts for these parsers. With GLR parsers, however, both kinds of conflicts are routine; otherwise, there would be no need to use GLR parsing. Therefore, it is also possible to specify an expected number of reduce/reduce conflicts in GLR parsers, using the declaration:

 
%expect-rr n

In general, using %expect involves these steps:

Now Bison will report an error if you introduce an unexpected conflict, but will keep silent otherwise.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.7.9 The Start-Symbol

Bison assumes by default that the start symbol for the grammar is the first nonterminal specified in the grammar specification section. The programmer may override this restriction with the %start declaration as follows:

 
%start symbol

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.7.10 A Pure (Reentrant) Parser

A reentrant program is one which does not alter in the course of execution; in other words, it consists entirely of pure (read-only) code. Reentrancy is important whenever asynchronous execution is possible; for example, a nonreentrant program may not be safe to call from a signal handler. In systems with multiple threads of control, a nonreentrant program must be called only within interlocks.

Normally, Bison generates a parser which is not reentrant. This is suitable for most uses, and it permits compatibility with Yacc. (The standard Yacc interfaces are inherently nonreentrant, because they use statically allocated variables for communication with yylex, including yylval and yylloc.)

Alternatively, you can generate a pure, reentrant parser. The Bison declaration ‘%define api.pure’ says that you want the parser to be reentrant. It looks like this:

 
%define api.pure full

The result is that the communication variables yylval and yylloc become local variables in yyparse, and a different calling convention is used for the lexical analyzer function yylex. See section Calling Conventions for Pure Parsers, for the details of this. The variable yynerrs becomes local in yyparse in pull mode but it becomes a member of yypstate in push mode. (see section The Error Reporting Function yyerror). The convention for calling yyparse itself is unchanged.

Whether the parser is pure has nothing to do with the grammar rules. You can generate either a pure parser or a nonreentrant parser from any valid grammar.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.7.11 A Push Parser

(The current push parsing interface is experimental and may evolve. More user feedback will help to stabilize it.)

A pull parser is called once and it takes control until all its input is completely parsed. A push parser, on the other hand, is called each time a new token is made available.

A push parser is typically useful when the parser is part of a main event loop in the client’s application. This is typically a requirement of a GUI, when the main event loop needs to be triggered within a certain time period.

Normally, Bison generates a pull parser. The following Bison declaration says that you want the parser to be a push parser (see section api.push-pull):

 
%define api.push-pull push

In almost all cases, you want to ensure that your push parser is also a pure parser (see section A Pure (Reentrant) Parser). The only time you should create an impure push parser is to have backwards compatibility with the impure Yacc pull mode interface. Unless you know what you are doing, your declarations should look like this:

 
%define api.pure full
%define api.push-pull push

There is a major notable functional difference between the pure push parser and the impure push parser. It is acceptable for a pure push parser to have many parser instances, of the same type of parser, in memory at the same time. An impure push parser should only use one parser at a time.

When a push parser is selected, Bison will generate some new symbols in the generated parser. yypstate is a structure that the generated parser uses to store the parser’s state. yypstate_new is the function that will create a new parser instance. yypstate_delete will free the resources associated with the corresponding parser instance. Finally, yypush_parse is the function that should be called whenever a token is available to provide the parser. A trivial example of using a pure push parser would look like this:

 
int status;
yypstate *ps = yypstate_new ();
do {
  status = yypush_parse (ps, yylex (), NULL);
} while (status == YYPUSH_MORE);
yypstate_delete (ps);

If the user decided to use an impure push parser, a few things about the generated parser will change. The yychar variable becomes a global variable instead of a variable in the yypush_parse function. For this reason, the signature of the yypush_parse function is changed to remove the token as a parameter. A nonreentrant push parser example would thus look like this:

 
extern int yychar;
int status;
yypstate *ps = yypstate_new ();
do {
  yychar = yylex ();
  status = yypush_parse (ps);
} while (status == YYPUSH_MORE);
yypstate_delete (ps);

That’s it. Notice the next token is put into the global variable yychar for use by the next invocation of the yypush_parse function.

Bison also supports both the push parser interface along with the pull parser interface in the same generated parser. In order to get this functionality, you should replace the ‘%define api.push-pull push’ declaration with the ‘%define api.push-pull both’ declaration. Doing this will create all of the symbols mentioned earlier along with the two extra symbols, yyparse and yypull_parse. yyparse can be used exactly as it normally would be used. However, the user should note that it is implemented in the generated parser by calling yypull_parse. This makes the yyparse function that is generated with the ‘%define api.push-pull both’ declaration slower than the normal yyparse function. If the user calls the yypull_parse function it will parse the rest of the input stream. It is possible to yypush_parse tokens to select a subgrammar and then yypull_parse the rest of the input stream. If you would like to switch back and forth between between parsing styles, you would have to write your own yypull_parse function that knows when to quit looking for input. An example of using the yypull_parse function would look like this:

 
yypstate *ps = yypstate_new ();
yypull_parse (ps); /* Will call the lexer */
yypstate_delete (ps);

Adding the ‘%define api.pure’ declaration does exactly the same thing to the generated parser with ‘%define api.push-pull both’ as it did for ‘%define api.push-pull push’.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.7.12 Bison Declaration Summary

Here is a summary of the declarations used to define a grammar:

Directive: %union

Declare the collection of data types that semantic values may have (see section The Union Declaration).

Directive: %token

Declare a terminal symbol (token type name) with no precedence or associativity specified (see section Token Type Names).

Directive: %right

Declare a terminal symbol (token type name) that is right-associative (see section Operator Precedence).

Directive: %left

Declare a terminal symbol (token type name) that is left-associative (see section Operator Precedence).

Directive: %nonassoc

Declare a terminal symbol (token type name) that is nonassociative (see section Operator Precedence). Using it in a way that would be associative is a syntax error.

Directive: %type

Declare the type of semantic values for a nonterminal symbol (see section Nonterminal Symbols).

Directive: %start

Specify the grammar’s start symbol (see section The Start-Symbol).

Directive: %expect

Declare the expected number of shift-reduce conflicts (see section Suppressing Conflict Warnings).


In order to change the behavior of bison, use the following directives:

Directive: %code {code}
Directive: %code qualifier {code}

Insert code verbatim into the output parser source at the default location or at the location specified by qualifier. See section %code Summary.

Directive: %debug

Instrument the parser for traces. Obsoleted by ‘%define parse.trace’. See section Tracing Your Parser.

Directive: %define variable
Directive: %define variable value
Directive: %define variable {value}
Directive: %define variable "value"

Define a variable to adjust Bison’s behavior. See section %define Summary.

Directive: %defines

Write a parser header file containing macro definitions for the token type names defined in the grammar as well as a few other declarations. If the parser implementation file is named ‘name.c’ then the parser header file is named ‘name.h’.

For C parsers, the parser header file declares YYSTYPE unless YYSTYPE is already defined as a macro or you have used a <type> tag without using %union. Therefore, if you are using a %union (see section More Than One Value Type) with components that require other definitions, or if you have defined a YYSTYPE macro or type definition (see section Data Types of Semantic Values), you need to arrange for these definitions to be propagated to all modules, e.g., by putting them in a prerequisite header that is included both by your parser and by any other module that needs YYSTYPE.

Unless your parser is pure, the parser header file declares yylval as an external variable. See section A Pure (Reentrant) Parser.

If you have also used locations, the parser header file declares YYLTYPE and yylloc using a protocol similar to that of the YYSTYPE macro and yylval. See section Tracking Locations.

This parser header file is normally essential if you wish to put the definition of yylex in a separate source file, because yylex typically needs to be able to refer to the above-mentioned declarations and to the token type codes. See section Semantic Values of Tokens.

If you have declared %code requires or %code provides, the output header also contains their code. See section %code Summary.

The generated header is protected against multiple inclusions with a C preprocessor guard: ‘YY_PREFIX_FILE_INCLUDED’, where PREFIX and FILE are the prefix (see section Multiple Parsers in the Same Program) and generated file name turned uppercase, with each series of non alphanumerical characters converted to a single underscore.

For instance with ‘%define api.prefix {calc}’ and ‘%defines "lib/parse.h"’, the header will be guarded as follows.

 
#ifndef YY_CALC_LIB_PARSE_H_INCLUDED
# define YY_CALC_LIB_PARSE_H_INCLUDED
...
#endif /* ! YY_CALC_LIB_PARSE_H_INCLUDED */
Directive: %defines defines-file

Same as above, but save in the file ‘defines-file’.

Directive: %destructor

Specify how the parser should reclaim the memory associated to discarded symbols. See section Freeing Discarded Symbols.

Directive: %file-prefix "prefix"

Specify a prefix to use for all Bison output file names. The names are chosen as if the grammar file were named ‘prefix.y’.

Directive: %language "language"

Specify the programming language for the generated parser. Currently supported languages include C, C++, and Java. language is case-insensitive.

Directive: %locations

Generate the code processing the locations (see section Special Features for Use in Actions). This mode is enabled as soon as the grammar uses the special ‘@n’ tokens, but if your grammar does not use it, using ‘%locations’ allows for more accurate syntax error messages.

Directive: %name-prefix "prefix"

Rename the external symbols used in the parser so that they start with prefix instead of ‘yy’. The precise list of symbols renamed in C parsers is yyparse, yylex, yyerror, yynerrs, yylval, yychar, yydebug, and (if locations are used) yylloc. If you use a push parser, yypush_parse, yypull_parse, yypstate, yypstate_new and yypstate_delete will also be renamed. For example, if you use ‘%name-prefix "c_"’, the names become c_parse, c_lex, and so on. For C++ parsers, see the ‘%define api.namespace’ documentation in this section. See section Multiple Parsers in the Same Program.

Directive: %no-lines

Don’t generate any #line preprocessor commands in the parser implementation file. Ordinarily Bison writes these commands in the parser implementation file so that the C compiler and debuggers will associate errors and object code with your source file (the grammar file). This directive causes them to associate errors with the parser implementation file, treating it as an independent source file in its own right.

Directive: %output "file"

Generate the parser implementation in ‘file’.

Directive: %pure-parser

Deprecated version of ‘%define api.pure’ (see section api.pure), for which Bison is more careful to warn about unreasonable usage.

Directive: %require "version"

Require version version or higher of Bison. See section Require a Version of Bison.

Directive: %skeleton "file"

Specify the skeleton to use.

If file does not contain a /, file is the name of a skeleton file in the Bison installation directory. If it does, file is an absolute file name or a file name relative to the directory of the grammar file. This is similar to how most shells resolve commands.

Directive: %token-table

Generate an array of token names in the parser implementation file. The name of the array is yytname; yytname[i] is the name of the token whose internal Bison token code number is i. The first three elements of yytname correspond to the predefined tokens "$end", "error", and "$undefined"; after these come the symbols defined in the grammar file.

The name in the table includes all the characters needed to represent the token in Bison. For single-character literals and literal strings, this includes the surrounding quoting characters and any escape sequences. For example, the Bison single-character literal '+' corresponds to a three-character name, represented in C as "'+'"; and the Bison two-character literal string "\\/" corresponds to a five-character name, represented in C as "\"\\\\/\"".

When you specify %token-table, Bison also generates macro definitions for macros YYNTOKENS, YYNNTS, and YYNRULES, and YYNSTATES:

YYNTOKENS

The highest token number, plus one.

YYNNTS

The number of nonterminal symbols.

YYNRULES

The number of grammar rules,

YYNSTATES

The number of parser states (see section Parser States).

Directive: %verbose

Write an extra output file containing verbose descriptions of the parser states and what is done for each type of lookahead token in that state. See section Understanding Your Parser, for more information.

Directive: %yacc

Pretend the option ‘--yacc’ was given, i.e., imitate Yacc, including its naming conventions. See section Bison Options, for more.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.7.13 %define Summary

There are many features of Bison’s behavior that can be controlled by assigning the feature a single value. For historical reasons, some such features are assigned values by dedicated directives, such as %start, which assigns the start symbol. However, newer such features are associated with variables, which are assigned by the %define directive:

Directive: %define variable
Directive: %define variable value
Directive: %define variable {value}
Directive: %define variable "value"

Define variable to value.

The type of the values depend on the syntax. Braces denote value in the target language (e.g., a namespace, a type, etc.). Keyword values (no delimiters) denote finite choice (e.g., a variation of a feature). String values denote remaining cases (e.g., a file name).

It is an error if a variable is defined by %define multiple times, but see -D name[=value].

The rest of this section summarizes variables and values that %define accepts.

Some variables take Boolean values. In this case, Bison will complain if the variable definition does not meet one of the following four conditions:

  1. value is true
  2. value is omitted (or "" is specified). This is equivalent to true.
  3. value is false.
  4. variable is never defined. In this case, Bison selects a default value.

What variables are accepted, as well as their meanings and default values, depend on the selected target language and/or the parser skeleton (see section %language, see section %skeleton). Unaccepted variables produce an error. Some of the accepted variables are described below.

Directive: %define api.namespace {namespace}
Directive: %define api.location.type {type}
Directive: %define api.prefix {prefix}
Directive: %define api.pure purity
Directive: %define api.push-pull kind
Directive: %define api.token.constructor
Directive: %define api.token.prefix {prefix}
Directive: %define api.value.type support
Directive: %define api.value.type {type}
Directive: %define location_type

Obsoleted by api.location.type since Bison 2.7.

Directive: %define lr.default-reduction when
Directive: %define lr.keep-unreachable-state
Directive: %define lr.type type
Directive: %define namespace {namespace}

Obsoleted by api.namespace

Directive: %define parse.assert
Directive: %define parse.error verbosity
Directive: %define parse.lac when
Directive: %define parse.trace

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.7.14 %code Summary

The %code directive inserts code verbatim into the output parser source at any of a predefined set of locations. It thus serves as a flexible and user-friendly alternative to the traditional Yacc prologue, %{code%}. This section summarizes the functionality of %code for the various target languages supported by Bison. For a detailed discussion of how to use %code in place of %{code%} for C/C++ and why it is advantageous to do so, see section Prologue Alternatives.

Directive: %code {code}

This is the unqualified form of the %code directive. It inserts code verbatim at a language-dependent default location in the parser implementation.

For C/C++, the default location is the parser implementation file after the usual contents of the parser header file. Thus, the unqualified form replaces %{code%} for most purposes.

For Java, the default location is inside the parser class.

Directive: %code qualifier {code}

This is the qualified form of the %code directive. qualifier identifies the purpose of code and thus the location(s) where Bison should insert it. That is, if you need to specify location-sensitive code that does not belong at the default location selected by the unqualified %code form, use this form instead.

For any particular qualifier or for the unqualified form, if there are multiple occurrences of the %code directive, Bison concatenates the specified code in the order in which it appears in the grammar file.

Not all qualifiers are accepted for all target languages. Unaccepted qualifiers produce an error. Some of the accepted qualifiers are:

requires
provides
top
imports

Though we say the insertion locations are language-dependent, they are technically skeleton-dependent. Writers of non-standard skeletons however should choose their locations consistently with the behavior of the standard Bison skeletons.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.8 Multiple Parsers in the Same Program

Most programs that use Bison parse only one language and therefore contain only one Bison parser. But what if you want to parse more than one language with the same program? Then you need to avoid name conflicts between different definitions of functions and variables such as yyparse, yylval. To use different parsers from the same compilation unit, you also need to avoid conflicts on types and macros (e.g., YYSTYPE) exported in the generated header.

The easy way to do this is to define the %define variable api.prefix. With different api.prefixs it is guaranteed that headers do not conflict when included together, and that compiled objects can be linked together too. Specifying ‘%define api.prefix {prefix}’ (or passing the option ‘-Dapi.prefix={prefix}’, see Invoking Bison) renames the interface functions and variables of the Bison parser to start with prefix instead of ‘yy’, and all the macros to start by PREFIX (i.e., prefix upper-cased) instead of ‘YY’.

The renamed symbols include yyparse, yylex, yyerror, yynerrs, yylval, yylloc, yychar and yydebug. If you use a push parser, yypush_parse, yypull_parse, yypstate, yypstate_new and yypstate_delete will also be renamed. The renamed macros include YYSTYPE, YYLTYPE, and YYDEBUG, which is treated specifically — more about this below.

For example, if you use ‘%define api.prefix {c}’, the names become cparse, clex, …, CSTYPE, CLTYPE, and so on.

The %define variable api.prefix works in two different ways. In the implementation file, it works by adding macro definitions to the beginning of the parser implementation file, defining yyparse as prefixparse, and so on:

 
#define YYSTYPE CTYPE
#define yyparse cparse
#define yylval  clval
...
YYSTYPE yylval;
int yyparse (void);

This effectively substitutes one name for the other in the entire parser implementation file, thus the “original” names (yylex, YYSTYPE, …) are also usable in the parser implementation file.

However, in the parser header file, the symbols are defined renamed, for instance:

 
extern CSTYPE clval;
int cparse (void);

The macro YYDEBUG is commonly used to enable the tracing support in parsers. To comply with this tradition, when api.prefix is used, YYDEBUG (not renamed) is used as a default value:

 
/* Debug traces.  */
#ifndef CDEBUG
# if defined YYDEBUG
#  if YYDEBUG
#   define CDEBUG 1
#  else
#   define CDEBUG 0
#  endif
# else
#  define CDEBUG 0
# endif
#endif
#if CDEBUG
extern int cdebug;
#endif


Prior to Bison 2.6, a feature similar to api.prefix was provided by the obsolete directive %name-prefix (see section Bison Symbols) and the option --name-prefix (see section Bison Options).


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Rick Perry on December 29, 2013 using texi2html 1.82.