[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
10.1.1 C++ Bison Interface | Asking for C++ parser generation | |
10.1.2 C++ Semantic Values | %union vs. C++ | |
10.1.3 C++ Location Values | The position and location classes | |
10.1.4 C++ Parser Interface | Instantiating and running the parser | |
10.1.5 C++ Scanner Interface | Exchanges between yylex and parse | |
10.1.6 A Complete C++ Example | Demonstrating their use |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The C++ deterministic parser is selected using the skeleton directive, ‘%skeleton "lalr1.cc"’, or the synonymous command-line option ‘--skeleton=lalr1.cc’. See section Bison Declaration Summary.
When run, bison
will create several entities in the ‘yy’
namespace.
Use the ‘%define api.namespace’ directive to change the namespace name,
see api.namespace. The various classes are generated
in the following files:
The definition of the classes position
and location
, used for
location tracking when enabled. These files are not generated if the
%define
variable api.location.type
is defined. See section C++ Location Values.
An auxiliary class stack
used by the parser.
(Assuming the extension of the grammar file was ‘.yy’.) The declaration and implementation of the C++ parser class. The basename and extension of these two files follow the same rules as with regular C parsers (see section Invoking Bison).
The header is mandatory; you must either pass
‘-d’/‘--defines’ to bison
, or use the
‘%defines’ directive.
All these files are documented using Doxygen; run doxygen
for a complete and accurate documentation.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Bison supports two different means to handle semantic values in C++. One is alike the C interface, and relies on unions (see section C++ Unions). As C++ practitioners know, unions are inconvenient in C++, therefore another approach is provided, based on variants (see section C++ Variants).
10.1.2.1 C++ Unions | Semantic values cannot be objects | |
10.1.2.2 C++ Variants | Using objects as semantic values |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The %union
directive works as for C, see The Union Declaration. In particular it produces a genuine
union
, which have a few specific features in C++.
YYSTYPE
is defined but its use is discouraged: rather
you should refer to the parser’s encapsulated type
yy::parser::semantic_type
.
Because objects have to be stored via pointers, memory is not
reclaimed automatically: using the %destructor
directive is the
only means to avoid leaks. See section Freeing Discarded Symbols.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Bison provides a variant based implementation of semantic values for C++. This alleviates all the limitations reported in the previous section, and in particular, object types can be used without pointers.
To enable variant-based semantic values, set %define
variable
variant
(see section variant). Once this defined,
%union
is ignored, and instead of using the name of the fields of the
%union
to “type” the symbols, use genuine types.
For instance, instead of
%union { int ival; std::string* sval; } %token <ival> NUMBER; %token <sval> STRING; |
write
%token <int> NUMBER; %token <std::string> STRING; |
STRING
is no longer a pointer, which should fairly simplify the user
actions in the grammar and in the scanner (in particular the memory
management).
Since C++ features destructors, and since it is customary to specialize
operator<<
to support uniform printing of values, variants also
typically simplify Bison printers and destructors.
Variants are stricter than unions. When based on unions, you may play any
dirty game with yylval
, say storing an int
, reading a
char*
, and then storing a double
in it. This is no longer
possible with variants: they must be initialized, then assigned to, and
eventually, destroyed.
Initialize, but leave empty. Returns the address where the actual value may be stored. Requires that the variant was not initialized yet.
Initialize, and copy-construct from t.
Warning: We do not use Boost.Variant, for two reasons. First, it
appeared unacceptable to require Boost on the user’s machine (i.e., the
machine on which the generated parser will be compiled, not the machine on
which bison
was run). Second, for each possible semantic value,
Boost.Variant not only stores the value, but also a tag specifying its
type. But the parser already “knows” the type of the semantic value, so
that would be duplicating the information.
Therefore we developed light-weight variants whose type tag is external (so
they are really like unions
for C++ actually). But our code is much
less mature that Boost.Variant. So there is a number of limitations in
(the current implementation of) variants:
double
is the most demanding
type on all platforms, alignments are enforced for double
whatever
types are actually used. This may waste space in some cases.
As far as we know, these limitations can be alleviated. All it takes is some time and/or some talented C++ hacker willing to contribute to Bison.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
When the directive %locations
is used, the C++ parser supports
location tracking, see Tracking Locations.
By default, two auxiliary classes define a position
, a single point
in a file, and a location
, a range composed of a pair of
position
s (possibly spanning several files). But if the
%define
variable api.location.type
is defined, then these
classes will not be generated, and the user defined type will be used.
In this section uint
is an abbreviation for unsigned int
: in
genuine code only the latter is used.
10.1.3.1 C++ position | One point in the source file | |
10.1.3.2 C++ location | Two points in the source file | |
10.1.3.3 User Defined Location Type | Required interface for locations |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
position
Create a position
denoting a given point. Note that file
is
not reclaimed when the position
is destroyed: memory managed must be
handled elsewhere.
Reset the position to the given values.
The name of the file. It will always be handled as a pointer, the parser will never duplicate nor deallocate it. As an experimental feature you may change it to ‘type*’ using ‘%define filename_type "type"’.
The line, starting at 1.
If height is not null, advance by height lines, resetting the column number. The resulting line number cannot be less than 1.
The column, starting at 1.
Advance by width columns, without changing the line number. The resulting column number cannot be less than 1.
Various forms of syntactic sugar for columns
.
Whether *this
and that
denote equal/different positions.
Report p on o like this: ‘file:line.column’, or ‘line.column’ if file is null.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
location
Create a Location
from the endpoints of the range.
Create a Location
denoting an empty range located at a given point.
Reset the location to an empty range at the given values.
The first, inclusive, position of the range, and the first beyond.
Forwarded to the end
position.
Various forms of syntactic sugar.
Move begin
onto end
.
Whether *this
and that
denote equal/different ranges of
positions.
Report p on o, taking care of special cases such as: no
filename
defined, or equal filename/line or column.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Instead of using the built-in types you may use the %define
variable
api.location.type
to specify your own type:
%define api.location.type {LocationType} |
The requirements over your LocationType are:
@$
in a reduction, the
parser basically runs
@$.begin = @$1.begin; @$.end = @$N.end; // The location of last right-hand side symbol. |
so there must be copyable begin
and end
members;
In programs with several C++ parsers, you may also use the %define
variable api.location.type
to share a common set of built-in
definitions for position
and location
. For instance, one
parser ‘master/parser.yy’ might use:
%defines %locations %define api.namespace {master::} |
to generate the ‘master/position.hh’ and ‘master/location.hh’ files, reused by other parsers as follows:
%define api.location.type {master::location} %code requires { #include <master/location.hh> } |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The output files ‘output.hh’ and ‘output.cc’
declare and define the parser class in the namespace yy
. The
class name defaults to parser
, but may be changed using
‘%define parser_class_name {name}’. The interface of
this class is detailed below. It can be extended using the
%parse-param
feature: its semantics is slightly changed since
it describes an additional member of the parser class, and an
additional argument for its constructor.
The types for semantic values and locations (if enabled).
A structure that contains (only) the yytokentype
enumeration, which
defines the tokens. To refer to the token FOO
,
use yy::parser::token::FOO
. The scanner can use
‘typedef yy::parser::token token;’ to “import” the token enumeration
(see section Calc++ Scanner).
This class derives from std::runtime_error
. Throw instances of it
from the scanner or from the user actions to raise parse errors. This is
equivalent with first
invoking error
to report the location and message of the syntax
error, and then to invoke YYERROR
to enter the error-recovery mode.
But contrary to YYERROR
which can only be invoked from user actions
(i.e., written in the action itself), the exception can be thrown from
function invoked from the user action.
Build a new parser object. There are no arguments by default, unless ‘%parse-param {type1 arg1}’ was used.
Instantiate a syntax-error exception.
Run the syntactic analysis, and return 0 on success, 1 otherwise.
The whole function is wrapped in a try
/catch
block, so that
when an exception is thrown, the %destructor
s are called to release
the lookahead symbol, and the symbols pushed on the stack.
Get or set the stream used for tracing the parsing. It defaults to
std::cerr
.
Get or set the tracing level. Currently its value is either 0, no trace, or nonzero, full tracing.
The definition for this member function must be supplied by the user: the parser uses it to report a parser error occurring at l, described by m. If location tracking is not enabled, the second signature is used.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The parser invokes the scanner by calling yylex
. Contrary to C
parsers, C++ parsers are always pure: there is no point in using the
‘%define api.pure’ directive. The actual interface with yylex
depends whether you use unions, or variants.
10.1.5.1 Split Symbols | Passing symbols as two/three components | |
10.1.5.2 Complete Symbols | Making symbols a whole |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The interface is as follows.
Return the next token. Its type is the return value, its semantic value and location (if enabled) being yylval and yylloc. Invocations of ‘%lex-param {type1 arg1}’ yield additional arguments.
Note that when using variants, the interface for yylex
is the same,
but yylval
is handled differently.
Regular union-based code in Lex scanner typically look like:
[0-9]+ { yylval.ival = text_to_int (yytext); return yy::parser::INTEGER; } [a-z]+ { yylval.sval = new std::string (yytext); return yy::parser::IDENTIFIER; } |
Using variants, yylval
is already constructed, but it is not
initialized. So the code would look like:
[0-9]+ { yylval.build<int>() = text_to_int (yytext); return yy::parser::INTEGER; } [a-z]+ { yylval.build<std::string> = yytext; return yy::parser::IDENTIFIER; } |
or
[0-9]+ { yylval.build(text_to_int (yytext)); return yy::parser::INTEGER; } [a-z]+ { yylval.build(yytext); return yy::parser::IDENTIFIER; } |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
If you specified both %define api.value.type variant
and
%define api.token.constructor
,
the parser
class also defines the class parser::symbol_type
which defines a complete symbol, aggregating its type (i.e., the
traditional value returned by yylex
), its semantic value (i.e., the
value passed in yylval
, and possibly its location (yylloc
).
Build a complete terminal symbol which token type is type, and which semantic value is value. If location tracking is enabled, also pass the location.
This interface is low-level and should not be used for two reasons. First, it is inconvenient, as you still have to build the semantic value, which is a variant, and second, because consistency is not enforced: as with unions, it is still possible to give an integer as semantic value for a string.
So for each token type, Bison generates named constructors as follows.
Build a complete terminal symbol for the token type token (not
including the api.token.prefix
) whose possible semantic value is
value of adequate value_type. If location tracking is enabled,
also pass the location.
For instance, given the following declarations:
%define api.token.prefix {TOK_} %token <std::string> IDENTIFIER; %token <int> INTEGER; %token COLON; |
Bison generates the following functions:
symbol_type make_IDENTIFIER(const std::string& v, const location_type& l); symbol_type make_INTEGER(const int& v, const location_type& loc); symbol_type make_COLON(const location_type& loc); |
which should be used in a Lex-scanner as follows.
[0-9]+ return yy::parser::make_INTEGER(text_to_int (yytext), loc); [a-z]+ return yy::parser::make_IDENTIFIER(yytext, loc); ":" return yy::parser::make_COLON(loc); |
Tokens that do not have an identifier are not accessible: you cannot simply
use characters such as ':'
, they must be declared with %token
.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This section demonstrates the use of a C++ parser with a simple but complete example. This example should be available on your system, ready to compile, in the directory .../bison/examples/calc++. It focuses on the use of Bison, therefore the design of the various C++ classes is very naive: no accessors, no encapsulation of members etc. We will use a Lex scanner, and more precisely, a Flex scanner, to demonstrate the various interactions. A hand-written scanner is actually easier to interface with.
10.1.6.1 Calc++ — C++ Calculator | The specifications | |
10.1.6.2 Calc++ Parsing Driver | An active parsing context | |
10.1.6.3 Calc++ Parser | A parser class | |
10.1.6.4 Calc++ Scanner | A pure C++ Flex scanner | |
10.1.6.5 Calc++ Top Level | Conducting the band |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Of course the grammar is dedicated to arithmetics, a single
expression, possibly preceded by variable assignments. An
environment containing possibly predefined variables such as
one
and two
, is exchanged with the parser. An example
of valid input follows.
three := 3 seven := one + two * three seven * seven |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
To support a pure interface with the parser (and the scanner) the technique of the “parsing context” is convenient: a structure containing all the data to exchange. Since, in addition to simply launch the parsing, there are several auxiliary tasks to execute (open the file for parsing, instantiate the parser etc.), we recommend transforming the simple parsing context structure into a fully blown parsing driver class.
The declaration of this driver class, ‘calc++-driver.hh’, is as follows. The first part includes the CPP guard and imports the required standard library components, and the declaration of the parser class.
#ifndef CALCXX_DRIVER_HH # define CALCXX_DRIVER_HH # include <string> # include <map> # include "calc++-parser.hh" |
Then comes the declaration of the scanning function. Flex expects
the signature of yylex
to be defined in the macro
YY_DECL
, and the C++ parser expects it to be declared. We can
factor both as follows.
// Tell Flex the lexer's prototype ... # define YY_DECL \ yy::calcxx_parser::symbol_type yylex (calcxx_driver& driver) // ... and declare it for the parser's sake. YY_DECL; |
The calcxx_driver
class is then declared with its most obvious
members.
// Conducting the whole scanning and parsing of Calc++. class calcxx_driver { public: calcxx_driver (); virtual ~calcxx_driver (); std::map<std::string, int> variables; int result; |
To encapsulate the coordination with the Flex scanner, it is useful to have member functions to open and close the scanning phase.
// Handling the scanner. void scan_begin (); void scan_end (); bool trace_scanning; |
Similarly for the parser itself.
// Run the parser on file F. // Return 0 on success. int parse (const std::string& f); // The name of the file being parsed. // Used later to pass the file name to the location tracker. std::string file; // Whether parser traces should be generated. bool trace_parsing; |
To demonstrate pure handling of parse errors, instead of simply dumping them on the standard error output, we will pass them to the compiler driver using the following two member functions. Finally, we close the class declaration and CPP guard.
// Error handling. void error (const yy::location& l, const std::string& m); void error (const std::string& m); }; #endif // ! CALCXX_DRIVER_HH |
The implementation of the driver is straightforward. The parse
member function deserves some attention. The error
functions
are simple stubs, they should actually register the located error
messages and set error state.
#include "calc++-driver.hh" #include "calc++-parser.hh" calcxx_driver::calcxx_driver () : trace_scanning (false), trace_parsing (false) { variables["one"] = 1; variables["two"] = 2; } calcxx_driver::~calcxx_driver () { } int calcxx_driver::parse (const std::string &f) { file = f; scan_begin (); yy::calcxx_parser parser (*this); parser.set_debug_level (trace_parsing); int res = parser.parse (); scan_end (); return res; } void calcxx_driver::error (const yy::location& l, const std::string& m) { std::cerr << l << ": " << m << std::endl; } void calcxx_driver::error (const std::string& m) { std::cerr << m << std::endl; } |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The grammar file ‘calc++-parser.yy’ starts by asking for the C++ deterministic parser skeleton, the creation of the parser header file, and specifies the name of the parser class. Because the C++ skeleton changed several times, it is safer to require the version you designed the grammar for.
%skeleton "lalr1.cc" /* -*- C++ -*- */ %require "3.0.2" %defines %define parser_class_name {calcxx_parser} |
This example will use genuine C++ objects as semantic values, therefore, we
require the variant-based interface. To make sure we properly use it, we
enable assertions. To fully benefit from type-safety and more natural
definition of “symbol”, we enable api.token.constructor
.
%define api.token.constructor %define api.value.type variant %define parse.assert |
Then come the declarations/inclusions needed by the semantic values. Because the parser uses the parsing driver and reciprocally, both would like to include the header of the other, which is, of course, insane. This mutual dependency will be broken using forward declarations. Because the driver’s header needs detailed knowledge about the parser class (in particular its inner types), it is the parser’s header which will use a forward declaration of the driver. See section %code Summary.
%code requires { # include <string> class calcxx_driver; } |
The driver is passed by reference to the parser and to the scanner. This provides a simple but effective pure interface, not relying on global variables.
// The parsing context. %param { calcxx_driver& driver } |
Then we request location tracking, and initialize the first location’s file name. Afterward new locations are computed relatively to the previous locations: the file name will be propagated.
%locations %initial-action { // Initialize the initial location. @$.begin.filename = @$.end.filename = &driver.file; }; |
Use the following two directives to enable parser tracing and verbose error messages. However, verbose error messages can contain incorrect information (see section LAC).
%define parse.trace %define parse.error verbose |
The code between ‘%code {’ and ‘}’ is output in the ‘*.cc’ file; it needs detailed knowledge about the driver.
%code { # include "calc++-driver.hh" } |
The token numbered as 0 corresponds to end of file; the following line
allows for nicer error messages referring to “end of file” instead of
“$end”. Similarly user friendly names are provided for each symbol. To
avoid name clashes in the generated files (see section Calc++ Scanner), prefix
tokens with TOK_
(see section api.token.prefix).
%define api.token.prefix {TOK_} %token END 0 "end of file" ASSIGN ":=" MINUS "-" PLUS "+" STAR "*" SLASH "/" LPAREN "(" RPAREN ")" ; |
Since we use variant-based semantic values, %union
is not used, and
both %type
and %token
expect genuine types, as opposed to type
tags.
%token <std::string> IDENTIFIER "identifier" %token <int> NUMBER "number" %type <int> exp |
No %destructor
is needed to enable memory deallocation during error
recovery; the memory, for strings for instance, will be reclaimed by the
regular destructors. All the values are printed using their
operator<<
(see section Printing Semantic Values).
%printer { yyoutput << $$; } <*>; |
The grammar itself is straightforward (see section Location Tracking Calculator: ltcalc
).
%% %start unit; unit: assignments exp { driver.result = $2; }; assignments: %empty {} | assignments assignment {}; assignment: "identifier" ":=" exp { driver.variables[$1] = $3; }; %left "+" "-"; %left "*" "/"; exp: exp "+" exp { $$ = $1 + $3; } | exp "-" exp { $$ = $1 - $3; } | exp "*" exp { $$ = $1 * $3; } | exp "/" exp { $$ = $1 / $3; } | "(" exp ")" { std::swap ($$, $2); } | "identifier" { $$ = driver.variables[$1]; } | "number" { std::swap ($$, $1); }; %% |
Finally the error
member function registers the errors to the
driver.
void yy::calcxx_parser::error (const location_type& l, const std::string& m) { driver.error (l, m); } |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The Flex scanner first includes the driver declaration, then the parser’s to get the set of defined tokens.
%{ /* -*- C++ -*- */ # include <cerrno> # include <climits> # include <cstdlib> # include <string> # include "calc++-driver.hh" # include "calc++-parser.hh" // Work around an incompatibility in flex (at least versions // 2.5.31 through 2.5.33): it generates code that does // not conform to C89. See Debian bug 333231 // <http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=333231>. # undef yywrap # define yywrap() 1 // The location of the current token. static yy::location loc; %} |
Because there is no #include
-like feature we don’t need
yywrap
, we don’t need unput
either, and we parse an
actual file, this is not an interactive session with the user.
Finally, we enable scanner tracing.
%option noyywrap nounput batch debug noinput |
Abbreviations allow for more readable rules.
id [a-zA-Z][a-zA-Z_0-9]* int [0-9]+ blank [ \t] |
The following paragraph suffices to track locations accurately. Each
time yylex
is invoked, the begin position is moved onto the end
position. Then when a pattern is matched, its width is added to the end
column. When matching ends of lines, the end
cursor is adjusted, and each time blanks are matched, the begin cursor
is moved onto the end cursor to effectively ignore the blanks
preceding tokens. Comments would be treated equally.
%{ // Code run each time a pattern is matched. # define YY_USER_ACTION loc.columns (yyleng); %} %% %{ // Code run each time yylex is called. loc.step (); %} {blank}+ loc.step (); [\n]+ loc.lines (yyleng); loc.step (); |
The rules are simple. The driver is used to report errors.
"-" return yy::calcxx_parser::make_MINUS(loc); "+" return yy::calcxx_parser::make_PLUS(loc); "*" return yy::calcxx_parser::make_STAR(loc); "/" return yy::calcxx_parser::make_SLASH(loc); "(" return yy::calcxx_parser::make_LPAREN(loc); ")" return yy::calcxx_parser::make_RPAREN(loc); ":=" return yy::calcxx_parser::make_ASSIGN(loc); {int} { errno = 0; long n = strtol (yytext, NULL, 10); if (! (INT_MIN <= n && n <= INT_MAX && errno != ERANGE)) driver.error (loc, "integer is out of range"); return yy::calcxx_parser::make_NUMBER(n, loc); } {id} return yy::calcxx_parser::make_IDENTIFIER(yytext, loc); . driver.error (loc, "invalid character"); <<EOF>> return yy::calcxx_parser::make_END(loc); %% |
Finally, because the scanner-related driver’s member-functions depend on the scanner’s data, it is simpler to implement them in this file.
void calcxx_driver::scan_begin () { yy_flex_debug = trace_scanning; if (file.empty () || file == "-") yyin = stdin; else if (!(yyin = fopen (file.c_str (), "r"))) { error ("cannot open " + file + ": " + strerror(errno)); exit (EXIT_FAILURE); } } void calcxx_driver::scan_end () { fclose (yyin); } |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The top level file, ‘calc++.cc’, poses no problem.
#include <iostream> #include "calc++-driver.hh" int main (int argc, char *argv[]) { int res = 0; calcxx_driver driver; for (int i = 1; i < argc; ++i) if (argv[i] == std::string ("-p")) driver.trace_parsing = true; else if (argv[i] == std::string ("-s")) driver.trace_scanning = true; else if (!driver.parse (argv[i])) std::cout << driver.result << std::endl; else res = 1; return res; } |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] |
This document was generated by Rick Perry on December 29, 2013 using texi2html 1.82.