A. Appendices

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

A.1 Makefiles and Flex

In this appendix, we provide tips for writing Makefiles to build your scanners.

In a traditional build environment, we say that the ‘.c’ files are the sources, and the ‘.o’ files are the intermediate files. When using flex, however, the ‘.l’ files are the sources, and the generated ‘.c’ files (along with the ‘.o’ files) are the intermediate files. This requires you to carefully plan your Makefile.

Modern make programs understand that ‘foo.l’ is intended to generate ‘lex.yy.c’ or ‘foo.c’, and will behave accordingly(4)(5). The following Makefile does not explicitly instruct make how to build ‘foo.c’ from ‘foo.l’. Instead, it relies on the implicit rules of the make program to build the intermediate file, ‘scan.c’:

    # Basic Makefile -- relies on implicit rules
    # Creates "myprogram" from "scan.l" and "myprogram.c"
    #
    LEX=flex
    myprogram: scan.o myprogram.o
    scan.o: scan.l

For simple cases, the above may be sufficient. For other cases, you may have to explicitly instruct make how to build your scanner. The following is an example of a Makefile containing explicit rules:

    # Basic Makefile -- provides explicit rules
    # Creates "myprogram" from "scan.l" and "myprogram.c"
    #
    LEX=flex
    myprogram: scan.o myprogram.o
            $(CC) -o $@  $(LDFLAGS) $^

    myprogram.o: myprogram.c
            $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^

    scan.o: scan.c
            $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^

    scan.c: scan.l
            $(LEX) $(LFLAGS) -o $@ $^

    clean:
            $(RM) *.o scan.c

Notice in the above example that ‘scan.c’ is in the clean target. This is because we consider the file ‘scan.c’ to be an intermediate file.

Finally, we provide a realistic example of a flex scanner used with a bison parser(6). There is a tricky problem we have to deal with. Since a flex scanner will typically include a header file (e.g., ‘y.tab.h’) generated by the parser, we need to be sure that the header file is generated BEFORE the scanner is compiled. We handle this case in the following example:

    # Makefile example -- scanner and parser.
    # Creates "myprogram" from "scan.l", "parse.y", and "myprogram.c"
    #
    LEX     = flex
    YACC    = bison -y
    YFLAGS  = -d
    objects = scan.o parse.o myprogram.o

    myprogram: $(objects)
    scan.o: scan.l parse.c
    parse.o: parse.y
    myprogram.o: myprogram.c

In the above example, notice the line,

    scan.o: scan.l parse.c

, which lists the file ‘parse.c’ (the generated parser) as a dependency of ‘scan.o’. We want to ensure that the parser is created before the scanner is compiled, and the above line seems to do the trick. Feel free to experiment with your specific implementation of make.

For more details on writing Makefiles, see (make)Top section ‘Top’ in The GNU Make Manual.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

A.2 C Scanners with Bison Parsers

This section describes the flex features useful when integrating flex with GNU bison(7). Skip this section if you are not using bison with your scanner. Here we discuss only the flex half of the flex and bison pair. We do not discuss bison in any detail. For more information about generating bison parsers, see (bison)Top section ‘Top’ in the GNU Bison Manual.

A compatible bison scanner is generated by declaring ‘%option bison-bridge’ or by supplying ‘--bison-bridge’ when invoking flex from the command line. This instructs flex that the macro yylval may be used. The data type for yylval, YYSTYPE, is typically defined in a header file, included in section 1 of the flex input file. For a list of functions and macros available, See bison-functions.

The declaration of yylex becomes,

      int yylex ( YYSTYPE * lvalp, yyscan_t scanner );

If %option bison-locations is specified, then the declaration becomes,

      int yylex ( YYSTYPE * lvalp, YYLTYPE * llocp, yyscan_t scanner );

Note that the macros yylval and yylloc evaluate to pointers. Support for yylloc is optional in bison, so it is optional in flex as well. The following is an example of a flex scanner that is compatible with bison.

    /* Scanner for "C" assignment statements... sort of. */
    %{
    #include "y.tab.h"  /* Generated by bison. */
    %}

    %option bison-bridge bison-locations
    %

    [[:digit:]]+  { yylval->num = atoi(yytext);   return NUMBER;}
    [[:alnum:]]+  { yylval->str = strdup(yytext); return STRING;}
    "="|";"       { return yytext[0];}
    .  {}
    %

As you can see, there really is no magic here. We just use yylval as we would any other variable. The data type of yylval is generated by bison, and included in the file ‘y.tab.h’. Here is the corresponding bison parser:

    /* Parser to convert "C" assignments to lisp. */
    %{
    /* Pass the argument to yyparse through to yylex. */
    #define YYPARSE_PARAM scanner
    #define YYLEX_PARAM   scanner
    %}
    %locations
    %pure_parser
    %union {
        int num;
        char* str;
    }
    %token <str> STRING
    %token <num> NUMBER
    %%
    assignment:
        STRING '=' NUMBER ';' {
            printf( "(setf %s %d)", $1, $3 );
       }
    ;

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

A.3 M4 Dependency

The macro processor m4(8) must be installed wherever flex is installed. flex invokes ‘m4’, found by searching the directories in the PATH environment variable. Any code you place in section 1 or in the actions will be sent through m4. Please follow these rules to protect your code from unwanted m4 processing.

Do not use symbols that begin with, ‘m4_’, such as, ‘m4_define’, or ‘m4_include’, since those are reserved for m4 macro names. If for some reason you need m4_ as a prefix, use a preprocessor #define to get your symbol past m4 unmangled.
Do not use the strings ‘[[’ or ‘]]’ anywhere in your code. The former is not valid in C, except within comments and strings, but the latter is valid in code such as x[y[z]]. The solution is simple. To get the literal string "]]", use "]""]". To get the array notation x[y[z]], use x[y[z] ]. Flex will attempt to detect these sequences in user code, and escape them. However, it’s best to avoid this complexity where possible, by removing such sequences from your code.

m4 is only required at the time you run flex. The generated scanner is ordinary C or C++, and does not require m4.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

A.4 Common Patterns

This appendix provides examples of common regular expressions you might use in your scanner.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

A.4.1 Numbers

C99 decimal constant

([[:digit:]]{-}[0])[[:digit:]]*

C99 hexadecimal constant

0[xX][[:xdigit:]]+

C99 octal constant

0[01234567]*

C99 floating point constant

 {dseq}      ([[:digit:]]+)
 {dseq_opt}  ([[:digit:]]*)
 {frac}      (({dseq_opt}"."{dseq})|{dseq}".")
 {exp}       ([eE][+-]?{dseq})
 {exp_opt}   ({exp}?)
 {fsuff}     [flFL]
 {fsuff_opt} ({fsuff}?)
 {hpref}     (0[xX])
 {hdseq}     ([[:xdigit:]]+)
 {hdseq_opt} ([[:xdigit:]]*)
 {hfrac}     (({hdseq_opt}"."{hdseq})|({hdseq}"."))
 {bexp}      ([pP][+-]?{dseq})
 {dfc}       (({frac}{exp_opt}{fsuff_opt})|({dseq}{exp}{fsuff_opt}))
 {hfc}       (({hpref}{hfrac}{bexp}{fsuff_opt})|({hpref}{hdseq}{bexp}{fsuff_opt}))

 {c99_floating_point_constant}  ({dfc}|{hfc})

See C99 section 6.4.4.2 for the gory details.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

A.4.2 Identifiers

C99 Identifier

ucn        ((\\u([[:xdigit:]]{4}))|(\\U([[:xdigit:]]{8})))
nondigit    [_[:alpha:]]
c99_id     ([_[:alpha:]]|{ucn})([_[:alnum:]]|{ucn})*

Technically, the above pattern does not encompass all possible C99 identifiers, since C99 allows for "implementation-defined" characters. In practice, C compilers follow the above pattern, with the addition of the ‘$’ character.

UTF-8 Encoded Unicode Code Point

[\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF]([\x80-\xBF]{2})|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF]([\x80-\xBF]{2})|[\xF1-\xF3]([\x80-\xBF]{3})|\xF4[\x80-\x8F]([\x80-\xBF]{2})

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

A.4.3 Quoted Constructs

C99 String Literal

L?\"([^\"\\\n]|(\\['\"?\\abfnrtv])|(\\([0123456]{1,3}))|(\\x[[:xdigit:]]+)|(\\u([[:xdigit:]]{4}))|(\\U([[:xdigit:]]{8})))*\"

C99 Comment

("/*"([^*]|"*"[^/])*"*/")|("/"(\\\n)*"/"[^\n]*)

Note that in C99, a ‘//’-style comment may be split across lines, and, contrary to popular belief, does not include the trailing ‘\n’ character.

A better way to scan ‘/* */’ comments is by line, rather than matching possibly huge comments all at once. This will allow you to scan comments of unlimited length, as long as line breaks appear at sane intervals. This is also more efficient when used with automatic line number processing. See option-yylineno.

<INITIAL>{
    "/*"      BEGIN(COMMENT);
}
<COMMENT>{
    "*/"      BEGIN(0);
    [^*\n]+   ;
    "*"[^/]   ;
    \n        ;
}

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

A.4.4 Addresses

IPv4 Address

dec-octet     [0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]
IPv4address   {dec-octet}\.{dec-octet}\.{dec-octet}\.{dec-octet}

IPv6 Address

h16           [0-9A-Fa-f]{1,4}
ls32          {h16}:{h16}|{IPv4address}
IPv6address   ({h16}:){6}{ls32}|
              ::({h16}:){5}{ls32}|
              ({h16})?::({h16}:){4}{ls32}|
              (({h16}:){0,1}{h16})?::({h16}:){3}{ls32}|
              (({h16}:){0,2}{h16})?::({h16}:){2}{ls32}|
              (({h16}:){0,3}{h16})?::{h16}:{ls32}|
              (({h16}:){0,4}{h16})?::{ls32}|
              (({h16}:){0,5}{h16})?::{h16}|
              (({h16}:){0,6}{h16})?::

See RFC 2373 for details. Note that you have to fold the definition of IPv6address into one line and that it also matches the “unspecified address” “::”.

URI

(([^:/?#]+):)?("//"([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

This pattern is nearly useless, since it allows just about any character to appear in a URI, including spaces and control characters. See RFC 2396 for details.

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

This document was generated by Rick Perry on January 7, 2013 using texi2html 1.82.