re2c



SYNOPSIS
       re2c [-bdDefFghisuvVw1] [-o output] [-c [-t header]] file


DESCRIPTION
       re2c  is a preprocessor that generates C-based recognizers from regular
       expressions.  The input to re2c consists of  C/C++  source  interleaved
       with comments of the form /*!re2c ... */ which contain scanner specifi-
       cations.  In the output these comments are  replaced  with  code  that,
       when  executed,  will  find  the next input token and then execute some
       user-supplied token-specific code.

       For example, given the following code

          char *scan(char *p)
          {
          /*!re2c
                  re2c:define:YYCTYPE  = "unsigned char";
                  re2c:define:YYCURSOR = p;
                  re2c:yyfill:enable   = 0;
                  re2c:yych:conversion = 1;
                  re2c:indent:top      = 1;
                  [0-9]+          {return p;}
                  [^]             {return (char*)0;}
          */
          }

       re2c -is will generate

          /* Generated by re2c on Sat Apr 16 11:40:58 1994 */
          char *scan(char *p)
          {
              {
                  unsigned char yych;

                  yych = (unsigned char)*p;
                  if(yych <= '/') goto yy4;
                  if(yych >= ':') goto yy4;
                  ++p;
                  yych = (unsigned char)*p;
                  goto yy7;
          yy3:
                  {return p;}
          yy4:
                  ++p;
                  yych = (unsigned char)*p;
                  {return char*)0;}
          yy6:
                  ++p;
                  yych = (unsigned char)*p;
          yy7:
                  if(yych <= '/') goto yy3;

       You can also use /*!ignore:re2c */ blocks that allows to  document  the
       scanner code and will not be part of the output.


OPTIONS
       re2c provides the following options:

       -?     -h Invoke a short help.

       -b     Implies -s.  Use bit vectors as well in the attempt to coax bet-
              ter code out of the compiler.  Most  useful  for  specifications
              with  more  than  a few keywords (e.g. for most programming lan-
              guages).

       -c     Used to support (f)lex-like condition support.

       -d     Creates a parser that dumps information about the current  posi-
              tion  and  in which state the parser is while parsing the input.
              This is useful to debug parser issues and  states.  If  you  use
              this  switch  you  need to define a macro YYDEBUG that is called
              like a function with two  parameters:  void  YYDEBUG(int  state,
              char  current). The first parameter receives the state or -1 and
              the second parameter receives the input at the current cursor.

       -D     Emit Graphviz dot data. It can then be processed with e.g.  "dot
              -Tpng  input.dot  >  output.png". Please note that scanners with
              many states may crash dot.

       -e     Cross-compile from an ASCII platform to an EBCDIC one.

       -f     Generate a scanner with support for storable state.  For details
              see below at SCANNER WITH STORABLE STATES.

       -F     Partial  support  for flex syntax. When this flag is active then
              named definitions must be surrounded by curly braces and can  be
              defined  without  an  equal sign and the terminating semi colon.
              Instead names are treated as direct double quoted strings.

       -g     Generate a scanner that utilizes GCC's  computed  goto  feature.
              That  is  re2c generates jump tables whenever a decision is of a
              certain complexity (e.g. a lot of if  conditions  are  otherwise
              necessary).  This  is  only useable with GCC and produces output
              that cannot be compiled with any other compiler. Note that  this
              implies  -b  and that the complexity threshold can be configured
              using the inplace configuration "cgoto:threshold".

       -i     Do not output #line information. This is usefull when  you  want
              use  a CMS tool with the re2c output which you might want if you
              do not require your users to have re2c themselves when  building
              from your source.  -o output Specify the output file.

       -r     Allows  reuse  of  scanner  definitions with '/*!use:re2c' after
              every '/*!use:re2c' block that follows. These blocks can contain
              inplace    configurations,    especially    're2c:flags:w'   and

       -u     Generate  a  parser  that  supports Unicode chars (UTF-32). This
              means the generated code can deal with any valid Unicode charac-
              ter  up  to 0x10FFFF. When UTF-8 or UTF-16 needs to be supported
              you need to convert the incoming stream  to  UTF-32  upon  input
              yourself.

       -v     Show version information.

       -V     Show the version as a number XXYYZZ.

       -w     Create  a  parser that supports wide chars (UCS-2). This implies
              -s and cannot be used together with -e switch.

       -1     Force single pass generation, this cannot be  combined  with  -f
              and disables YYMAXFILL generation prior to last re2c block.

       --no-generation-date
              Suppress  date  output  in  the generated output so that it only
              shows the re2c version.

       --case-insensitive
              All strings are  case  insensitive,  so  all  "-expressions  are
              treated in the same way '-expressions are.

       --case-inverted
              Invert  the  meaning  of single and double quoted strings.  With
              this switch single quotes are case sensitive and  double  quotes
              are case insensitive.


INTERFACE CODE
       Unlike  other scanner generators, re2c does not generate complete scan-
       ners: the user must supply some interface  code.   In  particular,  the
       user  must define the following macros or use the corresponding inplace
       configurations:

       YYCONDTYPE
              In -c mode you can use -t to generate a file that  contains  the
              enumeration  used  as conditions. Each of the values refers to a
              condition of a rule set.

       YYCTXMARKER
              l-expression of type *YYCTYPE.  The generated code saves  trail-
              ing  context  backtracking information in YYCTXMARKER.  The user
              only needs to define this macro if a scanner specification  uses
              trailing context in one or more of its regular-expressions.

       YYCTYPE
              Type  used  to  hold  an input symbol.  Usually char or unsigned
              char.

       YYCURSOR
              l-expression of type *YYCTYPE that points to the  current  input

       YYFILL(n)
              The  generated  code  "calls"  YYFILL(n)  when  the buffer needs
              (re)filling:  at least n additional characters  should  be  pro-
              vided.  YYFILL(n)  should adjust YYCURSOR, YYLIMIT, YYMARKER and
              YYCTXMARKER as needed.  Note that for typical  programming  lan-
              guages  n  will  be  the length of the longest keyword plus one.
              The user can place a comment of the form /*!max:re2c */ once  to
              insert  a  YYMAXFILL(n)  definition  that  is set to the maximum
              length value. If -1 switch is used then YYMAXFILL can  be  trig-
              gered only once after the last /*!re2c */ block.

       YYGETCONDITION()
              This  define  is used to get the condition prior to entering the
              scanner code when using -c switch. The value must be initialized
              with a value from the enumeration YYCONDTYPE type.

       YYGETSTATE()
              The  user  only  needs  to  define this macro if the -f flag was
              specified.  In that case,  the  generated  code  "calls"  YYGET-
              STATE()  at the very beginning of the scanner in order to obtain
              the saved state. YYGETSTATE() must return a signed integer.  The
              value  must be either -1, indicating that the scanner is entered
              for the first time,  or  a  value  previously  saved  by  YYSET-
              STATE(s).   In  the  second case, the scanner will resume opera-
              tions right after where the last YYFILL(n) was called.

       YYLIMIT
              Expression of type *YYCTYPE that marks the  end  of  the  buffer
              (YYLIMIT[-1]  is  the last character in the buffer).  The gener-
              ated code repeatedly compares YYCURSOR to YYLIMIT  to  determine
              when the buffer needs (re)filling.

       YYMARKER
              l-expression  of  type *YYCTYPE.  The generated code saves back-
              tracking information in YYMARKER. Some easy scanners  might  not
              use this.

       YYMAXFILL
              This  will  be automatically defined by /*!max:re2c */ blocks as
              explained above.

       YYSETCONDITION(c)
              This define is used to set the condition  in  transition  rules.
              This  is  only being used when -c is active and transition rules
              are being used.

       YYSETSTATE(s)
              The user only needs to define this macro  if  the  -f  flag  was
              specified.   In that case, the generated code "calls" YYSETSTATE
              just before calling YYFILL(n).  The parameter to YYSETSTATE is a
              signed integer that uniquely identifies the specific instance of
              YYFILL(n) that is about to be called.  Should the user  wish  to
              save  the  state of the scanner and have YYFILL(n) return to the

       The default operation of re2c is a "pull" model, where the scanner asks
       for  extra  input whenever it needs it. However, this mode of operation
       assumes that the scanner is the "owner" the parsing loop, and that  may
       not always be convenient.

       Typically,  if  there  is  a  preprocessor  ahead of the scanner in the
       stream, or for that matter any other procedural  source  of  data,  the
       scanner  cannot "ask" for more data unless both scanner and source live
       in a separate threads.

       The -f flag is useful for just this situation : it  lets  users  design
       scanners  that  work  in  a "push" model, i.e. where data is fed to the
       scanner chunk by chunk. When the scanner runs out of data  to  consume,
       it  just  stores  its  state, and return to the caller. When more input
       data is fed to the scanner, it resumes operations exactly where it left
       off.

       When  using  the -f option re2c does not accept stdin because it has to
       do the full generation process twice which means it  has  to  read  the
       input  twice.  That  means  re2c  would fail in case it cannot open the
       input twice or reading the input for the first time influences the sec-
       ond read attempt.

       Changes needed compared to the "pull" model.

       1. User has to supply macros YYSETSTATE() and YYGETSTATE(state)

       2. The -f option inhibits declaration of yych and yyaccept. So the user
       has to declare these. Also the user has to save and restore  these.  In
       the  example examples/push.re these are declared as fields of the (C++)
       class of which the scanner is a method, so  they  do  not  need  to  be
       saved/restored  explicitly.  For  C they could e.g. be made macros that
       select fields from a structure passed in as  parameter.  Alternatively,
       they could be declared as local variables, saved with YYFILL(n) when it
       decides to return and restored at entry to the function. Also, it could
       be  more  efficient  to  save  the  state from YYFILL(n) because YYSET-
       STATE(state) is called unconditionally. YYFILL(n) however does not  get
       state as parameter, so we would have to store state in a local variable
       by YYSETSTATE(state).

       3. Modify YYFILL(n) to return (from the function calling  it)  if  more
       input is needed.

       4. Modify caller to recognise "more input is needed" and respond appro-
       priately.

       5. The generated code will contain a  switch  block  that  is  used  to
       restores  the  last  state  by jumping behind the corrspoding YYFILL(n)
       call. This code is automatically generated in the epilog of  the  first
       "/*!re2c */" block.  It is possible to trigger generation of the YYGET-
       STATE() block earlier by placing a "/*!getstate:re2c */" comment.  This
       is  especially  useful when the scanner code should be wrapped inside a
       loop.

       There are two special rule types. First, the rules of the condition '*'
       are  merged  to  all   conditions.  And second the empty condition list
       allows to provide a code block that does not have a scanner part. Mean-
       ing  it  does  not  allow  any  regular expression. The condition value
       referring to this special block is always the one with the  enumeration
       value 0. This way the code of this special rule can be used to initial-
       ize a scanner. It is in no way necessary to have these rules: but some-
       times it is helpful to have a dedicated uninitialized condition state.

       Non  empty  rules  allow to specify the new condition, which makes them
       transition rules. Besides generating calls for the define  YYSETCONDTI-
       TION no other special code is generated.

       There  is  another  kind of special rules that allow to prepend code to
       any code block of all rules of a certain set of conditions  or  to  all
       code  blocks  to  all rules. This can be helpful when some operation is
       common among rules. For instance this can be used to store  the  length
       of the scanned string. These special setup rules start with an exclama-
       tion mark followed by either a list of conditions <! condition,  ...  >
       or  a  star  <!*>.  When re2c generates the code for a rule whose state
       does not have a setup rule and a star'd setup  rule  is  present,  than
       that code will be used as setup code.


SCANNER SPECIFICATIONS
       Each  scanner  specification  consists of a set of rules, named defini-
       tions and configurations.

       Rules consist of a regular-expression along with a block of C/C++  code
       that  is  to  be  executed  when  the  associated regular-expression is
       matched. You can either start the code with an opening curly  brace  or
       the  sequence  ':='.  When the code with a curly brace then re2c counts
       the brace depth and stops looking  for  code  automatically.  Otherwise
       curly  braces  are  not  allowed and re2c stops looking for code at the
       first line that does not begin with whitespace.

              regular-expression { C/C++ code }

              regular-expression := C/C++ code

       If -c is active then each regular-expression is preceeded by a list  of
       comma  separated condition names. Besides normal naming rules there are
       two special cases. A rule may contain the single condition name '*' and
       no  contition  name  at  all. In the latter case the rule cannot have a
       regular-expression. Non empty rules may further more  specify  the  new
       condition.  In  that  case  re2c  will  generated the necessary code to
       chnage the condition automatically. Just as above code can  be  started
       with  a  curly  brace  of the sequence ':='. Further more rules can use
       ':=>' as a shortcut to automatically generate code that not  only  sets
       the  new  condition  state  but  also  continues execution with the new
       state. A shortcut rule should not be used in a loop where there is code
       between  the start of the loop and the re2c block unless re2c:cond:goto
       is changed to 'continue;'. If code is necessary before all rule (though
       not simple jumps) you can doso by using <! pseudo-rules.

              <*> regular-expression := C/C++ code

              <*> regular-expression => condition { C/C++ code }

              <*> regular-expression => condition := C/C++ code

              <*> regular-expression :=> condition

              <> { C/C++ code }

              <> := C/C++ code

              <> => condition { C/C++ code }

              <> => condition := C/C++ code

              <> :=> condition

              <!condition-list> { C/C++ code }

              <!condition-list> := C/C++ code

              <!*> { C/C++ code }

              <!*> := C/C++ code

       Named definitions are of the form:

              name = regular-expression;

       -F is active, then named definitions are also of the form:

              name regular-expression

       Configurations  look  like  named  definitions  whose  names start with
       "re2c:":

              re2c:name = value;
              re2c:name = "value";


SUMMARY OF RE2C REGULAR-EXPRESSIONS
       "foo"  the literal string foo.  ANSI-C escape sequences can be used.

       'foo'  the literal string foo (characters [a-zA-Z] treated  case-insen-
              sitive).  ANSI-C escape sequences can be used.

       [xyz]  a  "character  class";  in  this  case,  the  regular-expression
              matches either an 'x', a 'y', or a 'z'.

       [abj-oZ]
              a "character class" with a range in it; matches an 'a',  a  'b',
              any letter from 'j' through 'o', or a 'Z'.

       name   the expansion of the "named definition" (see above)

       (r)    an r; parentheses are used to override precedence (see below)

       rs     an r followed by an s ("concatenation")

       r|s    either an r or an s

       r/s    an r but only if it is followed by an s. The s is  not  part  of
              the  matched  text.  This  type  of regular-expression is called
              "trailing context". A trailing context can only be the end of  a
              rule and not part of a named definition.

       r{n}   matches r exactly n times.

       r{n,}  matches r at least n times.

       r{n,m} matches r at least n but not more than m times.

       .      match any character except newline (\n).

       def    matches  named definition as specified by def only if -F is off.
              If the switch -F  is  active  then  this  behaves  like  it  was
              enclosed in double quotes and matches the string def.

       Character classes and string literals may contain octoal or hexadecimal
       character definitions and the following set of escape sequences (\n,
        \t, \v, \b, \r, \f, \a, \\).  An octal character is defined by a back-
       slash followed by its three octal digits and a hexadecimal character is
       defined by backslash, a lower cased 'x' and its two hexadecimal  digits
       or a backslash, an upper cased X and its four hexadecimal digits.

       re2c  further more supports the c/c++ unicode notation. That is a back-
       slash followed by either a lowercased u and its four hexadecimal digits
       or an uppercased U and its eight hexadecimal digits. However only in -u
       mode the generated code can deal with any valid Unicode character up to
       0x10FFFF.

       Since  characters  greater  \X00FF are not allowed in non unicode mode,
       the only portable "any" rules are (.|"\n") and [^].

       The regular-expressions listed above are grouped  according  to  prece-
       dence,  from  highest  precedence  at  the top to lowest at the bottom.
       Those grouped together have equal precedence.


INPLACE CONFIGURATION
       It is possible to configure code generation  inside  re2c  blocks.  The
       following lists the available configurations:

       re2c:condprefix = yyc_ ;
              Allows  to specify the prefix used for condition labels. That is
              this text is prepended to any condition label in  the  generated

       re2c:cond:divider@cond = @@ ;
              Specifies  the placeholder that will be replaced with the condi-
              tion name in re2c:cond:divider.

       re2c:cond:goto = "goto @@;" ;
              Allows to customize the  condition  goto  statements  used  with
              ':=>' style rules.  You can use '@@' to put the name of the con-
              dition or ustomize the plaeholder using re2c:cond:goto@cond. You
              can  also  change  this to 'continue;', which would allow you to
              continue with the next loop cycle  including  any  code  between
              loop start and re2c block.

       re2c:cond:goto@cond = @@ ;
              Spcifies  the  placeholder that will be replaced with the condi-
              tion label in re2c:cond:goto.

       re2c:indent:top = 0 ;
              Specifies the minimum number of indendation to use.  Requires  a
              numeric value greater than or equal zero.

       re2c:indent:string = "\t" ;
              Specifies  the  string to use for indendation. Requires a string
              that should contain only whitespace unless  you  need  this  for
              external  tools. The easiest way to specify spaces is to enclude
              them in single or double quotes. If you do not want any indenda-
              tion at all you can simply set this to "".

       re2c:yych:conversion = 0 ;
              When this setting is non zero, then re2c automatically generates
              conversion code whenever yych gets read. In this case  the  type
              must be defined using re2c:define:YYCTYPE.

       re2c:yych:emit = 1 ;
              Generation of yych can be suppressed by setting this to 0.

       re2c:yybm:hex = 0 ;
              If  set  to zero then a decimal table is being used else a hexa-
              decimal table will be generated.

       re2c:yyfill:enable = 1 ;
              Set this to zero to suppress generation of YYFILL(n). When using
              this  be sure to verify that the generated scanner does not read
              behind input. Allowing this behavior might introduce sever secu-
              rity issues to you programs.

       re2c:yyfill:check = 1 ;
              This  can be set 0 to suppress output of the pre condition using
              YYCURSOR and  YYLIMIT  which  becomes  usefull  when  YYLIMIT  +
              max(YYFILL) is always accessible.

       re2c:yyfill:parameter = 1 ;
              Allows  to suppress parameter passing to YYFILL calls. If set to
              zero  then  no  parameter   is   passed   to   YYFILL.   However
              define:YYFILL@LEN allows to specify a replacement string for the
              0 after a start label has been generated.

       re2c:labelprefix = yy ;
              Allows  to  change the prefix of numbered labels. The default is
              yy and can be set any string that is a valid label.

       re2c:state:abort = 0 ;
              When not zero and switch -f is active then the YYGETSTATE  block
              will  contain  a  default case that aborts and a -1 case is used
              for initialization.

       re2c:state:nextlabel = 0 ;
              Used when -f is active to control whether the  YYGETSTATE  block
              is followed by a yyNext: label line. Instead of using yyNext you
              can usually also use configuration startlabel to  force  a  spe-
              cific  start  label or default to yy0 as start label. Instead of
              using a dedicated label it  is  often  better  to  separate  the
              YYGETSTATE  code  from  the  actual  scanner  code  by placing a
              "/*!getstate:re2c */" comment.

       re2c:cgoto:threshold = 9 ;
              When -g is active this value specifies the complexity  threshold
              that triggers generation of jump tables rather than using nested
              if's and decision bitfields.  The threshold is compared  against
              a  calculated  estimation of if-s needed where every used bitmap
              divides the threshold by 2.

       re2c:yych:conversion = 0 ;
              When the input uses signed characters and -s or -b switches  are
              in  effect  re2c allows to automatically convert to the unsigned
              character type that is then necessary for  its  internal  single
              character. When this setting is zero or an empty string the con-
              version is disabled. Using a non zero number the  conversion  is
              taken from YYCTYPE. If that is given by an inplace configuration
              that value is being used. Otherwise it  will  be  (YYCTYPE)  and
              changes to that configuration are  no longer possible. When this
              setting is a string the braces must be specified.  Now  assuming
              your  input  is a char* buffer and you are using above mentioned
              switches you can set YYCTYPE to unsigned char and  this  setting
              to either 1 or "(unsigned char)".

       re2c:define:define:YYCONDTYPE = YYCONDTYPE ;
              Enumeration used for condition support with -c mode.

       re2c:define:YYCTXMARKER = YYCTXMARKER ;
              Allows  to overwrite the define YYCTXMARKER and thus avoiding it
              by setting the value to the actual code needed.

       re2c:define:YYCTYPE = YYCTYPE ;
              Allows to overwrite the define YYCTYPE and thus avoiding  it  by
              setting the value to the actual code needed.

       re2c:define:YYCURSOR = YYCURSOR ;
              Allows  to overwrite the define YYCURSOR and thus avoiding it by
              ted.

       re2c:define:YYFILL@len = @@ ;
              When  using  re2c:define:YYFILL  and  re2c:yyfill:parameter is 0
              then any occurence of this text inside YYFILL will  be  replaced
              with the actual length value.

       re2c:define:YYGETCONDITION = YYGETCONDITION ;
              Allows to overwrite the define YYGETCONDITION.

       re2c:define:YYGETCONDITION:naked =  ;
              When set to 1 neither braces, parameter nor semicolon gets emit-
              ted.

       re2c:define:YYGETSTATE = YYGETSTATE ;
              Allows to overwrite the define YYGETSTATE and thus  avoiding  it
              by setting the value to the actual code needed.

       re2c:define:YYGETSTATE:naked = 0 ;
              When set to 1 neither braces, parameter nor semicolon gets emit-
              ted.

       re2c:define:YYLIMIT = YYLIMIT ;
              Allows to overwrite the define YYLIMIT and thus avoiding  it  by
              setting the value to the actual code needed.

       re2c:define:YYMARKER = YYMARKER ;
              Allows  to overwrite the define YYMARKER and thus avoiding it by
              setting the value to the actual code needed.

       re2c:define:YYSETCONDITION = YYSETCONDITION ;
              Allows to overwrite the define YYSETCONDITION.

       re2c:define:YYSETCONDITION@cond = @@ ;
              When using re2c:define:YYSETCONDITION then any occurence of this
              text  inside YYSETCONDITION will be replaced with the actual new
              condition value.

       re2c:define:YYSETSTATE = YYSETSTATE ;
              Allows to overwrite the define YYSETSTATE and thus  avoiding  it
              by setting the value to the actual code needed.

       re2c:define:YYSETSTATE:naked = 0 ;
              When set to 1 neither braces, parameter nor semicolon gets emit-
              ted.

       re2c:define:YYSETSTATE@state = @@ ;
              When using re2c:define:YYSETSTATE then  any  occurence  of  this
              text  inside  YYSETSTATE  will  be  replaced with the actual new
              state value.

       re2c:label:yyFillLabel = yyFillLabel ;
              Allows to overwrite the name of the label yyFillLabel.

       re2c:variable:yyctable = yyctable ;
              When both -c and -g are active then re2c uses this  variable  to
              generate a static jump table for YYGETCONDITION.

       re2c:variable:yystable = yystable ;
              When  both  -f and -g are active then re2c uses this variable to
              generate a static jump table for YYGETSTATE.

       re2c:variable:yytarget = yytarget ;
              Allows to overwrite the name of the variable yytarget.


UNDERSTANDING RE2C
       The subdirectory lessons of the re2c distribution contains a  few  step
       by  step  lessons  to  get  you  started with re2c. All examples in the
       lessons subdirectory can be compiled and actually work.


FEATURES
       re2c does not provide a default action: the generated code assumes that
       the  input will consist of a sequence of tokens.  Typically this can be
       dealt with by adding a rule such as the one for  unexpected  characters
       in the example above.

       The  user  must  arrange  for  a sentinel token to appear at the end of
       input (and provide a rule for matching it): re2c does  not  provide  an
       <<EOF>>  expression.   If  the  source  is  from a null-byte terminated
       string, a rule matching a null character will suffice.  If  the  source
       is  from  a  file  then you could pad the input with a newline (or some
       other character that cannot appear within another token);  upon  recog-
       nizing  such  a  character  check  to see if it is the sentinel and act
       accordingly. And you can also use YYFILL(n) to end the scanner in  case
       not enough characters are available which is nothing else then e detec-
       tion of end of data/file.


BUGS
       Difference only works for character sets.

       The re2c internal algorithms need documentation.


SEE ALSO
       flex(1), lex(1).

       More information on re2c can be found here:
       http://re2c.org/


AUTHORS
       Peter Bumbulis <peter@csg.uwaterloo.ca>
       Brian Young <bayoung@acm.org>
       Dan Nuffer <nuffer@users.sourceforge.net>
       Marcus Boerger <helly@users.sourceforge.net>
Man Pages Copyright Respective Owners. Site Copyright (C) 1994 - 2017 Hurricane Electric. All Rights Reserved.