René Nyffenegger's collection of things on the web
René Nyffenegger on Oracle - Most wanted - Feedback -
 

Parse::RecDescent (Perl)

Parse::RecDescent can be used to generate recursive descent parsers. A parser is used to make meaning of an inputs (mostly given in textual form) that follow certain patterns. These patterns are described in a grammar. The grammer consists of a set of rules (called productions). When the parser reads the input, it tries to find a rule that matches and whenever it finds such a rule, it triggers an action. It then goes on to read from the input and to find matching patterns, until the input has finished.
An example for a rule looks like so:
  name_of_rule    :   name_of_other_rule 'SOME_LITERAL_TEXT'  /regular-expression/
                      {action_to_trigger_if_rule_matches()}  |
                      alternative_rule
                      {alternative_action_to_trigger()}
Note the bar (|) that seperates alternatives.
The entire grammar is passed through the new method of Parse::RecDescent.
Here's a simple example to demonstrate this:
use strict;
use warnings;

use Parse::RecDescent;

my $grammar = q {
  
  start    :  character character character(s)
              {print "Found: ", $item[1], $item[2], join "", @{$item[3]}, "\n"; }

  character:  /\w/
};

my $parser = Parse::RecDescent->new($grammar);

defined $parser->start("ABCD") or die "didn't match";
The special expression (s) means: one more more occurances of the preceeding item.

First production, not longest

use strict;
use warnings;

use Parse::RecDescent;

my $grammar = q {

  start    :  seq_1 seq_2

  seq_1    :   'A' 'B' 'C' 'D'         { print "seq_1: " . join (" ", @item[1..$#item]) . "\n" }
           |   'A' 'B' 'C' 'D' 'E' 'F' { print "seq_1: " . join (" ", @item[1..$#item]) . "\n" }

  seq_2    : character(s)

  character: /\w/                      { print "character: $item[1]\n" }

};

my $parser=Parse::RecDescent->new($grammar);

$parser->start("A B C D E F G");
seq_1: A B C D
character: E
character: F
character: G
use strict;
use warnings;

use Parse::RecDescent;

my $grammar = q {

  start:  seq_1 seq_2

  seq_1    :  'A' 'B' 'C' 'D' 'E' 'F'
               { print "seq_1: " . join (" ", @item[1..$#item]) . "\n" }
            | 'A' 'B' 'C' 'D'
               { print "seq_1: " . join (" ", @item[1..$#item]) . "\n" }


  seq_2    : character(s)

  character: /\w/
               { print "character: $item[1]\n" }

};

my $parser=Parse::RecDescent->new($grammar);

$parser->start("A B C D E F");
seq_1: A B C D E F

@item and %item

@item

$item[0] is the name of the rule, as is $item{__RULE__}.
$item[n] (n > 0) is the value of the nth subitem in the rule.
use strict;
use warnings;

use Parse::RecDescent;

my $grammar = q {

  start:  seq

  seq    :   char_1 char_2 char_3 char_4
               { print join (" -- ", @item); }

  char_1 : character

  char_2 : character
  
  char_3 : character

  char_4 : character

  character: /\w/
};

my $parser=Parse::RecDescent->new($grammar);

$parser->start("A B C D");
seq -- A -- B -- C -- D
If an item in a production is an (s) item, $item[$prod_no] is a reference to an array:
use strict;
use warnings;

use Parse::RecDescent;

my $grammar = q {

  start:  seq

  seq    :   character(s)
             { print join (" -- ", @{$item[1]}) }
              
  character: /\w/
};

my $parser=Parse::RecDescent->new($grammar);

$parser->start("A B C D");
A -- B -- C -- D

%item

$item{__RULE__} is the name of the rule, as is $item[0]
use strict;
use warnings;

use Parse::RecDescent;

my $grammar = q {

  start:  seq

  seq    :   char_1 char_2 char_3 char_4
               { print map {$_ . "=" . $item{$_} . "; "} keys %item}

  char_1 : character

  char_2 : character
  
  char_3 : character

  char_4 : character

  character: /\w/
};

my $parser=Parse::RecDescent->new($grammar);

$parser->start("A B C D");
char_4=D; char_2=B; char_3=C; char_1=A; __RULE__=seq;

dynamically matched rules

It is even possible to use @item to dynamically generate rules:
command:  keyword  body  "end $item[1]"

Cardinality of items

Usually, an item must occur exactly once in order to match the rule. However, it is possible to alter this behaviour.

item(s)

item occurs 1 or more times.

item(?)

.tem occurs 0 or 1 times.

item(s?)

.tem occurs 0, 1 or more times

item(n..m)

.tem occurs at least n times but no more than m times.

item(..m)

item occurs at most m times (0 times possible)

item(n..)

item occurs at least n times.

$text

The remaining (unparsed) text. Changes to $text do not propagate out of unsuccessful productions, but do survive successful productions. Hence it is possible to dynamically alter the text being parsed - for example, to provide a #include-like facility:
        hash_include: '#include' filename
                                { $text = ::loadfile($item[2]) . $text }
        filename: '<' /[a-z0-9._-]+/i '>'  { $return = $item[2] }
                | '"' /[a-z0-9._-]+/i '"'  { $return = $item[2] }

$thisline and $prevline

Note, $thisline starts with 1!
I have no idea what $prevline is or indicates.
use strict;
use warnings;

use Parse::RecDescent;

my $grammar = q {

  start:  line(s)

  line    :  "print_line" {print "print_line found on: $thisline\n"}
            |/.*/
};

my $parser=Parse::RecDescent->new($grammar);

$parser->start('
hello world
i want a print_line right
now
and
print_line
here again
print_line
');
print_line found on: 6
print_line found on: 8

Autoactions

An autoaction triggers at the end of each production.
$::RD_AUTOACTION must be defined when Parse::RecDescent::new() is called.
use strict;
use warnings;

use Parse::RecDescent;
use Data::Dumper;

$::RD_AUTOACTION = q { [@item] };

my $grammar = q {

  start         : seq

  seq           : char_1 char_2 number(s) thing(s)

  char_1        : character

  char_2        : character
  
  character     : /[A-Z]/

  number        : /\d+/

  thing         : '(' key_value_pair(s) ')'

  key_value_pair: identifier "=" identifier

  identifier    : m([A-Za-z_]\w*) 
};

my $parser=Parse::RecDescent->new($grammar);

my $result = $parser->start("A B 42 29 5 ( foo = bar perl = cool hello = world )");

print Dumper($result);
Prints a parse tree
$VAR1 = [
          'start',
          [
            'seq',
            [
              'char_1',
              [
                'character',
                'A'
              ]
            ],
            [
              'char_2',
              [
                'character',
                'B'
              ]
            ],
            [
              [
                'number',
                '42'
              ],
              [
                'number',
                '29'
              ],
              [
                'number',
                '5'
              ]
            ],
            [
              [
                'thing',
                '(',
                [
                  [
                    'key_value_pair',
                    [
                      'identifier',
                      'foo'
                    ],
                    '=',
                    [
                      'identifier',
                      'bar'
                    ]
                  ],
                  [
                    'key_value_pair',
                    [
                      'identifier',
                      'perl'
                    ],
                    '=',
                    [
                      'identifier',
                      'cool'
                    ]
                  ],
                  [
                    'key_value_pair',
                    [
                      'identifier',
                      'hello'
                    ],
                    '=',
                    [
                      'identifier',
                      'world'
                    ]
                  ]
                ],
                ')'
              ]
            ]
          ]
        ];

Zero width assertions

Zero-width positive look ahead assertion

inner_word: word ...word
Consumes word once if followed by another word.
use strict;
use warnings;

use Parse::RecDescent;

my $grammar = q {

  start:  seq(s)

  seq      :   'A' 'B' ...'C'
               { print "A B eaten, C follows\n"}
             | 'A' 'B' character
               { print "A B $item[3] eaten\n"}
             | character character
               { print "two character eaten: $item[1] and $item[2]\n" }

  character: /\w/

};

my $parser=Parse::RecDescent->new($grammar);

$parser->start("A B Q P A B E A B C A B X");
A B Q eaten
two character eaten: P and A
two character eaten: B and E
A B eaten, C follows
two character eaten: C and A
two character eaten: B and X

Zero-width negative look ahead assertion

Too lazy at the moment....

Directives

<commit> and <uncommit>

<reject>

Causes the current production to fail.

<score>

<autoscore>

<skip>

<resync>

<error>

<rulevar>

<matchrule>

<leftop>

The <leftop:...> directive specifies a left-associative binary operator>.

<rightop>

The <lrightop:...> directive specifies a right-associative binary operator>.

<defer>

<nocheck>

<perl_quotelike>

<perl_codeblock>

<perl_variable>

<token>

A calculator

Without leftop and rightop

use strict;
use warnings;

use Parse::RecDescent;

my $grammar = q {
                                         {my %vars}

start:          statements               {print $item[1]}

statements:     statement ';' statements
              | statement

statement:      variable '=' statement   {$vars {$item [1]} = $item [3]}
              | expression               {$item [1]}

expression:     term '+' expression      {$item [1] + $item [3]}
              | term '-' expression      {$item [1] - $item [3]}
              | term

term:           factor '*' term          {$item [1] * $item [3]}
              | factor '/' term          {$item [1] / $item [3]}
              | factor

factor:         number
              | variable                 {$vars {$item [1]} ||= 0 }
              | '+' factor               {$item [2]}
              | '-' factor               {$item [2] * -1}
              | '(' statement ')'        {$item [2]}

number:         /\d+/                    {$item [1]}

variable:       /[a-z]+/i


};

my $parser=Parse::RecDescent->new($grammar);

my $result = $parser->start("three=3;six=2*three;eight=three+5;2+eight*six+50");

With leftop and rightop

use strict;
use warnings;

use Parse::RecDescent;

my $grammar = q {
                                         {my %vars}

start:          statements               {print $item[1]}

statements:     statement ';' statements
              | statement

statement:      <rightop: variable '=' expression>
                         {my $value = pop @{$item [1]};
                          while (@{$item [1]}) {
                              $vars {shift @{$item [1]}} = $value;
                          }
                          $value
                         }

expression:     <leftop: term ('+' | '-') term>
                         {my $s = shift @{$item [1]};
                          while (@{$item [1]}) {
                              my ($op, $t) = splice @{$item [1]}, 0, 2;
                              if ($op eq '+') {$s += $t}
                              else            {$s -= $t}
                          }
                          $s
                         }

term:           <leftop: factor m{([*/])} factor>
                         {my $t = shift @{$item [1]};
                          while (@{$item [1]}) {
                              my ($op, $f) = splice @{$item [1]}, 0, 2;
                              if ($op eq '/') {$t /= $f}
                              else            {$t *= $f}
                          }
                          $t
                         }


factor:         number
              | variable                 {$vars {$item [1]} ||= 0 }
              | '+' factor               {$item [2]}
              | '-' factor               {$item [2] * -1}
              | '(' statement ')'        {$item [2]}

number:         /\d+/                    {$item [1]}

variable:       /[a-z]+/i


};

my $parser=Parse::RecDescent->new($grammar);

my $result = $parser->start("three=3;six=2*three;eight=three+5;2+eight*six+50");

Links

Examples using Parse::RecDescent

Parsing SQL create table statements and producing an ERD out of it.