René Nyffenegger's collection of things on the web
René Nyffenegger on Oracle - Most wanted - Feedback -
 

HTML::TokeParser [Perl]

use strict;
use warnings;
use LWP::Simple;
use HTML::TokeParser;

my $doc = get("http://www.adp-gmbh.ch/sitemap.html");
my $parser = HTML::TokeParser->new(\$doc);

my $indent   = 0;
my $print_it = 0;

r();

sub r {
  while (my $x = $parser->get_token) {

    if ($x->[0] eq 'S') {
      if ($x->[1] eq 'ul') {
        $indent++;
      }
      elsif ($x->[1] eq 'li') {
        print "\n" if $print_it;
        print  "  " x $indent;
        $print_it = 1;
      }
    }
    elsif ($x->[0] eq 'T') {
       print $x->[1] if $print_it;
    }
    elsif ($x->[0] eq 'E') {
      if ($x->[1] eq 'ul') {
        $indent--;
        $print_it = 0;
      }
    }
  }
}

get_token

get_token returns the next token of the parsed document. After the last token, it returns undef. The returned token is acutally a reference to an array whose first element describes the type of the token.
The following types can be returned:
  • S: Start tag
  • E: End tag
  • T: Text
  • C: Comment
  • D: Declaraion
  • PI: Process Instruction