René Nyffenegger's collection of things on the web
René Nyffenegger on Oracle - Most wanted - Feedback -
 

Extracting URLs from text with PERL

This is an attempt for a script to extract urls from text with Perl. The script recognizes an URL if it either starts with http:// or with www.. If the url is directly (that is: without white space) followed by a dot (.), comma (,), exclamation mark (!) or a question mark (?), these characters won't be included in the url.
extract_url_1.pl
use strict;
use warnings;

my $a = 
"http://some.url1.com/abc/def.html, bla bla 
bla:www.url2.qq/ghi/jkl.php?a=2&b=33, moo moo
www.miu.url3.qq bar bar bar http://www.url4.xxx.ui?r=19 aldjf";

print "\n";

while ($a =~ m@(((http://)|(www\.))\S+[^.,!? ])@g) {
  print " ", $1, "\n";
}
And here's the output:

 http://some.url1.com/abc/def.html
 www.url2.qq/ghi/jkl.php?a=2&b=33
 www.miu.url3.qq
 http://www.url4.xxx.ui?r=19

Prepending missing http://

Sometimes it seems necessary to prepend a missing http:// to a url. This is done with the negative lookbehind assertions ((?<http://)).
extract_url_2.pl
use strict;
use warnings;

my $a = 
"http://some.url1.com/abc/def.html, bla bla 
bla:www.url2.qq/ghi/jkl.php?a=2&b=33, moo moo
www.miu.url3.qq bar bar bar http://www.url4.xxx.ui?r=19 aldjf";

$a =~ s@(?<!http://)(www\.\S+[^.,!? ])@http://$1@g;

print "\n";

while ($a =~ m@(http://\S+[^.,!? ])@g) {
  print " ", $1, "\n";
}

 http://some.url1.com/abc/def.html
 http://www.url2.qq/ghi/jkl.php?a=2&b=33
 http://www.miu.url3.qq
 http://www.url4.xxx.ui?r=19