| René Nyffenegger's collection of things on the web | |
|
René Nyffenegger on Oracle - Most wanted - Feedback
|
Extracting URLs from text with PERL | ||
|
This is an attempt for a script to extract urls from text with Perl. The script recognizes an URL if it either starts with http:// or with
www.. If the url is directly (that is: without white space) followed by a dot (.), comma (,), exclamation mark (!) or a question mark (?), these
characters won't be included in the url.
extract_url_1.pl
use strict;
use warnings;
my $a =
"http://some.url1.com/abc/def.html, bla bla
bla:www.url2.qq/ghi/jkl.php?a=2&b=33, moo moo
www.miu.url3.qq bar bar bar http://www.url4.xxx.ui?r=19 aldjf";
print "\n";
while ($a =~ m@(((http://)|(www\.))\S+[^.,!? ])@g) {
print " ", $1, "\n";
}
And here's the output:
http://some.url1.com/abc/def.html www.url2.qq/ghi/jkl.php?a=2&b=33 www.miu.url3.qq http://www.url4.xxx.ui?r=19 Prepending missing http://
Sometimes it seems necessary to prepend a missing http:// to a url. This is done with the
negative lookbehind assertions (
(?<http://)).
extract_url_2.pl
use strict;
use warnings;
my $a =
"http://some.url1.com/abc/def.html, bla bla
bla:www.url2.qq/ghi/jkl.php?a=2&b=33, moo moo
www.miu.url3.qq bar bar bar http://www.url4.xxx.ui?r=19 aldjf";
$a =~ s@(?<!http://)(www\.\S+[^.,!? ])@http://$1@g;
print "\n";
while ($a =~ m@(http://\S+[^.,!? ])@g) {
print " ", $1, "\n";
}
http://some.url1.com/abc/def.html http://www.url2.qq/ghi/jkl.php?a=2&b=33 http://www.miu.url3.qq http://www.url4.xxx.ui?r=19 |