René Nyffenegger's collection of things on the web | |
René Nyffenegger on Oracle - Most wanted - Feedback
- Follow @renenyffenegger
|
Extracting URLs from text with PERL | ||
This is an attempt for a script to extract urls from text with Perl. The script recognizes an URL if it either starts with http:// or with
www.. If the url is directly (that is: without white space) followed by a dot (.), comma (,), exclamation mark (!) or a question mark (?), these
characters won't be included in the url.
extract_url_1.pl
use strict; use warnings; my $a = "http://some.url1.com/abc/def.html, bla bla bla:www.url2.qq/ghi/jkl.php?a=2&b=33, moo moo www.miu.url3.qq bar bar bar http://www.url4.xxx.ui?r=19 aldjf"; print "\n"; while ($a =~ m@(((http://)|(www\.))\S+[^.,!? ])@g) { print " ", $1, "\n"; }
And here's the output:
http://some.url1.com/abc/def.html www.url2.qq/ghi/jkl.php?a=2&b=33 www.miu.url3.qq http://www.url4.xxx.ui?r=19 Prepending missing http://
Sometimes it seems necessary to prepend a missing http:// to a url. This is done with the
negative lookbehind assertions (
(?<http://) ).
extract_url_2.pl
use strict; use warnings; my $a = "http://some.url1.com/abc/def.html, bla bla bla:www.url2.qq/ghi/jkl.php?a=2&b=33, moo moo www.miu.url3.qq bar bar bar http://www.url4.xxx.ui?r=19 aldjf"; $a =~ s@(?<!http://)(www\.\S+[^.,!? ])@http://$1@g; print "\n"; while ($a =~ m@(http://\S+[^.,!? ])@g) { print " ", $1, "\n"; } http://some.url1.com/abc/def.html http://www.url2.qq/ghi/jkl.php?a=2&b=33 http://www.miu.url3.qq http://www.url4.xxx.ui?r=19 |