René Nyffenegger's collection of things on the web | |
René Nyffenegger on Oracle - Most wanted - Feedback
- Follow @renenyffenegger
|
Recursive find and grep with perl | ||
I find it unbelievable how hard it is to grep recursively on a windows system in files. There are many
times I wished I had had something like grep and find on Windows. To counter this shortcoming, I have
written a little perl script that I use if I need to find certain text in a directory:
use strict; use warnings; use Cwd; use File::Find; my $search_pattern=$ARGV[0]; my $file_pattern =$ARGV[1]; find(\&d, cwd); sub d { my $file = $File::Find::name; $file =~ s,/,\\,g; return unless -f $file; return unless $file =~ /$file_pattern/; open F, $file or print "couldn't open $file\n" && return; while (<F>) { if (my ($found) = m/($search_pattern)/o) { print "found $found in $file\n"; last; } } close F; }
It is called as follows
perl find.pl (?i)sorting html
The first argument is the regular expression. I use (?i) to indicate that I want to search case insensitive. The second argument
specifies that only files should be considered that contain html in their name.
Here's a C++ version of a recursive directory descender.
There is also a very flexible grep for windows (an exe that allows to recursively
search for patterns within files).
By the way, File::Finder is a wrapper around File::Find.
Improved find by Christopher Hilding
Christopher Hilding sent me a mail:
Hey, Through your help I have finally got a solution to find in files (grep) on Windows! However, there was a bug where the file_pattern also matched directories. I.e. if you did "find.pl searchpattern htm" to find "searchpattern" in all files containing the word "htm", and the *directory path* to a file contained "htm" too, it would match as if the *file name* had contained the word "htm", even if the file did not have it. The reason is that you match $file (the full path) against the given filepattern, I changed it to $_ which is just the actual file *name*. I've also commented the file and created a case against trying to execute without valid parameters, since yours used to crash ("invalid regex" errors). It now outputs a help menu. I also changed it so that the file*name* match is case insensitive by default so you don't have to keep specifying (?i)htm to match HTM, htm, Htm, etc. Just in case there are variations. I compiled the filename-matching regex as it won't be changing and may be used to scan the name of thousands of files depending on what directory you are scanning from. I also added the option to specify just the search pattern and omitting a filename to make it search all files. When it comes to the output of the script, I added a file match counter at the end. And an extensive search mode where it shows all the matching lines of the files along with their line numbers. I changed the output format of both search types to a more advanced output with file numbering and clearer list (both for the simple list and extended list). Umm, that's all I can remember but I probably did even more. Enjoy! And thanks for starting me off with a great perl sample! Best Regards, Christopher H. use strict; use warnings; use Cwd; use File::Find; use File::Basename; my ($in_rgx,$in_files,$simple,$matches,$cwd); sub trim($) { my $string = shift; $string =~ s/[\r\n]+//g; $string =~ s/\s+$//; return $string; } # 1: Get input arguments if ($#ARGV == 0) { # *** ONE ARGUMENT *** (search pattern) ($in_rgx,$in_files,$simple) = ($ARGV[0],".",1); } elsif ($#ARGV == 1) { # *** TWO ARGUMENTS *** (search pattern + filename or flag) if (($ARGV[1] eq '-e') || ($ARGV[1] eq '-E')) { # extended ($in_rgx,$in_files,$simple) = ($ARGV[0],".",0); } else { # simple ($in_rgx,$in_files,$simple) = ($ARGV[0],$ARGV[1],1); } } elsif ($#ARGV == 2) { # *** THREE ARGUMENTS *** (search pattern + filename + flag) ($in_rgx,$in_files,$simple) = ($ARGV[0],$ARGV[1],0); } else { # *** HELP *** (either no arguments or more than three) print "Usage: ".basename($0)." regexpattern [filepattern] [-E]\n\n" . "Hints:\n" . "*) If you need spaces in your pattern, put quotation marks around it.\n" . "*) To do a case insensitive match, use (?i) preceding the pattern.\n" . "*) Both patterns are regular expressions, allowing powerful searches.\n" . "*) The file pattern is always case insensitive.\n"; exit; } if ($in_files eq '.') { # 2: Output search header print basename($0).": Searching all files for \"${in_rgx}\"... (".(($simple) ? "simple" : "extended").")\n"; } else { print basename($0).": Searching files matching \"${in_files}\" for \"${in_rgx}\"... (".(($simple) ? "simple" : "extended").")\n"; } if ($simple) { print "\n"; } # 3: Traverse directory tree using subroutine 'findfiles' ($matches,$cwd) = (0,cwd); $cwd =~ s,/,\\,g; find(\&findfiles, $cwd); sub findfiles { # 4: Used to iterate through each result my $file = $File::Find::name; # complete path to the file $file =~ s,/,\\,g; # substitute all / with \ return unless -f $file; # process files (-f), not directories return unless $_ =~ m/$in_files/io; # check if file matches input regex # /io = case-insensitive, compiled # $_ = just the file name, no path # 5: Open file and search for matching contents open F, $file or print "\n* Couldn't open ${file}\n\n" && return; if ($simple) { # *** SIMPLE OUTPUT *** while (<F>) { if (m/($in_rgx)/o) { # /o = compile regex # file matched! $matches++; print "---" . # begin printing file header sprintf("%04d", $matches) . # file number, padded with 4 zeros "--- ".$file."\n"; # file name, keep original name # end of file header last; # go on to the next file } } } # *** END OF SIMPLE OUTPUT *** else { # *** EXTENDED OUTPUT *** my $found = 0; # used to keep track of first match my $binary = (-B $file) ? 1 : 0; # don't show contents if file is bin $file =~ s/^\Q$cwd//g; # remove current working directory # \Q = quotemeta, escapes string while (<F>) { if (m/($in_rgx)/o) { # /o = compile regex # file matched! if (!$found) { # first matching line for the file $found = 1; $matches++; print "\n---" . # begin printing file header sprintf("%04d", $matches) . # file number, padded with 4 zeros "--- ".uc($file)."\n"; # file name, converted to uppercase # end of file header if ($binary) { # file is binary, do not show content print "Binary file.\n"; last; } } print "[$.]".trim($_)."\n"; # print line number and contents #last; # uncomment to only show first line } } } # *** END OF EXTENDED OUTPUT *** # 6: Close the file and move on to the next result close F; } #7: Show search statistics print "\nMatches: ${matches}\n"; # Search Engine Source: http://www.adp-gmbh.ch/perl/find.html # Rewritten by Christopher Hilding, Dec 02 2006 # Formatting adjusted to my liking by Rene Nyffenegger, Dec 22 2006 |