René Nyffenegger's collection of things on the web
René Nyffenegger on Oracle - Most wanted - Feedback -
 

Recursive find and grep with perl

I find it unbelievable how hard it is to grep recursively on a windows system in files. There are many times I wished I had had something like grep and find on Windows. To counter this shortcoming, I have written a little perl script that I use if I need to find certain text in a directory:
use strict;
use warnings;
use Cwd;

use File::Find;

my $search_pattern=$ARGV[0];
my $file_pattern  =$ARGV[1];

find(\&d, cwd);

sub d {

  my $file = $File::Find::name;

  $file =~ s,/,\\,g;

  return unless -f $file;
  return unless $file =~ /$file_pattern/;

  open F, $file or print "couldn't open $file\n" && return;

  while (<F>) {
    if (my ($found) = m/($search_pattern)/o) {
      print "found $found in $file\n";
      last;
    }
  }

  close F;
}

It is called as follows
perl find.pl (?i)sorting html
The first argument is the regular expression. I use (?i) to indicate that I want to search case insensitive. The second argument specifies that only files should be considered that contain html in their name.
Here's a C++ version of a recursive directory descender.
There is also a very flexible grep for windows (an exe that allows to recursively search for patterns within files).
By the way, File::Finder is a wrapper around File::Find.

Improved find by Christopher Hilding

Christopher Hilding sent me a mail:
Hey,

Through your help I have finally got a solution to find in files (grep) on
Windows! However, there was a bug where the file_pattern also matched
directories. I.e. if you did "find.pl searchpattern htm" to find
"searchpattern" in all files containing the word "htm", and the *directory
path* to a file contained "htm" too, it would match as if the *file name* had
contained the word "htm", even if the file did not have it. The reason is that
you match $file (the full path) against the given filepattern, I changed it to
$_ which is just the actual file *name*.

I've also commented the file and created a case against trying to execute
without valid parameters, since yours used to crash ("invalid regex" errors).
It now outputs a help menu. I also changed it so that the file*name* match is
case insensitive by default so you don't have to keep specifying (?i)htm to
match HTM, htm, Htm, etc. Just in case there are variations.

I compiled the filename-matching regex as it won't be changing and may be used
to scan the name of thousands of files depending on what directory you are
scanning from. I also added the option to specify just the search pattern and
omitting a filename to make it search all files.

When it comes to the output of the script, I added a file match counter at the
end. And an extensive search mode where it shows all the matching lines of the
files along with their line numbers. I changed the output format of both search
types to a more advanced output with file numbering and clearer list (both for
the simple list and extended list). Umm, that's all I can remember but I
probably did even more. Enjoy! And thanks for starting me off with a great perl
sample!

Best Regards, Christopher H.
    
use strict;
use warnings;
use Cwd;

use File::Find;
use File::Basename;

my ($in_rgx,$in_files,$simple,$matches,$cwd);
sub trim($) {
  my $string = shift;
  $string =~ s/[\r\n]+//g;
  $string =~ s/\s+$//;
  return $string;
}

                                      # 1: Get input arguments
if ($#ARGV == 0) {                    # *** ONE ARGUMENT *** (search pattern)
  ($in_rgx,$in_files,$simple) = ($ARGV[0],".",1);
}
elsif ($#ARGV == 1) {                 # *** TWO ARGUMENTS *** (search pattern + filename or flag)
  if (($ARGV[1] eq '-e') || ($ARGV[1] eq '-E')) { # extended
    ($in_rgx,$in_files,$simple) = ($ARGV[0],".",0);
  }
  else { # simple
    ($in_rgx,$in_files,$simple) = ($ARGV[0],$ARGV[1],1);
  }
}
elsif ($#ARGV == 2) {                 # *** THREE ARGUMENTS *** (search pattern + filename + flag)
  ($in_rgx,$in_files,$simple) = ($ARGV[0],$ARGV[1],0);
}
else {                                # *** HELP *** (either no arguments or more than three)
  print "Usage:  ".basename($0)." regexpattern [filepattern] [-E]\n\n" .
        "Hints:\n" .
        "*) If you need spaces in your pattern, put quotation marks around it.\n" .
        "*) To do a case insensitive match, use (?i) preceding the pattern.\n" .
        "*) Both patterns are regular expressions, allowing powerful searches.\n" .
        "*) The file pattern is always case insensitive.\n";
  exit;
}


if ($in_files eq '.') {               # 2: Output search header
  print basename($0).": Searching all files for \"${in_rgx}\"... (".(($simple) ? "simple" : "extended").")\n";
}
else {
  print basename($0).": Searching files matching \"${in_files}\" for \"${in_rgx}\"... (".(($simple) ? "simple" : "extended").")\n";
}


if ($simple) { print "\n"; }          # 3: Traverse directory tree using subroutine 'findfiles'

($matches,$cwd) = (0,cwd);
$cwd =~ s,/,\\,g;
find(\&findfiles, $cwd);

                                      
sub findfiles {                       # 4: Used to iterate through each result
  my $file = $File::Find::name;       # complete path to the file

  $file =~ s,/,\\,g;                  # substitute all / with \

  return unless -f $file;             # process files (-f), not directories
  return unless $_ =~ m/$in_files/io; # check if file matches input regex
                                      # /io = case-insensitive, compiled
                                      # $_ = just the file name, no path

                                      # 5: Open file and search for matching contents
  open F, $file or print "\n* Couldn't open ${file}\n\n" && return;

  if ($simple) {                      # *** SIMPLE OUTPUT ***
    while (<F>) {
      if (m/($in_rgx)/o) {            # /o = compile regex
                 # file matched!
          $matches++;
          print "---" .               # begin printing file header
          sprintf("%04d", $matches) . # file number, padded with 4 zeros
          "--- ".$file."\n";          # file name, keep original name
                                      # end of file header
        last;                         # go on to the next file
      }
    }
  }                                   # *** END OF SIMPLE OUTPUT ***
  else {                              # *** EXTENDED OUTPUT ***
    my $found = 0;                    # used to keep track of first match
    my $binary = (-B $file) ? 1 : 0;  # don't show contents if file is bin
    $file =~ s/^\Q$cwd//g;            # remove current working directory
                                      # \Q = quotemeta, escapes string

    while (<F>) {
      if (m/($in_rgx)/o) {            # /o = compile regex
                                      # file matched!
        if (!$found) {                # first matching line for the file
          $found = 1;
          $matches++;
          print "\n---" .             # begin printing file header
          sprintf("%04d", $matches) . # file number, padded with 4 zeros
          "--- ".uc($file)."\n";      # file name, converted to uppercase
                                      # end of file header
          if ($binary) {              # file is binary, do not show content
            print "Binary file.\n";
            last;
          }
        }
        print "[$.]".trim($_)."\n";   # print line number and contents
        #last;                        # uncomment to only show first line
      }
    }
  }                                   # *** END OF EXTENDED OUTPUT ***

  # 6: Close the file and move on to the next result
  close F;
}

#7: Show search statistics
print "\nMatches: ${matches}\n";

# Search Engine Source: http://www.adp-gmbh.ch/perl/find.html
# Rewritten by Christopher Hilding, Dec 02 2006
# Formatting adjusted to my liking by Rene Nyffenegger, Dec 22 2006