[This is a mid-week bonus item]
Suppose you want to find some dates inside a big string. The problem with dates is that there are some many ways to write them, and even if you can come up with a pattern to get the structure right, can you handle the different locales and languages that use different words to refer to the same day or month?
In Item 42. Don’t reinvent the regex, you saw the Regexp::Common module. It creates the regular expressions that many people often get wrong because they miss some subtle part of the pattern.
Regexp::Common::time‘s date handling is quite amazing though. It’s a plugin, so you need to install it separately. Instead of specifying a regular expression, you can use the -pat
option to specify the structure of the date, using a string much like that for strftime
, although with some regular expression bits added. From the semi-pattern, it constructs a much more complicated pattern that does the right thing. Since the module gives you a regex object, you can print it to see the pattern:
In this example, you extract the
use Regexp::Common qw(time); my @lines = `ls -l`; # May 3 2010 # Jan 17 18:21 $date_re = $RE{time}{strftime}{ -pat => '%b\s+%_d\s+(?:%Y|%_H:%M)' }; print "Pattern is------\n$date_re\n-------\n";
This pattern reflects the national representation for the en_US locale:
Pattern is------ (?=[SAFOJNMD])(?>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+(?:0[1-9]|[12]\d|3[01]|(?<!\d)[1-9])\s+(?:\d{4}|(?:(?=\d)(?:[01]\d|2[0123]|(?<!\d)\d)):(?:[0-5]\d)) -------
You can change your locale, in this case, to tr_TR for Turkish, to get a different pattern that has the same structure, although I don’t know if the Turks write their dates like this:
Pattern is------ (?=[AOTNKEHM\Å])(?>Oca|\Å\ub|Mar|Nis|May|Haz|Tem|A\Ä\u|Eyl|Eki|Kas|Ara)\s+(?:0[1-9]|[12]\d|3[01]|(?<!\d)[1-9])\s+(?:\d{4}|(?:(?=\d)(?:[01]\d|2[0123]|(?<!\d)\d)):(?:[0-5]\d)) -------
You can now use this pattern to match dates in text. Here’s a program that takes in a line and puts ^
characters under the parts it thinks are dates:
use Regexp::Common qw(time); my @lines = `ls -l`; # May 3 2010 # Jan 17 18:21 $date_re = $RE{time}{strftime}{ -pat => '%b\s+%_d\s+(?:%Y|%_H:%M)' }; while( defined( my $line = <> ) { next unless $line =~ /$date_re/; my $start = $-[0]; my $stop = $+[0]; my $underline = ( ' ' x $-[0] ) . ( '^' x ($stop - $start) ); print $line; print $underline, "\n\n"; }
You can test this by piping some output into this program. Here’s an extract of output from the Unix ls
command. Notice that the first date has a time instead of a year, but you still find it:
$ ls -l /usr/local/perls/perl-5.10.1/lib/site_perl/5.10.1 | perl date_finder.pl drwxr-xr-x 4 brian wheel 136 Dec 9 01:58 Acme ^^^^^^^^^^^^ -r--r--r-- 1 brian wheel 32517 Jul 6 2007 AppConfig.pm ^^^^^^^^^^^^ -r--r--r-- 1 brian wheel 54725 Jul 19 2007 Expect.pm ^^^^^^^^^^^^ -r--r--r-- 1 brian wheel 43735 Jul 19 2007 Expect.pod ^^^^^^^^^^^^ drwxr-xr-x 3 brian wheel 102 May 16 2010 ExtUtils ^^^^^^^^^^^^ drwxr-xr-x 3 brian wheel 102 Jun 17 2010 local ^^^^^^^^^^^^ -r--r--r-- 1 brian wheel 9137 Jun 15 2009 lwpcook.pod ^^^^^^^^^^^^ -r--r--r-- 1 brian wheel 25447 Jun 15 2009 lwptut.pod ^^^^^^^^^^^^ drwxr-xr-x 4 brian wheel 136 May 28 2010 namespace ^^^^^^^^^^^^ -r--r--r-- 1 brian wheel 1931 Sep 22 2009 oose.pm ^^^^^^^^^^^^
Notice that this would be hard to do with split
if you run into filenames that have spaces. You can’t depend on fixed column widths because the file sizes can move things around. It turns out to be pretty annoying.
Woah. Regexp::Common has plugins? Not sleeping tonight is going to be FUN.