Date parsing

2010-10-22 perl parsing date manipulation date test Test::More

Date parsing is always problematic, especially when it comes directly from users, say an Excel spreadsheet. For most of my work I prefer canonical YYYY-MM-DD or YYMMDD format that is great for sorting and easy to parse. But data comes from various sources. Recently I got data with all dates in US format: Wed, Sep 17 2003.

Here is very simple perl class that implement the parsing. Initially I looked on DateTime::Format::Natural and some others, but I haven’t found any working straight for me. I’ve chosen a OO form, since I need to parse many such dates and initialization is potentially costly compared to matching code.

package DateParser;

sub new {
    my $class = shift;
    my @months = qw( Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec );
    my $num;
    return bless {
        months_tran => { map { lc($_) => ++$num } @months },
        months_re   => join '|', @months,
    }, ref($class) || $class;
}

sub parse {
    my ($self,$string) = @_;
    my $months_re = $self->{months_re};
    if($string =~ /^ \s*
        ((Mon|Tue|Wed|Thu|Fri|Sat|Sun) (?: \s*,\s* | \s+) )?  # day name
        ($months_re) \s+                                      # month name
        (\d+)     (?: \s*,\s* | \s+)                          # day
        (\d+)     \s*                                         # year
        $/xi
    ) {
        my $month = $self->{months_tran}{lc($3)};
        return () unless $month;
        return wantarray ? ($5,$month,$4) : sprintf("%04d-%02d-%02d",$5,$month,$4);
    }
    return ();
}

1;

The class DateParser has a constructor and one method. In constructor new an anonymous hash is blessed into the class. The hash is constant and look like this:

{
  months_re => "Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",
  months_tran => {
    'jan' => 1,   'feb' => 2,   'mar' => 3,   'apr' => 4,   
    'may' => 5,   'jun' => 6,   'jul' => 7,   'aug' => 8,   
    'sep' => 9,   'oct' => 10,  'nov' => 11,  'dec' => 12,
  }
}

The months_re is part of parsing regular expression (RE) and months_tran is hash for converting month name into its ordinal number.

A single method of the class does the parsing. Note that extended syntax of RE is used (//x) that allows use comments and whitespace within it to enhance clarity. The method in list context returns three items - year, month and day - in scalar context returns date as string formatted YYYY-MM-DD.

Test

I used the code below to debug the class. It is using simple procedural Test::More module with its ok and is functions.

use Test::More;
use DateParser;

my $dp = DateParser->new;
ok $dp, "DateParser new";

my %tests = (
    ' Tue, Sep 16 2003 ' => '2003-09-16',
    'Wed, Sep 17 2003'   => '2003-09-17',
    'Thu, Jan 22 2004'   => '2004-01-22',
    'Wed, Mar 31 2004'   => '2004-03-31',
    'Jun 7 2004'         => '2004-06-07',
    '  Jun 7 , 2004 '    => '2004-06-07',
);
for my $input (sort keys %tests) {
    is  scalar $dp->parse($input),  $tests{$input},
        "\"$input\" => $tests{$input}";
}

done_testing;

Sample run:

ok 1 - DateParser new
ok 2 - "  Jun 7 , 2004 " => 2004-06-07
ok 3 - " Tue, Sep 16 2003 " => 2003-09-16
ok 4 - "Jun 7 2004" => 2004-06-07
ok 5 - "Thu, Jan 22 2004" => 2004-01-22
ok 6 - "Wed, Mar 31 2004" => 2004-03-31
ok 7 - "Wed, Sep 17 2003" => 2003-09-17
1..7