Simple web scraping with perl
2011-03-29 perl LWP::Simple HTML::TreeBuilder URI EncodingThere is something on web that I would like to use. The following article goes on how to locate and get web-pages, parse and extract data from it.
The code below is real-life problem I needed to solve. There was nice web kaloricketabulky.cz (calorie tables in english) that I used to find out what calorie consumption I had on various activities like swimming.
Building blocks:
- how to get a page from specific URL
- how to parse HTML string and find specific contents
- how to compose new URL from original and relative link
Request data from web
We can use LWP::Simple library to get a resource from specific url:
use LWP::Simple qw(get);
my $data = get('http://www.example.com');
Parse contents of a web page
The HTML::TreeBuilder provides nice interface for parsing of the HTML page and bunch of nice methods to extract items based on its attributes.
my $html = HTML::TreeBuilder->new_from_content($html_string);
for my $link ($html->look_down(id => 'katalog')->find('a')) {
# do something with links in the "katalog" element
print $link->attr('href');
}
URL composition
Is easy with URI module:
my $url = URI->new_abs($link, $base_link); # link can be either absolute or relative
Put together
use LWP::Simple qw(get);
use HTML::TreeBuilder;
use URI;
my $start_page
= q{http://www.kaloricketabulky.cz/tabulka-aktivit.php?pismeno=A};
my $html = HTML::TreeBuilder->new_from_content(get($start_page));
my @list_page_urls
= map { URI->new_abs($_->attr('href'),$start_page)->as_string }
$html->look_down(id => 'katalog')->find('a');
$html->delete;
my @items = ();
for my $page_url (@list_page_urls) {
my $html = HTML::TreeBuilder->new_from_content(get($page_url));
my $item_table_html = $html->look_down(class => 'vypis');
my @rows = $item_table_html->find('tr');
my $header = shift @rows;
for my $row (@rows) {
my @cols = map { $_->as_text } $row->find('td');
push @items, [ @cols ];
}
$html->delete;
}
open(my $out,">:encoding(cp1250):utf8", "activities.txt");
for my $item (@items) {
print {$out} join("\t",@$item), "\n";
}