Web Scraping with Perl: A Comprehensive Guide Using ProxyTee

Perl, known for its versatility and extensive module library, stands out as a powerful language for web scraping. In this post, we’ll explore how to use Perl for web scraping, demonstrating various techniques and highlighting how ProxyTee can enhance your scraping projects. We’ll cover methods such as LWP::UserAgent, HTML::TreeBuilder, Web::Scraper, Mojo::UserAgent, Mojo::DOM, and XML::LibXML, and then look at the common challenges of web scraping, showing how ProxyTee’s features help overcome them.
Web Scraping with Perl
Before we dive into code, ensure you have the latest version of Perl installed. The code examples are tested using Perl 5.38.2. Also, basic familiarity with installing Perl modules using cpanm
is beneficial. In this post, we’ll extract quotes from https://quotes.toscrape.com/, this example website will be used to demonstrate scraping methods with different Perl modules. Inspecting the website, each quote is inside a div
with the class quote
, containing a span
with the class text
for the quote text, and a small
element for the author’s name.
Using LWP::UserAgent and HTML::TreeBuilder
The LWP::UserAgent
module makes HTTP requests, while HTML::TreeBuilder
parses HTML content, which both provide fundamental components of a web scraper.
Install the modules with:
cpanm Bundle::LWP
cpanm HTML::Tree
Here’s an example code snippet to achieve web scraping:
use LWP::UserAgent;
use HTML::TreeBuilder;
my $ua = LWP::UserAgent->new;
$ua->agent("Quotes Scraper");
my $url = "https://quotes.toscrape.com/";
my $root = HTML::TreeBuilder->new();
my $request = $ua->get($url) or die "An error occurred $!\n";
if ($request->is_success) {
$root->parse($request->content);
my @quotes = $root->look_down(
_tag => 'div',
class => 'quote'
);
foreach my $quote (@quotes) {
my $text = $quote->look_down(
_tag => 'span',
class => 'text'
)->as_text;
my $author = $quote->look_down(
_tag => 'small',
class => 'author'
)->as_text;
print "$text: $author\n";
}
} else {
print "Cannot parse the result. " . $request->status_line . "\n";
}
This code makes an HTTP request using LWP::UserAgent
and then parse the HTML by using HTML::TreeBuilder
, after the document is successfully parsed, the text and author is extracted.
Using Web::Scraper
Web::Scraper provides a Domain Specific Language (DSL) to extract data from HTML and XML documents. The scraper is built from DSL to make the extraction concise and convenient.
Install the module with:
cpanm Web::Scraper
Here’s a scraping example with the DSL:
use URI;
use Web::Scraper;
use Encode;
my $quotes = scraper {
process 'div.quote', "quotes[]" => scraper {
process_first "span.text", text => 'TEXT';
process_first "small", author => 'TEXT';
};
};
my $res = $quotes->scrape( URI->new("https://quotes.toscrape.com/") );
for my $quote (@{$res->{quotes}}) {
print Encode::encode("utf8", "$quote->{text}: $quote->{author}\n");
}
In this code snippet, you can see the scraper
method defined, which defines a scraper logic which extract the quotes from div elements with class “quote”. Then process_first
method finds the corresponding tags and extract the content as text.
Using Mojo::UserAgent and Mojo::DOM
Mojo::UserAgent
and Mojo::DOM
, part of the Mojolicious framework, offer a more modern approach for interacting with and parsing web content, similar to LWP::UserAgent
and HTML::TreeBuilder
. These modules is a perfect choice when the web content is needed to scrape at real time.
Install the modules with:
cpanm Mojo::UserAgent
cpanm Mojo::DOM
The following example shows how to use Mojo::UserAgent
and Mojo::DOM
:
use Mojo::UserAgent;
use Mojo::DOM;
my $ua = Mojo::UserAgent->new;
my $res = $ua->get('https://quotes.toscrape.com/')->result;
if ($res->is_success) {
my $dom = Mojo::DOM->new($res->body);
my @quotes = $dom->find('div.quote')->each;
foreach my $quote (@quotes) {
my $text = $quote->find('span.text')->map('text')->join;
my $author = $quote->find('small.author')->map('text')->join;
print "$text: $author\n";
}
} else {
print "Cannot parse the result. " . $res->message . "\n";
}
As shown, you use Mojo::UserAgent
to send HTTP requests, then with Mojo::DOM
to parse the HTML. find
method allows to easily query the target content and extract it as needed.
Using XML::LibXML
XML::LibXML
is a robust module that allows to parse and query XML/HTML content with the XPath
capabilities.
Install the module with:
cpanm XML::LibXML
The following is an example how to parse and query the web document:
use LWP::UserAgent;
use XML::LibXML;
use open qw( :std :encoding(UTF-8) );
my $ua = LWP::UserAgent->new;
$ua->agent("Quotes Scraper");
my $url = "https://quotes.toscrape.com/";
my $request = $ua->get($url) or die "An error occurred $!\n";
if ($request->is_success) {
$dom = XML::LibXML->load_html(string => $request->content, recover => 1, suppress_errors => 1);
my $xpath = '//div[@class="quote"]';
foreach my $quote ($dom->findnodes($xpath)) {
my ($text) = $quote->findnodes('.//span[@class="text"]')->to_literal_list;
my ($author) = $quote->findnodes('.//small[@class="author"]')->to_literal_list;
print "$text: $author\n";
}
} else {
print "Cannot parse the result. " . $request->status_line . "\n";
}
XML::LibXML
is used to parse the HTML document with load_html
method and the parser also use the recover option to gracefully parse the content, then use the XPath
to query the target information. This example code demostrate the most powerful way to query the document.
Challenges of Web Scraping in Perl
Web scraping has various challenges, but with the robust tools and services of ProxyTee, many of these are easily handled:
- Pagination: Many websites use pagination to manage large volumes of data, therefore it’s crucial to identify and navigate through each page. ProxyTee’s Unlimited Residential Proxies allows users to easily perform large scale web scraping tasks with the benefit of unlimited bandwidth, handling large pagination jobs.
- Rotating Proxies: To protect anonymity and avoid IP bans, it’s essential to use rotating residential proxies. ProxyTee offers auto-rotating IPs which enables users to effortlessly rotate IP addresses with options for customize rotation intervals. This flexibility is crucial for evading IP bans and ensures continued data collection without interruptions.
- Honeypot Traps: ProxyTee’s vast and high quality IP pool mitigates risks, makes it less likely for bots to stumble upon these traps compared to typical home proxies.
- Solving CAPTCHAs: CAPTCHAs can be a hurdle. However, ProxyTee’s infrastructure with an vast network of residential IP addresses, can greatly reduce CAPTCHA occurrences.
- Scraping Dynamic Websites: Modern web pages are dynamic, using JavaScript to load content, hence this requires the scraper tool with a ability to handle dynamic pages, using technologies like headless browsers. ProxyTee offers robust and reliable residential proxies, to enable scraping the dynamic pages.