Web Scraping with Cheerio: A Comprehensive Guide Using ProxyTee
Web scraping has become an essential technique for businesses, researchers, and developers looking to extract valuable data from the internet. Whether it's gathering pricing information, monitoring competitors, or automating research, web scraping can provide an efficient way to collect data at scale.
One of the most popular libraries for web scraping in JavaScript is Cheerio, a fast and lightweight tool designed to parse and manipulate HTML and XML documents. This guide will explore how to effectively use Cheerio for web scraping while integrating ProxyTee, a leading provider of rotating residential proxies, to enhance reliability, anonymity, and efficiency.
ProxyTee Overview
ProxyTee is a provider of rotating residential proxies designed to support a wide range of internet activities, including web scraping, streaming, and other tasks requiring anonymity and IP rotation. Known for its affordability and efficiency, ProxyTee offers solutions with unlimited bandwidth, a vast pool of IP addresses, and easy-to-integrate tools. ProxyTee provides an affordable, reliable, and easy-to-use solution for anyone needing rotating residential proxies. Its features—such as unlimited bandwidth, a global IP pool, protocol flexibility, auto-rotation, and API integration—make it a great option for businesses and individuals involved in tasks like web scraping, streaming, or data gathering. With a focus on user-friendly design and competitive pricing, ProxyTee delivers strong value for those looking for effective proxy services.
Key features that make ProxyTee a standout choice:
- Unlimited Bandwidth: ProxyTee provides proxies with unlimited bandwidth, eliminating concerns about data overages during high-traffic tasks like web scraping or streaming. This feature ensures users can perform data-intensive operations without worrying about additional costs.
- Global IP Coverage: With over 20 million IP addresses from more than 100 countries, ProxyTee ensures that users have access to a wide range of geographic locations. This makes it an ideal solution for businesses and individuals targeting specific regions or performing location-based tasks.
- Multiple Protocol Support: ProxyTee supports both HTTP and SOCKS5 protocols, ensuring compatibility with a variety of applications and tools. This flexibility allows users to handle tasks like web scraping, bypassing geo-blocks, and more with ease.
- User-Friendly Interface: ProxyTee prioritizes simplicity with a clean, intuitive graphical user interface (GUI). This easy-to-use platform allows users to get started quickly, with minimal setup time and no technical expertise required.
- Auto Rotation: The auto-rotation feature allows IP addresses to change automatically at intervals of 3 to 60 minutes. This is particularly valuable for web scraping, where frequent IP changes prevent detection and bans from target websites. The rotation interval can be customized to suit different needs.
- API Integration: ProxyTee offers a simple API that enables seamless integration with various applications and workflows. This API supports all the service features, making it an ideal choice for developers and businesses looking to automate their proxy-related tasks.
Web Scraping with Cheerio
Cheerio is a fast and flexible library for parsing and manipulating HTML and XML documents. It implements a subset of jQuery features, making it easy for anyone familiar with jQuery to get started. Under the hood, Cheerio utilizes libraries like parse5
and htmlparser2
for parsing documents efficiently.
1️ Setting Up the Project
To begin, ensure you have Node.js installed. If not, download and install it from the official documentation. Afterward, set up your project with these steps:
- Create a new directory:
mkdir cheerio-demo && cd cheerio-demo
- Initialize an npm project:
npm init -y
- Install Cheerio and Axios:
npm install cheerio axios
Next, create a file called index.js
and include the following code to import the necessary modules:
const axios = require("axios");
const cheerio = require("cheerio");
2️⃣ Writing the Scraper
Create a new file named index.js
and import the required modules:
const axios = require("axios");
const cheerio = require("cheerio");
For demonstration purposes, we will scrape book information from Books to Scrape, a sandbox website commonly used for testing web scrapers.
axios.get("https://books.toscrape.com/")
.then(response => {
const $ = cheerio.load(response.data);
$("article.product_pod").each((i, element) => {
const title = $(element).find("h3 a").text();
const price = $(element).find(".price_color").text();
const availability = $(element).find(".instock.availability").text().trim();
console.log({ title, price, availability });
});
})
.catch(error => console.log(error));
3️⃣ Extracting Additional Data
To extract the book’s star rating, inspect the HTML structure and look for the p.star-rating
element:
$("article.product_pod").each((i, element) => {
const title = $(element).find("h3 a").text();
const price = $(element).find(".price_color").text();
const availability = $(element).find(".instock.availability").text().trim();
const ratingClass = $(element).find("p.star-rating").attr("class").split(" ")[1];
const ratingsMap = { One: 1, Two: 2, Three: 3, Four: 4, Five: 5 };
const rating = ratingsMap[ratingClass] || "Unknown";
console.log({ title, price, availability, rating });
});
4️⃣ Saving Scraped Data
After extracting data, it’s beneficial to save it for further analysis. We can save the results in a CSV file.
Install the CSV package:
npm install csv-writer
Modify the script to write data to a CSV file:
const fs = require("fs");
const { createObjectCsvWriter } = require("csv-writer");
const csvWriter = createObjectCsvWriter({
path: "books.csv",
header: [
{ id: "title", title: "Title" },
{ id: "price", title: "Price" },
{ id: "availability", title: "Availability" },
{ id: "rating", title: "Rating" }
]
});
axios.get("https://books.toscrape.com/", { httpsAgent: agent })
.then(response => {
const $ = cheerio.load(response.data);
const books = [];
$("article.product_pod").each((i, element) => {
const title = $(element).find("h3 a").text();
const price = $(element).find(".price_color").text();
const availability = $(element).find(".instock.availability").text().trim();
const ratingClass = $(element).find("p.star-rating").attr("class").split(" ")[1];
const ratingsMap = { One: 1, Two: 2, Three: 3, Four: 4, Five: 5 };
const rating = ratingsMap[ratingClass] || "Unknown";
books.push({ title, price, availability, rating });
});
return csvWriter.writeRecords(books);
})
.then(() => console.log("Data saved to books.csv"))
.catch(error => console.log(error));
Now putting everything together, here's the output with title, price, availablity and rating:
$("article.product_pod").each( (i, element) => {
const titleH3 = $(element).find("h3");
const title = titleH3.find("a").text();
const priceDiv = titleH3.next();
const price = priceDiv.children().eq(0).text().trim();
const availability = priceDiv.children().eq(1).text().trim();
const ratingP = $(element).find("p.star-rating");
const starRating = ratingP.attr('class');
const rating = { One: 1, Two: 2, Three: 3, Four: 4, Five: 5 }[starRating.split(" ")[1]];
console.log(title, price, availability, rating);
});
Conclusion
Cheerio is a powerful tool for web scraping static HTML pages, but it works best when combined with a robust proxy service like ProxyTee. Using ProxyTee’s rotating residential proxies ensures higher success rates by preventing detection, avoiding IP bans, and enabling geo-targeted scraping.
Whether you're gathering data for competitive analysis, market research, or price monitoring, integrating ProxyTee into your scraping workflow enhances efficiency, security, and anonymity.
If you're looking for an affordable and reliable proxy solution, explore ProxyTee’s pricing and discover how it can optimize your web scraping process. With features like unlimited bandwidth, a global IP pool, and auto-rotating proxies, ProxyTee provides an unparalleled solution for seamless web scraping.