Web Scraping with Puppeteer: A Comprehensive Guide for 2025

Taylor

19 Jan 2025 • 5 min read

Web scraping and automation with JavaScript have significantly advanced. While various methods exist for accessing and parsing web pages, this guide will focus on leveraging Google Puppeteer with ProxyTee for efficient scraping.

Why Automate Web Scraping?

Traditionally, web pages are accessed and parsed using methods like sending a direct GET request to receive HTML content and then using libraries such as Cheerio for parsing. Although this approach is fast, it struggles with dynamic sites—those that rely heavily on JavaScript for rendering. Headless browsers offer a better solution.

What is a Headless Browser?

A headless browser is a browser without a graphical user interface (GUI). It provides full browser functionality while operating in the background, making it faster and less resource-intensive. Popular browsers like Chrome and Firefox support headless modes, but Puppeteer—a Node.js library—works seamlessly with Chromium, an open-source browser that serves as the foundation for browsers like Microsoft Edge, Opera, and Brave.

Why Puppeteer?

Puppeteer stands out as a fast, lightweight, and versatile library for web scraping. It’s designed for JavaScript developers and allows full control over Chromium. For Python developers, Pyppeteer—an unofficial port of Puppeteer—offers similar functionality, integrating with Python’s asyncio for smooth operations.

Below are example code snippets of taking screenshots with both Puppeteer and Pyppeteer:

Puppeteer (JavaScript):

const puppeteer = require('puppeteer');

async function main() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://proxytee.com/');
  await page.screenshot({ 'path': 'proxytee_js.png' });
  await browser.close();
}

main();

Pyppeteer (Python):

import asyncio
import pyppeteer

async def main():
    browser = await pyppeteer.launch()
    page = await browser.newPage()
    await page.goto('https://proxytee.com/')
    await page.screenshot({ 'path': 'proxytee_python.png' })
    await browser.close()

asyncio.run(main())

This article will focus on Puppeteer, commencing with installation instructions.

Installation and Setup

Prerequisites

Node.js (comes with npm)
A code editor (e.g., VS Code)

Steps to Install Puppeteer

Install Node.js: Download Node.js from its official website.
Initialize a Project:

npm init -y

This command generates a package.json file to manage project dependencies.

Install Puppeteer:

npm install puppeteer

Puppeteer downloads a compatible Chromium version by default, ensuring seamless integration.

Getting Started with Puppeteer

Puppeteer uses asynchronous calls, all our examples use `async-await` syntax. To ensure smooth operations, and if you want to integrate ProxyTee proxies, please check out the simple API from ProxyTee.

Simple Example of Using Puppeteer

Create a `example1.js` file and add the code below:

const puppeteer = require('puppeteer');

(async () => {
  // Add code here
})();

The `require` keyword makes the Puppeteer library accessible, and a placeholder for asynchronous functions is created.

Next, launch the browser with this command, by default, it starts in headless mode.

const browser = await puppeteer.launch();

If a UI is needed, it can be added as a parameter like below:

const browser = await puppeteer.launch({ headless: false }); // default is true

Now create a page, which represents a tab, using below line:

const page = await browser.newPage();

A website can be loaded with the function `goto()`:

await page.goto('https://proxytee.com/');

Once the page is loaded, take a screenshot using below:

await page.screenshot({ path: 'proxytee_1080.png' });

By default, it takes screenshots with 800x600 dimensions. To change that, use the setViewPort method:

await page.setViewport({ width: 1920, height: 1080 });

Finally, close the browser after your work done.

await browser.close();

Here is the complete script for taking a screenshot:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.setViewport({ width: 1920, height: 1080 });
  await page.goto('https://proxytee.com/');
  await page.screenshot({ path: 'proxytee_1080.png' });
  await browser.close();
})();

Run this script with:

node example1.js

This generates a new file named `proxytee_1080.png` in the same folder.

Bonus Tip: To generate a PDF file, use `pdf()`:

await page.pdf({ path: 'proxytee.pdf', format: 'A4' });

Scraping an Element from a Page

Puppeteer loads complete DOM, allowing you to extract any element. The `evaluate()` method allows running JavaScript inside the page's context and lets you extract any data. Use `document.querySelector()` to target specific elements.

Let’s extract the title of this page from Wikipedia about web scraping. Use `Inspect` in browser developer tool, and find that heading element id is `#firstHeading`. In the `Console` tab in the developer tool write in this line:

document.querySelector('#firstHeading')

You can use below method to get the element content using Javascript:

document.querySelector('#firstHeading').textContent

To do it via `evaluate` method, you need the surrounding line:

await page.evaluate(() => {
    return document.querySelector("#firstHeading").textContent;
});

Here is the full code:

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://en.wikipedia.org/wiki/Web_scraping");

  title = await page.evaluate(() => {
    return document.querySelector("#firstHeading").textContent.trim();
  });

  console.log(title);
  await browser.close();
})();

Scraping Multiple Elements

Extracting multiple elements follows these steps:

Use `querySelectorAll` to select all matching elements:

headings_elements = document.querySelectorAll("h2 .mw-headline");

Convert `NodeList` into an array.

headings_array = Array.from(headings_elements);

Map each element with text content.

return headings_array.map(heading => heading.textContent);

Below is the full script for extracting multiple items from a website:

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://en.wikipedia.org/wiki/Web_scraping");

  headings = await page.evaluate(() => {
    headings_elements = document.querySelectorAll("h2 .mw-headline");
    headings_array = Array.from(headings_elements);
    return headings_array.map(heading => heading.textContent);
  });

  console.log(headings);
  await browser.close();
})();

Bonus Tip: Use map function directly, it depends on your preference

headings = await page.evaluate(() => {
  return Array.from(document.querySelectorAll("h2 .mw-headline"), heading => heading.innerText.trim());
});

Scraping a Hotel Listing Page

This section demonstrates how to scrape a listing page for JSON output. You can apply this to various types of listing. We’ll use an Airbnb page with 20 hotels.

Note: Website structures change often, so you need to recheck selectors every time.

The selector for the container element of hotel listing cards are like:

root = Array.from(document.querySelectorAll('div[data-testid="card-container"]'));

This will return 20 elements and will be used in `map()` function. Within the `map()` we’ll extract text, and image.

hotels = root.map(hotel => ({
 // code here
}));

You can get the hotel name with:

hotel.querySelector('div[data-testid="listing-card-title"]').textContent

The core idea here is concatenating querySelectors. For first hotel, you can locate element with:

document.querySelectorAll('div[data-testid="card-container"]')[0].querySelector('div[data-testid="listing-card-title"]').textContent

Image URL of hotels can be located with:

hotel.querySelector("img").getAttribute("src")

Constructing the `Hotel` object will be with syntax as:

Hotel = {
  Name: 'x',
  Photo: 'y'
}

Below is the complete script. Save it as `bnb.js`.

const puppeteer = require("puppeteer");

(async () => {
  let url ="https://www.airbnb.com/s/homes?refinement_paths%5B%5D=%2Fhomes&search_type=section_navigation&property_type_id%5B%5D=8";
  const browser = await puppeteer.launch(url);
  const page = await browser.newPage();
  await page.goto(url);

  data = await page.evaluate(() => {
    root = Array.from(document.querySelectorAll('div[data-testid="card-container"]'));
    hotels = root.map(hotel => ({
      Name: hotel.querySelector('div[data-testid="listing-card-title"]').textContent,
      Photo: hotel.querySelector("img").getAttribute("src")
    }));
    return hotels;
  });

  console.log(data);
  await browser.close();
})();

Run this using:

node bnb.js

A JSON object will be shown on console.

Summary

This guide explored various examples of web scraping with Puppeteer, starting with extracting a single element and progressing to fetching hotel listings. For deeper details, see the official Puppeteer documentation.

To streamline your web scraping tasks, check out ProxyTee, which provides Unlimited Residential Proxies with rotating IPs and vast IP address pools. ProxyTee provides auto-rotation features and simple API.

ProxyTee provides an affordable, reliable and easy-to-use solution for anyone needing rotating residential proxies. Its features, such as unlimited bandwidth, a global IP pool, protocol flexibility with multiple protocol support, auto-rotation and API integration, make it a great option for both businesses and individuals involved in tasks like web scraping or data gathering. Check our pricing page for competitive plans. With focus on user-friendly design by having simple and clean GUI, ProxyTee delivers great value to those looking for effective proxy service.

Tags: