Tutorial

Web Scraping in Java with Jsoup: A Step-by-Step Guide with ProxyTee

February 6, 2025 Mike

Web scraping is an essential technique for gathering data from the internet. Developers across industries use web scraping for data analysis, price monitoring, market research, and more. In this guide, we’ll explore how to perform web scraping using Jsoup, a powerful Java library, and how ProxyTee can enhance your scraping projects with reliable and efficient proxy services. This combination ensures your data collection process is fast, secure, and anonymous.

What is Jsoup?

Jsoup is a Java library designed for working with real-world HTML. It allows you to parse HTML from local files or URLs and provides a simple API for extracting and manipulating data using the Document Object Model (DOM). With its jQuery-like syntax, developers can easily traverse and manipulate HTML elements.

Key Features of Jsoup:

Parses and cleans real-world HTML
Supports CSS and jQuery-like selectors
Fetches URLs and parses responses into a DOM tree
Provides a clean API for navigating, manipulating, and extracting data

Whether you are a seasoned developer or new to web scraping, Jsoup’s intuitive design makes it easy to extract and process data.

Why use ProxyTee for Web Scraping?

While Jsoup provides robust parsing capabilities, using a reliable proxy service like ProxyTee is crucial for avoiding IP blocks and CAPTCHAs. ProxyTee offers several features designed to support web scraping tasks effectively:

Unlimited Bandwidth: ProxyTee ensures that you can scrape as much data as you need without concerns about overages.
Global IP Coverage: With over 20 million IP addresses from over 100 countries, you can access geo-restricted content easily.
Auto-Rotation: ProxyTee’s auto-rotation feature helps prevent IP bans by automatically changing your IP address at customizable intervals from 3 to 60 minutes.
Multiple Protocol Support: Compatibility with both HTTP and SOCKS5 protocols provides the flexibility to integrate with various web scraping tools and scripts.
API Integration: An API that allows for automation of tasks and integration with other tools.
Affordable and Reliable: Compared to other services, ProxyTee offers high-value solutions that balance performance and cost, especially its Unlimited Residential Proxies product, which is up to 50% cheaper than the competition.

When combined, Jsoup for data extraction and ProxyTee for network handling can take your scraping efforts to the next level.

Prerequisites

Before getting into coding, ensure you have the following installed:

Java >= 8: Any version of Java equal to or higher than version 8 is sufficient. This guide is based on Java 17, the current Long-Term Support (LTS) version.
Maven or Gradle: Use any Java build automation tool you’re familiar with for dependency management.
Java IDE: An Integrated Development Environment (IDE) that supports Java with Maven or Gradle, like IntelliJ IDEA, is beneficial.

Follow the links to install each component, if necessary, ensuring everything is set up correctly to avoid common issues.

Verifying Your Setup

To ensure everything is installed correctly, run the following commands in your terminal:

java -version
mvn -v (for Maven)
gradle -v (for Gradle)

How To Build a Web Scraper Using Jsoup

We’ll create a scraper that extracts quotes from the Quotes to Scrape website (https://quotes.toscrape.com), a test site designed for learning web scraping.

1️⃣ Step 1: Set up a Java Project

Launch IntelliJ IDEA, create a new Java project with the correct language and build tool, then name your project accordingly. Once set, your IDE will automatically set up a skeleton Java project.

2️⃣ Step 2: Install Jsoup

Add Jsoup to your project’s dependencies. If using Maven, insert the dependency in your pom.xml file; for Gradle, use the build.gradle file. Install this new dependency and make it ready to import by clicking on the Maven reload button inside your IDE.

3️⃣ Step 3: Connect to your target web page

Use the following Java code to establish a connection to your target website:

Document doc = Jsoup.connect("https://quotes.toscrape.com/")
    .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36")
    .get();

Note: Use a valid User-Agent header as it may help to avoid bot detection mechanisms.

4️⃣ Step 4: Inspect the HTML Page

Use your browser’s developer tools (usually by right-clicking on an element and selecting ‘Inspect’ or ‘Inspect element’) to understand the HTML structure. In this case, each quote is in a <div class="quote"> tag that includes the quote itself inside a span tag, the author inside a <small> tag, and the list of the tags inside a nested div.

5️⃣ Step 5: Select HTML Elements with Jsoup

The Jsoup Document class has different selection methods, like getElementsByTag(), getElementsByClass(), getElementById(), and select() that is particularly powerful as it accepts CSS selectors. By using CSS selectors on elements containing the desired content like .text, .author, and .tags .tag you will easily be able to select all of them, making your code clear, easy to read, and less prone to unexpected problems.

6️⃣ Step 6: Extract Data from a Web Page with Jsoup

First, create a Quote.java file where to store the extracted data. Then, iterate on every single element retrieved from your website by applying proper CSS selectors as shown below:

// initializing the list of Quote data objects
// that will contain the scraped data
List<Quote> quotes = new ArrayList<>();

// retrieving the list of product HTML elements
// selecting all quote HTML elements
Elements quoteElements = doc.select(".quote");

// iterating over the quoteElements list of HTML quotes
for (Element quoteElement : quoteElements) {
    // initializing a quote data object
    Quote quote = new Quote();

    // extracting the text of the quote and removing the 
    // special characters
    String text = quoteElement.select(".text").first().text()
        .replace("“", "")
        .replace("”", "");
    String author = quoteElement.select(".author").first().text();

    // initializing the list of tags
    List<String> tags = new ArrayList<>();

    // iterating over the list of tags
    for (Element tag : quoteElement.select(".tag")) {
        // adding the tag string to the list of tags
        tags.add(tag.text());
    }

    // storing the scraped data in the Quote object
    quote.setText(text);
    quote.setAuthor(author);
    quote.setTags(String.join(", ", tags)); // merging the tags into a "A, B, ..., Z" string

    // adding the Quote object to the list of the scraped quotes
    quotes.add(quote);
}

7️⃣ Step 7: How to Crawl the Entire Website with Jsoup

Crawl the target website to follow pagination. Look for the .next element, that contains the URL of the next page, as shown below:

// the URL of the target website's home page
String baseUrl = "https://quotes.toscrape.com";

// initializing the list of Quote data objects
// that will contain the scraped data
List<Quote> quotes = new ArrayList<>();

// retrieving the home page...

// looking for the "Next →" HTML element
Elements nextElements = doc.select(".next");

// if there is a next page to scrape
while (!nextElements.isEmpty()) {
    // getting the "Next →" HTML element
    Element nextElement = nextElements.first();

    // extracting the relative URL of the next page
    String relativeUrl = nextElement.getElementsByTag("a").first().attr("href");

    // building the complete URL of the next page
    String completeUrl = baseUrl + relativeUrl;

    // connecting to the next page
    doc = Jsoup
            .connect(completeUrl)
            .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36")
            .get();

    // scraping logic...

    // looking for the "Next →" HTML element in the new page
    nextElements = doc.select(".next");
}

8️⃣ Step 8: Export Scraped Data to CSV

Use the code below to save the scraped data to a CSV file.

// initializing the output CSV file
File csvFile = new File("output.csv");

// using the try-with-resources to handle the
// release of the unused resources when the writing process ends
try (PrintWriter printWriter = new PrintWriter(csvFile)) {
    // iterating over all quotes
    for (Quote quote : quotes) {
        // converting the quote data into a
        // list of strings
        List<String> row = new ArrayList<>();
        
        // wrapping each field with between quotes 
        // to make the CSV file more consistent
        row.add("\"" + quote.getText() + "\"");
        row.add("\"" +quote.getAuthor() + "\"");
        row.add("\"" +quote.getTags() + "\"");

        // printing a CSV line
        printWriter.println(String.join(",", row));
    }
}

Putting it all Together

The full Jsoup web scraper code would look as follows:

package com.proxytee;

import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;

public class Main {
    public static void main(String[] args) throws IOException {
        // the URL of the target website's home page
        String baseUrl = "https://quotes.toscrape.com";

        // initializing the list of Quote data objects
        // that will contain the scraped data
        List<Quote> quotes = new ArrayList<>();

        // downloading the target website with an HTTP GET request
        Document doc = Jsoup
                .connect(baseUrl)
                .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36")
                .get();

        // looking for the "Next →" HTML element
        Elements nextElements = doc.select(".next");

        // if there is a next page to scrape
        while (!nextElements.isEmpty()) {
            // getting the "Next →" HTML element
            Element nextElement = nextElements.first();

            // extracting the relative URL of the next page
            String relativeUrl = nextElement.getElementsByTag("a").first().attr("href");

            // building the complete URL of the next page
            String completeUrl = baseUrl + relativeUrl;

            // connecting to the next page
            doc = Jsoup
                    .connect(completeUrl)
                    .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36")
                    .get();

            // retrieving the list of product HTML elements
            // selecting all quote HTML elements
            Elements quoteElements = doc.select(".quote");

            // iterating over the quoteElements list of HTML quotes
            for (Element quoteElement : quoteElements) {
                // initializing a quote data object
                Quote quote = new Quote();

                // extracting the text of the quote and removing the
                // special characters
                String text = quoteElement.select(".text").first().text();
                String author = quoteElement.select(".author").first().text();

                // initializing the list of tags
                List<String> tags = new ArrayList<>();

                // iterating over the list of tags
                for (Element tag : quoteElement.select(".tag")) {
                    // adding the tag string to the list of tags
                    tags.add(tag.text());
                }

                // storing the scraped data in the Quote object
                quote.setText(text);
                quote.setAuthor(author);
                quote.setTags(String.join(", ", tags)); // merging the tags into a "A; B; ...; Z" string

                // adding the Quote object to the list of the scraped quotes
                quotes.add(quote);
            }

            // looking for the "Next →" HTML element in the new page
            nextElements = doc.select(".next");
        }

        // initializing the output CSV file
        File csvFile = new File("output.csv");
        // using the try-with-resources to handle the
        // release of the unused resources when the writing process ends
        try (PrintWriter printWriter = new PrintWriter(csvFile, StandardCharsets.UTF_8)) {
            // to handle BOM
            printWriter.write('\ufeff');

            // iterating over all quotes
            for (Quote quote : quotes) {
                // converting the quote data into a
                // list of strings
                List<String> row = new ArrayList<>();

                // wrapping each field with between quotes
                // to make the CSV file more consistent
                row.add("\"" + quote.getText() + "\"");
                row.add("\"" +quote.getAuthor() + "\"");
                row.add("\"" +quote.getTags() + "\"");

                // printing a CSV line
                printWriter.println(String.join(",", row));
            }
        }
    }
}

Running this script in your IDE creates an output.csv containing the scraped quotes.

In this guide, we explored how to build a web scraper using Jsoup and enhance it with ProxyTee’s proxy services. Jsoup makes HTML parsing and data extraction straightforward, while ProxyTee ensures your scraping activities remain reliable, anonymous, and secure.

Ready to take your web scraping to the next level? Explore ProxyTee’s solutions today and gain access to unlimited bandwidth, global IP coverage, and more.

Web Scraping in Java with Jsoup: A Step-by-Step Guide with ProxyTee

What is Jsoup?

Key Features of Jsoup:

Why use ProxyTee for Web Scraping?

Prerequisites

Verifying Your Setup

How To Build a Web Scraper Using Jsoup

1️⃣ Step 1: Set up a Java Project

2️⃣ Step 2: Install Jsoup

3️⃣ Step 3: Connect to your target web page

4️⃣ Step 4: Inspect the HTML Page

5️⃣ Step 5: Select HTML Elements with Jsoup

6️⃣ Step 6: Extract Data from a Web Page with Jsoup

7️⃣ Step 7: How to Crawl the Entire Website with Jsoup

8️⃣ Step 8: Export Scraped Data to CSV

Putting it all Together

We help ambitious businesses achieve more

Products

Tools

Legal

Support

Contact sales

Web Scraping in Java with Jsoup: A Step-by-Step Guide with ProxyTee

What is Jsoup?

Key Features of Jsoup:

Why use ProxyTee for Web Scraping?

Prerequisites

Verifying Your Setup

How To Build a Web Scraper Using Jsoup

1️⃣ Step 1: Set up a Java Project

2️⃣ Step 2: Install Jsoup

3️⃣ Step 3: Connect to your target web page

4️⃣ Step 4: Inspect the HTML Page

5️⃣ Step 5: Select HTML Elements with Jsoup

6️⃣ Step 6: Extract Data from a Web Page with Jsoup

7️⃣ Step 7: How to Crawl the Entire Website with Jsoup

8️⃣ Step 8: Export Scraped Data to CSV

Putting it all Together

Related Posts

Top Programming Language for Web Scraping in 2025

Build a Fast Web Scraper with Golang and ProxyTee in 2025

Bypass Bot Detection with Puppeteer Stealth: A ProxyTee Guide

We help ambitious businesses achieve more

Products

Tools

Legal

Support