Learn Web Scraping with Jsoup

Web Scraping with Jsoup is an essential skill for Java developers who need to extract, parse, and manipulate data from websites efficiently. Whether you’re building a data aggregation tool, automating content collection, or analyzing web trends, Jsoup provides a robust and intuitive API to handle HTML documents seamlessly. In this guide, you’ll learn how to leverage Jsoup’s powerful features, integrate proxies for scalable scraping, and handle real-world challenges like pagination and dynamic content. By the end, you’ll have practical code examples and a deep understanding of how to implement Web Scraping with Jsoup in your projects.
What is Jsoup?
Jsoup is a Java library designed for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data using the Document Object Model (DOM), CSS selectors, and jQuery-like syntax. Developers use it for tasks like parsing HTML content, cleaning malformed markup, and traversing data structures with ease.
Its versatility and reliability make it a favorite for backend data tasks and server-side scraping. Unlike browser-based tools, Jsoup performs server-side parsing, making it lightweight, fast, and ideal for integration into larger Java applications.
Key Features of Jsoup:
- Parses and cleans real-world HTML
- Supports CSS and jQuery-like selectors
- Fetches URLs and parses responses into a DOM tree
- Provides a clean API for navigating, manipulating, and extracting data
Whether you are a seasoned developer or new to web scraping, Jsoup’s intuitive design makes it easy to extract and process data.
Why Use Jsoup for Web Scraping?
- Simplicity: A single line of code can fetch and parse an entire webpage.
- Robust Parsing: Jsoup handles bad HTML gracefully.
- Powerful Selectors: Its CSS-based selectors make data extraction intuitive.
- Integration Friendly: It fits easily into any Java application or microservice architecture.
However, as websites become smarter and more protective against scraping, raw HTTP requests are often blocked. This is where using a reliable proxy solution like ProxyTee becomes critical.
Why use proxy for Web Scraping with Jsoup?
High-performance proxies can support intensive scraping tasks without interruptions, and one of the most valuable options is unlimited residential proxy access. These proxies are ideal for high-anonymity scraping, making Jsoup requests appear as genuine traffic from real users.
With this approach, developers gain access to:
- Residential IP addresses from real devices
- Unlimited bandwidth for unrestricted data collection
- High legitimacy and trust scores in the eyes of target websites
This leads to fewer blocks, faster data gathering, and more reliable results for large-scale scraping operations.
Prerequisites
Before getting into coding, ensure you have the following installed:
- Java >= 8: Any version of Java equal to or higher than version 8 is sufficient. This guide is based on Java 17, the current Long-Term Support (LTS) version.
- Maven or Gradle: Use any Java build automation tool you’re familiar with for dependency management.
- Java IDE: An Integrated Development Environment (IDE) that supports Java with Maven or Gradle, like IntelliJ IDEA, is beneficial.
Follow the links to install each component, if necessary, ensuring everything is set up correctly to avoid common issues.
Verifying Your Setup
To ensure everything is installed correctly, run the following commands in your terminal:
java -version
mvn -v
(for Maven)gradle -v
(for Gradle)
How To Build a Web Scraper Using Jsoup
We’ll create a scraper that extracts quotes from the Quotes to Scrape website (https://quotes.toscrape.com), a test site designed for learning web scraping.
1️⃣ Step 1: Set up a Java Project
Launch IntelliJ IDEA, create a new Java project with the correct language and build tool, then name your project accordingly. Once set, your IDE will automatically set up a skeleton Java project.
2️⃣ Step 2: Install Jsoup
Add Jsoup to your project’s dependencies. If using Maven, insert the dependency in your pom.xml file; for Gradle, use the build.gradle file. Install this new dependency and make it ready to import by clicking on the Maven reload button inside your IDE.
3️⃣ Step 3: Connect to your target web page
Use the following Java code to establish a connection to your target website:
Document doc = Jsoup.connect("https://quotes.toscrape.com/") .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36") .get();
Note: Use a valid User-Agent header as it may help to avoid bot detection mechanisms.
4️⃣ Step 4: Inspect the HTML Page
Use your browser’s developer tools (usually by right-clicking on an element and selecting ‘Inspect’ or ‘Inspect element’) to understand the HTML structure. In this case, each quote is in a <div class="quote">
tag that includes the quote itself inside a span
tag, the author inside a <small>
tag, and the list of the tags inside a nested div
.
5️⃣ Step 5: Select HTML Elements with Jsoup
The Jsoup Document
class has different selection methods, like getElementsByTag()
, getElementsByClass()
, getElementById()
, and select()
that is particularly powerful as it accepts CSS selectors. By using CSS selectors on elements containing the desired content like .text
, .author
, and .tags .tag
you will easily be able to select all of them, making your code clear, easy to read, and less prone to unexpected problems.
6️⃣ Step 6: Extract Data from a Web Page with Jsoup
First, create a Quote.java
file where to store the extracted data. Then, iterate on every single element retrieved from your website by applying proper CSS selectors as shown below:
// initializing the list of Quote data objects // that will contain the scraped data List<Quote> quotes = new ArrayList<>(); // retrieving the list of product HTML elements // selecting all quote HTML elements Elements quoteElements = doc.select(".quote"); // iterating over the quoteElements list of HTML quotes for (Element quoteElement : quoteElements) { // initializing a quote data object Quote quote = new Quote(); // extracting the text of the quote and removing the // special characters String text = quoteElement.select(".text").first().text() .replace("“", "") .replace("”", ""); String author = quoteElement.select(".author").first().text(); // initializing the list of tags List<String> tags = new ArrayList<>(); // iterating over the list of tags for (Element tag : quoteElement.select(".tag")) { // adding the tag string to the list of tags tags.add(tag.text()); } // storing the scraped data in the Quote object quote.setText(text); quote.setAuthor(author); quote.setTags(String.join(", ", tags)); // merging the tags into a "A, B, ..., Z" string // adding the Quote object to the list of the scraped quotes quotes.add(quote); }
7️⃣ Step 7: How to Crawl the Entire Website with Jsoup
Crawl the target website to follow pagination. Look for the .next
element, that contains the URL of the next page, as shown below:
// the URL of the target website's home page String baseUrl = "https://quotes.toscrape.com"; // initializing the list of Quote data objects // that will contain the scraped data List<Quote> quotes = new ArrayList<>(); // retrieving the home page... // looking for the "Next →" HTML element Elements nextElements = doc.select(".next"); // if there is a next page to scrape while (!nextElements.isEmpty()) { // getting the "Next →" HTML element Element nextElement = nextElements.first(); // extracting the relative URL of the next page String relativeUrl = nextElement.getElementsByTag("a").first().attr("href"); // building the complete URL of the next page String completeUrl = baseUrl + relativeUrl; // connecting to the next page doc = Jsoup .connect(completeUrl) .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36") .get(); // scraping logic... // looking for the "Next →" HTML element in the new page nextElements = doc.select(".next"); }
8️⃣ Step 8: Export Scraped Data to CSV
Use the code below to save the scraped data to a CSV file.
// initializing the output CSV file File csvFile = new File("output.csv"); // using the try-with-resources to handle the // release of the unused resources when the writing process ends try (PrintWriter printWriter = new PrintWriter(csvFile)) { // iterating over all quotes for (Quote quote : quotes) { // converting the quote data into a // list of strings List<String> row = new ArrayList<>(); // wrapping each field with between quotes // to make the CSV file more consistent row.add("\"" + quote.getText() + "\""); row.add("\"" + quote.getAuthor() + "\""); row.add("\"" + quote.getTags() + "\""); // printing a CSV line printWriter.println(String.join(",", row)); } }
Putting it all Together
The full Jsoup web scraper code would look as follows:
import org.jsoup.*; import org.jsoup.nodes.*; import org.jsoup.select.Elements; import java.io.File; import java.io.IOException; import java.io.PrintWriter; import java.nio.charset.StandardCharsets; import java.util.ArrayList; import java.util.List; public class Main { public static void main(String[] args) throws IOException { // the URL of the target website's home page String baseUrl = "https://quotes.toscrape.com"; // initializing the list of Quote data objects // that will contain the scraped data List<Quote> quotes = new ArrayList<>(); // downloading the target website with an HTTP GET request Document doc = Jsoup .connect(baseUrl) .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36") .get(); // looking for the "Next →" HTML element Elements nextElements = doc.select(".next"); // if there is a next page to scrape while (!nextElements.isEmpty()) { // getting the "Next →" HTML element Element nextElement = nextElements.first(); // extracting the relative URL of the next page String relativeUrl = nextElement.getElementsByTag("a").first().attr("href"); // building the complete URL of the next page String completeUrl = baseUrl + relativeUrl; // connecting to the next page doc = Jsoup .connect(completeUrl) .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36") .get(); // retrieving the list of product HTML elements // selecting all quote HTML elements Elements quoteElements = doc.select(".quote"); // iterating over the quoteElements list of HTML quotes for (Element quoteElement : quoteElements) { // initializing a quote data object Quote quote = new Quote(); // extracting the text of the quote and removing the // special characters String text = quoteElement.select(".text").first().text(); String author = quoteElement.select(".author").first().text(); // initializing the list of tags List<String> tags = new ArrayList<>(); // iterating over the list of tags for (Element tag : quoteElement.select(".tag")) { // adding the tag string to the list of tags tags.add(tag.text()); } // storing the scraped data in the Quote object quote.setText(text); quote.setAuthor(author); quote.setTags(String.join(", ", tags)); // merging the tags into a "A; B; ...; Z" string // adding the Quote object to the list of the scraped quotes quotes.add(quote); } // looking for the "Next →" HTML element in the new page nextElements = doc.select(".next"); } // initializing the output CSV file File csvFile = new File("output.csv"); // using the try-with-resources to handle the // release of the unused resources when the writing process ends try (PrintWriter printWriter = new PrintWriter(csvFile, StandardCharsets.UTF_8)) { // to handle BOM printWriter.write('\ufeff'); // iterating over all quotes for (Quote quote : quotes) { // converting the quote data into a // list of strings List<String> row = new ArrayList<>(); // wrapping each field with between quotes // to make the CSV file more consistent row.add("\"" + quote.getText() + "\""); row.add("\"" + quote.getAuthor() + "\""); row.add("\"" + quote.getTags() + "\""); // printing a CSV line printWriter.println(String.join(",", row)); } } } }
Running this script in your IDE creates an output.csv
containing the scraped quotes.
Next steps for scalable pipelines
You now have a reusable system that fetches, parses, paginates, and exports. Extend it by adding a queue based scheduler and a small service that serves the latest records. Continue practicing on quotes to scrape and then adapt the same skeleton to other public training sites. Keep selectors in one file, keep exports stable, and rehearse failure injection. As you integrate more targets you will rely on the Features of Jsoup for consistency. Remember that web scraping with jsoup keeps your stack small which is an effective tool for Java developers in any team. When volume grows revisit how proxy can help with distribution and access control across regions.