Affordable Rotating Residential Proxies with Unlimited Bandwidth
  • Products
  • Features
  • Pricing
  • Solutions
  • Blog

Contact sales

Give us a call or fill in the form below and we will contact you. We endeavor to answer all inquiries within 24 hours on business days. Or drop us a message at support@proxytee.com.

Edit Content



    Sign In
    Tutorial

    Learn Web Scraping with Jsoup: Hands-On Tutorial

    February 6, 2025 Mike
    Learn Web Scraping with Jsoup: Hands-On Tutorial

    In today’s data-driven ecosystem, having reliable tools and infrastructure for extracting information from websites is essential. Whether you are building a search engine, monitoring prices, tracking content changes, or conducting market analysis, web scraping offers unmatched utility. One highly effective tool for Java developers is Jsoup. This blog explores the technical nuances and practical usage of web scraping with Jsoup while also discussing how ProxyTee empowers scraping efforts through robust proxy support.


    What is Jsoup?

    Jsoup is a Java library designed for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data using the Document Object Model (DOM), CSS selectors, and jQuery-like syntax. Developers use it for tasks like parsing HTML content, cleaning malformed markup, and traversing data structures with ease.

    Its versatility and reliability make it a favorite for backend data tasks and server-side scraping. Unlike browser-based tools, Jsoup performs server-side parsing, making it lightweight, fast, and ideal for integration into larger Java applications.

    Key Features of Jsoup:

    • Parses and cleans real-world HTML
    • Supports CSS and jQuery-like selectors
    • Fetches URLs and parses responses into a DOM tree
    • Provides a clean API for navigating, manipulating, and extracting data

    Whether you are a seasoned developer or new to web scraping, Jsoup’s intuitive design makes it easy to extract and process data.


    Why Use Jsoup for Web Scraping?

    • Simplicity: A single line of code can fetch and parse an entire webpage.
    • Robust Parsing: Jsoup handles bad HTML gracefully.
    • Powerful Selectors: Its CSS-based selectors make data extraction intuitive.
    • Integration Friendly: It fits easily into any Java application or microservice architecture.

    However, as websites become smarter and more protective against scraping, raw HTTP requests are often blocked. This is where using a reliable proxy solution like ProxyTee becomes critical.


    Why use ProxyTee for Web Scraping with Jsoup?

    ProxyTee offers high-performance proxies that support intensive scraping tasks without interruptions. One of the most valuable offerings is unlimited residential proxy access. These proxies are ideal for high-anonymity scraping, making your Jsoup requests look like genuine traffic from real users.

    Using ProxyTee, developers gain access to:

    • Residential IP addresses from real devices
    • Unlimited bandwidth for unrestricted data collection
    • High legitimacy and trust scores in the eyes of target websites

    This translates to fewer blocks, faster data gathering, and more reliable results.


    Prerequisites

    Before getting into coding, ensure you have the following installed:

    • Java >= 8: Any version of Java equal to or higher than version 8 is sufficient. This guide is based on Java 17, the current Long-Term Support (LTS) version.
    • Maven or Gradle: Use any Java build automation tool you’re familiar with for dependency management.
    • Java IDE: An Integrated Development Environment (IDE) that supports Java with Maven or Gradle, like IntelliJ IDEA, is beneficial.

    Follow the links to install each component, if necessary, ensuring everything is set up correctly to avoid common issues.

    Verifying Your Setup

    To ensure everything is installed correctly, run the following commands in your terminal:

    • java -version
    • mvn -v (for Maven)
    • gradle -v (for Gradle)

    How To Build a Web Scraper Using Jsoup

    We’ll create a scraper that extracts quotes from the Quotes to Scrape website (https://quotes.toscrape.com), a test site designed for learning web scraping.

    1️⃣ Step 1: Set up a Java Project

    Launch IntelliJ IDEA, create a new Java project with the correct language and build tool, then name your project accordingly. Once set, your IDE will automatically set up a skeleton Java project.

    2️⃣ Step 2: Install Jsoup

    Add Jsoup to your project’s dependencies. If using Maven, insert the dependency in your pom.xml file; for Gradle, use the build.gradle file. Install this new dependency and make it ready to import by clicking on the Maven reload button inside your IDE.

    3️⃣ Step 3: Connect to your target web page

    Use the following Java code to establish a connection to your target website:

    Document doc = Jsoup.connect("https://quotes.toscrape.com/")
        .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36")
        .get();
    

    Note: Use a valid User-Agent header as it may help to avoid bot detection mechanisms.

    4️⃣ Step 4: Inspect the HTML Page

    Use your browser’s developer tools (usually by right-clicking on an element and selecting ‘Inspect’ or ‘Inspect element’) to understand the HTML structure. In this case, each quote is in a <div class="quote"> tag that includes the quote itself inside a span tag, the author inside a <small> tag, and the list of the tags inside a nested div.

    5️⃣ Step 5: Select HTML Elements with Jsoup

    The Jsoup Document class has different selection methods, like getElementsByTag(), getElementsByClass(), getElementById(), and select() that is particularly powerful as it accepts CSS selectors. By using CSS selectors on elements containing the desired content like .text, .author, and .tags .tag you will easily be able to select all of them, making your code clear, easy to read, and less prone to unexpected problems.

    6️⃣ Step 6: Extract Data from a Web Page with Jsoup

    First, create a Quote.java file where to store the extracted data. Then, iterate on every single element retrieved from your website by applying proper CSS selectors as shown below:

    // initializing the list of Quote data objects
    // that will contain the scraped data
    List<Quote> quotes = new ArrayList<>();
    
    // retrieving the list of product HTML elements
    // selecting all quote HTML elements
    Elements quoteElements = doc.select(".quote");
    
    // iterating over the quoteElements list of HTML quotes
    for (Element quoteElement : quoteElements) {
        // initializing a quote data object
        Quote quote = new Quote();
    
        // extracting the text of the quote and removing the 
        // special characters
        String text = quoteElement.select(".text").first().text()
            .replace("“", "")
            .replace("”", "");
        String author = quoteElement.select(".author").first().text();
    
        // initializing the list of tags
        List<String> tags = new ArrayList<>();
    
        // iterating over the list of tags
        for (Element tag : quoteElement.select(".tag")) {
            // adding the tag string to the list of tags
            tags.add(tag.text());
        }
    
        // storing the scraped data in the Quote object
        quote.setText(text);
        quote.setAuthor(author);
        quote.setTags(String.join(", ", tags)); // merging the tags into a "A, B, ..., Z" string
    
        // adding the Quote object to the list of the scraped quotes
        quotes.add(quote);
    }
    

    7️⃣ Step 7: How to Crawl the Entire Website with Jsoup

    Crawl the target website to follow pagination. Look for the .next element, that contains the URL of the next page, as shown below:

    // the URL of the target website's home page
    String baseUrl = "https://quotes.toscrape.com";
    
    // initializing the list of Quote data objects
    // that will contain the scraped data
    List<Quote> quotes = new ArrayList<>();
    
    // retrieving the home page...
    
    // looking for the "Next →" HTML element
    Elements nextElements = doc.select(".next");
    
    // if there is a next page to scrape
    while (!nextElements.isEmpty()) {
        // getting the "Next →" HTML element
        Element nextElement = nextElements.first();
    
        // extracting the relative URL of the next page
        String relativeUrl = nextElement.getElementsByTag("a").first().attr("href");
    
        // building the complete URL of the next page
        String completeUrl = baseUrl + relativeUrl;
    
        // connecting to the next page
        doc = Jsoup
                .connect(completeUrl)
                .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36")
                .get();
    
        // scraping logic...
    
        // looking for the "Next →" HTML element in the new page
        nextElements = doc.select(".next");
    }
    

    8️⃣ Step 8: Export Scraped Data to CSV

    Use the code below to save the scraped data to a CSV file.

    // initializing the output CSV file
    File csvFile = new File("output.csv");
    
    // using the try-with-resources to handle the
    // release of the unused resources when the writing process ends
    try (PrintWriter printWriter = new PrintWriter(csvFile)) {
        // iterating over all quotes
        for (Quote quote : quotes) {
            // converting the quote data into a
            // list of strings
            List<String> row = new ArrayList<>();
            
            // wrapping each field with between quotes 
            // to make the CSV file more consistent
            row.add("\"" + quote.getText() + "\"");
            row.add("\"" +quote.getAuthor() + "\"");
            row.add("\"" +quote.getTags() + "\"");
    
            // printing a CSV line
            printWriter.println(String.join(",", row));
        }
    }
    

    Putting it all Together

    The full Jsoup web scraper code would look as follows:

    package com.proxytee;
    
    import org.jsoup.*;
    import org.jsoup.nodes.*;
    import org.jsoup.select.Elements;
    
    import java.io.File;
    import java.io.IOException;
    import java.io.PrintWriter;
    import java.nio.charset.StandardCharsets;
    import java.util.ArrayList;
    import java.util.List;
    
    public class Main {
        public static void main(String[] args) throws IOException {
            // the URL of the target website's home page
            String baseUrl = "https://quotes.toscrape.com";
    
            // initializing the list of Quote data objects
            // that will contain the scraped data
            List<Quote> quotes = new ArrayList<>();
    
            // downloading the target website with an HTTP GET request
            Document doc = Jsoup
                    .connect(baseUrl)
                    .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36")
                    .get();
    
            // looking for the "Next →" HTML element
            Elements nextElements = doc.select(".next");
    
            // if there is a next page to scrape
            while (!nextElements.isEmpty()) {
                // getting the "Next →" HTML element
                Element nextElement = nextElements.first();
    
                // extracting the relative URL of the next page
                String relativeUrl = nextElement.getElementsByTag("a").first().attr("href");
    
                // building the complete URL of the next page
                String completeUrl = baseUrl + relativeUrl;
    
                // connecting to the next page
                doc = Jsoup
                        .connect(completeUrl)
                        .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36")
                        .get();
    
                // retrieving the list of product HTML elements
                // selecting all quote HTML elements
                Elements quoteElements = doc.select(".quote");
    
                // iterating over the quoteElements list of HTML quotes
                for (Element quoteElement : quoteElements) {
                    // initializing a quote data object
                    Quote quote = new Quote();
    
                    // extracting the text of the quote and removing the
                    // special characters
                    String text = quoteElement.select(".text").first().text();
                    String author = quoteElement.select(".author").first().text();
    
                    // initializing the list of tags
                    List<String> tags = new ArrayList<>();
    
                    // iterating over the list of tags
                    for (Element tag : quoteElement.select(".tag")) {
                        // adding the tag string to the list of tags
                        tags.add(tag.text());
                    }
    
                    // storing the scraped data in the Quote object
                    quote.setText(text);
                    quote.setAuthor(author);
                    quote.setTags(String.join(", ", tags)); // merging the tags into a "A; B; ...; Z" string
    
                    // adding the Quote object to the list of the scraped quotes
                    quotes.add(quote);
                }
    
                // looking for the "Next →" HTML element in the new page
                nextElements = doc.select(".next");
            }
    
            // initializing the output CSV file
            File csvFile = new File("output.csv");
            // using the try-with-resources to handle the
            // release of the unused resources when the writing process ends
            try (PrintWriter printWriter = new PrintWriter(csvFile, StandardCharsets.UTF_8)) {
                // to handle BOM
                printWriter.write('\ufeff');
    
                // iterating over all quotes
                for (Quote quote : quotes) {
                    // converting the quote data into a
                    // list of strings
                    List<String> row = new ArrayList<>();
    
                    // wrapping each field with between quotes
                    // to make the CSV file more consistent
                    row.add("\"" + quote.getText() + "\"");
                    row.add("\"" +quote.getAuthor() + "\"");
                    row.add("\"" +quote.getTags() + "\"");
    
                    // printing a CSV line
                    printWriter.println(String.join(",", row));
                }
            }
        }
    }
    

    Running this script in your IDE creates an output.csv containing the scraped quotes.


    In this guide, we explored how to build a web scraper using Jsoup and enhance it with ProxyTee’s proxy services. Jsoup makes HTML parsing and data extraction straightforward, while ProxyTee ensures your scraping activities remain reliable, anonymous, and secure.

    Ready to take your web scraping to the next level? Explore ProxyTee’s solutions today and gain access to unlimited bandwidth, global IP coverage, and more.

    • Java
    • Jsoup
    • Web Scraping

    Post navigation

    Previous
    Next

    Table of Contents

    • What is Jsoup?
    • Why use ProxyTee for Web Scraping with Jsoup?
    • Prerequisites
    • How To Build a Web Scraper Using Jsoup
    • Putting it all Together

    Categories

    • Comparison & Differences
    • Exploring
    • Tutorial

    Recent posts

    • Web Scraping with Beautiful Soup
      Learn Web Scraping with Beautiful Soup
    • How to Set Up a Proxy in SwitchyOmega
      How to Set Up a Proxy in SwitchyOmega (Step-by-Step Guide)
    • DuoPlus Cloud Mobile Feature Overview: Empowering Unlimited Opportunities Abroad
      DuoPlus Cloud Mobile Feature Overview: Empowering Unlimited Opportunities Abroad!
    • Best Rotating Proxies in 2025
      Best Rotating Proxies in 2025
    • How to Scrape Websites with Puppeteer: A 2025 Beginner’s Guide
      How to Scrape Websites with Puppeteer: A 2025 Beginner’s Guide

    Related Posts

    Web Scraping with Beautiful Soup
    Tutorial

    Learn Web Scraping with Beautiful Soup

    May 30, 2025 Mike

    Learn Web Scraping with Beautiful Soup and unlock the power of automated data collection from websites. Whether you’re a developer, digital marketer, data analyst, or simply curious, web scraping provides efficient ways to gather information from the internet. In this guide, we explore how Beautiful Soup can help you parse HTML and XML data, and […]

    Best Rotating Proxies in 2025
    Comparison & Differences

    Best Rotating Proxies in 2025

    May 19, 2025 Mike

    Best Rotating Proxies in 2025 are essential tools for developers, marketers, and SEO professionals seeking efficient and reliable data collection. With the increasing complexity of web scraping and data gathering, choosing the right proxy service can significantly impact your operations. This article explores the leading rotating proxy providers in 2025, highlighting their unique features and […]

    How to Scrape Websites with Puppeteer: A 2025 Beginner’s Guide
    Tutorial

    How to Scrape Websites with Puppeteer: A 2025 Beginner’s Guide

    May 19, 2025 Mike

    Scrape websites with Puppeteer efficiently using modern techniques that are perfect for developers, SEO professionals, and data analysts. Puppeteer, a Node.js library developed by Google, has become one of the go-to solutions for browser automation and web scraping in recent years. Whether you are scraping data for competitive analysis, price monitoring, or SEO audits, learning […]

    We help ambitious businesses achieve more

    Free consultation
    Contact sales
    • Sign In
    • Sign Up
    • Contact
    • Facebook
    • Twitter
    • Telegram
    Affordable Rotating Residential Proxies with Unlimited Bandwidth

    Get reliable, affordable rotating proxies with unlimited bandwidth for seamless browsing and enhanced security.

    Products
    • Features
    • Pricing
    • Solutions
    • Testimonials
    • FAQs
    • Partners
    Tools
    • App
    • API
    • Blog
    • Check Proxies
    • Free Proxies
    Legal
    • Privacy Policy
    • Terms of Use
    • Affiliate
    • Reseller
    • White-label
    Support
    • Contact
    • Support Center
    • Knowlegde Base

    Copyright © 2025 ProxyTee