Affordable Rotating Residential Proxies with Unlimited Bandwidth
  • Products
  • Features
  • Pricing
  • Solutions
  • Blog

Contact sales

Give us a call or fill in the form below and we will contact you. We endeavor to answer all inquiries within 24 hours on business days. Or drop us a message at support@proxytee.com.

Edit Content



    Sign In
    Tutorial

    Web Scraping with Java: A ProxyTee Guide

    January 17, 2025 Mike
    web scraping with java

    Web scraping is a vital tool for businesses and individuals looking to gather data from the web. While various programming languages offer robust options for web scraping, Java stands out with its powerful libraries and capabilities. This guide will explore how to create a web scraper using Java, emphasizing how ProxyTee’s features can enhance the process.


    Why Choose Java for Web Scraping?

    When selecting a programming language for web scraping, you might consider Python, JavaScript (Node.js), PHP, C#, or Java. Each has its strengths and limitations. This guide focuses on Java, highlighting its reliability and compatibility with ProxyTee’s residential proxies for efficient and seamless data extraction.

    Two popular libraries for web scraping with Java are JSoup and HtmlUnit:

    1. JSoup
      • Designed for handling malformed HTML.
      • The name “JSoup” originates from “tag soup,” referring to non-standard HTML documents.
    2. HtmlUnit
      • A headless browser that emulates real browser behavior.
      • Useful for interacting with web elements like forms and buttons.
      • JavaScript and CSS rendering can be disabled for faster processing.

    In this article, we’ll demonstrate the usage of both libraries and how to integrate ProxyTee for optimal performance.


    Prerequisites

    To follow along, ensure you have:

    • Basic knowledge of Java, HTML, and CSS selectors.
    • Familiarity with XPath for querying HTML documents.
    • Maven for dependency management.

    Required Tools

    • Java LTS 8+
    • Maven
    • Java IDE: Use IntelliJ IDEA or any IDE supporting Maven dependencies.

    To verify installations, run these commands:

    java -version
    mvn -v

    Web Scraping with JSoup

    These four steps are involved in the process:

    1️⃣ Step 1: Create a Project

    Create a new Maven project in your IDE and add the dependencies to the pom.xml file.

    2️⃣ Step 2: Get the JSoup Library

    Add the following dependency to the pom.xml:

    <dependencies>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.15.3</version>
        </dependency>
    </dependencies>

    3️⃣ Step 3: Get and Parse HTML

    Import the required classes:

    import org.jsoup.Connection;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;

    Use JSoup’s connect function:

    Document doc = Jsoup.connect(url).get();

    Handle exceptions, and remember to set HTTP headers to avoid blocks. For optimal results, integrating ProxyTee’s residential proxies can provide rotating IP addresses, ensuring seamless data retrieval without being detected. Using ProxyTee’s residential proxies is particularly helpful for this step due to its auto-rotation feature and global IP coverage, making it harder to detect and block.

    try {
        Document doc = Jsoup.connect("https://en.wikipedia.org/wiki/Jsoup")
            .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36")
            .header("Accept-Language", "*")
            .get();
    }
     catch (IOException e) {
        throw new RuntimeException(e);
    }

    4️⃣ Step 4: Query HTML

    Use JSoup methods like getElementByID, getElementsByClass, or select. The select function is more general. For more details on other available methods, visit JSoup API docs.

    Example to find the heading element of page:

    Element firstHeading = doc.selectFirst(".firstHeading");
    System.out.println(firstHeading.text());

    Here’s a combined code example:

    package org.proxytee;
    import org.jsoup.Connection;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    import java.io.IOException;
    
    public class Main {
        public static void main(String[] args) {
            try {
                Document doc = Jsoup.connect("https://en.wikipedia.org/wiki/Jsoup")
                    .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36")
                    .header("Accept-Language", "*")
                    .get();
                Element firstHeading = doc.selectFirst(".firstHeading");
                System.out.println(firstHeading.text());
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }
    }

    Web Scraping with HtmlUnit

    HtmlUnit is a headless browser, making it suitable for interactions with web pages like reading text, filling out forms, and clicking buttons. We’ll be focusing on using its methods to extract data from web pages.

    1️⃣ Step 1: Create a Project

    Create a new project using Maven and initialize with all files needed.

    2️⃣ Step 2: Get the HtmlUnit Library

    Add the following dependency in the pom.xml file:

    <dependency>
        <groupId>net.sourceforge.htmlunit</groupId>
        <artifactId>htmlunit</artifactId>
        <version>2.51.0</version>
    </dependency>

    3️⃣ Step 3: Get and Parse HTML

    Import these:

    import com.gargoylesoftware.htmlunit.WebClient;
    import com.gargoylesoftware.htmlunit.html.DomNode;
    import com.gargoylesoftware.htmlunit.html.DomNodeList;
    import com.gargoylesoftware.htmlunit.html.HtmlElement;
    import com.gargoylesoftware.htmlunit.html.HtmlPage;

    Create a WebClient and disable JavaScript and CSS for faster parsing. Then, get an instance of HtmlPage.

    WebClient webClient = new WebClient();
    webClient.getOptions().setCssEnabled(false);
    webClient.getOptions().setJavaScriptEnabled(false);
    HtmlPage page = webClient.getPage("https://librivox.org/the-first-men-in-the-moon-by-hg-wells");

    Wrap your code in a try-catch since it might throw an IOException. You can also have the function returning HtmlPage to use in other parts of project.

    public static HtmlPage getDocument(String url) {
        HtmlPage page = null;
        try (final WebClient webClient = new WebClient()) {
            webClient.getOptions().setCssEnabled(false);
            webClient.getOptions().setJavaScriptEnabled(false);
            page = webClient.getPage(url);
        } catch (IOException e) {
            e.printStackTrace();
        }
        return page;
    }

    4️⃣ Step 4: Query HTML

    Use DOM methods, XPath selectors, or CSS selectors. Use the XPath or CSS selector methods based on the structure of data in HTML pages.

    For instance, using XPath for a heading on a web page can look like:

    HtmlElement book = page.getFirstByXPath("//div[@class=\"content-wrap clearfix\"]/h1");
    System.out.print(book.asNormalizedText());

    Here’s an example to scrape table contents by CSS selectors, each rows is separated in DomNode:

    String selector = ".chapter-download tbody tr";
    DomNodeList<DomNode> rows = page.querySelectorAll(selector);
    for (DomNode row : rows) {
        String chapter = row.querySelector("td:nth-child(2) a").asNormalizedText();
        String reader = row.querySelector("td:nth-child(3) a").asNormalizedText();
        String duration = row.querySelector("td:nth-child(4)").asNormalizedText();
        System.out.println(chapter + "\t " + reader + "\t " + duration);
    }

    Here’s the complete code for HTMLUnit web scraper:

    package org.proxytee;
    import com.gargoylesoftware.htmlunit.WebClient;
    import com.gargoylesoftware.htmlunit.html.DomNode;
    import com.gargoylesoftware.htmlunit.html.DomNodeList;
    import com.gargoylesoftware.htmlunit.html.HtmlElement;
    import com.gargoylesoftware.htmlunit.html.HtmlPage;
    import java.io.IOException;
    public class HtmlUnitDemo {
        public static void main(String[] args) {
            HtmlPage page = HtmlUnitDemo.getDocument("https://librivox.org/the-first-men-in-the-moon-by-hg-wells");
            HtmlElement book = page.getFirstByXPath("//div[@class=\"content-wrap clearfix\"]/h1");
            System.out.print(book.asNormalizedText());
            String selector = ".chapter-download tbody tr";
            DomNodeList<DomNode> rows = page.querySelectorAll(selector);
            for (DomNode row : rows) {
                String chapter = row.querySelector("td:nth-child(2) a").asNormalizedText();
                String reader = row.querySelector("td:nth-child(3) a").asNormalizedText();
                String duration = row.querySelector("td:nth-child(4)").asNormalizedText();
                System.out.println(chapter + "\t " + reader + "\t " + duration);
            }
        }
        public static HtmlPage getDocument(String url) {
            HtmlPage page = null;
            try (final WebClient webClient = new WebClient()) {
                webClient.getOptions().setCssEnabled(false);
                webClient.getOptions().setJavaScriptEnabled(false);
                page = webClient.getPage(url);
            } catch (IOException e) {
                e.printStackTrace();
            }
            return page;
        }
    }

    In each of these examples, rotating IPs are key to effective and reliable scraping. ProxyTee offers this feature, preventing blocks and allowing for uninterrupted data collection. ProxyTee’s unlimited residential proxies, with auto-rotation capabilities, is perfect for web scraping that requires consistency and high volume data extraction. Its competitive pricing allows a better value than competitors, giving developers the tools they need at a better price.


    Advantages of ProxyTee

    With features like unlimited bandwidth, auto IP rotation, and global IP coverage, users can focus more on scraping efficiently and reliably without unexpected expenses. Besides the capabilities shown above with Jsoup and HTMLUnit, integrating ProxyTee’s simple API ensures the web scraping experience will always remain uninterrupted. The simple integration enables users to easily adapt their Java projects with proxies in the most performant manner. The Residential Proxies from ProxyTee are perfect to bypass any blocking mechanisms with the most competitive cost possible. These features enable your projects to run smoothly by handling IP rotations, connection and request limitations by third parties. You can choose between unlimited, static, or datacenter proxies, according to project needs.

    • Data Extraction
    • HtmlUnit
    • Java
    • Jsoup
    • Web Scraping

    Post navigation

    Previous
    Next

    Table of Contents

    • Why Choose Java for Web Scraping?
    • Prerequisites
    • Web Scraping with JSoup
    • Web Scraping with HtmlUnit
    • Advantages of ProxyTee

    Categories

    • Comparison & Differences (25)
    • Cybersecurity (5)
    • Datacenter Proxies (2)
    • Digital Marketing & Data Analytics (1)
    • Exploring (67)
    • Guide (1)
    • Mobile Proxies (2)
    • Residental Proxies (4)
    • Rotating Proxies (3)
    • Tutorial (52)
    • Uncategorized (1)
    • Web Scraping (3)

    Recent posts

    • Types of Proxies Explained: Mastering 3 Key Categories
      Types of Proxies Explained: Mastering 3 Key Categories
    • What is MAP Monitoring and Why It’s Crucial for Your Brand?
      What is MAP Monitoring and Why It’s Crucial for Your Brand?
    • earning with proxytee, affiliate, reseller, unlimited bandwidth, types of proxies, unlimited residential proxy, contact, press-kit
      Unlock Peak Performance with an Unlimited Residential Proxy
    • Web Scraping with lxml: A Guide Using ProxyTee
      Web Scraping with lxml: A Guide Using ProxyTee
    • How to Scrape Yelp Data with ProxyTee
      How to Scrape Yelp Data for Local Business Insights

    Related Posts

    Web Scraping with lxml: A Guide Using ProxyTee
    Tutorial

    Web Scraping with lxml: A Guide Using ProxyTee

    May 12, 2025 Mike

    Web scraping is an automated process of collecting data from websites, which is essential for many purposes, such as data analysis and training AI models. Python is a popular language for web scraping, and lxml is a robust library for parsing HTML and XML documents. In this post, we’ll explore how to leverage lxml for web […]

    How to Scrape Yelp Data with ProxyTee
    Tutorial

    How to Scrape Yelp Data for Local Business Insights

    May 10, 2025 Mike

    Scraping Yelp data can open up a world of insights for marketers, developers, and SEO professionals. Whether you’re conducting market research, generating leads, or monitoring local business trends, having access to structured Yelp data is invaluable. In this article, we’ll walk you through how to scrape Yelp data safely and effectively. You’ll discover real use […]

    Understanding Data Extraction with ProxyTee
    Exploring

    Understanding Data Extraction with ProxyTee

    May 9, 2025 Mike

    Data extraction is a cornerstone for many modern businesses, spanning various sectors from finance to e-commerce. Effective data extraction tools are crucial for automating tasks, saving time, resources, and money. This post delves into the essentials of data extraction, covering its uses, methods, and challenges, and explores how ProxyTee can enhance this process with its […]

    We help ambitious businesses achieve more

    Free consultation
    Contact sales
    • Sign In
    • Sign Up
    • Contact
    • Facebook
    • Twitter
    • Telegram
    Affordable Rotating Residential Proxies with Unlimited Bandwidth

    Get reliable, affordable rotating proxies with unlimited bandwidth for seamless browsing and enhanced security.

    Products
    • Features
    • Pricing
    • Solutions
    • Testimonials
    • FAQs
    • Partners
    Tools
    • App
    • API
    • Blog
    • Check Proxies
    • Free Proxies
    Legal
    • Privacy Policy
    • Terms of Use
    • Affiliate
    • Reseller
    • White-label
    Support
    • Contact
    • Support Center
    • Knowlegde Base

    Copyright © 2025 ProxyTee