Web Scraping with Java: A ProxyTee Guide

Web Scraping with Java: A ProxyTee Guide

Web scraping is a vital tool for businesses and individuals looking to gather data from the web. While various programming languages offer robust options for web scraping, Java stands out with its powerful libraries and capabilities. This guide will explore how to create a web scraper using Java, emphasizing how ProxyTee's features can enhance the process.

Why Choose Java for Web Scraping?

When selecting a programming language for web scraping, you might consider Python, JavaScript (Node.js), PHP, C#, or Java. Each has its strengths and limitations. This guide focuses on Java, highlighting its reliability and compatibility with ProxyTee's residential proxies for efficient and seamless data extraction.

Web Scraping Frameworks in Java

Two popular libraries for web scraping with Java are JSoup and HtmlUnit:

  1. JSoup
    • Designed for handling malformed HTML.
    • The name "JSoup" originates from “tag soup,” referring to non-standard HTML documents.
  2. HtmlUnit
    • A headless browser that emulates real browser behavior.
    • Useful for interacting with web elements like forms and buttons.
    • JavaScript and CSS rendering can be disabled for faster processing.

In this article, we’ll demonstrate the usage of both libraries and how to integrate ProxyTee for optimal performance.

Prerequisites

To follow along, ensure you have:

  • Basic knowledge of Java, HTML, and CSS selectors.
  • Familiarity with XPath for querying HTML documents.
  • Maven for dependency management.

Required Tools

  • Java LTS 8+
  • Maven
  • Java IDE: Use IntelliJ IDEA or any IDE supporting Maven dependencies.

To verify installations, run these commands:

java -version
mvn -v

Web Scraping with JSoup

JSoup is a common Java library for web scraping. These four steps are involved in the process:

Step 1: Create a Project

Create a new Maven project in your IDE and add the dependencies to the pom.xml file.

Step 2: Get the JSoup Library

Add the following dependency to the pom.xml:

<dependencies>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.15.3</version>
    </dependency>
</dependencies>

Step 3: Get and Parse HTML

Import the required classes:

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

Use JSoup’s connect function:

Document doc = Jsoup.connect(url).get();

Handle exceptions, and remember to set HTTP headers to avoid blocks. For optimal results, integrating ProxyTee’s residential proxies can provide rotating IP addresses, ensuring seamless data retrieval without being detected. Using ProxyTee's residential proxies is particularly helpful for this step due to its auto-rotation feature and global IP coverage, making it harder to detect and block.

try {
    Document doc = Jsoup.connect("https://en.wikipedia.org/wiki/Jsoup")
        .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36")
        .header("Accept-Language", "*")
        .get();
}
 catch (IOException e) {
    throw new RuntimeException(e);
}

Step 4: Query HTML

Use JSoup methods like getElementByID, getElementsByClass, or select. The select function is more general. For more details on other available methods, visit JSoup API docs.

Example to find the heading element of page:

Element firstHeading = doc.selectFirst(".firstHeading");
System.out.println(firstHeading.text());

Here's a combined code example:

package org.proxytee;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;

public class Main {
    public static void main(String[] args) {
        try {
            Document doc = Jsoup.connect("https://en.wikipedia.org/wiki/Jsoup")
                .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36")
                .header("Accept-Language", "*")
                .get();
            Element firstHeading = doc.selectFirst(".firstHeading");
            System.out.println(firstHeading.text());
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }
}

Web Scraping with HtmlUnit

HtmlUnit is a headless browser, making it suitable for interactions with web pages like reading text, filling out forms, and clicking buttons. We’ll be focusing on using its methods to extract data from web pages.

Step 1: Create a Project

Create a new project using Maven and initialize with all files needed.

Step 2: Get the HtmlUnit Library

Add the following dependency in the pom.xml file:

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>2.51.0</version>
</dependency>

Step 3: Get and Parse HTML

Import these:

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.DomNode;
import com.gargoylesoftware.htmlunit.html.DomNodeList;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

Create a WebClient and disable JavaScript and CSS for faster parsing. Then, get an instance of HtmlPage.

WebClient webClient = new WebClient();
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setJavaScriptEnabled(false);
HtmlPage page = webClient.getPage("https://librivox.org/the-first-men-in-the-moon-by-hg-wells");

Wrap your code in a try-catch since it might throw an IOException. You can also have the function returning HtmlPage to use in other parts of project.

public static HtmlPage getDocument(String url) {
    HtmlPage page = null;
    try (final WebClient webClient = new WebClient()) {
        webClient.getOptions().setCssEnabled(false);
        webClient.getOptions().setJavaScriptEnabled(false);
        page = webClient.getPage(url);
    } catch (IOException e) {
        e.printStackTrace();
    }
    return page;
}

Step 4: Query HTML

Use DOM methods, XPath selectors, or CSS selectors. Use the XPath or CSS selector methods based on the structure of data in HTML pages.

For instance, using XPath for a heading on a web page can look like:

HtmlElement book = page.getFirstByXPath("//div[@class=\"content-wrap clearfix\"]/h1");
System.out.print(book.asNormalizedText());

Here’s an example to scrape table contents by CSS selectors, each rows is separated in DomNode:

String selector = ".chapter-download tbody tr";
DomNodeList<DomNode> rows = page.querySelectorAll(selector);
for (DomNode row : rows) {
    String chapter = row.querySelector("td:nth-child(2) a").asNormalizedText();
    String reader = row.querySelector("td:nth-child(3) a").asNormalizedText();
    String duration = row.querySelector("td:nth-child(4)").asNormalizedText();
    System.out.println(chapter + "\t " + reader + "\t " + duration);
}

Here's the complete code for HTMLUnit web scraper:

package org.proxytee;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.DomNode;
import com.gargoylesoftware.htmlunit.html.DomNodeList;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.io.IOException;
public class HtmlUnitDemo {
    public static void main(String[] args) {
        HtmlPage page = HtmlUnitDemo.getDocument("https://librivox.org/the-first-men-in-the-moon-by-hg-wells");
        HtmlElement book = page.getFirstByXPath("//div[@class=\"content-wrap clearfix\"]/h1");
        System.out.print(book.asNormalizedText());
        String selector = ".chapter-download tbody tr";
        DomNodeList<DomNode> rows = page.querySelectorAll(selector);
        for (DomNode row : rows) {
            String chapter = row.querySelector("td:nth-child(2) a").asNormalizedText();
            String reader = row.querySelector("td:nth-child(3) a").asNormalizedText();
            String duration = row.querySelector("td:nth-child(4)").asNormalizedText();
            System.out.println(chapter + "\t " + reader + "\t " + duration);
        }
    }
    public static HtmlPage getDocument(String url) {
        HtmlPage page = null;
        try (final WebClient webClient = new WebClient()) {
            webClient.getOptions().setCssEnabled(false);
            webClient.getOptions().setJavaScriptEnabled(false);
            page = webClient.getPage(url);
        } catch (IOException e) {
            e.printStackTrace();
        }
        return page;
    }
}

In each of these examples, rotating IPs are key to effective and reliable scraping. ProxyTee offers this feature, preventing blocks and allowing for uninterrupted data collection. ProxyTee's unlimited residential proxies, with auto-rotation capabilities, is perfect for web scraping that requires consistency and high volume data extraction. Its competitive pricing allows a better value than competitors, giving developers the tools they need at a better price.

Conclusion

Web scraping is essential for competitive business intelligence. With a foundational understanding of how to scrape with Java, developers can build tailored solutions. In this guide, you've seen practical applications using both JSoup and HtmlUnit.

Choosing ProxyTee provides several advantages. With features like unlimited bandwidth, auto IP rotation, and global IP coverage, users can focus more on scraping efficiently and reliably without unexpected expenses.

Besides the capabilities shown above with Jsoup and HTMLUnit, integrating ProxyTee’s simple API ensures the web scraping experience will always remain uninterrupted. The simple integration enables users to easily adapt their Java projects with proxies in the most performant manner. The Residential Proxies from ProxyTee are perfect to bypass any blocking mechanisms with the most competitive cost possible. These features enable your projects to run smoothly by handling IP rotations, connection and request limitations by third parties. You can choose between unlimited, static, or datacenter proxies, according to project needs.