Affordable Rotating Residential Proxies with Unlimited Bandwidth
  • Products
  • Features
  • Pricing
  • Solutions
  • Blog

Contact sales

Give us a call or fill in the form below and we will contact you. We endeavor to answer all inquiries within 24 hours on business days. Or drop us a message at support@proxytee.com.

Edit Content



    Sign In
    Tutorial

    Web Scraping with Java Guide

    January 17, 2025 Mike
    Web Scraping With Java (Jsoup & HtmlUnit)

    Web scraping with Java is a reliable and scalable solution for extracting data from websites, whether for competitive analysis, data aggregation, or automation. While Python may dominate the scraping world, Java offers enterprise-grade performance, powerful libraries like JSoup and HtmlUnit, and seamless integration with proxy services. In this article, developers will learn how to build real-world scrapers using Java, parse static and dynamic HTML, handle pagination, and integrate ProxyTee’s rotating residential proxies to avoid IP bans and maximize data throughput.

    Why Use Java for Web Scraping Projects

    Web scraping with Java is often overlooked, but it offers strong typing, thread management, and a mature ecosystem. Java’s JSoup library simplifies HTML parsing, while HtmlUnit simulates browser behavior for dynamic websites. These tools are stable, well-documented, and battle-tested for production use. When paired with modern proxy services, Java scrapers can handle large-scale, automated scraping pipelines without being detected or blocked.

    Setting Up Your Java Scraping Environment

    To follow along, ensure you have:

    • Basic knowledge of Java, HTML, and CSS selectors.
    • Familiarity with XPath for querying HTML documents.
    • Maven for dependency management.

    Required Tools

    • Java LTS 8+
    • Maven
    • Java IDE: Use IntelliJ IDEA or any IDE supporting Maven dependencies.

    To verify installations, run these commands:

     java -version mvn -v

    Scraping Static HTML with JSoup

    What is JSoup?

    JSoup is a Java library for working with real-world HTML. It provides a convenient API for:

    • Fetching and parsing HTML
    • Traversing the DOM
    • Extracting and manipulating data
    • Handling malformed HTML gracefully

    It’s especially useful when dealing with static pages that don’t require JavaScript to render content.

    Step 1️⃣: Set Up Your Java Project

    If you’re using Maven, add the JSoup dependency to your pom.xml:

    <dependencies>
        <dependency&gt;
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.17.2</version>
        </dependency>
    </dependencies>

    If you’re using Gradle:

    implementation 'org.jsoup:jsoup:1.17.2'

    Or download the jar directly from jsoup.org/download.

    Step 2️⃣: Load an HTML Page

    JSoup allows you to fetch a web page with a single line:

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    
    public class Scraper {
        public static void main(String[] args) throws Exception {
            String url = "https://example.com";
            Document doc = Jsoup.connect(url).get();
            System.out.println(doc.title());
        }
    }

    Notes:

    • Jsoup.connect(url).get() fetches the page and parses it into a Document.
    • This works best for static pages—i.e., HTML content that is fully rendered on the server.

    Step 3️⃣: Inspect and Target the HTML Elements

    Use browser DevTools (Right-click → Inspect) to find the elements you want to scrape. Look for tags, IDs, or class names.

    Suppose we’re targeting this HTML:

    <div class="product">
        <h2 class="title">Example Product</h2>
        <span class="price">$99.99</span>
    </div>

    We want to extract the product title and price.

    Step 4️⃣: Extract Content with CSS Selectors

    Use JSoup’s select() method (based on CSS selectors) to access elements.

    Elements products = doc.select("div.product");
    
    for (Element product : products) {
        String title = product.selectFirst("h2.title").text();
        String price = product.selectFirst("span.price").text();
    
        System.out.println("Product: " + title + " | Price: " + price);
    }

    Explanation:

    • doc.select("div.product"): selects all <div> elements with the product class.
    • selectFirst("h2.title"): gets the first <h2> child with the title class.
    • .text(): extracts the visible text content.

    Step 5️⃣: Scrape Attributes (e.g., URLs, Images)

    JSoup can also retrieve element attributes like href, src, or data-*:

    <a class="download" href="/files/report.pdf">Download</a>
    <img src="image.png" alt="Preview">
    String downloadLink = doc.selectFirst("a.download").attr("href");
    String imageUrl = doc.selectFirst("img").attr("src");
    
    System.out.println("PDF: " + downloadLink);
    System.out.println("Image: " + imageUrl);

    Step 6️⃣: Handle Malformed or Incomplete HTML

    JSoup is designed to parse broken HTML. For example:

    <ul>
      <li>First item
      <li>Second item
    </ul>
    Elements items = doc.select("ul li");
    for (Element item : items) {
        System.out.println(item.text());
    }

    Step 7️⃣: Scrape Local or String-Based HTML

    You can also load HTML from a file or a string:

    File input = new File("page.html");
    Document doc = Jsoup.parse(input, "UTF-8");
    
    String html = "<div><p>Hello</p></div>";
    Document doc2 = Jsoup.parse(html);
    System.out.println(doc2.body().text()); // prints: Hello

    Best Practices

    • Respect robots.txt: Always check if scraping is allowed.
    • Use user-agent headers: Some sites block default Java user agents.
    • Add delays to avoid overloading the server.
    • Avoid scraping JavaScript-heavy websites (use Selenium or Playwright for those).

    Scraping Dynamic Content with HtmlUnit

    What is HtmlUnit?

    HtmlUnit is a “GUI-less browser for Java programs.” It supports JavaScript execution, cookie handling, redirects, and even emulates different browser versions. It’s particularly useful for scraping SPAs (Single Page Applications) or any content rendered asynchronously with JavaScript.

    Step 1️⃣: Setting Up HtmlUnit

    Let’s create a Maven project and include HtmlUnit as a dependency.

    Add this to your pom.xml:

    <dependencies>
      <dependency>
        <groupId>net.sourceforge.htmlunit</groupId>
        <artifactId>htmlunit</artifactId>
        <version>3.11.0</version> <!-- or latest -->
      </dependency>
    </dependencies>

    If you’re not using Maven, you can download the .jar files from HtmlUnit Downloads.

    Step 2️⃣: Configure the WebClient

    HtmlUnit uses WebClient as its main browser emulator. You can configure it to act like Chrome, enable/disable JavaScript, manage cookies, and set timeouts.

    import com.gargoylesoftware.htmlunit.WebClient;
    import com.gargoylesoftware.htmlunit.BrowserVersion;
    
    WebClient webClient = new WebClient(BrowserVersion.CHROME);
    webClient.getOptions().setJavaScriptEnabled(true);
    webClient.getOptions().setCssEnabled(false);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webClient.getOptions().setTimeout(10000); // 10 seconds timeout

    Step 3️⃣: Load a Page and Wait for JavaScript to Finish

    Unlike static HTML parsing, dynamic scraping with HtmlUnit means giving JavaScript time to execute.

    import com.gargoylesoftware.htmlunit.html.HtmlPage;
    
    String url = "https://example.com/dynamic-page";
    HtmlPage page = webClient.getPage(url);
    
    // Wait for background JS (like AJAX calls) to finish
    webClient.waitForBackgroundJavaScript(5000); // Wait up to 5 seconds

    Step 4️⃣: Extract Data from the Final DOM

    Once JavaScript has run and the page is rendered, you can use XPath or DOM methods to extract content.

    Example: Extract all product names inside <div class="product-name">

    import java.util.List;
    import com.gargoylesoftware.htmlunit.html.HtmlDivision;
    
    List<HtmlDivision> products = page.getByXPath("//div[@class='product-name']");
    
    for (HtmlDivision div : products) {
        System.out.println("Product Name: " + div.asNormalizedText());
    }

    You can also use querySelector or getElementById if you prefer DOM-style access:

    String price = page.getElementById("price").asText();
    

    Step 5️⃣: Handle Login Forms or Cookies (Optional)

    Many dynamic sites require login. HtmlUnit can fill out forms just like a browser:

    HtmlPage loginPage = webClient.getPage("https://example.com/login");
    HtmlForm form = loginPage.getForms().get(0);
    
    form.getInputByName("username").setValueAttribute("myUsername");
    form.getInputByName("password").setValueAttribute("myPassword");
    
    HtmlPage dashboard = form.getButtonByName("submit").click();
    webClient.waitForBackgroundJavaScript(3000);

    Example: Full JSoup Scraper

    public class DynamicScraperExample {
        public static void main(String[] args) throws Exception {
            try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
                webClient.getOptions().setJavaScriptEnabled(true);
                webClient.getOptions().setCssEnabled(false);
                webClient.getOptions().setThrowExceptionOnScriptError(false);
    
                HtmlPage page = webClient.getPage("https://news.example.com/");
                webClient.waitForBackgroundJavaScript(3000);
    
                List<HtmlDivision> titles = page.getByXPath("//a[@class='titlelink']");
    
                for (HtmlDivision title : titles) {
                    System.out.println(title.asNormalizedText());
                }
            }
        }
    }

    ProxyTee Integration for Scalable and Undetectable Scraping

    Using proxies is essential for large-scale scraping. ProxyTee’s residential proxies provide unlimited bandwidth, automatic IP rotation, and global coverage. These proxies help prevent blocks and bans when accessing public or rate-limited websites.

    You can configure JSoup or HtmlUnit to use ProxyTee proxies using Java system properties:

    System.setProperty("http.proxyHost", "55.66.77.88");
    System.setProperty("http.proxyPort", "10001"); 

    For authenticated proxies, include login credentials:

    System.setProperty("http.proxyUser", "username"); 
    System.setProperty("http.proxyPassword", "password"); 

    For more dynamic needs, ProxyTee provides API-based rotating proxies that change IPs every request or after set intervals, allowing true scaling.

    Real-World Use Cases for Java Web Scraping

    • Price Intelligence: Track competitor pricing across hundreds of product pages
    • Content Monitoring: Detect changes in public data or news feeds
    • SEO Audits: Analyze backlinks or metadata across large domains
    • Academic Research: Gather abstracts and citations from journals
    • Travel Aggregators: Scrape ticket and hotel data from multiple providers

    Each of these can benefit from ProxyTee’s proxy flexibility and Java’s reliability.

    Comparing JSoup and HtmlUnit for Scraping Tasks

    • Browser or system support: HtmlUnit mimics browser behavior; JSoup handles only static content
    • Language compatibility: Both libraries are written in Java and integrate seamlessly
    • Setup difficulty: JSoup has a simpler setup and learning curve
    • Speed and performance: JSoup is faster for static sites; HtmlUnit is heavier but handles JavaScript
    • Community and documentation: JSoup has more recent updates and a larger community

    For most sites, starting with JSoup is recommended, and falling back to HtmlUnit only when dynamic interaction is needed.

    Next Steps for Building Smarter Scrapers

    Now that you’ve learned the technical details of web scraping with Java, it’s time to build out production-ready workflows. Consider using schedulers to automate scraping intervals, exporting data to databases or CSV files, and implementing error handling for downtime and response anomalies. Integrating ProxyTee’s proxies allows your scraper to avoid detection and keep running continuously without intervention. Explore more advanced techniques such as rotating user agents, captcha solving, or cookie handling as your projects grow. Web scraping with Java is more than possible, it is powerful when paired with the right tools and architecture.

    • Data Extraction
    • Java
    • Web Scraping

    Post navigation

    Previous
    Next

    Categories

    • Comparison & Differences
    • Exploring
    • Integration
    • Tutorial

    Recent posts

    • How to Turn Off AI Overview in Google Search
      How to Turn Off AI Overview in Google Search
    • Beginner’s Guide to Web Crawling with Python and Scrapy
      Beginner’s Guide to Web Crawling with Python and Scrapy
    • Set Up ProxyTee Proxies in GeeLark for Smooth Online Tasks
      Set Up ProxyTee Proxies in GeeLark for Smooth Online Tasks
    • Web Scraping with Beautiful Soup
      Learn Web Scraping with Beautiful Soup
    • How to Set Up a Proxy in SwitchyOmega
      How to Set Up a Proxy in SwitchyOmega (Step-by-Step Guide)

    Related Posts

    Web Scraping with Beautiful Soup
    Tutorial

    Learn Web Scraping with Beautiful Soup

    May 30, 2025 Mike

    Learn Web Scraping with Beautiful Soup and unlock the power of automated data collection from websites. Whether you’re a developer, digital marketer, data analyst, or simply curious, web scraping provides efficient ways to gather information from the internet. In this guide, we explore how Beautiful Soup can help you parse HTML and XML data, and […]

    Best Rotating Proxies in 2025
    Comparison & Differences

    Best Rotating Proxies in 2025

    May 19, 2025 Mike

    Best Rotating Proxies in 2025 are essential tools for developers, marketers, and SEO professionals seeking efficient and reliable data collection. With the increasing complexity of web scraping and data gathering, choosing the right proxy service can significantly impact your operations. This article explores the leading rotating proxy providers in 2025, highlighting their unique features and […]

    How to Scrape Websites with Puppeteer: A 2025 Beginner’s Guide
    Tutorial

    How to Scrape Websites with Puppeteer: A 2025 Beginner’s Guide

    May 19, 2025 Mike

    Scrape websites with Puppeteer efficiently using modern techniques that are perfect for developers, SEO professionals, and data analysts. Puppeteer, a Node.js library developed by Google, has become one of the go-to solutions for browser automation and web scraping in recent years. Whether you are scraping data for competitive analysis, price monitoring, or SEO audits, learning […]

    We help ambitious businesses achieve more

    Free consultation
    Contact sales
    • Sign In
    • Sign Up
    • Contact
    • Facebook
    • Twitter
    • Telegram
    Affordable Rotating Residential Proxies with Unlimited Bandwidth

    Get reliable, affordable rotating proxies with unlimited bandwidth for seamless browsing and enhanced security.

    Products
    • Features
    • Pricing
    • Solutions
    • Testimonials
    • FAQs
    • Partners
    Tools
    • App
    • API
    • Blog
    • Check Proxies
    • Free Proxies
    Legal
    • Privacy Policy
    • Terms of Use
    • Affiliate
    • Reseller
    • White-label
    Support
    • Contact
    • Support Center
    • Knowlegde Base

    Copyright © 2025 ProxyTee