Tutorial

Web Scraping with lxml: A Guide Using ProxyTee

May 12, 2025 Mike

Web scraping is an automated process of collecting data from websites, which is essential for many purposes, such as data analysis and training AI models. Python is a popular language for web scraping, and lxml is a robust library for parsing HTML and XML documents. In this post, we’ll explore how to leverage lxml for web scraping and how ProxyTee can enhance your scraping projects.

Introducing ProxyTee

ProxyTee offers Unlimited Residential Proxies, a powerful tool for web scraping. ProxyTee is known for its reliability, affordability, and user-friendliness, providing an ideal option for both businesses and individuals.

Key benefits of using ProxyTee for web scraping include:

Unlimited Bandwidth: ProxyTee ensures that your high-traffic tasks will not be interrupted by bandwidth concerns.
Global IP Coverage: Access to over 20 million IPs across 100+ countries with ProxyTee’s extensive global network for precise targeting and local operations.
Multiple Protocol Support: Supporting both HTTP and SOCKS5 protocols, ProxyTee ensures maximum compatibility with a range of tools and applications.
Auto Rotation: Benefit from IP auto-rotation which changes your IP address at intervals from 3-60 minutes to avoid IP blocks and restrictions from websites, and can customize this based on need.
User-Friendly Interface: Start immediately without technical skills, thanks to a clean and easy-to-navigate GUI available in the tool.
Simple API: Simplify automation for proxy-related tasks by using ProxyTee’s simple API for a seemless experience when incorporating your proxy usage into applications.
Affordable Pricing: Compared to competitors, ProxyTee’s unlimited residential proxies offer savings as high as 50%, while not compromising quality.

Getting Started with lxml for Web Scraping

Before starting, you’ll need to install lxml, requests, and cssselect:

pip install lxml requests cssselect

These libraries enable you to parse HTML/XML, fetch web pages, and extract HTML elements using CSS selectors.

Scraping Static Content

Static content is embedded in the HTML document, making it easy to scrape. Here’s how to extract data from a website with static content, like the Books to Scrape website:

import requests
from lxml import html
import json

URL = "https://books.toscrape.com/"

content = requests.get(URL).text
parsed = html.fromstring(content)
all_books = parsed.xpath('//article[@class="product_pod"]')
books = []

for book in all_books:
    book_title = book.xpath('.//h3/a/@title')
    price = book.cssselect("p.price_color")[0].text_content()
    books.append({"title": book_title, "price": price})

with open("books.json", "w", encoding="utf-8") as file:
    json.dump(books ,file)

This code fetches HTML, parses it, locates book data using XPath and CSS selectors, and saves titles and prices in a books.json file.

Scraping Dynamic Content

Dynamic content is rendered with JavaScript, making scraping a bit more complex. We will use selenium for this.

pip install selenium

Here is an example using the YouTube channel freeCodeCamp.org:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from lxml import html
from time import sleep
import json

URL = "https://www.youtube.com/@freecodecamp/videos"
videos = []
driver = webdriver.Chrome()

driver.get(URL)
sleep(3)

parent = driver.find_element(By.TAG_NAME, 'html')
for i in range(4):
    parent.send_keys(Keys.END)
    sleep(3)

html_data = html.fromstring(driver.page_source)
videos_html = html_data.cssselect("a#video-title-link")

for video in videos_html:
    title = video.text_content()
    link = "https://www.youtube.com" + video.get("href")
    videos.append( {"title": title, "link": link} )

with open('videos.json', 'w') as file:
    json.dump(videos, file)

driver.close()

This script uses Selenium to load the page and simulate scrolling to load all videos, then parses the HTML using lxml to extract data.

Enhancing Scraping with ProxyTee

Websites often implement anti-scraping measures. ProxyTee helps you bypass these restrictions by providing rotating residential IPs. Here’s how to integrate ProxyTee into the previous static scraping script:

import requests
from lxml import html
import json

URL = "https://books.toscrape.com/"

# ProxyTee credentials
username = "YOUR_MYPROXY_USERNAME"
password = "YOUR_MYPROXY_PASSWORD"
hostname = "YOUR_MYPROXY_HOSTNAME"

proxies = {
    "http": f"https://{username}:{password}@{hostname}",
    "https": f"https://{username}:{password}@{hostname}",
}

content = requests.get(URL, proxies=proxies).text
parsed = html.fromstring(content)
all_books = parsed.xpath('//article[@class="product_pod"]')
books = []

for book in all_books:
    book_title = book.xpath('.//h3/a/@title')
    price = book.cssselect("p.price_color")[0].text_content()
    books.append({"title": book_title, "price": price})

with open("books.json", "w", encoding="utf-8") as file:
    json.dump(books ,file)

Replace YOUR_MYPROXY_USERNAME, YOUR_MYPROXY_PASSWORD, and YOUR_MYPROXY_HOSTNAME with your actual ProxyTee credentials. This code directs requests through ProxyTee, enabling anonymous and secure web scraping.

Web Scraping with lxml: A Guide Using ProxyTee

Introducing ProxyTee

Getting Started with lxml for Web Scraping

Scraping Static Content

Scraping Dynamic Content

Enhancing Scraping with ProxyTee

We help ambitious businesses achieve more

Products

Tools

Legal

Support

Contact sales

Web Scraping with lxml: A Guide Using ProxyTee

Introducing ProxyTee

Getting Started with lxml for Web Scraping

Scraping Static Content

Scraping Dynamic Content

Enhancing Scraping with ProxyTee

Related Posts

How ProxyTee Enhances Amazon Scraping for Data Analysts and Scrapers

How to Scrape Yelp Data with ProxyTee

Understanding Data Extraction with ProxyTee

We help ambitious businesses achieve more

Products

Tools

Legal

Support