Web Scraping with lxml: A Guide Using ProxyTee

Web scraping is an automated process of collecting data from websites, which is essential for many purposes, such as data analysis and training AI models. Python is a popular language for web scraping, and lxml
is a robust library for parsing HTML and XML documents. In this post, we’ll explore how to leverage lxml
for web scraping and how ProxyTee can enhance your scraping projects.
Introducing ProxyTee
ProxyTee offers Unlimited Residential Proxies, a powerful tool for web scraping. ProxyTee is known for its reliability, affordability, and user-friendliness, providing an ideal option for both businesses and individuals.
Key benefits of using ProxyTee for web scraping include:
- Unlimited Bandwidth: ProxyTee ensures that your high-traffic tasks will not be interrupted by bandwidth concerns.
- Global IP Coverage: Access to over 20 million IPs across 100+ countries with ProxyTee’s extensive global network for precise targeting and local operations.
- Multiple Protocol Support: Supporting both HTTP and SOCKS5 protocols, ProxyTee ensures maximum compatibility with a range of tools and applications.
- Auto Rotation: Benefit from IP auto-rotation which changes your IP address at intervals from 3-60 minutes to avoid IP blocks and restrictions from websites, and can customize this based on need.
- User-Friendly Interface: Start immediately without technical skills, thanks to a clean and easy-to-navigate GUI available in the tool.
- Simple API: Simplify automation for proxy-related tasks by using ProxyTee’s simple API for a seemless experience when incorporating your proxy usage into applications.
- Affordable Pricing: Compared to competitors, ProxyTee’s unlimited residential proxies offer savings as high as 50%, while not compromising quality.
Getting Started with lxml for Web Scraping
Before starting, you’ll need to install lxml
, requests
, and cssselect
:
pip install lxml requests cssselect
These libraries enable you to parse HTML/XML, fetch web pages, and extract HTML elements using CSS selectors.
Scraping Static Content
Static content is embedded in the HTML document, making it easy to scrape. Here’s how to extract data from a website with static content, like the Books to Scrape website:
import requests
from lxml import html
import json
URL = "https://books.toscrape.com/"
content = requests.get(URL).text
parsed = html.fromstring(content)
all_books = parsed.xpath('//article[@class="product_pod"]')
books = []
for book in all_books:
book_title = book.xpath('.//h3/a/@title')
price = book.cssselect("p.price_color")[0].text_content()
books.append({"title": book_title, "price": price})
with open("books.json", "w", encoding="utf-8") as file:
json.dump(books ,file)
This code fetches HTML, parses it, locates book data using XPath and CSS selectors, and saves titles and prices in a books.json
file.
Scraping Dynamic Content
Dynamic content is rendered with JavaScript, making scraping a bit more complex. We will use selenium
for this.
pip install selenium
Here is an example using the YouTube channel freeCodeCamp.org:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from lxml import html
from time import sleep
import json
URL = "https://www.youtube.com/@freecodecamp/videos"
videos = []
driver = webdriver.Chrome()
driver.get(URL)
sleep(3)
parent = driver.find_element(By.TAG_NAME, 'html')
for i in range(4):
parent.send_keys(Keys.END)
sleep(3)
html_data = html.fromstring(driver.page_source)
videos_html = html_data.cssselect("a#video-title-link")
for video in videos_html:
title = video.text_content()
link = "https://www.youtube.com" + video.get("href")
videos.append( {"title": title, "link": link} )
with open('videos.json', 'w') as file:
json.dump(videos, file)
driver.close()
This script uses Selenium to load the page and simulate scrolling to load all videos, then parses the HTML using lxml
to extract data.
Enhancing Scraping with ProxyTee
Websites often implement anti-scraping measures. ProxyTee helps you bypass these restrictions by providing rotating residential IPs. Here’s how to integrate ProxyTee into the previous static scraping script:
import requests
from lxml import html
import json
URL = "https://books.toscrape.com/"
# ProxyTee credentials
username = "YOUR_MYPROXY_USERNAME"
password = "YOUR_MYPROXY_PASSWORD"
hostname = "YOUR_MYPROXY_HOSTNAME"
proxies = {
"http": f"https://{username}:{password}@{hostname}",
"https": f"https://{username}:{password}@{hostname}",
}
content = requests.get(URL, proxies=proxies).text
parsed = html.fromstring(content)
all_books = parsed.xpath('//article[@class="product_pod"]')
books = []
for book in all_books:
book_title = book.xpath('.//h3/a/@title')
price = book.cssselect("p.price_color")[0].text_content()
books.append({"title": book_title, "price": price})
with open("books.json", "w", encoding="utf-8") as file:
json.dump(books ,file)
Replace YOUR_MYPROXY_USERNAME
, YOUR_MYPROXY_PASSWORD
, and YOUR_MYPROXY_HOSTNAME
with your actual ProxyTee credentials. This code directs requests through ProxyTee, enabling anonymous and secure web scraping.