Getting Started with Web Scraping Using Python and Beautiful Soup

Web scraping, while complex, can be simplified using languages like Python, which offers user-friendly libraries. One such library is Beautiful Soup, designed for parsing HTML and XML documents. This tutorial explores how to use Beautiful Soup for parsing a sample HTML file, including navigating HTML tags, extracting content, finding elements by ID, extracting text, and exporting data to CSV. Let’s dive in, enhancing your understanding of data extraction with Python. If you need reliable Residential Proxies, ProxyTee has got you covered.
Understanding Data Parsing and Beautiful Soup
Data parsing is about converting data into a format that’s easily readable and analyzable. A parser filters and combines information based on specific criteria. ProxyTee ensures this process is seamless. Beautiful Soup is a Python package that creates a parse tree, enabling easy data extraction, navigation, and modification of HTML and XML, useful for web scraping, compatible with Python 3.6 and above. This helps streamline the data collection and parsing process.
Installation and Setup
Before using Beautiful Soup, ensure Python and an IDE such as PyCharm are set up. When installing Python, check the ‘PATH installation’ box, allowing your OS to recognize commands like pip and python easily. Install Beautiful Soup using this command in the terminal:
pip install beautifulsoup4
This tutorial will work with a basic sample HTML, ensure you’re familiar with its structure for effective data parsing.
A Sample HTML Structure
Here’s a simple HTML document we will use:
<!DOCTYPE html>
<html>
<head>
<title>What is a Proxy?</title>
<meta charset="utf-8">
</head>
<body>
<h2>Proxy types</h2>
<p>
There are many different ways to categorize proxies.
However, two of the most popular types are residential and data center proxies. Here is a list of the most common types.
</p>
<ul id="proxytypes">
<li>Residential proxies</li>
<li>Datacenter proxies</li>
<li>Shared proxies</li>
<li>Semi-dedicated proxies</li>
<li>Private proxies</li>
</ul>
</body>
</html>
Copy this to a text editor and save as ‘index.html’, or create a new HTML file in your IDE and copy/paste it. Now, you’re all set to explore and get some practice.
Parsing HTML with Beautiful Soup
1️⃣ Finding HTML Tags
Use soup.descendants
to extract a list of all tags in the document. The code below does the following:
- Imports the Beautiful Soup Library.
- Opens and reads the HTML document.
- Creates a Beautiful Soup object to parse the document.
- Iterates through the elements and prints the tag names.
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, "html.parser")
for child in soup.descendants:
if child.name:
print(child.name)
This script will output:
html
head
title
meta
body
h2
p
ul
li
li
li
li
li
2️⃣ Extracting Full Content From HTML Tags
You can directly extract tag contents using the tag name:
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, "html.parser")
print(soup.h2)
print(soup.p)
print(soup.li)
Outputting the complete tag with the content:
<h2>Proxy types</h2>
<p>
There are many different ways to categorize proxies.
However, two of the most popular types are residential and data center proxies.
Here is a list of the most common types.
</p>
<li>Residential proxies</li>
To get only the text, add .text
:
print(soup.li.text)
Result:
Residential proxies
3️⃣ Finding Elements by ID
Find HTML elements using their IDs:
print(soup.find('ul', attrs={'id': 'proxytypes'}))
print(soup.find('ul', id='proxytypes'))
Both of the above will yield the same result in your console:
<ul id="proxytypes">
<li>Residential proxies</li>
<li>Datacenter proxies</li>
<li>Shared proxies</li>
<li>Semi-dedicated proxies</li>
<li>Private proxies</li>
</ul>
4️⃣ Finding All Instances of a Tag
To get all the tags on a specific item, use the find_all method:
for tag in soup.find_all('li'):
print(tag.text)
The above will return all list items from the HTML:
Residential proxies
Datacenter proxies
Shared proxies
Semi-dedicated proxies
Private proxies
5️⃣ Parsing by CSS Selectors
Beautiful Soup uses CSS selectors via the ‘select’ and ‘select_one’ methods:
select_one: To get the first list item, use:
print(soup.select_one('body ul li'))
select: To select the title, you can use
print(soup.select('html head title'))
For extracting the third list item use the selector as follows:
print(soup.select_one('body ul li:nth-of-type(3)'))
Handling Dynamic Content
For content loaded dynamically using JavaScript, other libraries are required as requests
and Beautiful Soup alone aren’t enough. Use libraries such as Selenium to interact with the DOM:
- Install Selenium:
pip install selenium
- Here’s the basic script:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get("http://quotes.toscrape.com/js/")
js_content = driver.page_source
soup = BeautifulSoup(js_content, "html.parser")
quote = soup.find("span", class_="text")
print(quote.text)
This fetches the page from a dummy website with quotes and then selects the first quote using BeautifulSoup.
To bypass bot detection consider using ProxyTee’s rotating residential proxies to mask your IP address.
Exporting to a CSV file
Exporting to a CSV file is crucial for further analysis. Use pandas to do this:
- Install pandas using pip:
pip install pandas
Here is how to create CSV file for the proxy list in the HTML example used here:
from bs4 import BeautifulSoup
import pandas as pd
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, "html.parser")
results = soup.find_all('li')
df = pd.DataFrame({'Names': results})
df.to_csv('names.csv', index=False, encoding='utf-8')
This creates a file named `names.csv` containing your extracted proxy list.