Tutorial

Getting Started with Web Scraping Using Python and Beautiful Soup

April 30, 2025 Mike

Web Scraping with Python: A Beautiful Soup Parsing Tutorial

Web scraping, while complex, can be simplified using languages like Python, which offers user-friendly libraries. One such library is Beautiful Soup, designed for parsing HTML and XML documents. This tutorial explores how to use Beautiful Soup for parsing a sample HTML file, including navigating HTML tags, extracting content, finding elements by ID, extracting text, and exporting data to CSV. Let’s dive in, enhancing your understanding of data extraction with Python. If you need reliable Residential Proxies, ProxyTee has got you covered.

Understanding Data Parsing and Beautiful Soup

Data parsing is about converting data into a format that’s easily readable and analyzable. A parser filters and combines information based on specific criteria. ProxyTee ensures this process is seamless. Beautiful Soup is a Python package that creates a parse tree, enabling easy data extraction, navigation, and modification of HTML and XML, useful for web scraping, compatible with Python 3.6 and above. This helps streamline the data collection and parsing process.

Installation and Setup

Before using Beautiful Soup, ensure Python and an IDE such as PyCharm are set up. When installing Python, check the ‘PATH installation’ box, allowing your OS to recognize commands like pip and python easily. Install Beautiful Soup using this command in the terminal:

pip install beautifulsoup4

This tutorial will work with a basic sample HTML, ensure you’re familiar with its structure for effective data parsing.

A Sample HTML Structure

Here’s a simple HTML document we will use:

<!DOCTYPE html>
<html>
	<head>
		<title>What is a Proxy?</title>
		<meta charset="utf-8">
	</head>
	<body>
		<h2>Proxy types</h2>
		<p>
		  There are many different ways to categorize proxies.
		 However, two of the most popular types are residential and data center proxies. Here is a list of the most common types.
		</p>
		<ul id="proxytypes">
			<li>Residential proxies</li>
			<li>Datacenter proxies</li>
			<li>Shared proxies</li>
			<li>Semi-dedicated proxies</li>
			<li>Private proxies</li>
		</ul>
	</body>
</html>

Copy this to a text editor and save as ‘index.html’, or create a new HTML file in your IDE and copy/paste it. Now, you’re all set to explore and get some practice.

Parsing HTML with Beautiful Soup

1️⃣ Finding HTML Tags

Use soup.descendants to extract a list of all tags in the document. The code below does the following:

Imports the Beautiful Soup Library.
Opens and reads the HTML document.
Creates a Beautiful Soup object to parse the document.
Iterates through the elements and prints the tag names.

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:
    contents = f.read()
    soup = BeautifulSoup(contents, "html.parser")

for child in soup.descendants:
    if child.name:
        print(child.name)

This script will output:

html
head
title
meta
body
h2
p
ul
li
li
li
li
li

2️⃣ Extracting Full Content From HTML Tags

You can directly extract tag contents using the tag name:

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:
    contents = f.read()
    soup = BeautifulSoup(contents, "html.parser")

print(soup.h2)
print(soup.p)
print(soup.li)

Outputting the complete tag with the content:

<h2>Proxy types</h2>
<p>
          There are many different ways to categorize proxies.
          However, two  of the most popular types are residential and data center proxies.
 Here is a list  of the most common types.
        </p>
<li>Residential proxies</li>

To get only the text, add .text:

print(soup.li.text)

Result:

Residential proxies

3️⃣ Finding Elements by ID

Find HTML elements using their IDs:

print(soup.find('ul', attrs={'id': 'proxytypes'}))
print(soup.find('ul', id='proxytypes'))

Both of the above will yield the same result in your console:

<ul id="proxytypes">
<li>Residential proxies</li>
<li>Datacenter proxies</li>
<li>Shared proxies</li>
<li>Semi-dedicated proxies</li>
<li>Private proxies</li>
</ul>

4️⃣ Finding All Instances of a Tag

To get all the tags on a specific item, use the find_all method:

for tag in soup.find_all('li'):
    print(tag.text)

The above will return all list items from the HTML:

Residential proxies
Datacenter proxies
Shared proxies
Semi-dedicated proxies
Private proxies

5️⃣ Parsing by CSS Selectors

Beautiful Soup uses CSS selectors via the ‘select’ and ‘select_one’ methods:

select_one: To get the first list item, use:

print(soup.select_one('body ul li'))

select: To select the title, you can use

print(soup.select('html head title'))

For extracting the third list item use the selector as follows:

print(soup.select_one('body ul li:nth-of-type(3)'))

Handling Dynamic Content

For content loaded dynamically using JavaScript, other libraries are required as requests and Beautiful Soup alone aren’t enough. Use libraries such as Selenium to interact with the DOM:

Install Selenium:

pip install selenium

Here’s the basic script:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get("http://quotes.toscrape.com/js/")
js_content = driver.page_source
soup = BeautifulSoup(js_content, "html.parser")
quote = soup.find("span", class_="text")
print(quote.text)

This fetches the page from a dummy website with quotes and then selects the first quote using BeautifulSoup.

To bypass bot detection consider using ProxyTee’s rotating residential proxies to mask your IP address.

Exporting to a CSV file

Exporting to a CSV file is crucial for further analysis. Use pandas to do this:

Install pandas using pip:

pip install pandas

Here is how to create CSV file for the proxy list in the HTML example used here:

from bs4 import BeautifulSoup
import pandas as pd

with open('index.html', 'r') as f:
    contents = f.read()
    soup = BeautifulSoup(contents, "html.parser")
    results = soup.find_all('li')
    df = pd.DataFrame({'Names': results})
    df.to_csv('names.csv', index=False, encoding='utf-8')

This creates a file named `names.csv` containing your extracted proxy list.

Getting Started with Web Scraping Using Python and Beautiful Soup

Understanding Data Parsing and Beautiful Soup

Installation and Setup

A Sample HTML Structure

Parsing HTML with Beautiful Soup

1️⃣ Finding HTML Tags

2️⃣ Extracting Full Content From HTML Tags

3️⃣ Finding Elements by ID

4️⃣ Finding All Instances of a Tag

5️⃣ Parsing by CSS Selectors

Handling Dynamic Content

Exporting to a CSV file

We help ambitious businesses achieve more

Products

Tools

Legal

Support

Contact sales

Getting Started with Web Scraping Using Python and Beautiful Soup

Understanding Data Parsing and Beautiful Soup

Installation and Setup

A Sample HTML Structure

Parsing HTML with Beautiful Soup

1️⃣ Finding HTML Tags

2️⃣ Extracting Full Content From HTML Tags

3️⃣ Finding Elements by ID

4️⃣ Finding All Instances of a Tag

5️⃣ Parsing by CSS Selectors

Handling Dynamic Content

Exporting to a CSV file

Related Posts

How to Scrape Products from E-commerce Sites with ProxyTee

Puppeteer and Selenium: Top Web Automation Tools in 2025

Web Scraping with LangChain and ProxyTee: A Step-by-Step Guide

We help ambitious businesses achieve more

Products

Tools

Legal

Support