How to Scrape Wikipedia Using Python and ProxyTee

How to Scrape Wikipedia Using Python and ProxyTee

Wikipedia is a vast and comprehensive resource, packed with millions of articles covering nearly every topic imaginable. For researchers, data scientists, and developers, this wealth of information opens doors to countless opportunities, from crafting machine learning datasets to conducting detailed academic research. In this post, we’ll explore how to scrape Wikipedia effectively, and how using ProxyTee can significantly enhance this process, making data extraction more efficient and reliable.

Enhancing Web Scraping with ProxyTee

Before diving into scraping techniques, let's quickly introduce ProxyTee, your partner for web scraping. ProxyTee is a leading provider of rotating residential proxies, tailored to support a variety of online tasks that demand anonymity and reliable IP rotation. It is known for its affordability and efficiency, making it an excellent choice for any serious data scraping efforts. ProxyTee provides unlimited bandwidth, a large pool of IP addresses, and tools designed for seamless integration, helping you conduct extensive scraping without facing blocks or restrictions. Let’s explore the key features that make ProxyTee an ideal solution for web scraping:

  1. Unlimited Bandwidth: With unlimited bandwidth, you can scrape as much data as needed without worrying about overage charges, perfect for data-heavy tasks. This feature ensures uninterrupted data flow even during peak demand, critical for large-scale data extractions.
  2. Global IP Coverage: Access over 20 million IP addresses in more than 100 countries. Global coverage allows you to perform location-specific tasks effectively.
  3. Multiple Protocol Support: Compatible with both HTTP and SOCKS5, multiple protocol support ensures smooth operation across diverse platforms and tools, offering the adaptability you need to handle a variety of scraping jobs, whether complex or straightforward.
  4. User-Friendly Interface: Get started quickly with an intuitive user interface that simplifies your setup process, making it easy even for users with minimal technical expertise. ProxyTee's clean GUI is designed to let you focus on data gathering rather than managing complicated software.
  5. Auto Rotation: Keep your scraping activity undetectable with automatic IP rotation that adjusts every 3 to 60 minutes. The auto-rotation function is customizable to meet varying scraping demands, making it invaluable for preventing website blocks and ensuring consistent access to target data.
  6. API Integration: Integrate ProxyTee into any workflow effortlessly through a comprehensive API, which is crucial for businesses or developers who automate web scraping tasks. The simple API makes it possible to streamline processes, enabling efficient handling of multiple projects.

Scraping Wikipedia with Python

Let’s walk through the process of scraping Wikipedia using Python, combined with the power of ProxyTee.

1. Setup and Prerequisites

Before starting, make sure you’re ready with:

  • Install Python: Get the latest Python from the official Python website.
  • Choose an IDE: Use an IDE like PyCharm, Visual Studio Code, or Jupyter Notebook to write code.
  • Basic Knowledge: Get familiar with CSS selectors and how to inspect page elements using browser DevTools.

Let's create a project with Poetry, a tool that simplifies package management in Python:

poetry new wikipedia-scraper
cd wikipedia-scraper
poetry add requests beautifulsoup4 pandas lxml
poetry shell
code .

Confirm your project’s dependencies in pyproject.toml:

[tool.poetry.dependencies]
python = "^3.12"
requests = "^2.32.3"
beautifulsoup4 = "^4.12.3"
pandas = "^2.2.3"
lxml = "^5.3.0"

Finally, create a main.py inside wikipedia_scraper for your scraping logic.

2. Connecting to the Target Wikipedia Page

Here’s code to connect to a Wikipedia page:

import requests
from bs4 import BeautifulSoup

def connect_to_wikipedia(url):
    response = requests.get(url)
    if response.status_code == 200:
        return BeautifulSoup(response.text, "html.parser")
    else:
        print(f"Failed to retrieve the page. Status code: {response.status_code}")
        return None

wikipedia_url = "https://en.wikipedia.org/wiki/Cristiano_Ronaldo"
soup = connect_to_wikipedia(wikipedia_url)

3. Inspecting the Page

Understanding the page structure is vital for effective scraping. Inspecting the DOM will help you target the following:

  • Links: Target <a> tags to get all the URLs
  • Images: Target <img> tags to extract src attributes.
  • Tables: Look for <table> tags with the class wikitable.
  • Paragraphs: Locate <p> tags for extracting main text.

This function extracts all the links from a Wikipedia page:

def extract_links(soup):
    links = []
    for link in soup.find_all("a", href=True):
        url = link["href"]
        if not url.startswith("http"):
            url = "https://en.wikipedia.org" + url
        links.append(url)
    return links

5. Extracting Paragraphs

To extract all the text within paragraphs use this:

def extract_paragraphs(soup):
    paragraphs = [p.get_text(strip=True) for p in soup.find_all("p")]
    return [p for p in paragraphs if p and len(p) > 10]

6. Extracting Tables

To grab all the tables on the page:

import pandas as pd
from io import StringIO

def extract_tables(soup):
    tables = []
    for table in soup.find_all("table", {"class": "wikitable"}):
        table_html = StringIO(str(table))
        df = pd.read_html(table_html)[0]
        tables.append(df)
    return tables

7. Extracting Images

Here's how you can grab images URLs from the page:

def extract_images(soup):
    images = []
    for img in soup.find_all("img", src=True):
        img_url = img["src"]
        if not img_url.startswith("http"):
            img_url = "https:" + img_url
        if "static/images" not in img_url:
            images.append(img_url)
    return images

8. Saving the Scraped Data

To save the extracted data:

import json

def store_data(links, images, tables, paragraphs):
    with open("wikipedia_links.txt", "w", encoding="utf-8") as f:
        for link in links:
            f.write(f"{link}\n")
    with open("wikipedia_images.json", "w", encoding="utf-8") as f:
        json.dump(images, f, indent=4)
    with open("wikipedia_paragraphs.txt", "w", encoding="utf-8") as f:
        for para in paragraphs:
            f.write(f"{para}\n\n")
    for i, table in enumerate(tables):
        table.to_csv(f"wikipedia_table_{i+1}.csv", index=False, encoding="utf-8-sig")

Putting It All Together

Now combine everything:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from io import StringIO
import json

def extract_links(soup):
    links = []
    for link in soup.find_all("a", href=True):
        url = link["href"]
        if not url.startswith("http"):
            url = "https://en.wikipedia.org" + url
        links.append(url)
    return links

def extract_images(soup):
    images = []
    for img in soup.find_all("img", src=True):
        img_url = img["src"]
        if not img_url.startswith("http"):
            img_url = "https:" + img_url
        if "static/images" not in img_url:
            images.append(img_url)
    return images

def extract_tables(soup):
    tables = []
    for table in soup.find_all("table", {"class": "wikitable"}):
        table_html = StringIO(str(table))
        df = pd.read_html(table_html)[0]
        tables.append(df)
    return tables

def extract_paragraphs(soup):
    paragraphs = [p.get_text(strip=True) for p in soup.find_all("p")]
    return [p for p in paragraphs if p and len(p) > 10]

def store_data(links, images, tables, paragraphs):
    with open("wikipedia_links.txt", "w", encoding="utf-8") as f:
        for link in links:
            f.write(f"{link}\n")
    with open("wikipedia_images.json", "w", encoding="utf-8") as f:
        json.dump(images, f, indent=4)
    with open("wikipedia_paragraphs.txt", "w", encoding="utf-8") as f:
        for para in paragraphs:
            f.write(f"{para}\n\n")
    for i, table in enumerate(tables):
        table.to_csv(f"wikipedia_table_{i+1}.csv", index=False, encoding="utf-8-sig")

def scrape_wikipedia(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    links = extract_links(soup)
    images = extract_images(soup)
    tables = extract_tables(soup)
    paragraphs = extract_paragraphs(soup)
    store_data(links, images, tables, paragraphs)

if __name__ == "__main__":
    scrape_wikipedia("https://en.wikipedia.org/wiki/Cristiano_Ronaldo")

Run the script, and you will find:

  • wikipedia_images.json: containing all images URLs
  • wikipedia_links.txt: including all the URLs
  • wikipedia_paragraphs.txt: contains paragraphs
  • CSV files for every extracted tables.

Leveraging ProxyTee for Enhanced Scraping

While the code above shows you how to scrape data from Wikipedia, consider leveraging ProxyTee for scalable and efficient data collection. With ProxyTee's unlimited bandwidth residential proxies, you can avoid IP bans and geographical blocks that can hinder scraping projects. By integrating a ProxyTee, this code can be enhanced to route all requests via a proxy and automate IP rotations, significantly improving success rates for large data extractions. Check ProxyTee's pricing to see which plan best matches your web scraping goals.

Conclusion

This tutorial has shown you how to scrape Wikipedia with Python. Using a combination of parsing with Beautiful Soup, structured data extraction, and strategic use of tools like ProxyTee, you can significantly enhance your scraping capabilities, ensuring robust and reliable data extraction.

Ready to level up your web scraping skills? Explore ProxyTee's residential proxy solutions and experience seamless, efficient, and scalable web data gathering.