Web Scraping with AutoScraper: A Comprehensive Guide with ProxyTee

Web scraping is an essential technique for extracting data from websites, and ProxyTee provides the perfect tools to enhance this process. AutoScraper is a beginner-friendly Python library that simplifies web scraping by automatically identifying and extracting data from websites without manual HTML inspection. Unlike traditional tools, AutoScraper learns the structure of data elements based on example queries, making it an ideal choice for both beginners and experienced developers. Ideal for tasks like collecting product information, aggregating content, or conducting market research, AutoScraper efficiently handles dynamic websites with minimal setup. This article will show how to use AutoScraper and how ProxyTee can boost the web scraping result.
ProxyTee Overview
ProxyTee is a leading provider of rotating residential proxies designed to support various internet activities, including web scraping, streaming, and other tasks requiring anonymity and IP rotation. Known for its affordability and efficiency, ProxyTee offers solutions with unlimited bandwidth, a vast pool of IP addresses, and easy-to-integrate tools.
ProxyTee provides an affordable, reliable, and user-friendly solution for anyone needing rotating residential proxies. Its features such as unlimited bandwidth, a global IP pool, protocol flexibility, auto-rotation, and API integration, make it a great option for businesses and individuals involved in tasks like web scraping, streaming, or data gathering. With a focus on user-friendly design and competitive pricing, ProxyTee delivers strong value for those seeking effective proxy services.
Key Features of ProxyTee
- Unlimited Bandwidth: ProxyTee offers proxies with unlimited bandwidth, which is ideal for data-intensive tasks like web scraping or streaming. This feature eliminates concerns about data overages, ensuring that you can conduct your operations without worrying about extra costs.
- Global IP Coverage: With over 20 million IP addresses from more than 100 countries, ProxyTee ensures you can target specific regions or perform location-based tasks. This global coverage is especially useful for businesses needing access to varied geographical content.
- Multiple Protocol Support: ProxyTee supports both HTTP and SOCKS5 protocols, ensuring compatibility with a variety of applications. This flexibility allows you to manage tasks such as web scraping, bypassing geo-blocks, and more easily. You can find more info in our multiple proxy protocols page.
- User-Friendly Interface: ProxyTee prioritizes simplicity with a clean, intuitive GUI. This easy-to-use platform allows you to get started quickly without requiring significant setup time or technical skills.
- Auto Rotation: The auto-rotation feature automatically changes IP addresses at intervals between 3 to 60 minutes. This is critical for web scraping, as it helps avoid detection and bans from target websites. The rotation interval can be customized as per requirements.
- API Integration: ProxyTee offers a simple API, enabling seamless integration with various applications and workflows. This API supports all service features, which makes it a great choice for developers and businesses seeking to automate their proxy-related tasks.
Prerequisites for Web Scraping
Before diving into the code, make sure that you have the correct software set up.
First you need Python 3 or a later version installed locally.
Like any other Python web scraping project, you need to run a few commands to create a project directory and activate a virtual environment.
# Set up project directory
mkdir auto-scrape
cd auto-scrape
# Create virtual environment
python -m venv env
# For Mac & Linux users
source env/bin/activate
# For Windows users
venv\Scripts\activate
Using a virtual environment simplifies dependency management for the project.
Next install the autoscraper
library by running the following command:
pip install autoscraper
You will also need to install pandas
to save the scraping results to a CSV file. Pandas is a Python library for easy-to-use data analysis and manipulation. It enables you to easily process and save the results in a CSV, XLSX, or JSON. Run the following command:
pip install pandas
Select a Target Website
When scraping websites, make sure you check the site’s Terms of Service (ToS) or robots.txt
file to ensure that the site allows web scraping. This will avoid any ethical or legal issues. You should choose websites that provide data in a structured format (such as tables or lists), which will make extraction easy.
Traditional tools need a thorough HTML structure analysis to find the target data elements, requiring significant time and familiarity with the developer console of a browser. However, AutoScraper automates this, learning the structure from example data points (known as wanted_list
), which eliminates the need for manual analysis.
In this tutorial, you will start by scraping the Scrape This Site’s Countries of the World: A Simple Example page, which is beginner friendly. Then move to the more complex page with Hockey Teams: Forms, Searching, and Pagination page. This will let you master the basic techniques.
Scrape Simple Data with AutoScraper
Now let’s scrape!
The Countries of the World page is straightforward; the following script is used to scrape a list of countries, along with their capital, population, and area:
# 1. Import dependencies
from autoscraper import AutoScraper
import pandas as pd
# 2. Define the URL of the site to be scraped
url = "https://www.scrapethissite.com/pages/simple/"
# 3. Instantiate the AutoScraper
scraper = AutoScraper()
# 4. Define the wanted list by using an example from the web page
# This list should contain some text or values that you want to scrape
wanted_list = ["Andorra", "Andorra la Vella", "84000", "468.0"]
# 5. Build the scraper based on the wanted list and URL
scraper.build(url, wanted_list)
# 6. Get the results for all the elements matched
results = scraper.get_result_similar(url, grouped=True)
# 7. Display the keys and sample data to understand the structure
print("Keys found by the scraper:", results.keys())
# 8. Assign columns based on scraper keys and expected order of data
columns = ["Country Name", "Capital", "Area (sq km)", "Population"]
# 9. Create a DataFrame with the extracted data
data = {columns[i]: results[list(results.keys())[i]] for i in range(len(columns))}
df = pd.DataFrame(data)
# 10. Save the DataFrame to a CSV file
csv_filename = 'countries_data.csv'
df.to_csv(csv_filename, index=False)
print(f"Data has been successfully saved to {csv_filename}")
This code imports AutoScraper and pandas. The URL of the target website is defined and an instance of the scraper is created.
Here is the interesting part: Instead of instructions about data on the site (as in XPath), you provide an example of data (known as the wanted_list
).
With this list, you build the scraper by passing in the URL and the wanted_list
. This scraper downloads the website and generates rules that are stored in a stack list to extract similar data in the future.
You use the get_result_similar
method on the AutoScraper to extract similar data. After that you will print out the rule IDs.
The code under the 8th and 9th comment creates a header schema for your CSV file and formats extracted data. The data is saved in CSV in step 10.
After running this script (saving to a file and running it on the command line), you’ll find a file called countries_data.csv
that has data like below:
Country Name,Capital,Area (sq km),Population
Andorra,Andorra la Vella,84000,468.0
United Arab Emirates,Abu Dhabi,4975593,82880.0
...246 collapsed rows
Zambia,Lusaka,13460305,752614.0
Zimbabwe,Harare,11651858,390580.0
That’s how simple it is to scrape websites with AutoScraper and ProxyTee.
Process and Extract Data From Websites With a Complex Design
For more complex sites, such as the Hockey Teams page, which contains a table of similar data, you need to be more precise to extract what you need. The technique demonstrated above may not work.
AutoScraper allows for fine model training by pruning collected rules during the build step. Below is the code to achieve this.
from autoscraper import AutoScraper
import pandas as pd
# Define the URL of the site to be scraped
url = "https://www.scrapethissite.com/pages/forms/"
def setup_model():
# Instantiate the AutoScraper
scraper = AutoScraper()
# Define the wanted list by using an example from the web page
# This list should contain some text or values that you want to scrape
wanted_list = ["Boston Bruins", "1990", "44", "24", "0.55", "299", "264", "35"]
# Build the scraper based on the wanted list and URL
scraper.build(url, wanted_list)
# Get the results for all the elements matched
results = scraper.get_result_similar(url, grouped=True)
# Display the data to understand the structure
print(results)
# Save the model
scraper.save("teams_model.json")
def prune_rules():
# Create an instance of Autoscraper
scraper = AutoScraper()
# Load the model saved earlier
scraper.load("teams_model.json")
# Update the model to only keep necessary rules
scraper.keep_rules(['rule_hjk5', 'rule_9sty', 'rule_2hml', 'rule_3qvv', 'rule_e8x1', 'rule_mhl4', 'rule_h090', 'rule_xg34'])
# Save the updated model again
scraper.save("teams_model.json")
def load_and_run_model():
# Create an instance of Autoscraper
scraper = AutoScraper()
# Load the model saved earlier
scraper.load("teams_model.json")
# Get the results for all the elements matched
results = scraper.get_result_similar(url, grouped=True)
# Assign columns based on scraper keys and expected order of data
columns = ["Team Name", "Year", "Wins", "Losses", "Win %", "Goals For (GF)", "Goals Against (GA)", "+/-"]
# Create a DataFrame with the extracted data
data = {columns[i]: results[list(results.keys())[i]] for i in range(len(columns))}
df = pd.DataFrame(data)
# Save the DataFrame to a CSV file
csv_filename = 'teams_data.csv'
df.to_csv(csv_filename, index=False)
print(f"Data has been successfully saved to {csv_filename}")
# setup_model()
# prune_rules()
# load_and_run_model()
This script has three defined methods: setup_model
, prune_rules
, and load_and_run_model
. setup_model
is very similar to the previous example.
To run this, uncomment the # setup_model()
, and run it using python script.py
. The output shows the complete data collection from the target website. Because it contains many numbers, AutoScraper creates a large set of rules, resulting in duplication. You must analyze the data and select the correct rules to extract your data.
For this output, the following rules contain the correct data:
['rule_hjk5', 'rule_9sty', 'rule_2hml', 'rule_3qvv', 'rule_e8x1', 'rule_mhl4', 'rule_h090', 'rule_xg34']
You need to update the prune_rules
method and then comment out setup_model()
, and uncomment prune_rules()
to load the previously created model, remove everything except the listed rules, and save back to the same file.
Now you can run the load_and_run_model
method by commenting out the previous steps and uncommenting load_and_run_model
. This extracts the correct data into teams_data.csv
.
Here is the content of this file after a successful run.
Team Name,Year,Wins,Losses,Win %,Goals For (GF),Goals Against (GA),+/-
Boston Bruins,1990,44,0.55,24,299,264,35
Buffalo Sabres,1990,31,14,30,292,278,14
...21 more rows
Calgary Flames,1991,31,-9,37,296,305,-9
Chicago Blackhawks,1991,36,21,29,257,236,21
Common Challenges with AutoScraper
AutoScraper is efficient for simple use cases with small datasets. However, setting it up with complex data can be tedious, like the table you saw earlier. AutoScraper also doesn’t support JavaScript rendering so you need a tool like Splash, Selenium, or Puppeteer.
If you face IP blocks, or if you need to customize headers while scraping, AutoScraper can specify extra request parameters such as this:
# build the scraper on an initial URL
scraper.build(
url,
wanted_list=wanted_list,
request_args=dict(proxies=proxies) # this is where you can pass in a list of proxies or customer headers
)
For instance, the example below shows how to set a custom user agent and proxy:
request_args = {
"headers: {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36" # You can customize this value with your desired user agent. This value is the default used by Autoscraper.
},
"proxies": {
"http": "http://user:[email protected]:3128/" # Example proxy to show you how to use the proxy port, host, username, and password values
}
}
# build the scraper on an initial URL
scraper.build(
url,
wanted_list=wanted_list,
request_args=request_args
)
To avoid blocks and ensure smooth operations, it’s wise to use proxies optimized for web scraping. For this, you should consider ProxyTee’s residential proxies, which offer over 20 million IPs from 100+ countries. They are cheap, reliable, and easy to use. You can get up to 50% cheaper compared to other competitors in the market, so do not hesitate to check out our pricing page!
AutoScraper is unable to inherently support rate-limiting. For this you need to setup your own function or use a library like the ratelimit
.
Since it cannot handle dynamic websites, or CAPTCHA protection sites, a good alternative would be a full solution like the Bright Data Web Scraping API. (Note that ProxyTee is not providing similar services, we focus on the proxy solutions)