How to Scrape Baidu Search Results with ProxyTee

Scrape Baidu and its search results effectively with the right tools and techniques. Baidu is a leading search engine in China, and extracting data from its search results can be valuable for market research, competitive analysis, and more. However, Baidu employs robust anti-scraping measures, making it challenging to gather information reliably. This post will guide you through the process of scraping Baidu’s organic search results using Python, with a focus on how ProxyTee can help overcome these challenges.
Baidu Search Engine: Overview
The Baidu Search Engine Results Page (SERP) includes organic results, paid advertisements, and related search suggestions. Organic results are listings that the search engine deems most relevant to the user’s query. Paid results, marked as “advertise (广告),” appear at the top of the page. Baidu also offers a “Related searches” section, usually at the bottom of the page.
Challenges of Scraping Baidu
Scraping Baidu is not straightforward. The platform employs several anti-scraping techniques:
- CAPTCHAs: Baidu frequently presents CAPTCHAs to block automated bots.
- User Agent and IP Blocking: Baidu blocks suspicious user agents and IP addresses.
- Dynamic Content: Baidu’s search result pages are dynamic, meaning their HTML structure often changes. This can cause web scrapers to break if they are not kept updated.
Overcoming Challenges with ProxyTee
To scrape Baidu effectively, you need a robust proxy solution like ProxyTee, which offers rotating residential proxies designed to help you bypass anti-scraping measures. ProxyTee is known for providing affordable, reliable, and user-friendly proxy solutions with key features that are particularly beneficial for web scraping tasks:
- Unlimited Bandwidth: You can perform data-intensive scraping without worrying about additional costs.
- Global IP Coverage: With over 20 million IP addresses from more than 100 countries, ProxyTee enables you to target specific regions effectively.
- Auto Rotation: IP addresses change automatically at customizable intervals (3 to 60 minutes), making it hard for Baidu to detect and block you.
- Affordable Pricing: ProxyTee offers solutions that can be as much as 50% cheaper than the competition while providing similar or better functionality, especially our Unlimited Residential Proxies.
Is it Legal to Scrape Baidu?
Collecting publicly available data, including Baidu’s search results, is often legal. However, ensure that your web scraper doesn’t breach any laws or collect copyrighted data. When in doubt, seek legal counsel.
Getting Started with Baidu Scraping
Before starting, make sure you have Python installed and have set up a virtual environment. This will keep the libraries you will be using separate from your main install. Here are basic commands for that, in the given order:
python -m venv env
(creates a virtual environment named “env”), or whatever name you like.source env/bin/activate
(activates the virtual environment).pip install requests
(installs the `requests` library).deactivate
to end the session in virtual env
How to Form Baidu URLs
You’ll need to construct URLs for both desktop and mobile devices to target the content you require. Baidu uses the following formats:
Desktop Devices:
https://www.baidu.<domain>/s?ie=utf-8&wd=<query>&rn=<limit>&pn=<calculated_start_page>
Mobile Devices:
https://m.baidu.<domain>/s?ie=utf-8&word=<query>&rn=<limit>&pn=<calculated_start_page>
Where:
domain
: .com for English and .cn for Chinese content.query
: The search keyword with spaces replaced by %20. Use `wd` for desktop and `word` for mobile.limit
: The number of results per page.calculated_start_page
: The starting point (calculated as `Limit * Start_page – Limit`). For example, to view the third page, which shows five items per page, use a value of 10, as 5*3 – 5 = 10
For example, to search for “nike shoes” on the fifth page, showing ten results, the URLs would be:
Desktop:
https://www.baidu.com/s?ie=utf-8&wd=nike%20shoes&rn=10&pn=40
Mobile:
https://m.baidu.com/s?ie=utf-8&word=nike%20shoes&rn=10&pn=40
Tutorial on Scraping Baidu Search Results with Python
Here is a step by step guide how to do it with Python and ProxyTee:
1️⃣ Import Necessary Libraries
import requests
from pprint import pprint
import json
2️⃣ Set your API Endpoint
Set up the URL for accessing the data through ProxyTee
url = 'https://api.proxytee.com/v1/queries'
3️⃣ Configure Authentication
Obtain your credentials (API username and password) from ProxyTee and use them as follows:
auth = ('your_api_username', 'your_api_password')
4️⃣ Create Your Payload
The `payload` dictionary holds the parameters for your search request. Modify them to fit your use case, and always point to the search query and geo-targeting. Here’s an example:
payload = {
'source': 'universal',
'url': 'https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50',
'geo_location': 'United States',
'user_agent_type': 'desktop_firefox'
}
5️⃣ Send POST Request
Pass parameters in the body of the request using Python library `requests`.
response = requests.post(url, json=payload, auth=auth, timeout=180)
6️⃣ Load and Print the Data
Parse the server’s response, typically in a JSON or HTML document, for more convenient usage. The sample shows two possibilities.
json_data = response.json()
pprint(json_data)
with open('baidu.html', 'w') as f:
f.write(response.json()['results'][0]['content'])
7️⃣ Full Code Example
Here is a full snippet, which contains all steps described:
import requests
from pprint import pprint
import json
payload = {
'source': 'universal',
'url': 'https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50',
'geo_location': 'United States',
'user_agent_type': 'desktop_firefox'
}
url = 'https://api.proxytee.com/v1/queries'
auth = ('your_api_username', 'your_api_password')
response = requests.post(url, json=payload, auth=auth, timeout=180)
json_data = response.json()
pprint(json_data)
with open('baidu.html', 'w') as f:
f.write(response.json()['results'][0]['content'])
Output Sample: When the code executes without any errors, you will get a status message showing 200 status code as well as a pretty JSON dump of the requested document, and additionally, a saved .html file will appear in your directory.