Tutorial

Get Started Web Scraping with Scrapy

January 26, 2025 Mike

Web scraping is a core technique for developers who need to extract data from websites at scale. Whether you’re building price monitoring tools, collecting real-time analytics, or feeding your machine learning models with external data, scraping becomes essential. In this article, we will explore how to perform advanced web scraping with Scrapy, a powerful Python framework. You will learn how to integrate it with a serverless environment like AWS Lambda and see how proxies can dramatically improve your scraping performance and reliability. With hands-on code examples and practical insights, this guide equips you with a reusable foundation to scale your scraping infrastructure efficiently.

Why Scrapy Is a Great Choice for Web Scraping

Scrapy is a fast, extensible, and production-grade web crawling framework. It simplifies the process of writing spiders, parsing HTML, following links, and exporting data to different formats. Here’s why Scrapy stands out:

Built-in support for selectors and middlewares
Automatic request scheduling and throttling
Pipeline support for post-processing scraped data
Native handling of pagination and link traversal

Setting Up Scrapy for Your Project

First, you need to install Scrapy and initialize your project. Make sure Python 3.7 or higher is installed.

# Installation
pip install scrapy

# Start a new project
scrapy startproject productscraper
cd productscraper

Scrapy generates a standard project layout. Edit the `items.py` file to define the structure of the data you plan to collect.

# items.py
import scrapy

class ProductItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()

Then create your spider inside the `spiders` directory:

# spiders/product_spider.py
import scrapy
from productscraper.items import ProductItem

class ProductSpider(scrapy.Spider):
    name = 'product'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('div.product'):
            item = ProductItem()
            item['title'] = product.css('h2::text').get()
            item['price'] = product.css('span.price::text').get()
            item['stock'] = product.css('span.stock::text').get()
            yield item

Using Proxies for Reliable Web Scraping

Many websites implement rate limits or IP blocking to prevent scraping. Proxies help you rotate IP addresses, avoid detection, and maintain consistent data access. Here’s how to add proxy support in Scrapy:

# middlewares.py
import random

class ProxyMiddleware:
    def __init__(self):
        self.proxies = [
            'http://55.66.77.88:10001',
            'http://55.66.77.88:10002',
            'http://55.66.77.88:10003'
        ]

    def process_request(self, request, spider):
        proxy = random.choice(self.proxies)
        request.meta['proxy'] = proxy

Enable the middleware in `settings.py`:

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'productscraper.middlewares.ProxyMiddleware': 350,
}

Deploying Web Scraping with Scrapy on AWS Lambda

Running Scrapy spiders on AWS Lambda enables cost-effective, scalable scraping jobs that are easy to schedule and deploy without managing servers. Since Lambda has size constraints and a cold start issue, we’ll use a specialized tool: `Zappa`, which allows Python web applications to run on Lambda easily.

# Install zappa
pip install zappa

# In your project directory
zappa init

Since Scrapy isn’t a typical web framework, we’ll wrap it using Flask to create a Lambda-compatible interface.

# lambda_handler.py
from flask import Flask
import subprocess

app = Flask(__name__)

@app.route('/run-spider')
def run_spider():
    result = subprocess.run(
        ['scrapy', 'crawl', 'product'],
        capture_output=True,
        text=True
    )
    return result.stdout

Update `zappa_settings.json` to point to this handler:

{
    "production": {
        "app_function": "lambda_handler.app",
        "aws_region": "us-east-1",
        "project_name": "scrapy-lambda-project",
        "runtime": "python3.9",
        "s3_bucket": "your-zappa-deployments"
    }
}

Then deploy to Lambda:

zappa deploy production

Integrating Proxies in AWS Lambda

When deploying on Lambda, ensure your proxy list is hosted securely, either through environment variables or AWS Secrets Manager. Here is a simplified method using environment variables:

# lambda_handler.py (modified)
import os
import random

proxies = os.environ['PROXY_LIST'].split(',')

@app.route('/run-spider')
def run_spider():
    chosen_proxy = random.choice(proxies)
    result = subprocess.run(
        ['scrapy', 'crawl', 'product', '-s', f'HTTP_PROXY={chosen_proxy}'],
        capture_output=True,
        text=True
    )
    return result.stdout

In AWS Lambda configuration, set an environment variable named `PROXY_LIST` containing comma-separated proxy URLs.

Handling Pagination and Rate Limiting in Scrapy

Scrapy makes it easy to handle pagination dynamically using link extraction or selectors. Here’s how to parse multiple pages:

# spiders/product_spider.py (add to parse method)
next_page = response.css('a.next::attr(href)').get()
if next_page:
    yield response.follow(next_page, callback=self.parse)

To avoid hitting rate limits, set a download delay and enable auto-throttle in `settings.py`:

DOWNLOAD_DELAY = 1.5
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 5

Exporting Data and Using Pipelines

Scrapy supports exporting data to JSON, CSV, or a database via pipelines. To export to JSON:

# Run from terminal
scrapy crawl product -o data.json

To save data in a database, use a pipeline:

# pipelines.py
import sqlite3

class SQLitePipeline:
    def open_spider(self, spider):
        self.conn = sqlite3.connect('products.db')
        self.cur = self.conn.cursor()
        self.cur.execute('CREATE TABLE IF NOT EXISTS products (title TEXT, price TEXT, stock TEXT)')
    
    def process_item(self, item, spider):
        self.cur.execute(
            'INSERT INTO products (title, price, stock) VALUES (?, ?, ?)',
            (item['title'], item['price'], item['stock'])
        )
        self.conn.commit()
        return item
    
    def close_spider(self, spider):
        self.conn.close()

Then enable it in `settings.py`:

ITEM_PIPELINES = {
    'productscraper.pipelines.SQLitePipeline': 300,
}

Best Practices for Scalable and Ethical Scraping

Respect robots.txt and site terms
Use realistic User-Agent headers
Throttle requests to avoid overloading servers
Use proxy rotation to prevent IP bans
Monitor response status codes and adapt to changes

Next Steps and Optimization Ideas

Scrapy combined with serverless execution and proxy management provides a powerful solution for scalable scraping. However, there is always room for optimization. Consider using dynamic proxies with session persistence, incorporating headless browsers like Selenium for JavaScript-heavy pages, or integrating distributed scraping using message queues and multiple Lambda functions. With the right configuration, you can build a high-performance scraping infrastructure that adapts to nearly any use case and scales seamlessly with your data demands.

Whether you’re building a personal data collector or a commercial scraper backend, mastering these techniques will set you apart and give you robust, production-ready tooling.

Get Started Web Scraping with Scrapy

Why Scrapy Is a Great Choice for Web Scraping

Setting Up Scrapy for Your Project

Using Proxies for Reliable Web Scraping

Deploying Web Scraping with Scrapy on AWS Lambda

Integrating Proxies in AWS Lambda

Handling Pagination and Rate Limiting in Scrapy

Exporting Data and Using Pipelines

Best Practices for Scalable and Ethical Scraping

Next Steps and Optimization Ideas

We help ambitious businesses achieve more

Products

Tools

Legal

Support

Contact sales

Get Started Web Scraping with Scrapy

Why Scrapy Is a Great Choice for Web Scraping

Setting Up Scrapy for Your Project

Using Proxies for Reliable Web Scraping

Deploying Web Scraping with Scrapy on AWS Lambda

Integrating Proxies in AWS Lambda

Handling Pagination and Rate Limiting in Scrapy

Exporting Data and Using Pipelines

Best Practices for Scalable and Ethical Scraping

Next Steps and Optimization Ideas

Related Posts

Beginner’s Guide to Web Crawling with Python and Scrapy

Learn Web Scraping with Beautiful Soup

Best Rotating Proxies in 2025

We help ambitious businesses achieve more

Products

Tools

Legal

Support