Get Started Web Scraping with Scrapy

Web scraping is a core technique for developers who need to extract data from websites at scale. Whether you’re building price monitoring tools, collecting real-time analytics, or feeding your machine learning models with external data, scraping becomes essential. In this article, we will explore how to perform advanced web scraping with Scrapy, a powerful Python framework. You will learn how to integrate it with a serverless environment like AWS Lambda and see how proxies can dramatically improve your scraping performance and reliability. With hands-on code examples and practical insights, this guide equips you with a reusable foundation to scale your scraping infrastructure efficiently.
Why Scrapy Is a Great Choice for Web Scraping
Scrapy is a fast, extensible, and production-grade web crawling framework. It simplifies the process of writing spiders, parsing HTML, following links, and exporting data to different formats. Here’s why Scrapy stands out:
- Built-in support for selectors and middlewares
- Automatic request scheduling and throttling
- Pipeline support for post-processing scraped data
- Native handling of pagination and link traversal
Setting Up Scrapy for Your Project
First, you need to install Scrapy and initialize your project. Make sure Python 3.7 or higher is installed.
# Installation pip install scrapy # Start a new project scrapy startproject productscraper cd productscraper
Scrapy generates a standard project layout. Edit the `items.py` file to define the structure of the data you plan to collect.
# items.py import scrapy class ProductItem(scrapy.Item): title = scrapy.Field() price = scrapy.Field() stock = scrapy.Field() Then create your spider inside the `spiders` directory: # spiders/product_spider.py import scrapy from productscraper.items import ProductItem class ProductSpider(scrapy.Spider): name = 'product' start_urls = ['https://example.com/products'] def parse(self, response): for product in response.css('div.product'): item = ProductItem() item['title'] = product.css('h2::text').get() item['price'] = product.css('span.price::text').get() item['stock'] = product.css('span.stock::text').get() yield item
Using Proxies for Reliable Web Scraping
Many websites implement rate limits or IP blocking to prevent scraping. Proxies help you rotate IP addresses, avoid detection, and maintain consistent data access. Here’s how to add proxy support in Scrapy:
# middlewares.py import random class ProxyMiddleware: def __init__(self): self.proxies = [ 'http://55.66.77.88:10001', 'http://55.66.77.88:10002', 'http://55.66.77.88:10003' ] def process_request(self, request, spider): proxy = random.choice(self.proxies) request.meta['proxy'] = proxy
Enable the middleware in `settings.py`:
# Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { 'productscraper.middlewares.ProxyMiddleware': 350, }
Deploying Web Scraping with Scrapy on AWS Lambda
Running Scrapy spiders on AWS Lambda enables cost-effective, scalable scraping jobs that are easy to schedule and deploy without managing servers. Since Lambda has size constraints and a cold start issue, we’ll use a specialized tool: `Zappa`, which allows Python web applications to run on Lambda easily.
# Install zappa pip install zappa # In your project directory zappa init
Since Scrapy isn’t a typical web framework, we’ll wrap it using Flask to create a Lambda-compatible interface.
# lambda_handler.py from flask import Flask import subprocess app = Flask(__name__) @app.route('/run-spider') def run_spider(): result = subprocess.run( ['scrapy', 'crawl', 'product'], capture_output=True, text=True ) return result.stdout
Update `zappa_settings.json` to point to this handler:
{ "production": { "app_function": "lambda_handler.app", "aws_region": "us-east-1", "project_name": "scrapy-lambda-project", "runtime": "python3.9", "s3_bucket": "your-zappa-deployments" } }
Then deploy to Lambda:
zappa deploy production
Integrating Proxies in AWS Lambda
When deploying on Lambda, ensure your proxy list is hosted securely, either through environment variables or AWS Secrets Manager. Here is a simplified method using environment variables:
# lambda_handler.py (modified) import os import random proxies = os.environ['PROXY_LIST'].split(',') @app.route('/run-spider') def run_spider(): chosen_proxy = random.choice(proxies) result = subprocess.run( ['scrapy', 'crawl', 'product', '-s', f'HTTP_PROXY={chosen_proxy}'], capture_output=True, text=True ) return result.stdout
In AWS Lambda configuration, set an environment variable named `PROXY_LIST` containing comma-separated proxy URLs.
Handling Pagination and Rate Limiting in Scrapy
Scrapy makes it easy to handle pagination dynamically using link extraction or selectors. Here’s how to parse multiple pages:
# spiders/product_spider.py (add to parse method) next_page = response.css('a.next::attr(href)').get() if next_page: yield response.follow(next_page, callback=self.parse)
To avoid hitting rate limits, set a download delay and enable auto-throttle in `settings.py`:
DOWNLOAD_DELAY = 1.5 AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 1 AUTOTHROTTLE_MAX_DELAY = 5
Exporting Data and Using Pipelines
Scrapy supports exporting data to JSON, CSV, or a database via pipelines. To export to JSON:
# Run from terminal scrapy crawl product -o data.json
To save data in a database, use a pipeline:
# pipelines.py import sqlite3 class SQLitePipeline: def open_spider(self, spider): self.conn = sqlite3.connect('products.db') self.cur = self.conn.cursor() self.cur.execute('CREATE TABLE IF NOT EXISTS products (title TEXT, price TEXT, stock TEXT)') def process_item(self, item, spider): self.cur.execute( 'INSERT INTO products (title, price, stock) VALUES (?, ?, ?)', (item['title'], item['price'], item['stock']) ) self.conn.commit() return item def close_spider(self, spider): self.conn.close()
Then enable it in `settings.py`:
ITEM_PIPELINES = { 'productscraper.pipelines.SQLitePipeline': 300, }
Best Practices for Scalable and Ethical Scraping
- Respect robots.txt and site terms
- Use realistic User-Agent headers
- Throttle requests to avoid overloading servers
- Use proxy rotation to prevent IP bans
- Monitor response status codes and adapt to changes
Next Steps and Optimization Ideas
Scrapy combined with serverless execution and proxy management provides a powerful solution for scalable scraping. However, there is always room for optimization. Consider using dynamic proxies with session persistence, incorporating headless browsers like Selenium for JavaScript-heavy pages, or integrating distributed scraping using message queues and multiple Lambda functions. With the right configuration, you can build a high-performance scraping infrastructure that adapts to nearly any use case and scales seamlessly with your data demands.
Whether you’re building a personal data collector or a commercial scraper backend, mastering these techniques will set you apart and give you robust, production-ready tooling.