Affordable Rotating Residential Proxies with Unlimited Bandwidth
  • Products
  • Features
  • Pricing
  • Solutions
  • Blog

Contact sales

Give us a call or fill in the form below and we will contact you. We endeavor to answer all inquiries within 24 hours on business days. Or drop us a message at support@proxytee.com.

Edit Content



    Sign In
    Tutorial

    Get Started Web Scraping with Scrapy

    January 26, 2025 Mike
    Serverless Web Scraping with Scrapy

    Web scraping is a core technique for developers who need to extract data from websites at scale. Whether you’re building price monitoring tools, collecting real-time analytics, or feeding your machine learning models with external data, scraping becomes essential. In this article, we will explore how to perform advanced web scraping with Scrapy, a powerful Python framework. You will learn how to integrate it with a serverless environment like AWS Lambda and see how proxies can dramatically improve your scraping performance and reliability. With hands-on code examples and practical insights, this guide equips you with a reusable foundation to scale your scraping infrastructure efficiently.

    Why Scrapy Is a Great Choice for Web Scraping

    Scrapy is a fast, extensible, and production-grade web crawling framework. It simplifies the process of writing spiders, parsing HTML, following links, and exporting data to different formats. Here’s why Scrapy stands out:

    • Built-in support for selectors and middlewares
    • Automatic request scheduling and throttling
    • Pipeline support for post-processing scraped data
    • Native handling of pagination and link traversal

    Setting Up Scrapy for Your Project

    First, you need to install Scrapy and initialize your project. Make sure Python 3.7 or higher is installed.

    # Installation
    pip install scrapy
    
    # Start a new project
    scrapy startproject productscraper
    cd productscraper

    Scrapy generates a standard project layout. Edit the `items.py` file to define the structure of the data you plan to collect.

    # items.py
    import scrapy
    
    class ProductItem(scrapy.Item):
        title = scrapy.Field()
        price = scrapy.Field()
        stock = scrapy.Field()
    
    Then create your spider inside the `spiders` directory:
    
    # spiders/product_spider.py
    import scrapy
    from productscraper.items import ProductItem
    
    class ProductSpider(scrapy.Spider):
        name = 'product'
        start_urls = ['https://example.com/products']
    
        def parse(self, response):
            for product in response.css('div.product'):
                item = ProductItem()
                item['title'] = product.css('h2::text').get()
                item['price'] = product.css('span.price::text').get()
                item['stock'] = product.css('span.stock::text').get()
                yield item

    Using Proxies for Reliable Web Scraping

    Many websites implement rate limits or IP blocking to prevent scraping. Proxies help you rotate IP addresses, avoid detection, and maintain consistent data access. Here’s how to add proxy support in Scrapy:

    # middlewares.py
    import random
    
    class ProxyMiddleware:
        def __init__(self):
            self.proxies = [
                'http://55.66.77.88:10001',
                'http://55.66.77.88:10002',
                'http://55.66.77.88:10003'
            ]
    
        def process_request(self, request, spider):
            proxy = random.choice(self.proxies)
            request.meta['proxy'] = proxy

    Enable the middleware in `settings.py`:

    # Enable or disable downloader middlewares
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    DOWNLOADER_MIDDLEWARES = {
        'productscraper.middlewares.ProxyMiddleware': 350,
    }

    Deploying Web Scraping with Scrapy on AWS Lambda

    Running Scrapy spiders on AWS Lambda enables cost-effective, scalable scraping jobs that are easy to schedule and deploy without managing servers. Since Lambda has size constraints and a cold start issue, we’ll use a specialized tool: `Zappa`, which allows Python web applications to run on Lambda easily.

    # Install zappa
    pip install zappa
    
    # In your project directory
    zappa init

    Since Scrapy isn’t a typical web framework, we’ll wrap it using Flask to create a Lambda-compatible interface.

    # lambda_handler.py
    from flask import Flask
    import subprocess
    
    app = Flask(__name__)
    
    @app.route('/run-spider')
    def run_spider():
        result = subprocess.run(
            ['scrapy', 'crawl', 'product'],
            capture_output=True,
            text=True
        )
        return result.stdout

    Update `zappa_settings.json` to point to this handler:

    {
        "production": {
            "app_function": "lambda_handler.app",
            "aws_region": "us-east-1",
            "project_name": "scrapy-lambda-project",
            "runtime": "python3.9",
            "s3_bucket": "your-zappa-deployments"
        }
    }

    Then deploy to Lambda:

    zappa deploy production

    Integrating Proxies in AWS Lambda

    When deploying on Lambda, ensure your proxy list is hosted securely, either through environment variables or AWS Secrets Manager. Here is a simplified method using environment variables:

    # lambda_handler.py (modified)
    import os
    import random
    
    proxies = os.environ['PROXY_LIST'].split(',')
    
    @app.route('/run-spider')
    def run_spider():
        chosen_proxy = random.choice(proxies)
        result = subprocess.run(
            ['scrapy', 'crawl', 'product', '-s', f'HTTP_PROXY={chosen_proxy}'],
            capture_output=True,
            text=True
        )
        return result.stdout

    In AWS Lambda configuration, set an environment variable named `PROXY_LIST` containing comma-separated proxy URLs.

    Handling Pagination and Rate Limiting in Scrapy

    Scrapy makes it easy to handle pagination dynamically using link extraction or selectors. Here’s how to parse multiple pages:

    # spiders/product_spider.py (add to parse method)
    next_page = response.css('a.next::attr(href)').get()
    if next_page:
        yield response.follow(next_page, callback=self.parse)

    To avoid hitting rate limits, set a download delay and enable auto-throttle in `settings.py`:

    DOWNLOAD_DELAY = 1.5
    AUTOTHROTTLE_ENABLED = True
    AUTOTHROTTLE_START_DELAY = 1
    AUTOTHROTTLE_MAX_DELAY = 5

    Exporting Data and Using Pipelines

    Scrapy supports exporting data to JSON, CSV, or a database via pipelines. To export to JSON:

    # Run from terminal
    scrapy crawl product -o data.json

    To save data in a database, use a pipeline:

    # pipelines.py
    import sqlite3
    
    class SQLitePipeline:
        def open_spider(self, spider):
            self.conn = sqlite3.connect('products.db')
            self.cur = self.conn.cursor()
            self.cur.execute('CREATE TABLE IF NOT EXISTS products (title TEXT, price TEXT, stock TEXT)')
        
        def process_item(self, item, spider):
            self.cur.execute(
                'INSERT INTO products (title, price, stock) VALUES (?, ?, ?)',
                (item['title'], item['price'], item['stock'])
            )
            self.conn.commit()
            return item
        
        def close_spider(self, spider):
            self.conn.close()

    Then enable it in `settings.py`:

    ITEM_PIPELINES = {
        'productscraper.pipelines.SQLitePipeline': 300,
    }

    Best Practices for Scalable and Ethical Scraping

    • Respect robots.txt and site terms
    • Use realistic User-Agent headers
    • Throttle requests to avoid overloading servers
    • Use proxy rotation to prevent IP bans
    • Monitor response status codes and adapt to changes

    Next Steps and Optimization Ideas

    Scrapy combined with serverless execution and proxy management provides a powerful solution for scalable scraping. However, there is always room for optimization. Consider using dynamic proxies with session persistence, incorporating headless browsers like Selenium for JavaScript-heavy pages, or integrating distributed scraping using message queues and multiple Lambda functions. With the right configuration, you can build a high-performance scraping infrastructure that adapts to nearly any use case and scales seamlessly with your data demands.

    Whether you’re building a personal data collector or a commercial scraper backend, mastering these techniques will set you apart and give you robust, production-ready tooling.

    • Python
    • Scrapy
    • Web Scraping

    Post navigation

    Previous
    Next

    Categories

    • Comparison & Differences
    • Exploring
    • Integration
    • Tutorial

    Recent posts

    • What Is a Reverse Proxy
      What Is a Reverse Proxy? Definition, Advantages, and Common Use Cases
    • Dolphin{anty} Antidetect Browser: Online Privacy and ProxyTee Integration
      Dolphin{anty} Antidetect Browser: Online Privacy and ProxyTee Integration
    • ProxyTee Usage Guide
      ProxyTee Usage Guide
    • How to Turn Off AI Overview in Google Search
      How to Turn Off AI Overview in Google Search
    • Beginner’s Guide to Web Crawling with Python and Scrapy
      Beginner’s Guide to Web Crawling with Python and Scrapy

    Related Posts

    Beginner’s Guide to Web Crawling with Python and Scrapy
    Tutorial

    Beginner’s Guide to Web Crawling with Python and Scrapy

    June 14, 2025 Mike

    Guide to Web Crawling with Python and Scrapy is an essential resource for anyone interested in learning how to automatically extract and organize data from websites. With the growing importance of data in industries ranging from marketing to research, understanding web crawling with the right tools is crucial. Python, combined with the Scrapy framework, offers […]

    Web Scraping with Beautiful Soup
    Tutorial

    Learn Web Scraping with Beautiful Soup

    May 30, 2025 Mike

    Learn Web Scraping with Beautiful Soup and unlock the power of automated data collection from websites. Whether you’re a developer, digital marketer, data analyst, or simply curious, web scraping provides efficient ways to gather information from the internet. In this guide, we explore how Beautiful Soup can help you parse HTML and XML data, and […]

    Best Rotating Proxies in 2025
    Comparison & Differences

    Best Rotating Proxies in 2025

    May 19, 2025 Mike

    Best Rotating Proxies in 2025 are essential tools for developers, marketers, and SEO professionals seeking efficient and reliable data collection. With the increasing complexity of web scraping and data gathering, choosing the right proxy service can significantly impact your operations. This article explores the leading rotating proxy providers in 2025, highlighting their unique features and […]

    We help ambitious businesses achieve more

    Free consultation
    Contact sales
    • Sign In
    • Sign Up
    • Contact
    • Facebook
    • Twitter
    • Telegram
    Affordable Rotating Residential Proxies with Unlimited Bandwidth

    Get reliable, affordable rotating proxies with unlimited bandwidth for seamless browsing and enhanced security.

    Products
    • Features
    • Pricing
    • Solutions
    • Testimonials
    • FAQs
    • Partners
    Tools
    • App
    • API
    • Blog
    • Check Proxies
    • Free Proxies
    Legal
    • Privacy Policy
    • Terms of Use
    • Affiliate
    • Reseller
    • White-label
    Support
    • Contact
    • Support Center
    • Knowlegde Base

    Copyright © 2025 ProxyTee