Affordable Rotating Residential Proxies with Unlimited Bandwidth
  • Products
  • Features
  • Pricing
  • Solutions
  • Blog

Contact sales

Give us a call or fill in the form below and we will contact you. We endeavor to answer all inquiries within 24 hours on business days. Or drop us a message at support@proxytee.com.

Edit Content



    Sign In
    Tutorial

    Beginner’s Guide to Web Crawling with Python and Scrapy

    June 14, 2025 Mike
    Beginner’s Guide to Web Crawling with Python and Scrapy

    Guide to Web Crawling with Python and Scrapy is an essential resource for anyone interested in learning how to automatically extract and organize data from websites. With the growing importance of data in industries ranging from marketing to research, understanding web crawling with the right tools is crucial. Python, combined with the Scrapy framework, offers a flexible and powerful way to build efficient crawlers. In this article, you’ll discover how to get started with web crawling, and create your own spider using Scrapy. Whether you’re a complete beginner or a developer looking to sharpen your automation skills, this guide is for you.

    Why You Need Web Crawling

    Before we dive into the tutorial, it’s important to acknowledge the difference between web scraping and web crawling. While similar, web scraping extracts specific data from web pages, whereas web crawling browses web pages for indexing and gathers information for search engines.

    Web crawling is useful in all kinds of scenarios, including the following:

    • Data extraction: Web crawling can be used to extract specific pieces of data from websites, which can then be used for analysis or research.
    • Website indexing: Search engines often use web crawling to index websites and make them searchable for users.
    • Monitoring: Web crawling can be used to monitor websites for changes or updates. This information is often useful in tracking competitors.
    • Content aggregation: Web crawling can be used to collect content from multiple websites and aggregate it into a single location for easy access.
    • Security testing: Web crawling can be used for security testing to identify vulnerabilities or weaknesses in websites and web applications.

    Understanding the Difference Between Crawling and Scraping

    Before jumping into coding, it’s important to understand the distinction. Web crawling refers to the act of navigating through web pages, often to discover and collect links or metadata. Web scraping, on the other hand, is about extracting structured data from web pages. Guide to Web Crawling with Python and Scrapy emphasizes that crawling is usually the first step before scraping. For instance, a crawler might collect links to all product pages on an ecommerce site, which are then scraped for price and title information.

    Web Crawling with Python

    Python is a popular choice for web crawling due to its ease of use in coding and intuitive syntax. Additionally, Scrapy, one of the most popular web crawling frameworks, is built on Python. This powerful and flexible framework makes it easy to extract data from websites, follow links, and store the results.

    Scrapy is designed to handle large amounts of data and can be used for a wide range of web scraping tasks. The tools included in Scrapy, such as the HTTP downloader, spider for crawling websites, scheduler for managing crawling frequency, and item pipeline for processing scraped data, make it well-suited for various web crawling tasks. To get started with web crawling using Python, you need to install the Scrapy framework on your system.

    Open your terminal and run the following command:

    pip install scrapy

    After running this command, you will have scrapy installed in your system. Scrapy provides you with classes called spiders that define how to perform a web crawling task. These spiders are responsible for navigating the website, sending requests, and extracting data from the website’s HTML.

    Creating a Scrapy Project

    In this article, you’re going to crawl a website called Books to Scrape and save the name, category, and price of each book in a CSV file. This website was created to work as a sandbox for scraping projects.

    Once Scrapy is installed, you need to create a new project structure using the following command:

    scrapy startproject bookcrawler

    (note: If you get a ”command not found” error, restart your terminal)

    The default directory structure provides a clear and organized framework, with separate files and directories for each component of the web scraping process. This makes it easy to write, test, and maintain your spider code, as well as process and store the extracted data in whatever way you prefer. This is what your directory structure looks like:

    bookcrawler
    │ scrapy.cfg
    │
    └───bookcrawler
    │ items.py
    │ middlewares.py
    │ pipelines.py
    │ settings.py
    │ __init__.py
    │
    └───spiders
    __init__.py
    

    To initiate the crawling process in your Scrapy project, it’s essential to create a new spider file in the bookcrawler/spiders directory because it’s the standard directory where Scrapy looks for any spiders to execute the code. To do so, navigate to the bookcrawler/spiders directory and create a new file named bookspider.py. Then write the following code into the file to define your spider and specify its behavior:

    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    
    class BookCrawler(CrawlSpider):
        name = 'bookspider'
        start_urls = [
            'https://books.toscrape.com/',
        ]
        rules = (
            Rule(LinkExtractor(allow='/catalogue/category/books/')),
        )
    

    This code defines a BookCrawler, which is subclassed from the built-in CrawlSpider, and provides a convenient way to define rules for following links and extracting data. The start_urls attribute specifies a list of URLs to start crawling from. In this case, it contains only one URL, which is the home page of the website.

    The rules attribute specifies a set of rules to determine which links the spider should follow. In this case, there is only one rule defined, which is created using the Rule class from the scrapy.spiders module. The rule is defined with a LinkExtractor instance that specifies the pattern of links that the spider should follow. The allow parameter of the LinkExtractor is set to /catalogue/category/books/, which means that the spider should only follow links that contain this string in their URL.

    To run the spider, open your terminal and run the following command:

    scrapy crawl bookspider

    As soon as you run this, Scrapy initializes the spider class BookCrawler, creates a request for each URL in the start_urls attribute, and sends them to the Scrapy scheduler. When the scheduler receives a request, it checks if the request is allowed by the spider’s allowed_domains (if specified) attribute. If the domain is allowed, the request is then passed to the downloader, which makes an HTTP request to the server and retrieves the response.

    At this point, you should be able to see every URL your spider has crawled in your console window:

    web crawler output in console window

    The initial crawler that was created only performs the task of crawling a predefined set of URLs without extracting any information. To retrieve data during the crawling process, you need to define a parse_item function within the crawler class. The parse_item function is tasked with receiving the response from each request made by the crawler and returning relevant data obtained from the response.

    Please note: The parse_item function only works after setting the callback attribute in your LinkExtractor.

    To extract data from the response obtained by crawling web pages in Scrapy, you need to use CSS selectors. The next section provides a brief introduction to CSS selectors.

    A Bit about CSS Selectors

    CSS selectors are a way to extract data from the web page by specifying tags, classes, and attributes. For instance, here is a Scrapy shell session that has been initialized using scrapy shell books.toscrape.com:

    # check if the response was successful
    >>> response
    <200 http://books.toscrape.com>
    
    #extract the title tag
    >>> response.css('title')
    [<Selector xpath='descendant-or-self::title' data='<title>\n    All products | Books to S...'>]
    

    In this session, the css function takes a tag (i.e., title) and returns the Selector object. To get the text within the title tag, you have to write the following query:

    >>> print(response.css('title::text').get())
        All products | Books to Scrape - Sandbox
    

    In this snippet, the text pseudo selector is used in order to remove the enclosing title tag and returns only the inner text. The get method is used to display only the data value.

    In order to get the classes of the elements, you need to view the source code of the page by right-clicking and selecting Inspect:

    inspect feature on a web browser

    Extracting Data Using Scrapy

    To extract elements from the response object, you need to define a callback function and assign it as an attribute in the Rule class.

    Open bookspider.py and run the following code:

    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    
    class BookCrawler(CrawlSpider):
        name = 'bookspider'
        start_urls = [
            'https://books.toscrape.com/',
        ]
    
        rules = (
            Rule(LinkExtractor(allow='/catalogue/category/books/'), callback="parse_item"), 
        
        )
        def parse_item(self, response):
            category = response.css('h1::text').get()
            book_titles = response.css('article.product_pod').css('h3').css('a::text').getall()
            book_prices = response.css('article.product_pod').css('p.price_color::text').getall()
            yield {
                "category": category,
                "books":list(zip(book_titles,book_prices))
            }
    

    The parse_item function in the BookCrawler class contains the logic for the data to be extracted and yields it to the console. Using yield allows Scrapy to process the data in the form of items, which can then be passed through item pipelines for further processing or storage.

    The process of selecting the category is a straightforward task since it’s coded within a simple <h1> tag. However, the selection of book_titles is achieved through a multilevel selection process, where the first step involves selecting the <article> tag with class product_pod. Following this, the traversal process continues to identify the <a> tag nested within the <h3> tag. The same approach is taken when selecting book_prices, enabling the retrieval of the necessary information from the web page.

    At this point, you’ve created a spider that crawls a website and retrieves data. To run the spider, open the terminal and run the following command:

    scrapy crawl bookspider -o books.json

    When executed, the web pages crawled by the crawler and their corresponding data are displayed on the console. The usage of the -o flag instructs Scrapy to store all the retrieved data in a file named books.json. Upon completion of the script, a new file named books.json is created in the project directory. This file contains all the book-related data retrieved by the crawler.

    Advanced Features in Scrapy

    As your project grows, Scrapy lets you plug in more advanced tools:

    • Item pipelines for saving data to a database
    • Custom middlewares for user-agent and proxy management
    • Job directory to resume interrupted crawls
    • AutoThrottle to automatically manage crawl rate

    Scaling Up with ProxyTee

    Scrapy spiders are powerful, but without proxy support, they often get blocked. ProxyTee is a residential proxy provider that gives your spider a new IP for each request. Guide to Web Crawling with Python and Scrapy recommends using ProxyTee to avoid blocks, especially when crawling large or protected websites.

    • Unlimited bandwidth for uninterrupted crawling
    • Global IP rotation to bypass region locks
    • Support for both HTTP and SOCKS5 protocols
    • Easy API integration with Scrapy

    By configuring middleware or using third-party Scrapy proxy packages, you can route your crawler’s requests through ProxyTee’s proxy endpoints.

    web crawler results saved into a JSON file

    Final Thoughts on Building with Python and Scrapy

    Guide to Web Crawling with Python and Scrapy is not just about writing a few lines of code. It’s about building efficient systems that extract value from the web at scale. From setting up your first spider to managing large-scale crawls with proxies, this guide lays the foundation for a strong web crawling strategy. With Python’s simplicity, Scrapy’s robust tools, and ProxyTee’s powerful infrastructure, you have everything you need to build reliable, scalable, and ethical web crawlers.

    Keep exploring Scrapy’s documentation, stay up to date with proxy best practices, and always ensure your data collection complies with website terms of service. Happy crawling!

    • Python
    • Scrapy
    • Web Crawling

    Post navigation

    Previous

    Categories

    • Comparison & Differences
    • Exploring
    • Integration
    • Tutorial

    Recent posts

    • Beginner’s Guide to Web Crawling with Python and Scrapy
      Beginner’s Guide to Web Crawling with Python and Scrapy
    • Set Up ProxyTee Proxies in GeeLark for Smooth Online Tasks
      Set Up ProxyTee Proxies in GeeLark for Smooth Online Tasks
    • Web Scraping with Beautiful Soup
      Learn Web Scraping with Beautiful Soup
    • How to Set Up a Proxy in SwitchyOmega
      How to Set Up a Proxy in SwitchyOmega (Step-by-Step Guide)
    • DuoPlus Cloud Mobile Feature Overview: Empowering Unlimited Opportunities Abroad
      DuoPlus Cloud Mobile Feature Overview: Empowering Unlimited Opportunities Abroad!

    Related Posts

    Web Scraping with Beautiful Soup
    Tutorial

    Learn Web Scraping with Beautiful Soup

    May 30, 2025 Mike

    Learn Web Scraping with Beautiful Soup and unlock the power of automated data collection from websites. Whether you’re a developer, digital marketer, data analyst, or simply curious, web scraping provides efficient ways to gather information from the internet. In this guide, we explore how Beautiful Soup can help you parse HTML and XML data, and […]

    How to Scrape Yelp Data with ProxyTee
    Tutorial

    How to Scrape Yelp Data for Local Business Insights

    May 10, 2025 Mike

    Scraping Yelp data can open up a world of insights for marketers, developers, and SEO professionals. Whether you’re conducting market research, generating leads, or monitoring local business trends, having access to structured Yelp data is invaluable. In this article, we’ll walk you through how to scrape Yelp data safely and effectively. You’ll discover real use […]

    How to Scrape Booking.com with Python
    Tutorial

    How to Scrape Booking.com with Python

    April 23, 2025 Mike

    How to Scrape Booking.com with Python is a question that often surfaces among developers, data analysts, and digital marketers looking to gather pricing, availability, or review data from one of the world’s largest travel platforms. In this guide, we will walk through practical ways to scrape Booking.com with Python effectively, ethically, and efficiently. Whether you’re […]

    We help ambitious businesses achieve more

    Free consultation
    Contact sales
    • Sign In
    • Sign Up
    • Contact
    • Facebook
    • Twitter
    • Telegram
    Affordable Rotating Residential Proxies with Unlimited Bandwidth

    Get reliable, affordable rotating proxies with unlimited bandwidth for seamless browsing and enhanced security.

    Products
    • Features
    • Pricing
    • Solutions
    • Testimonials
    • FAQs
    • Partners
    Tools
    • App
    • API
    • Blog
    • Check Proxies
    • Free Proxies
    Legal
    • Privacy Policy
    • Terms of Use
    • Affiliate
    • Reseller
    • White-label
    Support
    • Contact
    • Support Center
    • Knowlegde Base

    Copyright © 2025 ProxyTee