Running Puppeteer on AWS Lambda with ProxyTee

Using Puppeteer with AWS Lambda is straightforward once you understand the key constraints. When integrating ProxyTee‘s rotating residential proxies with this approach, especially considering the low cost of unlimited residential proxies offered by ProxyTee compared to others, you gain a flexible and scalable solution for tasks like automated testing or web scraping
Introduction to Puppeteer and AWS Lambda
Puppeteer, developed by Google, is an open-source tool that allows you to control a headless browser via a simple API. This tool is invaluable for automating browser interactions, like testing or, a common use case, web scraping.
AWS Lambda is a serverless compute service that lets you run code without managing servers. It’s a pay-as-you-go service, and with its ease of use, you can focus solely on your code rather than infrastructure management.
You can perform multiple operations with AWS Lambda. For instance, it’s frequently used for web scraping tasks and to insert data into databases. Even back-end API routes can be hosted with AWS Lambda.
Integrating ProxyTee with Puppeteer on AWS Lambda
While both are individually powerful, the real potential lies when these are used with a reliable proxy provider. That’s where ProxyTee comes into play. With ProxyTee, you gain access to unlimited bandwidth residential proxies, allowing you to scrape data without worrying about hitting bandwidth caps or facing IP bans. ProxyTee also offers auto-rotating proxies that help bypass website blocks. Its API is easy to use, perfect for automating various scraping needs.
Challenges and Solutions for Puppeteer on AWS Lambda
Problem #1: Puppeteer is too big to push to Lambda
AWS Lambda has a 50 MB limit for zip files uploaded directly. Since Puppeteer includes Chromium, its package size exceeds this limit. The solution? Load your function from S3; this bypasses the 50 MB restriction. Simply upload your package to an S3 bucket and configure Lambda to use this bucket as a source. This is perfect for getting your Puppeteer setup.
To achieve this, you’ll create an S3 bucket, upload the built zip package to S3 and then configure the AWS lambda to pull the package from your S3 bucket.
Problem #2: Puppeteer on AWS Lambda doesn’t work
AWS Lambda lacks the necessary libraries for Puppeteer to function by default. Luckily, you can address this using the chrome-aws-lambda package. You must install the chrome-aws-lambda
and puppeteer-core
packages to make Puppeteer work properly. Importantly, using the regular Puppeteer package could push you beyond the 250 MB unzipped size restriction for Lambda. This problem is easily solvable by installing correct libraries, like so:
npm i --save chrome-aws-lambda puppeteer-core
When launching Puppeteer, use this setup:
const browser = await chromium.puppeteer
.launch({
args: chromium.args,
defaultViewport: chromium.defaultViewport,
executablePath: await chromium.executablePath,
headless: chromium.headless,
});
Important Notes
- Puppeteer consumes more memory than regular scripts. To prevent your function from hitting timeout, allocate a minimum of 512 MB of memory to your AWS Lambda function.
- Ensure that your function uses
await browser.close()
to free up the process after execution, to avoid your function running to timeout, caused by open browser instances. - When using ProxyTee, always utilize residential proxies to maximize data collection potential, as these IPs can greatly enhance your project. With ProxyTee you will be using unlimited bandwidth of residential proxies.
- Also use ProxyTee feature of auto rotating residential proxies to avoid detection.
- Utilize ProxyTee residential proxy targeting by geo location, this will reduce chances of IP blocking.