TL;DR: This post details how to get a web scraper running on AWS Lambda using Selenium and a headless Chrome browser, while using Docker to test locally. It’s based on this guide, but it didn’t work for me because the versions of Selenium, headless Chrome and chromedriver were incompatible. What did work was the following:
EDIT: The versions above are no longer supported. According to this GitHub issue, these versions work well together:
- chromedriver 2.43
- severless-chrome 1.0.0-55
- selenium 3.14
The full story
I recently spent several frustrating weeks trying to deploy a Selenium web scraper that runs every night on its own and saves the results to a database on Amazon S3. With this post, I hope to spare you from wanting to smash all computers with a sledgehammer.
First, some background
I wanted to scrape a government website that is regularly updated every night, detect new additions, alert me by email when something is found, and save the results. I could have run the script on my computer with a cron job on Mac or a scheduled task on Windows.
But desktop computers are unreliable. They can get unplugged accidentally, or restart because of an update. I wanted my script to be run from a server that never turns off.
At the NICAR 2018 conference, I learned about serverless applications using AWS Lambda, so this seemed like an ideal solution. But the demo I saw, and almost all the documentation and blog posts about this use Node.js. I wanted to work in Python, which Lambda also supports.
How can something be serverless if it runs on an Amazon server? Well, it’s serverless for you. You don’t have to set up the software, maintain it, or make sure it’s still running. Amazon does it all for you. You just need to upload your scripts and tell it it what to do. And it costs pennies a month, even for daily scrapes.
This guide is based mostly off this repo from 21Buttons, a fashion social network based in Barcelona. They did most of the heavy work to get a Selenium scraper using a Chrome headless browser working in Lambda using Python. I simply modified it a bit to work for me.
Download their repo onto your machine. The important files are these:
The rest you can delete.
Lambda is Amazon’s serverless application platform. It lets you write or upload a script that runs according to various triggers you give it. For example, it can be run at a certain time, or when a file is added or changed in a S3 bucket.
This is how I set up my Lambda instance.
1. Go to AWS Lambda, choose your preferred region and create a new function.
2. Choose “Author from scratch”. Give it a function name, and choose Python 3.6 as the runtime. Under “Role”, choose “Create new role from template”. Roles define the permissions your function has to access other AWS services, like S3. For now give this role a name like “scraper”. Under “Policy templates” choose “Basic Edge Lambda permissions”, which gives your function the ability to run and create logs. Hit “Create function”.
3. Now you’re in the main Lambda dashboard. Here is where you set up the triggers, environment variables, and access the logs. Since we want this to run on a schedule, we need to set up the trigger. On the “add trigger” menu on the left, choose CloudWatch Events.
Click on “Configuration required” to set up the time the script will run. A form will appear below it.
Under “Rule” choose “Create a new rule”. Give it a rule name. It can be anything, like “dailyscrape”. Give it a description so you know what it’s doing if you forget.
Now you have to write the weird cron expression that tells it when to run. I want to scraper to run at 11 PM EST every night. In Lambda, you need to enter the time in GST, so that’s 3 AM. My expression therefore needs to be:
cron(0 3 * * ? *)
For more on cron syntax, read this.
Click on “Add” and that’s it. Your function is set to run at a scheduled time.
4. Now we need to add our actual function. But first, let’s set up some environment variables.
The “Function code” section lets you write the code in the inline text editor, upload a ZIP package, or upload from S3. A Selenium scraper is big, because you need to include a Chrome browser with it. So choose “Upload file form Amazon S3”. We’ll come back to this later.
5. We need to set up our environment variables. Since we’ll be uploading a Chrome browser, we need to tell Lambda where to find it. So punch in these keys and values.
PATH = /var/task/bin PYTHONPATH = /var/task/src:/var/task/lib
6. Finally, configure the final options. Under “Execution role”, choose the role you defined in the first step. We’ll need to configure this further in a future step.
Under “Basic settings” choose how much memory will be allocated to your scraper. My scraper rarely needs more than 1000 MB, but I give it a little more to be safe. If the memory used goes above this limit, Lambda will terminate the function.
Same with execution time. Lambda gives you a maximum of five minutes to run a function. If your scraper goes over that, you’ll need to rethink it. Maybe split it into two Lambda functions.
And that’s all the configuration we need in Lambda for now.
Remember that we gave this function the most basic Lambda permissions? Well, if I want my script to write data to S3, I need to give it permission to do that. That’s where your role comes in. And you need to configure you role in IAM, Amazon’s Identity and Access Management system.
1. Go to IAM. Click on “Roles” on the left menu. You should see the role you created in the previous section. Click on it. You’ll see that it has one policy attached to it: AWSLambdaEdgeExecutionRole, the most basic Lambda permission.
2. Click on “Attach Policy”. Search for S3. Choose AmazonS3FullAccess. This will allow the function to read and write to S3. Click on “Attach policy”. That’s it. When you go back to Lambda, you should now see S3 as a resource.
Now we’re ready to upload the scraper to Lambda.
But hold on. How do you know it will work? Lambda runs on Linux. It’s a different computing environment. You’ll want to test it locally on your machine simulating the Lambda environment. Otherwise it’s a pain to keep uploading files to Lambda with each change you make.
That’s where Docker comes in. It’s a tool that creates virtual machine containers that simulate the environments that you’ll deploy to. And there’s a handy Lambda container all ready for you to use.
So first, install Docker.
Then open the docker-compose.yml file you downloaded. This specifies the environment variables that you defined in Lambda earlier.
You’ll need to edit this if you want to read and write to S3 while testing from your local machine. Specifically, you need to add your S3 bucket name and AWS credentials.
version: '3' services: lambda: build: . environment: - PYTHONPATH=/var/task/src:/var/task/lib - PATH=/var/task/bin - AWS_BUCKET_NAME='YOUR-BUCKET-NAME' - AWS_ACCESS_KEY_ID='YOUR-AWS-KEY-ID' - AWS_SECRET_ACCESS_KEY='YOUR-AWS-ACCESS-KEY' volumes: - ./src/:/var/task/src/
Then continue below.
The gentlemen at 21Buttons created a lovely Makefile that does the job of automating the creation of a Docker container, running the scraper in the Docker environment, and packaging the files to upload to Lambda.
What’s a Makefile? It’s just a recipe of command-line commands that automates repetitive tasks. To make use of Makefiles, you’ll need to install Mac’s command line tools. On Windows, it’s possible to run Makefiles with Powershell, but I have no experience with this.
I won’t go through all the items in the Makefile, but you can read about it here.
You will need to change the Makefile to download the right versions of headless-chrome and chromedriver. Like I said in the intro, the versions used by 21Buttons didn’t work for me.
This is the Makefile that worked:
clean: rm -rf build build.zip rm -rf __pycache__ fetch-dependencies: mkdir -p bin/ # Get chromedriver curl -SL https://chromedriver.storage.googleapis.com/2.37/chromedriver_linux64.zip > chromedriver.zip unzip chromedriver.zip -d bin/ # Get Headless-chrome curl -SL https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-37/stable-headless-chromium-amazonlinux-2017-03.zip > headless-chromium.zip unzip headless-chromium.zip -d bin/ # Clean rm headless-chromium.zip chromedriver.zip docker-build: docker-compose build docker-run: docker-compose run lambda src/lambda_function.lambda_handler build-lambda-package: clean fetch-dependencies mkdir build cp -r src build/. cp -r bin build/. cp -r lib build/. pip install -r requirements.txt -t build/lib/. cd build; zip -9qr build.zip . cp build/build.zip . rm -rf build
You’ll also need to edit the requirements.txt file to download the Python libraries that work with the Chrome browser and driver.
This is the minimum you need to run Selenium with chromedriver, and if you want read/write access to S3. That’s what the boto3 library is for
boto3==1.6.18 botocore==1.9.18 selenium==2.53.6 chromedriver-installer==0.0.6
You have your scraper code all ready to go, but you can’t just upload it as is. You have to set a bunch of options for the Chrome driver so it works on Lambda. Again, 21Buttons did the hard work of figuring out the right options.
Usually, when you run a Selenium scraper on you machine, it suffices to start it like this:
from selenium import webdriver driver = webdriver.Chrome()
But we’re using a headless Chrome browser. That is, a browser that run without a UI. So you need to specify this and a bunch of other things.
This is what you need to pass. Notice the very important line near the bottom that tells Lambda where to find the headless Chrome binary file. It will be packaged in a folder called bin.
from selenium import webdriver import os chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--headless') chrome_options.add_argument('--no-sandbox') chrome_options.add_argument('--disable-gpu') chrome_options.add_argument('--window-size=1280x1696') chrome_options.add_argument('--user-data-dir=/tmp/user-data') chrome_options.add_argument('--hide-scrollbars') chrome_options.add_argument('--enable-logging') chrome_options.add_argument('--log-level=0') chrome_options.add_argument('--v=99') chrome_options.add_argument('--single-process') chrome_options.add_argument('--data-path=/tmp/data-path') chrome_options.add_argument('--ignore-certificate-errors') chrome_options.add_argument('--homedir=/tmp') chrome_options.add_argument('--disk-cache-dir=/tmp/cache-dir') chrome_options.add_argument('user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36') chrome_options.binary_location = os.getcwd() + "/bin/headless-chromium" driver = webdriver.Chrome(chrome_options=chrome_options)
Finally, you need to give your file and function a name that Lambda will know how to run. By default, Lambda will look for a file called lambda_function.py and run a function called lambda_handler. You can call it whatever you want, just make sure you change this in the Lambda dashboard (under “Handler”) and in your Makefile, under the “docker-run” command.
So if your file is called crime_scraper.py and your main function is called main(), you need to change these values to crime_scraper.main
Putting it all together
Now you’re ready to test your scraper in your local Lambda environment and package everything
1. Start Docker.
2. In your terminal, cd into your folder that has your Makefile and Docker files. Create a folder called src and put your Python scraper file in there.
3. Run this:
This will create the Lambda environment based on the Dockerfile, the docker-compose.yml you edited earlier and your requirements.txt file. It will also download all the necessary libraries and binaries that your scraper needs to run.
4. Now that you have a Lambda instance running on your machine, run your script:
Does it work? Excellent! Your code is ready to be uploaded to Lambda. If not, you have some debugging to do.
5. Create the zip package:
This will create three folders:
bin: the location of your Chrome binary and web driver
lib: the Python libraries you need (Selenium, etc.)
src: your Python script
Then it will zip everything into a file called build.zip
The resulting file will be too big to upload directly to Lambda, since it has a limit of 50MB. So you’ll need to add it to an S3 bucket and tell Lambda where it is. Paste the link in the “S3 link URL” box you left blank earlier.
Save and observe
Your Lambda scraper is ready to de deployed! Click the big orange “Save” button at the top and that’s it! If there are errors, they will appear in your logs. Click on the “Monitoring” link at the top of your Lambda dashboard, and scroll to the Errors box.
Click on “Jump to logs” to see a detailed log of errors that might need debugging.