Setting up a Selenium web scraper on AWS Lambda with Python

IMPORTANT UPDATE

This post is outdated now that AWS Lambda allows users to create and distribute layers with all sorts of plugins and packages, including Selenium and chromedriver. This simplifies a lot of the process. Here’s a post on how to make such a layer. And here’s a list of useful pre-packaged layers.

This post should be used as a historical reference only.

TL;DR: This post details how to get a web scraper running on AWS Lambda using Selenium and a headless Chrome browser, while using Docker to test locally. It’s based on this guide, but it didn’t work for me because the versions of Selenium, headless Chrome and chromedriver were incompatible. What did work was the following:

EDIT: The versions above are no longer supported. According to this GitHub issue, these versions work well together:

chromedriver 2.43
severless-chrome 1.0.0-55
selenium 3.14

The full story

I recently spent several frustrating weeks trying to deploy a Selenium web scraper that runs every night on its own and saves the results to a database on Amazon S3. With this post, I hope to spare you from wanting to smash all computers with a sledgehammer.

First, some background

I wanted to scrape a government website that is regularly updated every night, detect new additions, alert me by email when something is found, and save the results. I could have run the script on my computer with a cron job on Mac or a scheduled task on Windows.

But desktop computers are unreliable. They can get unplugged accidentally, or restart because of an update. I wanted my script to be run from a server that never turns off.

At the NICAR 2018 conference, I learned about serverless applications using AWS Lambda, so this seemed like an ideal solution. But the demo I saw, and almost all the documentation and blog posts about this use Node.js. I wanted to work in Python, which Lambda also supports.

Hello serverless

How can something be serverless if it runs on an Amazon server? Well, it’s serverless for you. You don’t have to set up the software, maintain it, or make sure it’s still running. Amazon does it all for you. You just need to upload your scripts and tell it it what to do. And it costs pennies a month, even for daily scrapes.

Hello PyChromeless

This guide is based mostly off this repo from 21Buttons, a fashion social network based in Barcelona. They did most of the heavy work to get a Selenium scraper using a Chrome headless browser working in Lambda using Python. I simply modified it a bit to work for me.

Download their repo onto your machine. The important files are these:

Dockerfile
Makefile
docker-compose.yml
requirements.txt

The rest you can delete.

Hello Lambda

Lambda is Amazon’s serverless application platform. It lets you write or upload a script that runs according to various triggers you give it. For example, it can be run at a certain time, or when a file is added or changed in a S3 bucket.

Here’s more about it.

This is how I set up my Lambda instance.

1. Go to AWS Lambda, choose your preferred region and create a new function.

2. Choose “Author from scratch”. Give it a function name, and choose Python 3.6 as the runtime. Under “Role”, choose “Create new role from template”. Roles define the permissions your function has to access other AWS services, like S3. For now give this role a name like “scraper”. Under “Policy templates” choose “Basic Edge Lambda permissions”, which gives your function the ability to run and create logs. Hit “Create function”.

3. Now you’re in the main Lambda dashboard. Here is where you set up the triggers, environment variables, and access the logs. Since we want this to run on a schedule, we need to set up the trigger. On the “add trigger” menu on the left, choose CloudWatch Events.

Click on “Configuration required” to set up the time the script will run. A form will appear below it.

Under “Rule” choose “Create a new rule”. Give it a rule name. It can be anything, like “dailyscrape”. Give it a description so you know what it’s doing if you forget.

Now you have to write the weird cron expression that tells it when to run. I want to scraper to run at 11 PM EST every night. In Lambda, you need to enter the time in GST, so that’s 3 AM. My expression therefore needs to be:

cron(0 3 * * ? *)

For more on cron syntax, read this.

Click on “Add” and that’s it. Your function is set to run at a scheduled time.

4. Now we need to add our actual function. But first, let’s set up some environment variables.

The “Function code” section lets you write the code in the inline text editor, upload a ZIP package, or upload from S3. A Selenium scraper is big, because you need to include a Chrome browser with it. So choose “Upload file form Amazon S3”. We’ll come back to this later.

5. We need to set up our environment variables. Since we’ll be uploading a Chrome browser, we need to tell Lambda where to find it. So punch in these keys and values.

PATH = /var/task/bin

PYTHONPATH = /var/task/src:/var/task/lib

6. Finally, configure the final options. Under “Execution role”, choose the role you defined in the first step. We’ll need to configure this further in a future step.

Under “Basic settings” choose how much memory will be allocated to your scraper. My scraper rarely needs more than 1000 MB, but I give it a little more to be safe. If the memory used goes above this limit, Lambda will terminate the function.

Same with execution time. Lambda gives you a maximum of five minutes to run a function. If your scraper goes over that, you’ll need to rethink it. Maybe split it into two Lambda functions.

And that’s all the configuration we need in Lambda for now.

Hello IAM

Remember that we gave this function the most basic Lambda permissions? Well, if I want my script to write data to S3, I need to give it permission to do that. That’s where your role comes in. And you need to configure you role in IAM, Amazon’s Identity and Access Management system.

1. Go to IAM. Click on “Roles” on the left menu. You should see the role you created in the previous section. Click on it. You’ll see that it has one policy attached to it: AWSLambdaEdgeExecutionRole, the most basic Lambda permission.

2. Click on “Attach Policy”. Search for S3. Choose AmazonS3FullAccess. This will allow the function to read and write to S3. Click on “Attach policy”. That’s it. When you go back to Lambda, you should now see S3 as a resource.

Hello Docker

Now we’re ready to upload the scraper to Lambda.

But hold on. How do you know it will work? Lambda runs on Linux. It’s a different computing environment. You’ll want to test it locally on your machine simulating the Lambda environment. Otherwise it’s a pain to keep uploading files to Lambda with each change you make.

That’s where Docker comes in. It’s a tool that creates virtual machine containers that simulate the environments that you’ll deploy to. And there’s a handy Lambda container all ready for you to use.

So first, install Docker.

Then open the docker-compose.yml file you downloaded. This specifies the environment variables that you defined in Lambda earlier.

You’ll need to edit this if you want to read and write to S3 while testing from your local machine. Specifically, you need to add your S3 bucket name and AWS credentials.

version: '3'

services:
  lambda:
    build: .
    environment:
      - PYTHONPATH=/var/task/src:/var/task/lib
      - PATH=/var/task/bin
      - AWS_BUCKET_NAME='YOUR-BUCKET-NAME'
      - AWS_ACCESS_KEY_ID='YOUR-AWS-KEY-ID'
      - AWS_SECRET_ACCESS_KEY='YOUR-AWS-ACCESS-KEY'
    volumes:
      - ./src/:/var/task/src/

Then continue below.

Hello Makefile

The gentlemen at 21Buttons created a lovely Makefile that does the job of automating the creation of a Docker container, running the scraper in the Docker environment, and packaging the files to upload to Lambda.

What’s a Makefile? It’s just a recipe of command-line commands that automates repetitive tasks. To make use of Makefiles, you’ll need to install Mac’s command line tools. On Windows, it’s possible to run Makefiles with Powershell, but I have no experience with this.

I won’t go through all the items in the Makefile, but you can read about it here.

You will need to change the Makefile to download the right versions of headless-chrome and chromedriver. Like I said in the intro, the versions used by 21Buttons didn’t work for me.

This is the Makefile that worked:

clean:
	rm -rf build build.zip
	rm -rf __pycache__

fetch-dependencies:
	mkdir -p bin/

	# Get chromedriver
	curl -SL https://chromedriver.storage.googleapis.com/2.37/chromedriver_linux64.zip > chromedriver.zip
	unzip chromedriver.zip -d bin/

	# Get Headless-chrome
	curl -SL https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-37/stable-headless-chromium-amazonlinux-2017-03.zip > headless-chromium.zip
	unzip headless-chromium.zip -d bin/

	# Clean
	rm headless-chromium.zip chromedriver.zip

docker-build:
	docker-compose build

docker-run:
	docker-compose run lambda src/lambda_function.lambda_handler

build-lambda-package: clean fetch-dependencies
	mkdir build
	cp -r src build/.
	cp -r bin build/.
	cp -r lib build/.
	pip install -r requirements.txt -t build/lib/.
	cd build; zip -9qr build.zip .
	cp build/build.zip .
	rm -rf build

Hello requirements.txt

You’ll also need to edit the requirements.txt file to download the Python libraries that work with the Chrome browser and driver.

This is the minimum you need to run Selenium with chromedriver, and if you want read/write access to S3. That’s what the boto3 library is for

boto3==1.6.18
botocore==1.9.18
selenium==2.53.6
chromedriver-installer==0.0.6

Hello Python

You have your scraper code all ready to go, but you can’t just upload it as is. You have to set a bunch of options for the Chrome driver so it works on Lambda. Again, 21Buttons did the hard work of figuring out the right options.

Usually, when you run a Selenium scraper on you machine, it suffices to start it like this:

from selenium import webdriver

driver = webdriver.Chrome()

But we’re using a headless Chrome browser. That is, a browser that run without a UI. So you need to specify this and a bunch of other things.

This is what you need to pass. Notice the very important line near the bottom that tells Lambda where to find the headless Chrome binary file. It will be packaged in a folder called bin.

from selenium import webdriver
import os

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--window-size=1280x1696')
chrome_options.add_argument('--user-data-dir=/tmp/user-data')
chrome_options.add_argument('--hide-scrollbars')
chrome_options.add_argument('--enable-logging')
chrome_options.add_argument('--log-level=0')
chrome_options.add_argument('--v=99')
chrome_options.add_argument('--single-process')
chrome_options.add_argument('--data-path=/tmp/data-path')
chrome_options.add_argument('--ignore-certificate-errors')
chrome_options.add_argument('--homedir=/tmp')
chrome_options.add_argument('--disk-cache-dir=/tmp/cache-dir')
chrome_options.add_argument('user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36')
chrome_options.binary_location = os.getcwd() + "/bin/headless-chromium"

driver = webdriver.Chrome(chrome_options=chrome_options)

Finally, you need to give your file and function a name that Lambda will know how to run. By default, Lambda will look for a file called lambda_function.py and run a function called lambda_handler. You can call it whatever you want, just make sure you change this in the Lambda dashboard (under “Handler”) and in your Makefile, under the “docker-run” command.

So if your file is called crime_scraper.py and your main function is called main(), you need to change these values to crime_scraper.main

Putting it all together

Now you’re ready to test your scraper in your local Lambda environment and package everything

1. Start Docker.

2. In your terminal, cd into your folder that has your Makefile and Docker files. Create a folder called src and put your Python scraper file in there.

3. Run this:

make docker-build

This will create the Lambda environment based on the Dockerfile, the docker-compose.yml you edited earlier and your requirements.txt file. It will also download all the necessary libraries and binaries that your scraper needs to run.

4. Now that you have a Lambda instance running on your machine, run your script:

make docker-run

Does it work? Excellent! Your code is ready to be uploaded to Lambda. If not, you have some debugging to do.

5. Create the zip package:

make build-lambda-package

This will create three folders:

bin: the location of your Chrome binary and web driver
lib: the Python libraries you need (Selenium, etc.)
src: your Python script

Then it will zip everything into a file called build.zip

Hello S3

The resulting file will be too big to upload directly to Lambda, since it has a limit of 50MB. So you’ll need to add it to an S3 bucket and tell Lambda where it is. Paste the link in the “S3 link URL” box you left blank earlier.

Save and observe

Your Lambda scraper is ready to de deployed! Click the big orange “Save” button at the top and that’s it! If there are errors, they will appear in your logs. Click on the “Monitoring” link at the top of your Lambda dashboard, and scroll to the Errors box.

Click on “Jump to logs” to see a detailed log of errors that might need debugging.

110 thoughts on “Setting up a Selenium web scraper on AWS Lambda with Python”

liwaiwai says:

May 18, 2018 at 2:21 AM

At section “Pull it all together”, one more step is needed before step 3, $ make fetch-dependencies

Reply
- Roberto says:
  
  July 10, 2018 at 3:06 PM
  
  The `fetch-dependencies` command is actually called by the `build-lambda-package` command in the Makefile, so no need to run it separately.
  
  Reply
  - liwaiwai says:
    
    July 19, 2018 at 9:18 AM
    
    It’s true that *build-lambda-package* command (i.e. step 5) does not require the step, but I think step 3 and step 4 need it. Andrew’s problem below seems to explain why the additional step is needed
    
    Reply
j. win says:

July 3, 2018 at 4:29 PM

Hey, there seems to be a broken link at http://robertorocha.info/wp-content/uploads/2018/04/Screen-Shot-2018-04-29-at-13.51.04.png

Reply
- Roberto says:
  
  July 10, 2018 at 3:13 PM
  
  Oops, thanks for catching that. Fixed.
  
  Reply
  - John says:
    
    November 1, 2018 at 2:58 PM
    
    I was able to get it to create the build.zip with “make build-lambda-package” which was the instructions from the 21button git repo, you could try that. from the error “Makefile:20” it refers to line 20 of the Makefile if that helps you debug, be sure your dockerbuild.yml is filled out correctly, that you have the dependencies installed to run the Makefile (unzip, curl, pip etc) and that when indenting you use tab rather thank whitespace, the Makefile errors with whitespace. Good Luck!
    
    Reply
Andrew says:

July 3, 2018 at 5:01 PM

Roberto, Thank you so much for the article. It’s great place to practise.
During executing command – Python
“make docker-build” stuck on error:

Step 7/10 : COPY bin ./bin
ERROR: Service ‘lambda’ failed to build: COPY failed: stat /var/lib/docker/tmp/docker-builder283874387/bin: no such file or directory
Makefile:20: recipe for target ‘docker-build’ failed
make: *** [docker-build] Error 1

Maybe you have advice on how to overcome the current error. Would be grateful, for any one, thank you

Reply
- Roberto says:
  
  July 10, 2018 at 3:15 PM
  
  I never encountered this error and I’m really not a Docker expert to be able to help. I was lucky enough that it worked for me.
  
  Reply
  - KZ says:
    
    November 5, 2020 at 3:06 PM
    
    Where can we find those files?
    
    Reply
- matt says:
  
  July 18, 2018 at 1:24 AM
  
  hey Andrew. I had the same issue. I found that I needed the original /lib folder (which has libgconf-2.so.4, libORBit-2.so.0) in it, then I could do the build. hope it works for you
  
  Reply
- liwaiwai says:
  
  July 19, 2018 at 9:12 AM
  
  You may try running $ make fetch-dependencies before proceeding to step 3. Without this step, bin directory is not created and thus COPY would fail.
  
  Reply
  - affhndr says:
    
    August 24, 2019 at 9:44 AM
    
    this is the solution. bin directory should be create before run docker-build. Check the Makefile file. on fetch-dependencies there is shell command to create bin directory
    
    Reply
- Erica says:
  
  November 17, 2018 at 7:39 PM
  
  Were you ever able to fix this? I’m having the same problem 🙁
  
  Reply
- Erica says:
  
  November 17, 2018 at 8:11 PM
  
  For anyone else out there who happens to be struggling with this problem, the fix was actually pretty simple.
  
  At this point in the build, inside of my directory where I had my Dockerfile etc, there was a bin folder created, but not a lib folder. So it was trying to copy something to a folder that didn’t exist.
  
  To fix it, I just made a lib folder in this directory.
  Builds just fine now.
  
  Reply
  - KZ says:
    
    November 5, 2020 at 3:07 PM
    
    What to put in bin folder?
    
    Reply
AJ says:

July 7, 2018 at 12:07 PM

Hey mate,

Thank you for writing this guide. It’s a good one. I have managed to get it working with your help. However, i want to point out that you there is no need to delete any files after clone the pychromeless repo. Perhaps you should remove this:

Download their repo onto your machine. The important files are these:

Dockerfile
Makefile
docker-compose.yml
requirements.txt
The rest you can delete.

Reply
- Roberto says:
  
  July 10, 2018 at 3:13 PM
  
  True, there’s no need to delete them. But there’s no need to keep them, either.
  
  Reply
George says:

August 1, 2018 at 5:14 PM

Thanks a lot for this article. I am having a problem with adding beautiful soup to the requirements.txt file. I have tried ‘bs4=4.4.1’ but it has the error:

Could not find a version that satisfies the requirement bs4==4.4.1 (from -r requirements.txt (line 5)) (from versions: 0.0.0, 0.0.1)
No matching distribution found for bs4==4.4.1 (from -r requirements.txt (line 5))

If you know how to solve this, I would be grateful if you let me know.

Thanks again!

Reply
- George says:
  
  August 1, 2018 at 5:42 PM
  
  A simple fix! Turns out the module is called beautifulsoup4 not bs4
  
  Reply
Tom says:

October 11, 2018 at 9:40 AM

Hi Roberto,
Thanks for the great article and solution. Have you noticed such issue when your run lambda? Locally on Docker runs fine, but on AWS I got error below:

START RequestId: b9486b8e-cd32-11e8-b275-31fda3785429 Version: $LATEST
module initialization error: Message: unknown error: unable to discover open pages
(Driver info: chromedriver=2.32.498513 (2c63aa53b2c658de596ed550eb5267ec5967b351),platform=Linux 4.14.67-66.56.amzn1.x86_64 x86_64)

Thanks,

Reply
- Roberto says:
  
  October 13, 2018 at 6:49 PM
  
  I never got this error and don’t really know how to solve it. Good luck.
  
  Reply
  - Vinicius Vieira says:
    
    June 28, 2020 at 11:03 PM
    
    I’ve had the same problem, could make it work only after altering the options of the driver to the following:
    
    options.headless = True
    options.add_argument(‘start-maximized’)
    options.add_argument(‘–no-sandbox’)
    options.add_argument(‘–disable-extensions’)
    options.add_argument(‘–single-process’)
    options.add_argument(‘–disable-dev-shm-usage’)
    options.add_argument(‘–disable-gpu’)
    options.add_argument(‘–ignore-certificate-errors’)
    options.add_argument(‘–hide-scrollbars’)
    options.binary_location = os.getcwd() + “/bin/headless-chromium”
    options.add_argument(‘–window-size=1280×1696’)
    options.add_argument(‘–hide-scrollbars’)
    
    And adding the code:
    
    driver.maximize_window()
    
    I hope this also works for you.
    
    All the best.
    
    Reply
- Denise says:
  
  April 17, 2020 at 4:43 AM
  
  Hi Tom. I am getting this error “unable to discover open pages” error as well using the following versions:
  selenium==3.14.0
  chromedriver==2.43
  severless-chrome==1.0.0-55
  AND
  python=3.6
  
  Did you ever find a fix for this. Cheers.
  
  Reply
Erica says:

November 18, 2018 at 4:06 AM

I followed these directions (thank you so much for this!), but I get this error on make docker-run and I can’t seem to figure it out:

docker-compose run lambda src/scraper.main
START RequestId: 048bb802-bc93-425c-8b97-46a2898ed4d3 Version: $LATEST
Exception ignored in: <bound method Service.__del__ of >
Traceback (most recent call last):
File “/var/task/lib/selenium/webdriver/common/service.py”, line 151, in __del__
self.stop()
File “/var/task/lib/selenium/webdriver/common/service.py”, line 123, in stop
if self.process is None:
AttributeError: ‘Service’ object has no attribute ‘process’
module initialization error: Message: ‘chromedriver’ executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home

END RequestId: 048bb802-bc93-425c-8b97-46a2898ed4d3
REPORT RequestId: 048bb802-bc93-425c-8b97-46a2898ed4d3 Duration: 181 ms Billed Duration: 200 ms Memory Size: 1536 MB Max Memory Used: 20 MB

{“errorMessage”: “module initialization error”}
Makefile:23: recipe for target ‘docker-run’ failed
make: *** [docker-run] Error 1

Thoughts?
Thank you

Reply
- Roberto says:
  
  November 18, 2018 at 6:54 AM
  
  Is the PATH correctly set in the docker-compose.yml file, and is the chromedriver file in the right folder?
  
  Reply
  - jbjbjb1 says:
    
    November 6, 2020 at 9:42 AM
    
    Did you manage to find a solution for this? I am stuck on the same error:
    
    AttributeError: ‘Service’ object has no attribute ‘process’
    module initialization error: Message: ‘chromedriver’ executable may have wrong permissions. Please see https://sites.google.com/a/chromium.org/chromedriver/home
    
    Reply
- Erica Anderson says:
  
  November 25, 2018 at 8:57 PM
  
  I think it is – I followed your instructors exactly using the same docker-compose.yml. Chromedriver is in bin, my code in src, etc. I started trying different PATH settings but nothing is working. 🙁 I don’t understand why it would not work when I used exactly the same files as you. Maybe it has to do with my Docker version..?
  
  Reply
  - Roberto says:
    
    December 23, 2018 at 3:34 PM
    
    Hard to say for sure. Keep trying different things until it works. It’s what I did.
    
    Reply
JSwordy says:

December 20, 2018 at 4:19 PM

Hello,

Thank you for the blog, I got it to work on lambda however when i check the screenshot uploaded to S3, it is just a blank screen. Would you be able to assist? My code is below:

from selenium import webdriver
import time
import os
import boto3

def lambda_handler(event, context):
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(‘–no-sandbox’)
chrome_options.add_argument(‘–disable-gpu’)
chrome_options.add_argument(‘–window-size=1280×1696’)
chrome_options.add_argument(‘–user-data-dir=/tmp/user-data’)
chrome_options.add_argument(‘–hide-scrollbars’)
chrome_options.add_argument(‘–enable-logging’)
chrome_options.add_argument(‘–log-level=0’)
chrome_options.add_argument(‘–v=99’)
chrome_options.add_argument(‘–single-process’)
chrome_options.add_argument(‘–data-path=/tmp/data-path’)
chrome_options.add_argument(“user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36′”)
chrome_options.add_argument(‘–headless’)
chrome_options.add_argument(‘–ignore-certificate-errors’)
chrome_options.add_argument(‘–homedir=/tmp’)
chrome_options.add_argument(‘–disk-cache-dir=/tmp/cache-dir’)
chrome_options.binary_location = os.getcwd() + “/bin/headless-chromium”
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.get(“https://www.civitekflorida.com/ocrs/app/search.xhtml”)
time.sleep(5)
pic =browser.get_screenshot_as_file(‘/tmp/FL.png’)
s3 = boto3.client(‘s3’)
filename = ‘/tmp/FL.png’
bucketname = ‘cdbucketswordy’
s3.upload_file(filename, bucketname, ‘FL.png’)

Thank you!

Reply
- Roberto says:
  
  December 23, 2018 at 3:30 PM
  
  I’ve never taken screenshots before, so it’s not something I can help with. But looking at your code, it seems you’re saving the screenshot to the variable pic and not using that variable later. You’re just saving a blank png filename.
  
  Reply
- mala says:
  
  June 4, 2020 at 5:10 PM
  
  Hi, I was wondering how you were even able to get that code to run? I tried running the exact code you ran it and am getting an error. I am not sure how to run this with regular selenium web driver. It only seems to work with webdriver wrapper.
  
  Reply
Eric James says:

January 7, 2019 at 8:07 PM

Thanks for putting this together. It was immensely helpful. For my project I had to remove the single quotes from the environment AWS variables in docker-compose.yml to get my credentials to work correctly.

Reply
- Roberto says:
  
  January 8, 2019 at 1:45 AM
  
  Ah yes, those are just placeholders for the real credentials. The single quotes don’t need to be there.
  
  Reply
Luis says:

January 17, 2019 at 5:43 PM

Thanks for putting this together, I was wondering if there would be a problem (my guess is there would be) if my scraper actually uses a UI. I’ve been working on a project using scrapy and selenium to send and retrieve data from a website, I’ve been running this on my computer but as you very well said, it would be more convenient to have it running on a server.

Reply
- Roberto says:
  
  February 2, 2019 at 8:22 PM
  
  If you want your scraper to be serverless, there’s no reason for it to have a UI. You won’t be able to see it anyway. There are several headless browsers for this, like Chrome, Firefox, and PhantomJS.
  
  Reply
Arabahn says:

February 2, 2019 at 9:31 PM

I followed the tutorial but I was lost in the last point. I don’t have a main function, just a Python script with selenium and I don’t need to save/write any object to S3. What should be the name of my handler? blank? Leave it as lambda_handler?

Reply
- Roberto says:
  
  February 2, 2019 at 10:52 PM
  
  Can you refactor your script so it’s inside a function?
  
  Reply
  - Arabahn says:
    
    February 3, 2019 at 6:19 AM
    
    I guess I can refactor my script to make it inside a function (even if it has no parameters). My script is very simple, it just books a class at 7 am. I cannot be on the computer at that time and they get all booked in 5 minutes.
    A bigger problem is that I have windows 10 Home and I cannot install Docker either. Is there a way to install headless Chrome in Lambda with the few python libraries I need (selenium, pandas, datetime, time and numpy), and then just copy paste the code in the Lambda editor?
    
    Reply
Mranalinee says:

February 3, 2019 at 11:17 PM

Hey there is no Python 3.5 version on Runtime then what I need to select for this

Reply
- Roberto says:
  
  February 4, 2019 at 10:12 AM
  
  You can only select versions 2.7, 3.6 and 3.7 in Lambda.
  
  Reply
  - Ignacio Porte Stefoni says:
    
    August 12, 2022 at 9:13 PM
    
    This versions of chromedriver and headless-chrome worked for me:
    curl -SL https://chromedriver.storage.googleapis.com/86.0.4240.22/chromedriver_linux64.zip > chromedriver.zip
    curl -SL https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-57/stable-headless-chromium-amazonlinux-2.zip >
    
    also i have to edit the requirements.txt:
    
    chromedriver-installer==0.0.6 -> chromedriver-install==1.0.3
    
    Reply
Neil Patel says:

April 4, 2019 at 2:54 PM

Thanks for the help. I was wondering what should the new URL be for the makefile for chromedriver and headless chrome?
With the current set up I get this: Exception: Unable to get latest chromedriver version from https://sites.google.com/a/chromium.org/chromedriver/downloads

Reply
- Roberto says:
  
  April 22, 2019 at 1:56 PM
  
  It looks like Google changed the URLs of the hosted file and versions below 2.46 are no longer offered. I’m afraid you’ll have to experiment with different versions until you find one that works.
  
  Consult the website that you linked to to see the currently supported versions.
  
  Reply
stsuei says:

April 19, 2019 at 7:24 PM

for anyone still reading this tutorial (which is REALLY well written btw) – i don’t thinkis chromedriver-installer==0.0.6 is actually needed in your requirements.txt, since chromedriver-installer is merely downloading the chromedriver into your bin folder.

One because it will keep throwing the following error (at least for me) that it isn’t able to find the chromedriver to download from the chromedriver link

Two because when you do make fetch-dependencies you are already creating /bin with chromedriver & headless- chromium, and you need to do make fetch-dependencies in order to avoid some of the errors previously listed.

@For the author – is it possible to list in the tutorial some of the compatible versions when it comes to chromedriver, selenium & chromium. Looks like last stable version of headless-chromium is <70, so make sure to use an older version of chromedriver & older version of selenium as well.

Reply
- Roberto says:
  
  April 22, 2019 at 1:49 PM
  
  Thanks for the comment. I’m no expert in this, just I’m just sharing what worked for me. If this is causing problems for others, I hope this will help them. As for listing compatible versions, I don’t think I’ll be testing all the combinations of versions that work. But if this is something you do, I’d be happy to update the post with your solutions.
  
  Reply
gseek27 says:

June 9, 2019 at 10:10 PM

I was using the WRONG Dockerfile! I was using the one from lambci instead of 21buttons. Ahhh!

Anyway, there were a few other things, like I had to delete the last line in your requirments.txt file like the other guy said.

And I had to fix the lambda_function.py file by removing some of what 21buttons used. But it worked!

Reply
- Kailyn says:
  
  October 13, 2021 at 11:42 AM
  
  Go to Code -> Runtime Settings -> and change your handler to “src.lambda_function.lambda_handler”
  
  Reply
gseek27 says:

June 10, 2019 at 8:38 PM

My code works all the way until the last step, and then the AWS logs say:

Unable to import module ‘lambda_function’: No module named ‘lambda_function’

Not sure why yet…

Reply
gseek27 says:

June 10, 2019 at 9:09 PM

OK I got it! I had started over with a 2nd function and forgot to add the Lambda Environment variables back in.

Then I also misspellled “PYTHONPATH”. Then, I’d also seen advice on other forums to zip all of the files (not just the main folder), so I did that for the files in the Src folder, which didn’t seem to hurt.

Then I ran “make build-lambda-package” in command line terminal again, uploaded the build.zip to the S3 Management Console, copied its path for the Amazon S3 link url in the Lambda Management Console, and saved it. Voila!

The error went away and instead, now it’s printing out the comments from my lambda_function.py file into the CloudWatch Management Console logs. I hope this helps someone else. It took me a week to set this up!

Reply
guigui38v says:

June 26, 2019 at 6:00 PM

chromedriver-installer == 0.0.6
is depricated, replace it in the requierements.txt by:
chromedriver-install==1.0.3

Reply
- Roberto says:
  
  June 26, 2019 at 6:49 PM
  
  Thanks for this.
  
  Reply
oli says:

July 13, 2019 at 6:31 AM

I tried following this, and also adding the library pandas.

The problem is that the unzipped file is above the Lamda maximum file size. Do you have any advice?

Reply
- Roberto says:
  
  July 15, 2019 at 3:36 PM
  
  Upload the file via S3, as mentioned in the post.
  
  Reply
Victor Llanir says:

July 30, 2019 at 8:54 AM

It works perfectly on my docker environment, but when i upload to S3 and try to test, i get:
“errorMessage”: “Unable to import module ‘lambda_function'”
Can you help?

Reply
- Roberto says:
  
  August 24, 2019 at 11:37 AM
  
  You probably didn’t change the name of the function to be called in Lambda. The post explains it all.
  
  Reply
Pingback: Run a Python Selenium web scraper on AWS Fargate – Sergei Shevlyagin
Fakhri says:

August 23, 2019 at 12:41 PM

Hello Roberto, how to run the scraper without docker? (Since it’s just for testing, i guess)

Do i just zip 3 folders (src, lib, bin) and upload it to lambda? How tobspecify the environment variable in lambda? Thank you

Reply
- Roberto says:
  
  August 24, 2019 at 11:36 AM
  
  You can just skip the Docker section and remove the two docker commands in the Makefile. The post explains how to set the env variables.
  
  Reply
  - Matt says:
    
    February 1, 2020 at 7:22 PM
    
    I am also getting this error.
    
    Reply
Darren says:

September 3, 2019 at 6:20 PM

selenium.common.exceptions.WebDriverException: Message: Service /build/bin/chromedriver unexpectedly exited. Status code was: 127
I get the above error on the below versions:
chromedriver 2.43
severless-chrome 1.0.0-55
selenium 3.14
and
serverless-chrome v. 0.0-37
chromedriver v. 2.37
Selenium for Python v. 2.53.6

For the following error:
selenium.common.exceptions.WebDriverException: Message: Can not connect to the Service /build/bin/chromedriver
I get it on version:
serverless-chrome v. 0.0-37
chromedriver v. 2.37
Selenium for Python v. 2.53.0

I would sincerely appreciate any help. Thank you.

Reply
- Matt says:
  
  February 1, 2020 at 7:22 PM
  
  I am also getting this error (mistakenly replied to the comment above)
  
  Reply
- Steven says:
  
  February 29, 2020 at 12:29 PM
  
  Have you found out how to fix it? I have the same error.
  
  Reply
- Benjamin says:
  
  May 16, 2021 at 8:17 AM
  
  am also getting this error, has anyone figured it out?
  
  Reply
Jeff says:

September 12, 2019 at 5:09 PM

I continue to get the following error when running the lamba in aws “Message: ‘chromedriver’ executable may have wrong permissions.
I’ve tried using chmod but the environment is read only so that doesn’t work. Has anyone else seen this problem?

Reply
- Mukul says:
  
  May 4, 2020 at 1:44 AM
  
  Were you able to solve this issue?
  
  Reply
- Ayuub Adan says:
  
  April 23, 2021 at 9:52 AM
  
  Hey guys,
  
  I have solved this by moving the chromedriver and headless chromium to a tmp folder and changing the permissions. This will work for you
  
  from selenium import webdriver
  from shutil import copyfile
  import os
  import time
  
  def permissions(origin_path, destiny_path):
  copyfile(origin_path, destiny_path)
  os.chmod(destiny_path, 0o775)
  
  def lambda_handler(*args, **kwargs):
  chrome_options = webdriver.ChromeOptions()
  chrome_options.add_argument(‘–headless’)
  chrome_options.add_argument(‘–no-sandbox’)
  chrome_options.add_argument(‘–disable-gpu’)
  chrome_options.add_argument(‘–window-size=1280×1696’)
  chrome_options.add_argument(‘–user-data-dir=/tmp/user-data’)
  chrome_options.add_argument(‘–hide-scrollbars’)
  chrome_options.add_argument(‘–enable-logging’)
  chrome_options.add_argument(‘–log-level=0’)
  chrome_options.add_argument(‘–v=99’)
  chrome_options.add_argument(‘–single-process’)
  chrome_options.add_argument(‘–data-path=/tmp/data-path’)
  chrome_options.add_argument(‘–ignore-certificate-errors’)
  chrome_options.add_argument(‘–homedir=/tmp’)
  chrome_options.add_argument(‘–disk-cache-dir=/tmp/cache-dir’)
  chrome_options.add_argument(‘user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36’)
  try:
  permissions(os.getcwd() + ‘/bin/chromedriver’,’/tmp/chromedriver’)
  permissions(os.getcwd() + ‘/bin/headless-chromium’,’/tmp/headless-chromium’)
  except:
  print(“file is busy”)
  finally:
  
  Reply
vitaizs says:

September 17, 2019 at 8:09 AM

Hi Roberto,

thanks a lot for the very useful post.
I’m following exactly your instructions but I’m getting error on AWS Lambda that says
‘chromedriver’ executable may have wrong permissions
or
[Errno 30] Read-only file system: ‘/var/task/bin/chromedriver’
Do you maybe know how to solve this issue?

Reply
Igor Spiridonov says:

December 1, 2019 at 9:16 PM

Is there a way to pass the aws token to the docker container to run a lambda function that saves to s3 locally ?

Reply
balala says:

January 3, 2020 at 8:46 AM

Hi there, I have the environment built and works perfectly except that I cannot download files.
In the code, there is a section talks about enable chrome headless download. However, no downloaded files were found.
Very sad, I guess I need to find other solutions other than 21buttons.

Reply
- tomiszili says:
  
  May 27, 2020 at 11:47 AM
  
  Hello!
  
  Did you find any other solution thath works for the first try? 😀
  
  Reply
Pingback: OSError: [Errno 8] Exec format error: ‘/var/task/bin/chromedriver’ with make docker-run command – Docker Questions
Mike says:

February 9, 2020 at 4:18 PM

I followed new version(chromedriver 2.43, severless-chrome 1.0.0-55, selenium 3.14) also added all options to chrome as you mentioned but getting below error:
I’m just trying to run my python code to test without Docker even Lambda on just EC2 as got stuck on Makefile.
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally
(Driver info: chromedriver=2.37.544315 (730aa6a5fdba159ac9f4c1e8cbc59bf1b5ce12b7),platform=Linux 4.15.0-1057-aws x86_64)

Also tried with newest Chorme but same error. Any idea?

Reply
- tomiszili says:
  
  May 27, 2020 at 12:01 PM
  
  Hi Mike!
  
  Did you find any solution for this problem, i have the same issue?
  
  Reply
Pingback: Building a python selenium lambda function that works | Notepad Of Lchen
AP says:

May 12, 2020 at 12:04 PM

Hi mate great post.

Do you know if headless chrome works differently to normal chrome

The reason I ask is that if I scrape a site for example, https://car.gocompare.com/vehicle
and send keys for a vehicle, in normal chrome it produces the reg details, if I use headless chrome it cannot find it,
Is this down to the chrome driver and headless, or is it down to the website? How can you spot whats caused, btw it works in firefox.

Reply
mala says:

June 4, 2020 at 1:35 PM

Is there a reason that this runs in selenium web driver wrapper as opposed to regular selenium?

Reply
- mala says:
  
  June 4, 2020 at 1:36 PM
  
  In other words, how can you configure this to use regular web driver commands? It seems to give an error when I do that.
  
  Reply
  - alon says:
    
    February 8, 2021 at 6:00 AM
    
    I guess you already figured it out, but for others who just got this error and still didn’t.
    your python function inside your scraper should accept to arguments. like this:
    
    def lambda_handler(*args, **kwargs):
    ….your scraping code using the driver
    
    Reply
Jack says:

August 14, 2020 at 2:59 PM

Hey all!

So I’ve gotten to the point (finally!) of running docker-build and this thing successfully building, but when I try docker-run, I end up getting a type error saying lambda_handler() takes 0 positional arguments but 2 were given. Anyone have any advice? Total newbie to both AWS, Selenium, and Docker, so I’m amazed I’ve made it this far

Reply
- KZ says:
  
  November 5, 2020 at 6:03 PM
  
  Were you able to figure it out?
  
  Reply
- alon says:
  
  February 8, 2021 at 6:00 AM
  
  Your comment is awaiting moderation.
  
  I guess you already figured it out, but for others who just got this error and still didn’t.
  your python function inside your scraper should accept to arguments. like this:
  
  def lambda_handler(*args, **kwargs):
  ….your scraping code using the driver
  
  Reply
Joakim Calais says:

August 20, 2020 at 6:54 AM

Why do we need chromedriver-installer in requirements.txt if we download chromedriver manually?

Reply
KZ says:

November 5, 2020 at 3:03 PM

Is it possible to put several functions inside the python scraper in the src folder? It will be great if anyone can share what the scraper and lambda function should look like. Also how can we export the scraped data to was?

Reply
MW says:

November 26, 2020 at 3:26 PM

Unfortunately I am getting many many errors trying to follow this!!! Nione of the make commands seem to work at all.

Reply
Zei says:

November 28, 2020 at 10:47 PM

Took a long time to get this working but you are the only person on the internet with a decent guide on how to do this.

I highly recommend to anyone trying this out.
Keep the downloaded repo from the start exactly the same.
Make SURE you’re using python 3.6, that’s important.

Ignore all of the stuff this guy said to do with the makefile/requirements.txt and whatever. Just leave everything exactly as it was in the repo.
I wasted so much time trying to change versions and get them compatible, and in the end I just redownloaded the repo and changed nothing, and it worked.

Reply
tdry77 says:

December 28, 2020 at 8:37 AM

“2. Click on “Attach Policy”. Search for S3. Choose AmazonS3FullAccess. This will allow the function to read and write to S3. Click on “Attach policy”. That’s it. When you go back to Lambda, you should now see S3 as a resource.”

When I attach the AmazonS3FullAccess and go back to Lambda, there is no change. S3 isn’t displaying as a resource? What might I be doing wrong

Reply
Igor says:

December 31, 2020 at 10:16 AM

The tutorial is not working for me when I run ‘docker compose build’. I get this error:

ERROR: Service ‘lambda’ failed to build : COPY failed: file not found in build context or excluded by .dockerignore: stat lib: file does not exist

Reply
Nick says:

January 18, 2021 at 11:56 PM

I keep getting the message “Unable to import module ‘src/lambda_function’: No module named ‘selenium.webdriver.common'”. Did anyone else see this? I think it must be to do with the PATH not being specified correctly as it can’t find the modules of selenium, but can’t figure where its going wrong.

Reply
- MN says:
  
  February 27, 2021 at 5:35 AM
  
  Hey Nick, I’m having the same issue. Were you able to determine a fix?
  
  Reply
  - yzg says:
    
    February 27, 2021 at 5:56 PM
    
    Hey MN, Im having the same issue,
    Unable to import module ‘src.lambda_function’: No module named ‘src’
    Found the issue? Thanks!!
    
    Reply
    - Roberto says:
      
      March 17, 2021 at 6:52 PM
      
      src is the name of the folder with your script, not the name of the script file. You call the function by punching in ‘file_name.function’. So if your file is scraper.py and the function that runs everything is main(), your function call in Lambda should be scraper.main
      
      Reply
VJ says:

February 22, 2021 at 4:25 PM

our application has login page with google captcha enabled. somehow if i try to login, it is popping “are you a robot” message. I am using chrome headless browser. any help here ? why it is saying “are you robot” on this setup ( AWS Lamda Python Chrome headless ).

Reply
yzg says:

February 27, 2021 at 5:42 PM

Hey Roberto, thank you very much.
Everything worked fine and make docker-build runs succesfully. But then when trying to run make docker-run, i get the following error:
{“errorMessage”:”Unable to import module ‘src.lambda_function'”}
I do have the python (lambda_function.py) inside the src folder, so I don’t understand what is wrong. Help? Thanks!

Reply
PS says:

March 8, 2021 at 3:31 AM

Hi, I have the environment built and works perfectly except that I cannot download files.File does not start downloading. Could you please help. I am struggling for this for quite a long time. Thanks

Reply
Alec says:

March 17, 2021 at 3:39 PM

Excellent guide, was able to get things working with the help of some comments here as well.

One issue I ran into was getting through a login page generated by JS. I was able to come to the conclusion that the issue had something to do with serverless-chrome (no issues when testing locally without the –headless option).

Changing from v1.0.0-37 to v1.0.0-57 solved my issues.

Just by changing this line in the Makefile:

curl -SL https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-57/stable-headless-chromium-amazonlinux-2.zip > headless-chromium.zip

Reply
- Mariano says:
  
  June 26, 2021 at 9:59 AM
  
  Alec,
  
  I had the same issue. I spent two hours trying to figure out why the cookies were not working properly. I tried many chromedriver / headless-chromium combinations and always encountered some random error.
  
  Your suggestion fixed my issue, I don’t think I could have found this myself.
  
  Thanks a lot, now you have a friend in Buenos Aires
  
  Reply
Cameron says:

March 22, 2021 at 5:02 PM

Everyone make sure you’re using the right python version on your local. For this example that’s 3.6

Reply
Cameron says:

March 22, 2021 at 6:02 PM

This didn’t work for me until I followed this comment: https://github.com/heroku/heroku-buildpack-google-chrome/issues/46#issuecomment-396783086

I specifically added
chrome_options.add_argument(‘–disable-dev-shm-usage’)

Reply
Wayne Morris says:

June 8, 2021 at 10:07 AM

Firstly, thank you Roberto for taking the time out to share your work and for providing a guide for taking this approach. It is much appreciated.

To get mine working, I did the following:
1. Created a bin folder in the repo. This solved the issue for Lambda service failure which I saw some of the others encountered.
2. I encountered issues with doing a pip install on chromedriver_install. It produced an error saying it couldn’t find the latest. I removed it from the requirement.txt and went directly to https://sites.google.com/a/chromium.org/chromedriver/downloads and manually download the chromedriver then add it to the bin folder created earlier.
3. Did the same by manually running curl -SL https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-57/stable-headless-chromium-amazonlinux-2.zip > headless-chromium.zip then unzip that file to the bin directory created earlier as well.
4. The versions I used are as follows:
chromedriver: 86.0.4240.111
headless-chromium – v1.0.0-57
selenium: 3.141.0

Reply
Tucker says:

July 5, 2021 at 8:37 PM

I keep getting this error… Does anyone have any idea how I can fix it

raise WebDriverException(“Can not connect to the Service %s” % self.path)
[ERROR] WebDriverException: Message: Can not connect to the Service chromedriver Traceback (most recent call last): File “/var/task/src/lambda_function.py”, line 31, in lambda_handler driver = webdriver.Chrome(chrome_options=chrome_options) File “/var/task/lib/selenium/webdriver/chrome/webdriver.py”, line 61, in __init__ self.service.start() File “/var/task/lib/selenium/webdriver/common/service.py”, line 88, in start raise WebDriverException(“Can not connect to the Service %s” % self.path)

Reply
hadiya asif says:

September 20, 2021 at 1:30 PM

thank you so much for this article.

Reply
Kailyn says:

October 13, 2021 at 12:20 PM

BEST TIP: Download the repo and follow the guide exactly then try to reverse engineer everything. You will save yourself lots of time (I’ve tried for 4 days and finnaly got it to work). You don’t have to change anything in the repo. Don’t add anything until you get it to work in Lambda

My requirement.txt:
selenium==2.53.0
chromedriver-binary==2.37.0

Reply
Levi says:

December 8, 2021 at 2:05 AM

Dude if you are gonna do a tutorial show all the code you are using and don’t make us guess at what you are running. If you are thinking about trying this tutorial out don’t bother. Its a headache and you can look at 98 comments of errors to prove it.

Reply
- Roberto says:
  
  December 8, 2021 at 8:48 PM
  
  It doesn’t matter what code you’re running if you don’t get the setup right. My script could just be print("Hello.") but it won’t run if I don’t configure things properly.
  
  Reply
  - Levi says:
    
    December 9, 2021 at 9:33 PM
    
    You know what – I apologize for that comment. I shouldn’t condemn some hard work put in by a stranger. I am struggling and I suck at coding so that is the probably the core reason, not this tutorial.
    
    Reply
    - Roberto says:
      
      December 22, 2021 at 3:06 PM
      
      We all suck at the beginning! This is hard stuff to master.
      
      Reply
Pingback: http.client.RemoteDisconnected error in selenium for successive requests – Ask python questions
Levi says:

January 12, 2022 at 2:31 PM

While this is a great tutorial, it is a bit outdated. I would recommend uploading the package as a layer. In this video + linked github you will already find a packaged zip file that you can upload to Lambda as a layer. You can have a working selenium web scrapper set up in minutes. This tutorial is filled with errors and headaches. Instead look here to save time:

https://www.youtube.com/watch?v=jWqbYiHudt8&ab_channel=soumilshah1995
https://github.com/soumilshah1995/Selenium-on-AWS-Lambda-Python3.7

Reply
- Roberto says:
  
  February 3, 2022 at 10:55 AM
  
  Good point, I have updated the intro to explain this.
  
  Reply
- Zel says:
  
  December 8, 2022 at 1:59 PM
  
  This tutorial is excellent, and was written in 2018. To complain 4 years later is a poor reflection on you.
  
  Reply