logos

TL;DR: This post details how to get a web scraper running on AWS Lambda using Selenium and a headless Chrome browser, while using Docker to test locally. It’s based on this guide, but it didn’t work for me because the versions of Selenium, headless Chrome and chromedriver were incompatible. What did work was the following:

EDIT: The versions above are no longer supported. According to this GitHub issue, these versions work well together:

  • chromedriver 2.43
  • severless-chrome 1.0.0-55
  • selenium 3.14

The full story

I recently spent several frustrating weeks trying to deploy a Selenium web scraper that runs every night on its own and saves the results to a database on Amazon S3. With this post, I hope to spare you from wanting to smash all computers with a sledgehammer.

First, some background

I wanted to scrape a government website that is regularly updated every night, detect new additions, alert me by email when something is found, and save the results. I could have run the script on my computer with a cron job on Mac or a scheduled task on Windows.

But desktop computers are unreliable. They can get unplugged accidentally, or restart because of an update. I wanted my script to be run from a server that never turns off.

At the NICAR 2018 conference, I learned about serverless applications using AWS Lambda, so this seemed like an ideal solution. But the demo I saw, and almost all the documentation and blog posts about this use Node.js. I wanted to work in Python, which Lambda also supports.

Hello serverless

How can something be serverless if it runs on an Amazon server? Well, it’s serverless for you. You don’t have to set up the software, maintain it, or make sure it’s still running. Amazon does it all for you. You just need to upload your scripts and tell it it what to do. And it costs pennies a month, even for daily scrapes.

Hello PyChromeless

This guide is based mostly off this repo from 21Buttons, a fashion social network based in Barcelona. They did most of the heavy work to get a Selenium scraper using a Chrome headless browser working in Lambda using Python. I simply modified it a bit to work for me.

Download their repo onto your machine. The important files are these:

  • Dockerfile
  • Makefile
  • docker-compose.yml
  • requirements.txt

The rest you can delete.

Hello Lambda

Lambda is Amazon’s serverless application platform. It lets you write or upload a script that runs according to various triggers you give it. For example, it can be run at a certain time, or when a file is added or changed in a S3 bucket.

Here’s more about it.

This is how I set up my Lambda instance.

1. Go to AWS Lambda, choose your preferred region and create a new function.

2. Choose “Author from scratch”. Give it a function name, and choose Python 3.6 as the runtime. Under “Role”, choose “Create new role from template”. Roles define the permissions your function has to access other AWS services, like S3. For now give this role a name like “scraper”. Under “Policy templates” choose “Basic Edge Lambda permissions”, which gives your function the ability to run and create logs. Hit “Create function”.

setting up AWS Lambda

3. Now you’re in the main Lambda dashboard. Here is where you set up the triggers, environment variables, and access the logs. Since we want this to run on a schedule, we need to set up the trigger. On the “add trigger” menu on the left, choose CloudWatch Events.

Click on “Configuration required” to set up the time the script will run. A form will appear below it.

Under “Rule” choose “Create a new rule”. Give it a rule name. It can be anything, like “dailyscrape”. Give it a description so you know what it’s doing if you forget.

Now you have to write the weird cron expression that tells it when to run. I want to scraper to run at 11 PM EST every night. In Lambda, you need to enter the time in GST, so that’s 3 AM. My expression therefore needs to be:

cron(0 3 * * ? *)

For more on cron syntax, read this.

lambda cron job

Click on “Add” and that’s it. Your function is set to run at a scheduled time.

4. Now we need to add our actual function. But first, let’s set up some environment variables.

The “Function code” section lets you write the code in the inline text editor, upload a ZIP package, or upload from S3. A Selenium scraper is big, because you need to include a Chrome browser with it. So choose “Upload file form Amazon S3”. We’ll come back to this later.

5. We need to set up our environment variables. Since we’ll be uploading a Chrome browser, we need to tell Lambda where to find it. So punch in these keys and values.

PATH = /var/task/bin

PYTHONPATH = /var/task/src:/var/task/lib

lambda environment variables

6. Finally, configure the final options. Under “Execution role”, choose the role you defined in the first step. We’ll need to configure this further in a future step.

Under “Basic settings” choose how much memory will be allocated to your scraper. My scraper rarely needs more than 1000 MB, but I give it a little more to be safe. If the memory used goes above this limit, Lambda will terminate the function.

Same with execution time. Lambda gives you a maximum of five minutes to run a function. If your scraper goes over that, you’ll need to rethink it. Maybe split it into two Lambda functions.

And that’s all the configuration we need in Lambda for now.

other settings in Lambda

Hello IAM

Remember that we gave this function the most basic Lambda permissions? Well, if I want my script to write data to S3, I need to give it permission to do that. That’s where your role comes in. And you need to configure you role in IAM, Amazon’s Identity and Access Management system.

1. Go to IAM. Click on “Roles” on the left menu. You should see the role you created in the previous section. Click on it. You’ll see that it has one policy attached to it: AWSLambdaEdgeExecutionRole, the most basic Lambda permission.

2. Click on “Attach Policy”. Search for S3. Choose AmazonS3FullAccess. This will allow the function to read and write to S3. Click on “Attach policy”. That’s it. When you go back to Lambda, you should now see S3 as a resource.

Lambda with S3 access

Hello Docker

Now we’re ready to upload the scraper to Lambda.

But hold on. How do you know it will work? Lambda runs on Linux. It’s a different computing environment. You’ll want to test it locally on your machine simulating the Lambda environment. Otherwise it’s a pain to keep uploading files to Lambda with each change you make.

That’s where Docker comes in. It’s a tool that creates virtual machine containers that simulate the environments that you’ll deploy to. And there’s a handy Lambda container all ready for you to use.

So first, install Docker.

Then open the docker-compose.yml file you downloaded. This specifies the environment variables that you defined in Lambda earlier.

You’ll need to edit this if you want to read and write to S3 while testing from your local machine. Specifically, you need to add your S3 bucket name and AWS credentials.

version: '3'

services:
  lambda:
    build: .
    environment:
      - PYTHONPATH=/var/task/src:/var/task/lib
      - PATH=/var/task/bin
      - AWS_BUCKET_NAME='YOUR-BUCKET-NAME'
      - AWS_ACCESS_KEY_ID='YOUR-AWS-KEY-ID'
      - AWS_SECRET_ACCESS_KEY='YOUR-AWS-ACCESS-KEY'
    volumes:
      - ./src/:/var/task/src/

Then continue below.

Hello Makefile

The gentlemen at 21Buttons created a lovely Makefile that does the job of automating the creation of a Docker container, running the scraper in the Docker environment, and packaging the files to upload to Lambda.

What’s a Makefile? It’s just a recipe of command-line commands that automates repetitive tasks. To make use of Makefiles, you’ll need to install Mac’s command line tools. On Windows, it’s possible to run Makefiles with Powershell, but I have no experience with this.

I won’t go through all the items in the Makefile, but you can read about it here.

You will need to change the Makefile to download the right versions of headless-chrome and chromedriver. Like I said in the intro, the versions used by 21Buttons didn’t work for me.

This is the Makefile that worked:

clean:
	rm -rf build build.zip
	rm -rf __pycache__

fetch-dependencies:
	mkdir -p bin/

	# Get chromedriver
	curl -SL https://chromedriver.storage.googleapis.com/2.37/chromedriver_linux64.zip > chromedriver.zip
	unzip chromedriver.zip -d bin/

	# Get Headless-chrome
	curl -SL https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-37/stable-headless-chromium-amazonlinux-2017-03.zip > headless-chromium.zip
	unzip headless-chromium.zip -d bin/

	# Clean
	rm headless-chromium.zip chromedriver.zip

docker-build:
	docker-compose build

docker-run:
	docker-compose run lambda src/lambda_function.lambda_handler

build-lambda-package: clean fetch-dependencies
	mkdir build
	cp -r src build/.
	cp -r bin build/.
	cp -r lib build/.
	pip install -r requirements.txt -t build/lib/.
	cd build; zip -9qr build.zip .
	cp build/build.zip .
	rm -rf build

Hello requirements.txt

You’ll also need to edit the requirements.txt file to download the Python libraries that work with the Chrome browser and driver.

This is the minimum you need to run Selenium with chromedriver, and if you want read/write access to S3. That’s what the boto3 library is for

boto3==1.6.18
botocore==1.9.18
selenium==2.53.6
chromedriver-installer==0.0.6

Hello Python

You have your scraper code all ready to go, but you can’t just upload it as is. You have to set a bunch of options for the Chrome driver so it works on Lambda. Again, 21Buttons did the hard work of figuring out the right options.

Usually, when you run a Selenium scraper on you machine, it suffices to start it like this:

from selenium import webdriver

driver = webdriver.Chrome()

But we’re using a headless Chrome browser. That is, a browser that run without a UI. So you need to specify this and a bunch of other things.

This is what you need to pass. Notice the very important line near the bottom that tells Lambda where to find the headless Chrome binary file. It will be packaged in a folder called bin.

from selenium import webdriver
import os

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--window-size=1280x1696')
chrome_options.add_argument('--user-data-dir=/tmp/user-data')
chrome_options.add_argument('--hide-scrollbars')
chrome_options.add_argument('--enable-logging')
chrome_options.add_argument('--log-level=0')
chrome_options.add_argument('--v=99')
chrome_options.add_argument('--single-process')
chrome_options.add_argument('--data-path=/tmp/data-path')
chrome_options.add_argument('--ignore-certificate-errors')
chrome_options.add_argument('--homedir=/tmp')
chrome_options.add_argument('--disk-cache-dir=/tmp/cache-dir')
chrome_options.add_argument('user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36')
chrome_options.binary_location = os.getcwd() + "/bin/headless-chromium"

driver = webdriver.Chrome(chrome_options=chrome_options)

Finally, you need to give your file and function a name that Lambda will know how to run. By default, Lambda will look for a file called lambda_function.py and run a function called lambda_handler. You can call it whatever you want, just make sure you change this in the Lambda dashboard (under “Handler”) and in your Makefile, under the “docker-run” command.

So if your file is called crime_scraper.py and your main function is called main(), you need to change these values to crime_scraper.main

aws lambda handler

 

Putting it all together

Now you’re ready to test your scraper in your local Lambda environment and package everything

1. Start Docker.

2. In your terminal, cd into your folder that has your Makefile and Docker files. Create a folder called src and put your Python scraper file in there.

3. Run this:

make docker-build

This will create the Lambda environment based on the Dockerfile, the docker-compose.yml you edited earlier and your requirements.txt file. It will also download all the necessary libraries and binaries that your scraper needs to run.

4. Now that you have a Lambda instance running on your machine, run your script:

make docker-run

Does it work? Excellent! Your code is ready to be uploaded to Lambda. If not, you have some debugging to do.

5. Create the zip package:

make build-lambda-package

This will create three folders:

bin: the location of your Chrome binary and web driver
lib: the Python libraries you need (Selenium, etc.)
src: your Python script

Then it will zip everything into a file called build.zip

Hello S3

The resulting file will be too big to upload directly to Lambda, since it has a limit of 50MB. So you’ll need to add it to an S3 bucket and tell Lambda where it is. Paste the link in the “S3 link URL” box you left blank earlier.

Save and observe

Your Lambda scraper is ready to de deployed! Click the big orange “Save” button at the top and that’s it! If there are errors, they will appear in your logs. Click on the “Monitoring” link at the top of your Lambda dashboard, and scroll to the Errors box.

Click on “Jump to logs” to see a detailed log of errors that might need debugging.

Lambda logs

 

96 thoughts on “Setting up a Selenium web scraper on AWS Lambda with Python

  1. At section “Pull it all together”, one more step is needed before step 3, $ make fetch-dependencies

    • The `fetch-dependencies` command is actually called by the `build-lambda-package` command in the Makefile, so no need to run it separately.

      • It’s true that *build-lambda-package* command (i.e. step 5) does not require the step, but I think step 3 and step 4 need it. Andrew’s problem below seems to explain why the additional step is needed

      • I was able to get it to create the build.zip with “make build-lambda-package” which was the instructions from the 21button git repo, you could try that. from the error “Makefile:20” it refers to line 20 of the Makefile if that helps you debug, be sure your dockerbuild.yml is filled out correctly, that you have the dependencies installed to run the Makefile (unzip, curl, pip etc) and that when indenting you use tab rather thank whitespace, the Makefile errors with whitespace. Good Luck!

  2. Roberto, Thank you so much for the article. It’s great place to practise.
    During executing command – Python
    “make docker-build” stuck on error:

    Step 7/10 : COPY bin ./bin
    ERROR: Service ‘lambda’ failed to build: COPY failed: stat /var/lib/docker/tmp/docker-builder283874387/bin: no such file or directory
    Makefile:20: recipe for target ‘docker-build’ failed
    make: *** [docker-build] Error 1

    Maybe you have advice on how to overcome the current error. Would be grateful, for any one, thank you

    • I never encountered this error and I’m really not a Docker expert to be able to help. I was lucky enough that it worked for me.

    • hey Andrew. I had the same issue. I found that I needed the original /lib folder (which has libgconf-2.so.4, libORBit-2.so.0) in it, then I could do the build. hope it works for you

    • You may try running $ make fetch-dependencies before proceeding to step 3. Without this step, bin directory is not created and thus COPY would fail.

      • this is the solution. bin directory should be create before run docker-build. Check the Makefile file. on fetch-dependencies there is shell command to create bin directory

    • For anyone else out there who happens to be struggling with this problem, the fix was actually pretty simple.

      At this point in the build, inside of my directory where I had my Dockerfile etc, there was a bin folder created, but not a lib folder. So it was trying to copy something to a folder that didn’t exist.

      To fix it, I just made a lib folder in this directory.
      Builds just fine now.

  3. Hey mate,

    Thank you for writing this guide. It’s a good one. I have managed to get it working with your help. However, i want to point out that you there is no need to delete any files after clone the pychromeless repo. Perhaps you should remove this:

    Download their repo onto your machine. The important files are these:

    Dockerfile
    Makefile
    docker-compose.yml
    requirements.txt
    The rest you can delete.

  4. Thanks a lot for this article. I am having a problem with adding beautiful soup to the requirements.txt file. I have tried ‘bs4=4.4.1’ but it has the error:

    Could not find a version that satisfies the requirement bs4==4.4.1 (from -r requirements.txt (line 5)) (from versions: 0.0.0, 0.0.1)
    No matching distribution found for bs4==4.4.1 (from -r requirements.txt (line 5))

    If you know how to solve this, I would be grateful if you let me know.

    Thanks again!

  5. Hi Roberto,
    Thanks for the great article and solution. Have you noticed such issue when your run lambda? Locally on Docker runs fine, but on AWS I got error below:

    START RequestId: b9486b8e-cd32-11e8-b275-31fda3785429 Version: $LATEST
    module initialization error: Message: unknown error: unable to discover open pages
    (Driver info: chromedriver=2.32.498513 (2c63aa53b2c658de596ed550eb5267ec5967b351),platform=Linux 4.14.67-66.56.amzn1.x86_64 x86_64)

    Thanks,

      • I’ve had the same problem, could make it work only after altering the options of the driver to the following:

        options.headless = True
        options.add_argument(‘start-maximized’)
        options.add_argument(‘–no-sandbox’)
        options.add_argument(‘–disable-extensions’)
        options.add_argument(‘–single-process’)
        options.add_argument(‘–disable-dev-shm-usage’)
        options.add_argument(‘–disable-gpu’)
        options.add_argument(‘–ignore-certificate-errors’)
        options.add_argument(‘–hide-scrollbars’)
        options.binary_location = os.getcwd() + “/bin/headless-chromium”
        options.add_argument(‘–window-size=1280×1696’)
        options.add_argument(‘–hide-scrollbars’)

        And adding the code:

        driver.maximize_window()

        I hope this also works for you.

        All the best.

    • Hi Tom. I am getting this error “unable to discover open pages” error as well using the following versions:
      selenium==3.14.0
      chromedriver==2.43
      severless-chrome==1.0.0-55
      AND
      python=3.6

      Did you ever find a fix for this. Cheers.

  6. I followed these directions (thank you so much for this!), but I get this error on make docker-run and I can’t seem to figure it out:

    docker-compose run lambda src/scraper.main
    START RequestId: 048bb802-bc93-425c-8b97-46a2898ed4d3 Version: $LATEST
    Exception ignored in: <bound method Service.__del__ of >
    Traceback (most recent call last):
    File “/var/task/lib/selenium/webdriver/common/service.py”, line 151, in __del__
    self.stop()
    File “/var/task/lib/selenium/webdriver/common/service.py”, line 123, in stop
    if self.process is None:
    AttributeError: ‘Service’ object has no attribute ‘process’
    module initialization error: Message: ‘chromedriver’ executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home

    END RequestId: 048bb802-bc93-425c-8b97-46a2898ed4d3
    REPORT RequestId: 048bb802-bc93-425c-8b97-46a2898ed4d3 Duration: 181 ms Billed Duration: 200 ms Memory Size: 1536 MB Max Memory Used: 20 MB

    {“errorMessage”: “module initialization error”}
    Makefile:23: recipe for target ‘docker-run’ failed
    make: *** [docker-run] Error 1

    Thoughts?
    Thank you

    • I think it is – I followed your instructors exactly using the same docker-compose.yml. Chromedriver is in bin, my code in src, etc. I started trying different PATH settings but nothing is working. 🙁 I don’t understand why it would not work when I used exactly the same files as you. Maybe it has to do with my Docker version..?

  7. Hello,

    Thank you for the blog, I got it to work on lambda however when i check the screenshot uploaded to S3, it is just a blank screen. Would you be able to assist? My code is below:

    from selenium import webdriver
    import time
    import os
    import boto3

    def lambda_handler(event, context):
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument(‘–no-sandbox’)
    chrome_options.add_argument(‘–disable-gpu’)
    chrome_options.add_argument(‘–window-size=1280×1696’)
    chrome_options.add_argument(‘–user-data-dir=/tmp/user-data’)
    chrome_options.add_argument(‘–hide-scrollbars’)
    chrome_options.add_argument(‘–enable-logging’)
    chrome_options.add_argument(‘–log-level=0’)
    chrome_options.add_argument(‘–v=99’)
    chrome_options.add_argument(‘–single-process’)
    chrome_options.add_argument(‘–data-path=/tmp/data-path’)
    chrome_options.add_argument(“user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36′”)
    chrome_options.add_argument(‘–headless’)
    chrome_options.add_argument(‘–ignore-certificate-errors’)
    chrome_options.add_argument(‘–homedir=/tmp’)
    chrome_options.add_argument(‘–disk-cache-dir=/tmp/cache-dir’)
    chrome_options.binary_location = os.getcwd() + “/bin/headless-chromium”
    browser = webdriver.Chrome(chrome_options=chrome_options)
    browser.get(“https://www.civitekflorida.com/ocrs/app/search.xhtml”)
    time.sleep(5)
    pic =browser.get_screenshot_as_file(‘/tmp/FL.png’)
    s3 = boto3.client(‘s3’)
    filename = ‘/tmp/FL.png’
    bucketname = ‘cdbucketswordy’
    s3.upload_file(filename, bucketname, ‘FL.png’)

    Thank you!

    • I’ve never taken screenshots before, so it’s not something I can help with. But looking at your code, it seems you’re saving the screenshot to the variable pic and not using that variable later. You’re just saving a blank png filename.

    • Hi, I was wondering how you were even able to get that code to run? I tried running the exact code you ran it and am getting an error. I am not sure how to run this with regular selenium web driver. It only seems to work with webdriver wrapper.

  8. Thanks for putting this together. It was immensely helpful. For my project I had to remove the single quotes from the environment AWS variables in docker-compose.yml to get my credentials to work correctly.

  9. Thanks for putting this together, I was wondering if there would be a problem (my guess is there would be) if my scraper actually uses a UI. I’ve been working on a project using scrapy and selenium to send and retrieve data from a website, I’ve been running this on my computer but as you very well said, it would be more convenient to have it running on a server.

    • If you want your scraper to be serverless, there’s no reason for it to have a UI. You won’t be able to see it anyway. There are several headless browsers for this, like Chrome, Firefox, and PhantomJS.

  10. I followed the tutorial but I was lost in the last point. I don’t have a main function, just a Python script with selenium and I don’t need to save/write any object to S3. What should be the name of my handler? blank? Leave it as lambda_handler?

      • I guess I can refactor my script to make it inside a function (even if it has no parameters). My script is very simple, it just books a class at 7 am. I cannot be on the computer at that time and they get all booked in 5 minutes.
        A bigger problem is that I have windows 10 Home and I cannot install Docker either. Is there a way to install headless Chrome in Lambda with the few python libraries I need (selenium, pandas, datetime, time and numpy), and then just copy paste the code in the Lambda editor?

    • It looks like Google changed the URLs of the hosted file and versions below 2.46 are no longer offered. I’m afraid you’ll have to experiment with different versions until you find one that works.

      Consult the website that you linked to to see the currently supported versions.

  11. for anyone still reading this tutorial (which is REALLY well written btw) – i don’t thinkis chromedriver-installer==0.0.6 is actually needed in your requirements.txt, since chromedriver-installer is merely downloading the chromedriver into your bin folder.

    One because it will keep throwing the following error (at least for me) that it isn’t able to find the chromedriver to download from the chromedriver link

    Two because when you do make fetch-dependencies you are already creating /bin with chromedriver & headless- chromium, and you need to do make fetch-dependencies in order to avoid some of the errors previously listed.

    @For the author – is it possible to list in the tutorial some of the compatible versions when it comes to chromedriver, selenium & chromium. Looks like last stable version of headless-chromium is <70, so make sure to use an older version of chromedriver & older version of selenium as well.

    • Thanks for the comment. I’m no expert in this, just I’m just sharing what worked for me. If this is causing problems for others, I hope this will help them. As for listing compatible versions, I don’t think I’ll be testing all the combinations of versions that work. But if this is something you do, I’d be happy to update the post with your solutions.

  12. I was using the WRONG Dockerfile! I was using the one from lambci instead of 21buttons. Ahhh!

    Anyway, there were a few other things, like I had to delete the last line in your requirments.txt file like the other guy said.

    And I had to fix the lambda_function.py file by removing some of what 21buttons used. But it worked!

  13. My code works all the way until the last step, and then the AWS logs say:

    Unable to import module ‘lambda_function’: No module named ‘lambda_function’

    Not sure why yet…

  14. OK I got it! I had started over with a 2nd function and forgot to add the Lambda Environment variables back in.

    Then I also misspellled “PYTHONPATH”. Then, I’d also seen advice on other forums to zip all of the files (not just the main folder), so I did that for the files in the Src folder, which didn’t seem to hurt.

    Then I ran “make build-lambda-package” in command line terminal again, uploaded the build.zip to the S3 Management Console, copied its path for the Amazon S3 link url in the Lambda Management Console, and saved it. Voila!

    The error went away and instead, now it’s printing out the comments from my lambda_function.py file into the CloudWatch Management Console logs. I hope this helps someone else. It took me a week to set this up!

  15. chromedriver-installer == 0.0.6
    is depricated, replace it in the requierements.txt by:
    chromedriver-install==1.0.3

  16. I tried following this, and also adding the library pandas.

    The problem is that the unzipped file is above the Lamda maximum file size. Do you have any advice?

  17. It works perfectly on my docker environment, but when i upload to S3 and try to test, i get:
    “errorMessage”: “Unable to import module ‘lambda_function'”
    Can you help?

  18. Hello Roberto, how to run the scraper without docker? (Since it’s just for testing, i guess)

    Do i just zip 3 folders (src, lib, bin) and upload it to lambda? How tobspecify the environment variable in lambda? Thank you

  19. selenium.common.exceptions.WebDriverException: Message: Service /build/bin/chromedriver unexpectedly exited. Status code was: 127
    I get the above error on the below versions:
    chromedriver 2.43
    severless-chrome 1.0.0-55
    selenium 3.14
    and
    serverless-chrome v. 0.0-37
    chromedriver v. 2.37
    Selenium for Python v. 2.53.6

    For the following error:
    selenium.common.exceptions.WebDriverException: Message: Can not connect to the Service /build/bin/chromedriver
    I get it on version:
    serverless-chrome v. 0.0-37
    chromedriver v. 2.37
    Selenium for Python v. 2.53.0

    I would sincerely appreciate any help. Thank you.

  20. I continue to get the following error when running the lamba in aws “Message: ‘chromedriver’ executable may have wrong permissions.
    I’ve tried using chmod but the environment is read only so that doesn’t work. Has anyone else seen this problem?

    • Hey guys,

      I have solved this by moving the chromedriver and headless chromium to a tmp folder and changing the permissions. This will work for you

      from selenium import webdriver
      from shutil import copyfile
      import os
      import time

      def permissions(origin_path, destiny_path):
      copyfile(origin_path, destiny_path)
      os.chmod(destiny_path, 0o775)

      def lambda_handler(*args, **kwargs):
      chrome_options = webdriver.ChromeOptions()
      chrome_options.add_argument(‘–headless’)
      chrome_options.add_argument(‘–no-sandbox’)
      chrome_options.add_argument(‘–disable-gpu’)
      chrome_options.add_argument(‘–window-size=1280×1696’)
      chrome_options.add_argument(‘–user-data-dir=/tmp/user-data’)
      chrome_options.add_argument(‘–hide-scrollbars’)
      chrome_options.add_argument(‘–enable-logging’)
      chrome_options.add_argument(‘–log-level=0’)
      chrome_options.add_argument(‘–v=99’)
      chrome_options.add_argument(‘–single-process’)
      chrome_options.add_argument(‘–data-path=/tmp/data-path’)
      chrome_options.add_argument(‘–ignore-certificate-errors’)
      chrome_options.add_argument(‘–homedir=/tmp’)
      chrome_options.add_argument(‘–disk-cache-dir=/tmp/cache-dir’)
      chrome_options.add_argument(‘user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36’)
      try:
      permissions(os.getcwd() + ‘/bin/chromedriver’,’/tmp/chromedriver’)
      permissions(os.getcwd() + ‘/bin/headless-chromium’,’/tmp/headless-chromium’)
      except:
      print(“file is busy”)
      finally:

  21. Hi Roberto,

    thanks a lot for the very useful post.
    I’m following exactly your instructions but I’m getting error on AWS Lambda that says
    ‘chromedriver’ executable may have wrong permissions
    or
    [Errno 30] Read-only file system: ‘/var/task/bin/chromedriver’
    Do you maybe know how to solve this issue?

  22. Is there a way to pass the aws token to the docker container to run a lambda function that saves to s3 locally ?

  23. Hi there, I have the environment built and works perfectly except that I cannot download files.
    In the code, there is a section talks about enable chrome headless download. However, no downloaded files were found.
    Very sad, I guess I need to find other solutions other than 21buttons.

  24. I followed new version(chromedriver 2.43, severless-chrome 1.0.0-55, selenium 3.14) also added all options to chrome as you mentioned but getting below error:
    I’m just trying to run my python code to test without Docker even Lambda on just EC2 as got stuck on Makefile.
    selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally
    (Driver info: chromedriver=2.37.544315 (730aa6a5fdba159ac9f4c1e8cbc59bf1b5ce12b7),platform=Linux 4.15.0-1057-aws x86_64)

    Also tried with newest Chorme but same error. Any idea?

  25. Hi mate great post.

    Do you know if headless chrome works differently to normal chrome

    The reason I ask is that if I scrape a site for example, https://car.gocompare.com/vehicle
    and send keys for a vehicle, in normal chrome it produces the reg details, if I use headless chrome it cannot find it,
    Is this down to the chrome driver and headless, or is it down to the website? How can you spot whats caused, btw it works in firefox.

    • In other words, how can you configure this to use regular web driver commands? It seems to give an error when I do that.

      • I guess you already figured it out, but for others who just got this error and still didn’t.
        your python function inside your scraper should accept to arguments. like this:

        def lambda_handler(*args, **kwargs):
        ….your scraping code using the driver

  26. Hey all!

    So I’ve gotten to the point (finally!) of running docker-build and this thing successfully building, but when I try docker-run, I end up getting a type error saying lambda_handler() takes 0 positional arguments but 2 were given. Anyone have any advice? Total newbie to both AWS, Selenium, and Docker, so I’m amazed I’ve made it this far

    • Your comment is awaiting moderation.

      I guess you already figured it out, but for others who just got this error and still didn’t.
      your python function inside your scraper should accept to arguments. like this:

      def lambda_handler(*args, **kwargs):
      ….your scraping code using the driver

  27. Is it possible to put several functions inside the python scraper in the src folder? It will be great if anyone can share what the scraper and lambda function should look like. Also how can we export the scraped data to was?

  28. Took a long time to get this working but you are the only person on the internet with a decent guide on how to do this.

    I highly recommend to anyone trying this out.
    Keep the downloaded repo from the start exactly the same.
    Make SURE you’re using python 3.6, that’s important.

    Ignore all of the stuff this guy said to do with the makefile/requirements.txt and whatever. Just leave everything exactly as it was in the repo.
    I wasted so much time trying to change versions and get them compatible, and in the end I just redownloaded the repo and changed nothing, and it worked.

  29. “2. Click on “Attach Policy”. Search for S3. Choose AmazonS3FullAccess. This will allow the function to read and write to S3. Click on “Attach policy”. That’s it. When you go back to Lambda, you should now see S3 as a resource.”

    When I attach the AmazonS3FullAccess and go back to Lambda, there is no change. S3 isn’t displaying as a resource? What might I be doing wrong

  30. The tutorial is not working for me when I run ‘docker compose build’. I get this error:

    ERROR: Service ‘lambda’ failed to build : COPY failed: file not found in build context or excluded by .dockerignore: stat lib: file does not exist

  31. I keep getting the message “Unable to import module ‘src/lambda_function’: No module named ‘selenium.webdriver.common'”. Did anyone else see this? I think it must be to do with the PATH not being specified correctly as it can’t find the modules of selenium, but can’t figure where its going wrong.

      • Hey MN, Im having the same issue,
        Unable to import module ‘src.lambda_function’: No module named ‘src’
        Found the issue? Thanks!!

        • src is the name of the folder with your script, not the name of the script file. You call the function by punching in ‘file_name.function’. So if your file is scraper.py and the function that runs everything is main(), your function call in Lambda should be scraper.main

  32. our application has login page with google captcha enabled. somehow if i try to login, it is popping “are you a robot” message. I am using chrome headless browser. any help here ? why it is saying “are you robot” on this setup ( AWS Lamda Python Chrome headless ).

  33. Hey Roberto, thank you very much.
    Everything worked fine and make docker-build runs succesfully. But then when trying to run make docker-run, i get the following error:
    {“errorMessage”:”Unable to import module ‘src.lambda_function'”}
    I do have the python (lambda_function.py) inside the src folder, so I don’t understand what is wrong. Help? Thanks!

  34. Hi, I have the environment built and works perfectly except that I cannot download files.File does not start downloading. Could you please help. I am struggling for this for quite a long time. Thanks

  35. Excellent guide, was able to get things working with the help of some comments here as well.

    One issue I ran into was getting through a login page generated by JS. I was able to come to the conclusion that the issue had something to do with serverless-chrome (no issues when testing locally without the –headless option).

    Changing from v1.0.0-37 to v1.0.0-57 solved my issues.

    Just by changing this line in the Makefile:

    curl -SL https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-57/stable-headless-chromium-amazonlinux-2.zip > headless-chromium.zip

  36. Firstly, thank you Roberto for taking the time out to share your work and for providing a guide for taking this approach. It is much appreciated.

    To get mine working, I did the following:
    1. Created a bin folder in the repo. This solved the issue for Lambda service failure which I saw some of the others encountered.
    2. I encountered issues with doing a pip install on chromedriver_install. It produced an error saying it couldn’t find the latest. I removed it from the requirement.txt and went directly to https://sites.google.com/a/chromium.org/chromedriver/downloads and manually download the chromedriver then add it to the bin folder created earlier.
    3. Did the same by manually running curl -SL https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-57/stable-headless-chromium-amazonlinux-2.zip > headless-chromium.zip then unzip that file to the bin directory created earlier as well.
    4. The versions I used are as follows:
    chromedriver: 86.0.4240.111
    headless-chromium – v1.0.0-57
    selenium: 3.141.0

Leave a Reply

Your email address will not be published. Required fields are marked *