I’ve recently fallen in love with Python’s standard calendar module. It has lots of functions to make handling dates a breeze. And for scraping data based on dates, it couldn’t be more convenient.
Take Environment Canada’s historical hourly data for Montreal. Each page has 24 hours of data in a single day. If I want to get the data for every day since the start, I have to loop through each day of each month of each year.
This becomes a pain when you have to account for months that have 30 or 31 days. Leap years add to the hassle. Python’s calendar module handles all this for you.
WORD OF WARNING: whenever scraping a website, be a good internet citizen. See my post on ethical web scraping for some guidelines.
First, look at the URL structure to see where you have to cycle through the dates. This URL takes you to the data for Feb. 8, 2015:
When scraping, the year has to be added after ‘&Year=’, the month after ‘&Month=’, and the day after ‘&Day=’
The days are the tricky parts, because they depend on the months. Here’s how the calendar module helps.
First, we initiate a calendar object:
import calendar cal = calendar.Calendar()
Now we have access to all the calendar methods like itermonthdays, which are iterators for all sorts of dates that you specify.
In fact, itermonthdays is just the method we need. You feed it two arguments — a year and a month — and it returns an iterator of all days in that month in that year.
>>> [d for d in cal.itermonthdays(2015, 2)] [0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 0]
But wait, what are all those zeros? Well, calendar works a lot like your wall calendar: it includes the entire week, starting on Sunday, even if some days belong to the previous and next months.
Displayed as a wall calendar, it looks like this (zeroes added in by me):
February 2015 Mo Tu We Th Fr Sa Su 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 0
This, by the way, is another neat feature of the calendar module, called TextCalendar. It can be accessed via the calendar.TextCalendar() class, using the prmonth (print month) method, passing in the year and month.
We can’t feed the URL zero dates, but we can filter out the zeroes in the list comprehension.
>>> [d for d in cal.itermonthdays(2015, 2) if d != 0] [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28]
And voilà, a list of all days in Feb. 2015 without the zeroes. In our scraping code, this would look like:
import calendar import requests cal = calendar.Calendar() base_url = 'http://climate.weather.gc.ca/climateData/hourlydata_e.html?timeframe=1&Prov=QC&StationID=30165&hlyRange=2008-01-08|2015-03-08&Year=' month_url = '&Month=' day_url = '&Day=' for year in range(2008, 2016): for month in range(1, 13): monthdays = [d for d in cal.itermonthdays(year, month) if d != 0] for day in monthdays: r = requests.get(base_url + str(year) + month_url + str(month) + day_url + str(day)) # .... etc ....
This can be used in all sots of websites that take in dates as URL parameters or in forms.
2 thoughts on “Using Python’s calendar module for scraping date-based data”
I wish you had not left off with “etc” !.
I’m new-ish to this and it anything but obvious to me what comes next
I probably should have been less ambiguous. But where I put “etc.” would be the scraping code, which would vary depending on the website you’re scraping. Every website has a different structure and requires a unique scraping approach. So there was nothing I could have written that would generalize scraping for any website.