Flickr user: Mike Hyde

I’ve recently fallen in love with Python’s standard calendar module. It has lots of functions to make handling dates a breeze. And for scraping data based on dates, it couldn’t be more convenient.

Take Environment Canada’s historical hourly data for Montreal. Each page has 24 hours of data in a single day. If I want to get the data for every day since the start, I have to loop through each day of each month of each year.

This becomes a pain when you have to account for months that have 30 or 31 days. Leap years add to the hassle. Python’s calendar module handles all this for you.

WORD OF WARNING: whenever scraping a website, be a good internet citizen. See my post on ethical web scraping for some guidelines.

First, look at the URL structure to see where you have to cycle through the dates. This URL takes you to the data for Feb. 8, 2015:

When scraping, the year has to be added after ‘&Year=’, the month after ‘&Month=’, and the day after ‘&Day=’

The days are the tricky parts, because they depend on the months. Here’s how the calendar module helps.

First, we initiate a calendar object:

Now we have access to all the calendar methods like itermonthdays, which are iterators for all sorts of dates that you specify.

In fact, itermonthdays is just the method we need. You feed it two arguments — a year and a month — and it returns an iterator of all days in that month in that year.

But wait, what are all those zeros? Well, calendar works a lot like your wall calendar: it includes the entire week, starting on Sunday, even if some days belong to the previous and next months.

Displayed as a wall calendar, it looks like this (zeroes added in by me):

This, by the way, is another neat feature of the calendar module, called TextCalendar. It can be accessed via the calendar.TextCalendar() class, using the prmonth (print month) method, passing in the year and month.

We can’t feed the URL zero dates, but we can filter out the zeroes in the list comprehension.

And voilà, a list of all days in Feb. 2015 without the zeroes. In our scraping code, this would look like:

This can be used in all sots of websites that take in dates as URL parameters or in forms.

Happy date-scraping.

Leave a Reply

Your email address will not be published. Required fields are marked *