How to extract entities from raw text with Spacy: 3 approaches using Canadian data

TL;DR: Use the en_core_web_trf transformer model with Spacy to get much more accurate named entity recognition with multilingual text. Entity recognition is one of the marvels or current technology, as least from a journalist’s perspective. There was a time journalists had to read through hundreds, maybe thousands of documents, highlight names of people, companies and […]

4 ways to make self-updating Datawrapper charts

Datawrapper is right now the best tool for creating quick and simple charts. It’s so useful and feature-rich that news organizations that had their own in-house charting tool are switching over. One of its best features is the ability to connect a CSV file hosted on the web as a data source. This enables users […]

Using NLP to analyze open-ended responses in surveys

One of the final frontiers of data analysis is making sense of unstructured text like reports and open-ended responses in surveys. Natural language processing (NLP), with the help of AI, is making this kind of analysis more accessible. Libraries like spaCy and Gensim, although still code-based, are simplifying the process of getting insights out of […]

How I made the Montreal street history map

Click here to see the map at Huffington Post Québec First of all, a clarification. I did not really make that map. I adapted the code from Noah Veltman’s San Francisco history map, and made one for Montreal. Compare both maps, and you’ll see they are very similar in many ways. That said, the data sources […]

The best PyCon 2015 videos for journalists

PyCon is the world’s biggest conference for Python programmers, with great talks for both veterans and newcomers. And every year, organizers publish videos of talks and workshops for free for all to enjoy. Here is my selection of videos from this year’s conference in Montreal that I believe are of value for journalists who use […]

Using Python’s calendar module for scraping date-based data

I’ve recently fallen in love with Python’s standard calendar module. It has lots of functions to make handling dates a breeze. And for scraping data based on dates, it couldn’t be more convenient. Take Environment Canada’s historical hourly data for Montreal. Each page has 24 hours of data in a single day. If I want […]