Roberto – Roberto Rocha

Pair programming with LLMs: putting 5 leading models to the test

March 26, 2024March 27, 2024 Roberto 0 comments

Wanna skip the blabla and get right to the code? Access the Colab notebook here. I recently took a short course on DeepLearning.ai called Pair Programming with LLMs, where you learn how to use Google’s PaLM2 language model to help write, debug, and explain code within a Jupyter Notebook environment. Well, PaLM is old news, […]

How to use ChatGPT Vision to turn handwritten forms into data

October 30, 2023October 30, 2023 Roberto 0 comments

Takeaways: ChatGPT can turn handwritten forms into data, even with sloppy handwriting. Defining a schema of the desired output helps. It makes mistakes. Output still needs to be validated and possibly fixed by hand. Can’t be automated with API yet. Still need to manually upload images to web application. Limit of four images per upload […]

mop-sweeping-cleaning-hardwood-floors-house

Using ChatGPT to clean data: an experiment

April 23, 2023April 23, 2023 Roberto 4 Comments

One of the most annoying parts of data work is dealing with inconsistent entities: names of the same person spelled differently. Company names that rebranded, merged, or have varying suffixes like “Ltd.” and “Limited”. Standardizing data for accurate analysis can take days, sometime weeks, even with powerful tools like OpenRefine and Dedupe, which were made […]

How to extract entities from raw text with Spacy: 3 approaches using Canadian data

November 7, 2022November 7, 2022 Roberto 0 comments

TL;DR: Use the en_core_web_trf transformer model with Spacy to get much more accurate named entity recognition with multilingual text. Entity recognition is one of the marvels or current technology, as least from a journalist’s perspective. There was a time journalists had to read through hundreds, maybe thousands of documents, highlight names of people, companies and […]

Getting tabular data from unstructured text with GPT-3: an ongoing experiment

October 4, 2022October 24, 2022 Roberto 10 Comments

One of the most exciting applications of AI in journalism is the creation of structured data from unstructured text. Government reports, legal documents, emails, memos… these are rich with content like names, organizations, dates, and prices. But to get them into a format that can be analyzed and counted, like a spreadsheet, usually involves days […]

Screenshot-2021-12-13-at-13-26-49-Andrew-Scheer-Difference-between-revisions-Wikipedia

Screenshot-2021-12-13-at-13-26-49-Andrew-Scheer-Difference-between-revisions-Wikipedia-1

Someone at Shared Services Canada is REALLY into curling

December 20, 2021December 22, 2021 Roberto 0 comments

And other insights from 7 years of anonymous Wikipedia edits by government employees

4 ways to make self-updating Datawrapper charts

August 10, 2021August 11, 2021 Roberto 0 comments

Datawrapper is right now the best tool for creating quick and simple charts. It’s so useful and feature-rich that news organizations that had their own in-house charting tool are switching over. One of its best features is the ability to connect a CSV file hosted on the web as a data source. This enables users […]

Using NLP to analyze open-ended responses in surveys

May 30, 2021May 30, 2021 Roberto 3 Comments

One of the final frontiers of data analysis is making sense of unstructured text like reports and open-ended responses in surveys. Natural language processing (NLP), with the help of AI, is making this kind of analysis more accessible. Libraries like spaCy and Gensim, although still code-based, are simplifying the process of getting insights out of […]

chartoftheday_12420_america_s_most_and_least_trusted_professions_n

How data and transparency can restore trust in journalism: a speech to the Concordia Library Research Forum

May 2, 2018May 3, 2018 Roberto 0 comments

This is the text of the keynote speech I delivered at the 2018 Concordia Library Research Forum. It has been edited slightly. Thank you for this lovely opportunity to be among you this morning. I feel especially honoured to deliver this keynote because I believe librarians and journalists share a special kinship because our jobs […]

Setting up a Selenium web scraper on AWS Lambda with Python

April 29, 2018February 3, 2022 Roberto 110 Comments

IMPORTANT UPDATE This post is outdated now that AWS Lambda allows users to create and distribute layers with all sorts of plugins and packages, including Selenium and chromedriver. This simplifies a lot of the process. Here’s a post on how to make such a layer. And here’s a list of useful pre-packaged layers. This post […]

Roberto Rocha

Data storyteller and educator

Author: Roberto