One of the most annoying parts of data work is dealing with inconsistent entities: names of the same person spelled differently. Company names that rebranded, merged, or have varying suffixes like “Ltd.” and “Limited”.
Standardizing data for accurate analysis can take days, sometime weeks, even with powerful tools like OpenRefine and Dedupe, which were made for this purpose.
With generative AI becoming so powerful, can it be used to clean up messy datasets with less human intervention?
I put ChatGPT to the test on several data cleaning tasks.
TL;DR: It shows great potential, but like with anything critical, it’s not quite ready to be trusted 100%, at least not on the current models (I’m using the free ChatGPT version with the older GPT-3.5 model). Accuracy might also improve with better-crafted prompts.
Standardizing company names
I gave ChatGPT a list of company names that I know have either changed or been acquired. I wanted to see how well it knows the more recent name and the parent company.
Today we are going to standardize company names. Create a table with the company names provided and another column with that company’s most recent name. Add a third column with the company’s parent company. Only provide the company name without legal abbreviations like “Ltd.”, “Corp.”, “LLC” or other suffixes.
Grand Trunk Western Railroad
General Dynamics Mission Systems–Canada
Nintendo of America
AWS web services
Grand River Railway
|Original Name||Most Recent Name||Parent Company|
|Bell Canada||Bell Canada||BCE Inc.|
|Rogers Media||Rogers Sports & Media||Rogers Communications|
|Koodo Mobile||Koodo Mobile||Telus Communications|
|Grand Trunk Western Railroad||Canadian National Railway||Canadian National Railway|
|Molson Brewery||Molson Coors Brewing Company||Molson Coors|
|General Dynamics Mission Systems–Canada||General Dynamics Mission Systems Canada||General Dynamics|
|Google Canada||Google Canada||Alphabet Inc.|
|Facebook Ltd.||Facebook Inc.|
|Nintendo of America||Nintendo of America||Nintendo Co., Ltd.|
|Rio Tinto||Rio Tinto||Rio Tinto Group|
|Teck Cominco||Teck Resources||Teck Resources Limited|
|CP Rail||Canadian Pacific Railway||Canadian Pacific Railway|
|Encana Corp.||Ovintiv Inc.||Ovintiv Inc.|
|Canadian Natural||Canadian Natural Resources Limited||Canadian Natural Resources|
|L3 Technologies||L3Harris Technologies||L3Harris Technologies|
|Storm Resources||ARC Resources Ltd.||ARC Resources Ltd.|
|Ubisoft Montreal||Ubisoft Montreal||Ubisoft Entertainment|
|AWS web services||Amazon Web Services||Amazon.com, Inc.|
|RIM||BlackBerry Limited||BlackBerry Limited|
|Abitibi Consolidated||Resolute Forest Products||Resolute Forest Products|
|Future Shop||Best Buy Canada||Best Buy Co., Inc.|
|Videotron Ltd.||Videotron||Quebecor Inc.|
|RBC Securities||RBC Capital Markets||Royal Bank of Canada|
|Zellers||Target Canada||Target Corporation|
|Grand River Railway||Canadian Pacific Railway||Canadian Pacific Railway|
For the most part, it was spot on. It knows that companies like Zellers and Future Shop were acquired and it knows their parent companies. It also knows that Encana rebranded to Ovintiv.
BUT it got one thing very wrong. Storm Resources was acquired by Canadian Natural Resources, not ARC. It also didn’t put Meta as Facebook’s parent company. Although minor, it didn’t follow the instructions to exclude legal suffixes like “Ltd.”
Standardizing people names
I gave ChatGPT a list of names. Some of them are correct, others have minor typos, middle initials, and nicknames. I asked it to standardize the names to the same form.
Below is a list of names. Some of those names are repeated but have spelling mistakes, nicknames and middle initials. Create a table with these names and add a column with a standardized version of the name: fix typos, remove middle initials, and convert nicknames to proper first names.
Navdeep Singh Bains
Jason T. Kenney
Helen A Rosenthal
Jean Marc Fournier
Response: (fixed names in bold)
|Original Name||Standardized Name|
|Navdeep Singh Bains||Navdeep Bains|
|Jason Kenney||Jason Kenney|
|Mary Ng||Mary Ng|
|Thomas McKenzie||Thomas McKenzie|
|Edward Gallagher||Edward Gallagher|
|Harjit Sajjan||Harjit Sajjan|
|Mary Ngg||Mary Ng|
|Helen Rosenthal||Helen Rosenthal|
|Sam Ho||Samuel Ho|
|Jason T. Kenney||Jason Kenney|
|Helen A Rosenthal||Helen Rosenthal|
|Tom McKenzie||Thomas McKenzie|
|Edward Gallager||Edward Gallagher|
|Harjitt Sajan||Harjit Sajjan|
|Helen Rosental||Helen Rosenthal|
|Samuel Ho||Samuel Ho|
|Francois Pelletier||François Pelletier|
|Jean Marc Fournier||Jean-Marc Fournier|
|Marisa Lopez||Marisa Lopez|
|Jean-Marc Fournier||Jean-Marc Fournier|
|François Pelletier||François Pelletier|
|Thomas Mackenzie||Thomas McKenzie|
|Marissa Lopez||Marisa Lopez|
For this small sample, ChatGPT did exactly what was asked of it. Typos were fixed, initials stripped, and nicknames converted to proper first names. French compound names were hyphenated correctly.
However, there’s a danger inherent in standardizing names like this. Helen Rosenthal and Helen A Rosenthal could indeed be different people. There could certainly be a Marisa Lopez and an unrelated Marissa. Although this could be a big time saver, it has to be done cautiously and respectful of the data’s context.
Enriching data: assigning NAICS codes to company names
ChatGPT is not only useful at fixing data, but it can also enrich data by adding new fields. This is called feature engineering in data science and it can be laborious if two datasets don’t match perfectly.
Suppose I have a list of companies and I want to know their industrial sector. The NAICS industry classification system is a great standardized list of company types, but there are no readily-available resources that map companies to sector codes. ChatGPT makes this trivial.
Now we’re going to classify companies by industrial sector. Below is a list of companies. Create a table with the company name and add a column with its three-digit NAICS code. Add a third column with the NAICS subsector name.
Canadian Natural Resources
|Company Name||NAICS Code||NAICS Subsector Name|
|Rogers Media||515||Broadcasting (except Internet)|
|General Dynamics||336||Aerospace Product and Parts Manufacturing|
|Google Canada||518||Data Processing, Hosting, and Related Services|
|Meta Ltd.||518||Data Processing, Hosting, and Related Services|
|Rio Tinto||212||Mining (except Oil and Gas)|
|CP Rail||482||Rail Transportation|
|Canadian Natural Resources||211||Oil and Gas Extraction|
|L3Harris||334||Computer and Electronic Product Manufacturing|
|Ubisoft||511||Publishing Industries (except Internet)|
|AWS||518||Data Processing, Hosting, and Related Services|
|Scotiabank||522||Credit Intermediation and Related Activities|
|CGI Inc.||541||Professional, Scientific, and Technical Services|
|Hewlett Packard||334||Computer and Electronic Product Manufacturing|
|Ikea||442||Furniture and Home Furnishing Merchant Wholesalers|
|Raytheon||334||Computer and Electronic Product Manufacturing|
|Broil King||333||Machinery Manufacturing|
|Volkswagen||336||Aerospace Product and Parts Manufacturing|
ChatGPT got the NAICS code for every company right. The subsector names are iffy, but this shows the importance of a good prompt. I asked ChatGPT to give me the subsector name, but didn’t specify the granularity. So sometimes it gave me a general three-digit descriptor, but sometimes it went deeper down the hierarchy.
Take Molson Coors. The 312 NAICS code is “Beverage and tobacco product manufacturing”, but ChatGPT chose the more granular five-digit subsector of “Breweries” for the descriptor.
Likewise, Volkswagen is certainly 336 (Transportation equipment manufacturing), and it’s primarily an auto manufacturer, although it has entered the electric vertical-takeoff-and-landing (eVTOL) market. Maybe this is why it got labelled as an aerospace manufacturer?
Also, large conglomerate can be tricky to categorize. A huge defense contractor like L3Harris, which makes military drones and satellite communications systems, was given the general “Computer and Electronic Product Manufacturing” label.
Processing data with ChatGPT shows a lot of promise, but it’s still not ready to be fully automated and trusted. Maybe future, better models will solve these problems, or maybe it’s all about crafting better prompts.
Also, these were small testing samples. Piping large datasets through the AI can be tricky, require the use of coding APIs, and can be a lot harder to validate.
Have you used LLMs to clean and enrich data? Share your experiences below.
One thought on “Using ChatGPT to clean data: an experiment”