a mop cleaning a wooden floor
Credit

One of the most annoying parts of data work is dealing with inconsistent entities: names of the same person spelled differently. Company names that rebranded, merged, or have varying suffixes like “Ltd.” and “Limited”.

Standardizing data for accurate analysis can take days, sometime weeks, even with powerful tools like OpenRefine and Dedupe, which were made for this purpose.

With generative AI becoming so powerful, can it be used to clean up messy datasets with less human intervention?

I put ChatGPT to the test on several data cleaning tasks.

TL;DR: It shows great potential, but like with anything critical, it’s not quite ready to be trusted 100%, at least not on the current models (I’m using the free ChatGPT version with the older GPT-3.5 model). Accuracy might also improve with better-crafted prompts.

Standardizing company names

I gave ChatGPT a list of company names that I know have either changed or been acquired. I wanted to see how well it knows the more recent name and the parent company.

Prompt:

Today we are going to standardize company names. Create a table with the company names provided and another column with that company’s most recent name. Add a third column with the company’s parent company. Only provide the company name without legal abbreviations like “Ltd.”, “Corp.”, “LLC” or other suffixes.

Bell Canada
Rogers Media
Koodo Mobile
Grand Trunk Western Railroad
Molson Brewery
General Dynamics Mission Systems–Canada
Google Canada
Facebook Ltd.
Nintendo of America
Rio Tinto
Teck Cominco
CP Rail
Encana Corp.
Canadian Natural
L3 Technologies
Storm Resources
Ubisoft Montreal
AWS web services
RIM
Abitibi Consolidated
Future Shop
Videotron Ltd.
RBC Secutiries
Zellers
Grand River Railway

Response:

Original NameMost Recent NameParent Company
Bell CanadaBell CanadaBCE Inc.
Rogers MediaRogers Sports & MediaRogers Communications
Koodo MobileKoodo MobileTelus Communications
Grand Trunk Western RailroadCanadian National RailwayCanadian National Railway
Molson BreweryMolson Coors Brewing CompanyMolson Coors
General Dynamics Mission Systems–CanadaGeneral Dynamics Mission Systems CanadaGeneral Dynamics
Google CanadaGoogle CanadaAlphabet Inc.
Facebook Ltd.FacebookFacebook Inc.
Nintendo of AmericaNintendo of AmericaNintendo Co., Ltd.
Rio TintoRio TintoRio Tinto Group
Teck ComincoTeck ResourcesTeck Resources Limited
CP RailCanadian Pacific RailwayCanadian Pacific Railway
Encana Corp.Ovintiv Inc.Ovintiv Inc.
Canadian NaturalCanadian Natural Resources LimitedCanadian Natural Resources
L3 TechnologiesL3Harris TechnologiesL3Harris Technologies
Storm ResourcesARC Resources Ltd.ARC Resources Ltd.
Ubisoft MontrealUbisoft MontrealUbisoft Entertainment
AWS web servicesAmazon Web ServicesAmazon.com, Inc.
RIMBlackBerry LimitedBlackBerry Limited
Abitibi ConsolidatedResolute Forest ProductsResolute Forest Products
Future ShopBest Buy CanadaBest Buy Co., Inc.
Videotron Ltd.VideotronQuebecor Inc.
RBC SecuritiesRBC Capital MarketsRoyal Bank of Canada
ZellersTarget CanadaTarget Corporation
Grand River RailwayCanadian Pacific RailwayCanadian Pacific Railway

Comments:

For the most part, it was spot on. It knows that companies like Zellers and Future Shop were acquired and it knows their parent companies. It also knows that Encana rebranded to Ovintiv.

BUT it got one thing very wrong. Storm Resources was acquired by Canadian Natural Resources, not ARC. It also didn’t put Meta as Facebook’s parent company. Although minor, it didn’t follow the instructions to exclude legal suffixes like “Ltd.”

Standardizing people names

I gave ChatGPT a list of names. Some of them are correct, others have minor typos, middle initials, and nicknames. I asked it to standardize the names to the same form.

Prompt:

Below is a list of names. Some of those names are repeated but have spelling mistakes, nicknames and middle initials. Create a table with these names and add a column with a standardized version of the name: fix typos, remove middle initials, and convert nicknames to proper first names.

Navdeep Singh Bains
Jason Kenney
Mary Ng
Thomas McKenzie
Edward Gallagher
Harjit Sajjan
Mary Ngg
Helen Rosenthal
Sam Ho
Jason T. Kenney
Helen A Rosenthal
Tom McKenzie
Navdeep Bains
Edward Gallager
Harjitt Sajan
Helen Rosental
Samuel Ho
Francois Pelletier
Jean Marc Fournier
Marisa Lopez
Jean-Marc Fournier
François Pelletier
Thomas Mackenzie
Marissa Lopez

Response: (fixed names in bold)

Original NameStandardized Name
Navdeep Singh BainsNavdeep Bains
Jason KenneyJason Kenney
Mary NgMary Ng
Thomas McKenzieThomas McKenzie
Edward GallagherEdward Gallagher
Harjit SajjanHarjit Sajjan
Mary NggMary Ng
Helen RosenthalHelen Rosenthal
Sam HoSamuel Ho
Jason T. KenneyJason Kenney
Helen A RosenthalHelen Rosenthal
Tom McKenzieThomas McKenzie
Edward GallagerEdward Gallagher
Harjitt SajanHarjit Sajjan
Helen RosentalHelen Rosenthal
Samuel HoSamuel Ho
Francois PelletierFrançois Pelletier
Jean Marc FournierJean-Marc Fournier
Marisa LopezMarisa Lopez
Jean-Marc FournierJean-Marc Fournier
François PelletierFrançois Pelletier
Thomas MackenzieThomas McKenzie
Marissa LopezMarisa Lopez

Comment:

For this small sample, ChatGPT did exactly what was asked of it. Typos were fixed, initials stripped, and nicknames converted to proper first names. French compound names were hyphenated correctly.

However, there’s a danger inherent in standardizing names like this. Helen Rosenthal and Helen A Rosenthal could indeed be different people. There could certainly be a Marisa Lopez and an unrelated Marissa. Although this could be a big time saver, it has to be done cautiously and respectful of the data’s context.

Enriching data: assigning NAICS codes to company names

ChatGPT is not only useful at fixing data, but it can also enrich data by adding new fields. This is called feature engineering in data science and it can be laborious if two datasets don’t match perfectly.

Suppose I have a list of companies and I want to know their industrial sector. The NAICS industry classification system is a great standardized list of company types, but there are no readily-available resources that map companies to sector codes. ChatGPT makes this trivial.

Prompt:

Now we’re going to classify companies by industrial sector. Below is a list of companies. Create a table with the company name and add a column with its three-digit NAICS code. Add a third column with the NAICS subsector name.

Bell Canada
Rogers Media
Molson Coors
General Dynamics
Google Canada
Meta Ltd.
Rio Tinto
CP Rail
Canadian Natural Resources
L3Harris
Ubisoft
AWS
Scotiabank
CGI Inc.
Hewlett Packard
Ikea
Raytheon
Broil King
Volkswagen

Response:

Company NameNAICS CodeNAICS Subsector Name
Bell Canada517Telecommunications
Rogers Media515Broadcasting (except Internet)
Molson Coors312Breweries
General Dynamics336Aerospace Product and Parts Manufacturing
Google Canada518Data Processing, Hosting, and Related Services
Meta Ltd.518Data Processing, Hosting, and Related Services
Rio Tinto212Mining (except Oil and Gas)
CP Rail482Rail Transportation
Canadian Natural Resources211Oil and Gas Extraction
L3Harris334Computer and Electronic Product Manufacturing
Ubisoft511Publishing Industries (except Internet)
AWS518Data Processing, Hosting, and Related Services
Scotiabank522Credit Intermediation and Related Activities
CGI Inc.541Professional, Scientific, and Technical Services
Hewlett Packard334Computer and Electronic Product Manufacturing
Ikea442Furniture and Home Furnishing Merchant Wholesalers
Raytheon334Computer and Electronic Product Manufacturing
Broil King333Machinery Manufacturing
Volkswagen336Aerospace Product and Parts Manufacturing

Comment:

ChatGPT got the NAICS code for every company right. The subsector names are iffy, but this shows the importance of a good prompt. I asked ChatGPT to give me the subsector name, but didn’t specify the granularity. So sometimes it gave me a general three-digit descriptor, but sometimes it went deeper down the hierarchy.

Take Molson Coors. The 312 NAICS code is “Beverage and tobacco product manufacturing”, but ChatGPT chose the more granular five-digit subsector of “Breweries” for the descriptor.

Likewise, Volkswagen is certainly 336 (Transportation equipment manufacturing), and it’s primarily an auto manufacturer, although it has entered the electric vertical-takeoff-and-landing (eVTOL) market. Maybe this is why it got labelled as an aerospace manufacturer?

Also, large conglomerate can be tricky to categorize. A huge defense contractor like L3Harris, which makes military drones and satellite communications systems, was given the general “Computer and Electronic Product Manufacturing” label.

Conclusion

Processing data with ChatGPT shows a lot of promise, but it’s still not ready to be fully automated and trusted. Maybe future, better models will solve these problems, or maybe it’s all about crafting better prompts.

Also, these were small testing samples. Piping large datasets through the AI can be tricky, require the use of coding APIs, and can be a lot harder to validate.

Have you used LLMs to clean and enrich data? Share your experiences below.

4 thoughts on “Using ChatGPT to clean data: an experiment

  1. Tried again with GPT 4 – I’m struggling with latest model at the moment with this task
    Here is a table classifying the listed companies by their respective three-digit NAICS codes and corresponding subsector names:

    Company Name NAICS Code NAICS Subsector
    Bell Canada 517 Telecommunications
    Rogers Media 517 Telecommunications
    Molson Coors 312 Beverage and Tobacco Product Manufacturing
    General Dynamics 336 Transportation Equipment Manufacturing
    Google Canada 518 Data Processing, Hosting, and Related Services
    Meta Ltd. 519 Other Information Services
    Rio Tinto 212 Mining (except Oil and Gas)
    CP Rail 482 Rail Transportation
    Canadian Natural Resources 211 Oil and Gas Extraction
    L3Harris 334 Computer and Electronic Product Manufacturing
    Ubisoft 511 Publishing Industries (except Internet)
    AWS 518 Data Processing, Hosting, and Related Services
    Scotiabank 522 Finance and Insurance
    CGI Inc. 541 Professional, Scientific, and Technical Services
    Hewlett Packard 334 Computer and Electronic Product Manufacturing
    Ikea 442 Furniture and Home Furnishings Stores
    Raytheon 336 Transportation Equipment Manufacturing
    Broil King 335 Electrical Equipment, Appliance, and Component Manufacturing
    Volkswagen 336 Transportation Equipment Manufacturing

    • This looks pretty good to me. What’s the issue you see? I also had better results with GPT-4, making sure to specify in the prompt the granularity of the NAICS code and its descriptor.

Leave a Reply

Your email address will not be published.