How to extract entities from raw text with Spacy: 3 approaches using Canadian data

Screenshot of the Spacy named entity recognizer showing tagged entities in a paragraph of text.

TL;DR: Use the en_core_web_trf transformer model with Spacy to get much more accurate named entity recognition with multilingual text.

Entity recognition is one of the marvels or current technology, as least from a journalist’s perspective.

There was a time journalists had to read through hundreds, maybe thousands of documents, highlight names of people, companies and dates, and copy those to a spreadsheet before being able to do any kind of analysis.

Think of that scene in Spotlight when reporter Matt Carroll punches in names and dates of priests into Excel.

These days, software, with the help of AI models, can automatically detect these kinds of entities, preciously speeding up investigations.

One handy tool is Spacy, a Python library for doing all kinds of natural language tasks. It has named entity recognition (NER) functions out of the box, which work a lot of the time, but sometimes fail. I found this to be the case particularly when using Canadian texts, which often contain French names and entities, and other non-English names that the standard English models don’t recognize.

The problem with unilingual models

Regular language models are trained on texts in a single language. To make NER tools more multilingual, you would need to further train it with other language elements (for example, lots of French names). Or you’d need to run an English model over the text, then a French one, and figure out a way to combine the results later.

This can be challenging and time consuming, especially for a non-expert coder on deadlines.

This is why transformer models are such a game-changer.

I don’t pretend to understand these models, but they’re a huge leap in NLP, and they’re revolutionizing everything from translation to chatbots to text generation.

In this post, I’ll compare two standard English models in Spacy with a transformer model in parsing Canadian government texts: the appointments of people to public positions, which contains names and places that that standard models don’t recognize.

I’ll show the results of the same seven paragraphs using Spacy’s entity visualizer.

This assumes Spacy and all the models have been installed. I use pip, but check the instructions for your setup:

> pip install -U spacy
> pip install spacy[transformers]
> python -m spacy download en_core_web_sm
> python -m spacy download en_core_web_lg
> python -m spacy download en_core_web_trf

Small English model

First, I’ll use the en_core_web_sm model, which is the smallest and probably most used model. Assuming I have a pandas Series text containing my seven paragraphs:

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(text))
for doc in docs:
    displacy.render(doc, style="ent")

This is the result:

displaCy

Appointment of DOUGLAS G. BLISS of Sackville GPE , New Brunswick GPE , to hold office during pleasure, as a Canadian NORP representative to the Council of the North Atlantic Salmon Conservation Organization ORG ; to the North American Commission ORG of the North Atlantic Salmon Conservation Organization LOC ; and to the West Greenland Commission ORG of the North Atlantic Salmon Conservation Organization LOC .

Appointment of ROGER ALLEN FRASER PERSON of Yellowknife GPE , Northwest Territories GPE , to be a member to the Gwich’in Renewable Resources Board ORG , for a term of five years DATE , effective on the date of this Order in Council or the Order of the Executive Council ORG , whichever is later.

Reappointment of DELAINE M. DEW of Edmonton GPE , Alberta GPE , as a full time member of the Prairies Regional Division of the Parole Board of Canada ORG , to hold office during good behaviour for a period of five years DATE , effective July 4, 2022 DATE .

Re-appointment of CONSTANCE SUGIYAMA ORG , C.M. ORG , of Toronto GPE , Ontario GPE , as a director of the Board of Directors ORG of the Asia-Pacific Foundation of Canada GPE , to hold office during pleasure for a term of three years DATE , effective July 4, 2022 DATE .

Re-appointment of LISA DE WILDE, C.M. ORG , of Oakville ORG , Ontario GPE , as a director of the Board of Directors ORG of the Asia-Pacific Foundation of Canada GPE , to hold office during pleasure for a term of three years DATE , effective July 4, 2022 DATE .

Re-appointment of the HONOURABLE PIERRE S. PETTIGREW PRODUCT , P.C. GPE , of Toronto GPE , Ontario GPE , as Chairperson PERSON of the Board of Directors ORG of the Asia-Pacific Foundation of Canada GPE , to hold office during pleasure for a term of three years DATE , effective July 1, 2022 DATE .

Approval of the appointment by the Minister of Housing and Diversity and Inclusion of CHRISTOPHER F. SICOTTE of ORG Saskatoon GPE , Saskatchewan GPE , to be a director of the Board of Directors ORG of the Canada Mortgage and Housing Corporation ORG , to hold office during pleasure, on a part-time basis, for a term of four years DATE .

Right away we see that most names were not recognized as names. Even English names! And some names with honorifics like “The Honourable” were recognized as products, while others were organizations. Roles like “Chairperson” are names. Organizations like North Atlantic Salmon Conservation Organization are locations.

This won’t cut it when we’re dealing with hundreds of such paragraphs.

For a list of entity types, check this guide.

Large English model

Here’s the result of the same code, but using the en_core_web_lg model, which is much larger.

displaCy

Appointment of DOUGLAS G. BLISS PERSON of Sackville GPE , New Brunswick GPE , to hold office during pleasure, as a Canadian NORP representative to the Council of the North Atlantic Salmon Conservation Organization ORG ; to the North American Commission ORG of the North Atlantic Salmon Conservation Organization ORG ; and to the West Greenland Commission GPE of the North Atlantic Salmon Conservation Organization ORG .

Appointment of ROGER ALLEN FRASER PERSON of Yellowknife GPE , Northwest Territories GPE , to be a member to the Gwich’in Renewable Resources Board ORG , for a term of five years DATE , effective on the date of this Order in Council ORG or the Order of the Executive Council ORG , whichever is later.

Reappointment of DELAINE M. DEW of Edmonton GPE , Alberta GPE , as a full time member of the Prairies Regional Division of the Parole ORG Board of Canada ORG , to hold office during good behaviour for a period of five years DATE , effective July 4, 2022 DATE .

Re-appointment of CONSTANCE SUGIYAMA PERSON , C.M. ORG , of Toronto GPE , Ontario GPE , as a director of the Board of Directors ORG of the Asia-Pacific Foundation of Canada, to hold office during pleasure for a term of three years DATE , effective July 4, 2022 DATE .

Re-appointment of LISA DE WILDE PERSON , C.M. ORG , of Oakville GPE , Ontario GPE , as a director of the Board of Directors ORG of the Asia-Pacific Foundation of Canada, to hold office during pleasure for a term of three years DATE , effective July 4, 2022 DATE .

Re-appointment of the HONOURABLE PIERRE S. PETTIGREW PERSON , P.C. GPE , of Toronto GPE , Ontario GPE , as Chairperson of the Board of Directors ORG of the Asia-Pacific Foundation of Canada, to hold office during pleasure for a term of three years DATE , effective July 1, 2022 DATE .

Approval of the appointment by the Minister of Housing and Diversity and Inclusion of CHRISTOPHER F. SICOTTE ORG of Saskatoon GPE , Saskatchewan GPE , to be a director of the Board of Directors ORG of the Canada Mortgage and Housing Corporation ORG , to hold office during pleasure, on a part-time basis, for a term of four years DATE .

More names were recognized but not all. Fewer entities are mislabelled but some names with titles are still being parsed as organizations. This is only marginally better than the small model

Next-level transformers

And here’s the same data parsed with the en_core_web_trf model. Note that transformer models take a lot longer to run because they’re deep learning neural networks, and need good Nvidia GPUs to run efficiently. On my machine with an Intel graphics card, it took nearly 40 minutes to process a few hundred paragraphs.

Consider using Google Colab for this kind of work.

displaCy

Appointment of DOUGLAS G. BLISS PERSON of Sackville GPE , New Brunswick GPE , to hold office during pleasure, as a Canadian NORP representative to the Council ORG of the North Atlantic Salmon Conservation Organization ORG ; to the North American Commission ORG of the North Atlantic Salmon Conservation Organization ORG ; and to the West Greenland Commission ORG of the North Atlantic Salmon Conservation Organization ORG .

Reappointment of DELAINE M. DEW PERSON of Edmonton GPE , Alberta GPE , as a full time member of the Prairies Regional Division ORG of the Parole Board of Canada ORG , to hold office during good behaviour for a period of five years DATE , effective July 4, 2022 DATE .

Re-appointment of CONSTANCE SUGIYAMA PERSON , C.M., of Toronto GPE , Ontario GPE , as a director of the Board of Directors ORG of the Asia-Pacific Foundation of Canada ORG , to hold office during pleasure for a term of three years DATE , effective July 4, 2022 DATE .

Re-appointment of LISA DE WILDE PERSON , C.M., of Oakville GPE , Ontario GPE , as a director of the Board of Directors ORG of the Asia-Pacific Foundation of Canada ORG , to hold office during pleasure for a term of three years DATE , effective July 4, 2022 DATE .

Re-appointment of the HONOURABLE PIERRE S. PETTIGREW PERSON , P.C. GPE , of Toronto GPE , Ontario GPE , as Chairperson of the Board of Directors ORG of the Asia-Pacific Foundation of Canada ORG , to hold office during pleasure for a term of three years DATE , effective July 1, 2022 DATE .

Approval of the appointment by the Minister of Housing and Diversity and Inclusion of CHRISTOPHER F. SICOTTE PERSON of Saskatoon GPE , Saskatchewan GPE , to be a director of the Board of Directors ORG of the Canada Mortgage and Housing Corporation ORG , to hold office during pleasure, on a part-time basis, for a term of four years DATE .

Every name was recognized correctly. Honorifics were ignored. Even laws were recognized!

There are still some errors, like “the Council” being tagged as a separate organization, or “P.C.” tagged as as a place. But these can be easily excluded during the data cleaning process.

This is as good as NER gets without any additional tuning or retraining. This saved reporters days of work.

The last step is organizing the entity tags into structured data. This was my approach:

from itertools import groupby

parsed_data = []

for doc in docs:
    doc_dict = {key: list(set(map(lambda x: str(x), g))) 
                for key, g 
                in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_) }  
    parsed_data.append(doc_dict)

This groups all entities into a list of dictionaries with the entity tags as keys. Example of one paragraph’s output:

{'DATE': ['one year'],
 'GPE': ['Montague', 'Prince Edward Island'],
 'ORG': ['the Independent Advisory Board for Senate Appointments'],
 'PERSON': 'Morley Scott Annear'}

Which can then be loaded into a pandas data frame for analysis.

Roberto Rocha

Data storyteller and educator

How to extract entities from raw text with Spacy: 3 approaches using Canadian data

The problem with unilingual models

Small English model

Large English model

Next-level transformers

Leave a Reply Cancel reply