How to extract entities from raw text with Spacy: 3 approaches using Canadian data
TL;DR:Use the en_core_web_trftransformer model with Spacy to get much more accurate named entity recognition with multilingual text.
Entity recognition is one of the marvels or current technology, as least from a journalist’s perspective.
There was a time journalists had to read through hundreds, maybe thousands of documents, highlight names of people, companies and dates, and copy those to a spreadsheet before being able to do any kind of analysis.
Think of that scene in Spotlight when reporter Matt Carroll punches in names and dates of priests into Excel.
These days, software, with the help of AI models, can automatically detect these kinds of entities, preciously speeding up investigations.
One handy tool is Spacy, a Python library for doing all kinds of natural language tasks. It has named entity recognition (NER) functions out of the box, which work a lot of the time, but sometimes fail. I found this to be the case particularly when using Canadian texts, which often contain French names and entities, and other non-English names that the standard English models don’t recognize.
The problem with unilingual models
Regular language models are trained on texts in a single language. To make NER tools more multilingual, you would need to further train it with other language elements (for example, lots of French names). Or you’d need to run an English model over the text, then a French one, and figure out a way to combine the results later.
This can be challenging and time consuming, especially for a non-expert coder on deadlines.
This is why transformer models are such a game-changer.
I don’t pretend to understand these models, but they’re a huge leap in NLP, and they’re revolutionizing everything from translation to chatbots to text generation.
In this post, I’ll compare two standard English models in Spacy with a transformer model in parsing Canadian government texts: the appointments of people to public positions, which contains names and places that that standard models don’t recognize.
First, I’ll use the en_core_web_sm model, which is the smallest and probably most used model. Assuming I have a pandas Series text containing my seven paragraphs:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(text))
for doc in docs:
displacy.render(doc, style="ent")
This is the result:
displaCy
Appointment of DOUGLAS G. BLISS of
Sackville
GPE
,
New Brunswick
GPE
, to hold office during pleasure, as a
Canadian
NORP
representative to
the Council of the North Atlantic Salmon Conservation Organization
ORG
; to
the North American Commission
ORG
of
the North Atlantic Salmon Conservation Organization
LOC
; and to
the West Greenland Commission
ORG
of
the North Atlantic Salmon Conservation Organization
LOC
.
Appointment of
ROGER ALLEN FRASER
PERSON
of
Yellowknife
GPE
,
Northwest Territories
GPE
, to be a member to
the Gwich’in Renewable Resources Board
ORG
, for a term of
five years
DATE
, effective on the date of this Order in Council or the Order of
the Executive Council
ORG
, whichever is later.
Reappointment of DELAINE M. DEW of
Edmonton
GPE
,
Alberta
GPE
, as a full time member of
the Prairies Regional Division of the Parole Board of Canada
ORG
, to hold office during good behaviour for a period of
five years
DATE
, effective
July 4, 2022
DATE
.
Re-appointment of
CONSTANCE SUGIYAMA
ORG
,
C.M.
ORG
, of
Toronto
GPE
,
Ontario
GPE
, as a director of
the Board of Directors
ORG
of the Asia-Pacific Foundation of
Canada
GPE
, to hold office during pleasure for a term of
three years
DATE
, effective
July 4, 2022
DATE
.
Re-appointment of LISA DE WILDE,
C.M.
ORG
, of
Oakville
ORG
,
Ontario
GPE
, as a director of
the Board of Directors
ORG
of the Asia-Pacific Foundation of
Canada
GPE
, to hold office during pleasure for a term of
three years
DATE
, effective
July 4, 2022
DATE
.
Re-appointment of
the HONOURABLE PIERRE S. PETTIGREW
PRODUCT
,
P.C.
GPE
, of
Toronto
GPE
,
Ontario
GPE
, as
Chairperson
PERSON
of
the Board of Directors
ORG
of the Asia-Pacific Foundation of
Canada
GPE
, to hold office during pleasure for a term of
three years
DATE
, effective
July 1, 2022
DATE
.
Approval of the appointment by the Minister of
Housing and Diversity and Inclusion of CHRISTOPHER F. SICOTTE of
ORG
Saskatoon
GPE
,
Saskatchewan
GPE
, to be a director of
the Board of Directors
ORG
of
the Canada Mortgage and Housing Corporation
ORG
, to hold office during pleasure, on a part-time basis, for a term of
four years
DATE
.
Right away we see that most names were not recognized as names. Even English names! And some names with honorifics like “The Honourable” were recognized as products, while others were organizations. Roles like “Chairperson” are names. Organizations like North Atlantic Salmon Conservation Organization are locations.
This won’t cut it when we’re dealing with hundreds of such paragraphs.
Here’s the result of the same code, but using the en_core_web_lg model, which is much larger.
displaCy
Appointment of
DOUGLAS G. BLISS
PERSON
of
Sackville
GPE
,
New Brunswick
GPE
, to hold office during pleasure, as a
Canadian
NORP
representative to
the Council of the North Atlantic Salmon Conservation Organization
ORG
; to
the North American Commission
ORG
of
the North Atlantic Salmon Conservation Organization
ORG
; and to
the West Greenland Commission
GPE
of
the North Atlantic Salmon Conservation Organization
ORG
.
Appointment of
ROGER ALLEN FRASER
PERSON
of
Yellowknife
GPE
,
Northwest Territories
GPE
, to be a member to
the Gwich’in Renewable Resources Board
ORG
, for a term of
five years
DATE
, effective on the date of
this Order in Council
ORG
or
the Order of the Executive Council
ORG
, whichever is later.
Reappointment of DELAINE M. DEW of
Edmonton
GPE
,
Alberta
GPE
, as a full time member of
the Prairies Regional Division of the Parole
ORG
Board of Canada
ORG
, to hold office during good behaviour for a period of
five years
DATE
, effective
July 4, 2022
DATE
.
Re-appointment of
CONSTANCE SUGIYAMA
PERSON
,
C.M.
ORG
, of
Toronto
GPE
,
Ontario
GPE
, as a director of
the Board of Directors
ORG
of the Asia-Pacific Foundation of Canada, to hold office during pleasure for a term of
three years
DATE
, effective
July 4, 2022
DATE
.
Re-appointment of
LISA DE WILDE
PERSON
,
C.M.
ORG
, of
Oakville
GPE
,
Ontario
GPE
, as a director of
the Board of Directors
ORG
of the Asia-Pacific Foundation of Canada, to hold office during pleasure for a term of
three years
DATE
, effective
July 4, 2022
DATE
.
Re-appointment of the HONOURABLE
PIERRE S. PETTIGREW
PERSON
,
P.C.
GPE
, of
Toronto
GPE
,
Ontario
GPE
, as Chairperson of
the Board of Directors
ORG
of the Asia-Pacific Foundation of Canada, to hold office during pleasure for a term of
three years
DATE
, effective
July 1, 2022
DATE
.
Approval of the appointment by
the Minister of Housing and Diversity and Inclusion of CHRISTOPHER F. SICOTTE
ORG
of
Saskatoon
GPE
,
Saskatchewan
GPE
, to be a director of
the Board of Directors
ORG
of
the Canada Mortgage and Housing Corporation
ORG
, to hold office during pleasure, on a part-time basis, for a term of
four years
DATE
.
More names were recognized but not all. Fewer entities are mislabelled but some names with titles are still being parsed as organizations. This is only marginally better than the small model
Next-level transformers
And here’s the same data parsed with the en_core_web_trf model. Note that transformer models take a lot longer to run because they’re deep learning neural networks, and need good Nvidia GPUs to run efficiently. On my machine with an Intel graphics card, it took nearly 40 minutes to process a few hundred paragraphs.
Consider using Google Colab for this kind of work.
displaCy
Appointment of
DOUGLAS G. BLISS
PERSON
of
Sackville
GPE
,
New Brunswick
GPE
, to hold office during pleasure, as a
Canadian
NORP
representative to
the Council
ORG
of
the North Atlantic Salmon Conservation Organization
ORG
; to
the North American Commission
ORG
of
the North Atlantic Salmon Conservation Organization
ORG
; and to
the West Greenland Commission
ORG
of
the North Atlantic Salmon Conservation Organization
ORG
.
Appointment of
ROGER ALLEN FRASER
PERSON
of
Yellowknife
GPE
,
Northwest Territories
GPE
, to be a member to
the Gwich’in Renewable Resources Board
ORG
, for a term of
five years
DATE
, effective on the date of this Order in Council or
the Order of the Executive Council
LAW
, whichever is later.
Reappointment of
DELAINE M. DEW
PERSON
of
Edmonton
GPE
,
Alberta
GPE
, as a full time member of
the Prairies Regional Division
ORG
of
the Parole Board of Canada
ORG
, to hold office during good behaviour for a period of
five years
DATE
, effective
July 4, 2022
DATE
.
Re-appointment of
CONSTANCE SUGIYAMA
PERSON
, C.M., of
Toronto
GPE
,
Ontario
GPE
, as a director of
the Board of Directors
ORG
of
the Asia-Pacific Foundation of Canada
ORG
, to hold office during pleasure for a term of
three years
DATE
, effective
July 4, 2022
DATE
.
Re-appointment of
LISA DE WILDE
PERSON
, C.M., of
Oakville
GPE
,
Ontario
GPE
, as a director of
the Board of Directors
ORG
of
the Asia-Pacific Foundation of Canada
ORG
, to hold office during pleasure for a term of
three years
DATE
, effective
July 4, 2022
DATE
.
Re-appointment of the HONOURABLE
PIERRE S. PETTIGREW
PERSON
,
P.C.
GPE
, of
Toronto
GPE
,
Ontario
GPE
, as Chairperson of
the Board of Directors
ORG
of
the Asia-Pacific Foundation of Canada
ORG
, to hold office during pleasure for a term of
three years
DATE
, effective
July 1, 2022
DATE
.
Approval of the appointment by the Minister of Housing and Diversity and Inclusion of
CHRISTOPHER F. SICOTTE
PERSON
of
Saskatoon
GPE
,
Saskatchewan
GPE
, to be a director of
the Board of Directors
ORG
of
the Canada Mortgage and Housing Corporation
ORG
, to hold office during pleasure, on a part-time basis, for a term of
four years
DATE
.
Every name was recognized correctly. Honorifics were ignored. Even laws were recognized!
There are still some errors, like “the Council” being tagged as a separate organization, or “P.C.” tagged as as a place. But these can be easily excluded during the data cleaning process.
This is as good as NER gets without any additional tuning or retraining. This saved reporters days of work.
The last step is organizing the entity tags into structured data. This was my approach:
from itertools import groupby
parsed_data = []
for doc in docs:
doc_dict = {key: list(set(map(lambda x: str(x), g)))
for key, g
in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_) }
parsed_data.append(doc_dict)
This groups all entities into a list of dictionaries with the entity tags as keys. Example of one paragraph’s output:
{'DATE': ['one year'],
'GPE': ['Montague', 'Prince Edward Island'],
'ORG': ['the Independent Advisory Board for Senate Appointments'],
'PERSON': 'Morley Scott Annear'}
Which can then be loaded into a pandas data frame for analysis.