How to extract entities from raw text with Spacy: 3 approaches using Canadian data
TL;DR:Use the en_core_web_trftransformer model with Spacy to get much more accurate named entity recognition with multilingual text.
Entity recognition is one of the marvels or current technology, as least from a journalist’s perspective.
There was a time journalists had to read through hundreds, maybe thousands of documents, highlight names of people, companies and dates, and copy those to a spreadsheet before being able to do any kind of analysis.
Think of that scene in Spotlight when reporter Matt Carroll punches in names and dates of priests into Excel.
These days, software, with the help of AI models, can automatically detect these kinds of entities, preciously speeding up investigations.
One handy tool is Spacy, a Python library for doing all kinds of natural language tasks. It has named entity recognition (NER) functions out of the box, which work a lot of the time, but sometimes fail. I found this to be the case particularly when using Canadian texts, which often contain French names and entities, and other non-English names that the standard English models don’t recognize.
The problem with unilingual models
Regular language models are trained on texts in a single language. To make NER tools more multilingual, you would need to further train it with other language elements (for example, lots of French names). Or you’d need to run an English model over the text, then a French one, and figure out a way to combine the results later.
This can be challenging and time consuming, especially for a non-expert coder on deadlines.
This is why transformer models are such a game-changer.
I don’t pretend to understand these models, but they’re a huge leap in NLP, and they’re revolutionizing everything from translation to chatbots to text generation.
In this post, I’ll compare two standard English models in Spacy with a transformer model in parsing Canadian government texts: the appointments of people to public positions, which contains names and places that that standard models don’t recognize.
First, I’ll use the en_core_web_sm model, which is the smallest and probably most used model. Assuming I have a pandas Series text containing my seven paragraphs:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(text))
for doc in docs:
displacy.render(doc, style="ent")
This is the result:
displaCy
Right away we see that most names were not recognized as names. Even English names! And some names with honorifics like “The Honourable” were recognized as products, while others were organizations. Roles like “Chairperson” are names. Organizations like North Atlantic Salmon Conservation Organization are locations.
This won’t cut it when we’re dealing with hundreds of such paragraphs.
Here’s the result of the same code, but using the en_core_web_lg model, which is much larger.
displaCy
More names were recognized but not all. Fewer entities are mislabelled but some names with titles are still being parsed as organizations. This is only marginally better than the small model
Next-level transformers
And here’s the same data parsed with the en_core_web_trf model. Note that transformer models take a lot longer to run because they’re deep learning neural networks, and need good Nvidia GPUs to run efficiently. On my machine with an Intel graphics card, it took nearly 40 minutes to process a few hundred paragraphs.
Consider using Google Colab for this kind of work.
displaCy
Every name was recognized correctly. Honorifics were ignored. Even laws were recognized!
There are still some errors, like “the Council” being tagged as a separate organization, or “P.C.” tagged as as a place. But these can be easily excluded during the data cleaning process.
This is as good as NER gets without any additional tuning or retraining. This saved reporters days of work.
The last step is organizing the entity tags into structured data. This was my approach:
from itertools import groupby
parsed_data = []
for doc in docs:
doc_dict = {key: list(set(map(lambda x: str(x), g)))
for key, g
in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_) }
parsed_data.append(doc_dict)
This groups all entities into a list of dictionaries with the entity tags as keys. Example of one paragraph’s output:
{'DATE': ['one year'],
'GPE': ['Montague', 'Prince Edward Island'],
'ORG': ['the Independent Advisory Board for Senate Appointments'],
'PERSON': 'Morley Scott Annear'}
Which can then be loaded into a pandas data frame for analysis.