One of the final frontiers of data analysis is making sense of unstructured text like reports and open-ended responses in surveys. Natural language processing (NLP), with the help of AI, is making this kind of analysis more accessible.
Libraries like spaCy and Gensim, although still code-based, are simplifying the process of getting insights out of raw text.
Recently, we at the CBC sent a questionnaire to more then 50,000 teachers across Canada. We got nearly 9,400 responses. More then 4,000 of those had open-ended text responses. It’s a lot of text loaded with information about teachers’ frustrations during the Covid-19 pandemic. One can read through them all and note the topics and sentiments expressed, but NLP greatly reduces the time to get useful insights.
In this post, I show how I used the Python library spaCy to analyze all those responses and get broad understanding of educators’ most common concerns.
Note: I can’t share the raw data used in this analysis for confidentiality reasons, but the code can be applied to any data with lots of unstructured text.
The analysis
First, import spaCy and the small English language model. The max_length attribute allows us to go past spaCy’s default limit on the number of tokens.
import spacy nlp = spacy.load("en_core_web_sm") nlp.max_length = 1850000
This is a sample of what the data from the survey looks like, in a pandas dataframe df:
Response |
---|
Safety has been a facade. Education for education’s sake hasn’t been as important as parents having somewhere to send their children. |
More time should be spent with education around proper mask use, ensuring cohorting is adhered to, and more transparent information should be available regarding positive cases within the school. |
I am a special education teacher with autistic students. We have been in the classroom throughout the year. We are alone in the school. There are no supports, no enhanced PPE, no enhanced cleaning routines or any check ins, support or communication from our consultants or superintendents. We are invisible to all…government, media, community and teaching colleagues and unions. We have worked as custodians, medical aides, family counsellors and mental health aides…and finally teachers. Absolutely no one is aware of our double, triple and quadruple duties. It is the most discriminated event I have ever experienced in my over 30 years of teaching. We are truly invisible. This is my final year of teaching…and what a year it has been. |
It has been a very challenging year and I personally do NOT feel we have had the support of the Premier in any way. He has little understanding of the challenges teachers have faced. He doesn’t have a clue! I feel insulted and saddened by every remark he has made about teachers. His position is authoritarian and not inclusive in any way. I dont know of anyone in the school system who feels any support from him. |
I support the decision to keep students in school but there is a lack of transparency in the number of cases and transmission rates at each school. |
Teachers should have access to the vaccine as we effect so many people. Mental health, making sure kids are happy and health have been bigger focuses than curricular outcomes. |
The first step is to join all the responses into a single mega string, since I want to analyze the responses as a whole. For this, I use pandas’ handy cat string method to concatenate all the strings in the Reponse column.
all_text = df.Response.str.cat(sep = ' ')
Now create a spaCy document with that text. I don’t need the named entity recognizer (NER) so I disable that to save on memory and computing time.
doc = nlp(all_text, disable = ['ner'])
This does a few things: it splits the text into individual words and tags them with their part-of-speech, like nouns, verbs, adjectives, etc. It also recognizes common words (stop-words) like “and”, “I”, and “with” that don’t have much meaning and can be excluded from word counts.
Now I can do an overall word frequency analysis to see the most common words that aren’t stop words or punctuation marks.
from collections import Counter words = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct] word_freq = Counter(words) word_freq.most_common(20)
Note that I asked for the lemma_ attribute of each token, which is the lemmatized version of a word. That means that words with different variations, like “be”, “am”, “is” and “are” are all standardized to a root version like “be”. Plural words are transformed to their singular versions.
Result:
[(' ', 5702),
('student', 4672),
('school', 4278),
('teacher', 2954),
('year', 1861),
('work', 1784),
('government', 1654),
('staff', 1337),
('class', 1259),
('feel', 1143),
('need', 1108),
('online', 1090),
('time', 1085),
('health', 935),
('support', 851),
('day', 849),
('learning', 838),
('teach', 809),
('classroom', 786),
('kid', 764),
('education', 743),
('mask', 698),
('child', 697),
('pandemic', 669)]
Already you can get a sense of what teachers talk about the most. The most commonly-used verbs are quite telling: Feel and need. What are teachers feeling? What do they need? spaCy can help us answer this with pattern matching.
Pattern matching in linguistics is a bit like regular expressions, but for language. Instead of matching a sequence of characters, you can match a sequence of word types. For example, what are the most common adjective-noun phrases?
from spacy.matcher import Matcher matcher = Matcher(nlp.vocab) pattern = [{'POS':'ADJ'}, {'POS':'NOUN'}] matcher.add('ADJ_PHRASE', [pattern]) matches = matcher(doc, as_spans=True) phrases = [] for span in matches: phrases.append(span.text.lower()) phrase_freq = Counter(phrases) phrase_freq.most_common(30)
Note how pattern is defined: a list of dictionaries, each defining a part of speech (POS). In this case, it’s an adjective, then a noun. So spaCy will look for all instances of this pattern in the text.
Result:
[('mental health', 493),
('online learning', 248),
('social distancing', 140),
('high school', 133),
('many students', 121),
('provincial government', 116),
('front line', 86),
('next year', 82),
('last year', 72),
('special needs', 66),
('essential workers', 64),
('many teachers', 60),
('close contact', 57),
('public education', 57),
('special education', 57),
('remote learning', 56),
('public health', 56),
('same time', 48),
('social distance', 43),
('physical distancing', 39),
('general public', 39),
('difficult year', 38),
('high schools', 36),
('many people', 34),
('long term', 33),
('high risk', 31),
('last minute', 31),
('challenging year', 30),
('young children', 30),
('stressful year', 30)]
Incredible. “Mental health” was by far the most common phrase in this pattern written by teachers.
Now let’s look for the most common adjective that follow the phrase “I am” or “I feel”. For this the pattern has to be more complex. Because these are all valid constructions that we’d like to capture:
- I feel exhausted
- I really feel exhausted
- We’re pretty exhausted
For this, spaCy’s Matcher allows wildcards.
feel_adj = [] matcher = Matcher(nlp.vocab) pattern = [{'LOWER' : {'IN' : ['i', 'we']}}, {'OP': '?'}, {'LOWER': {'IN' : ['feel', 'am', "'m", 'are', "'re"]}}, {'OP': '?'}, {'OP': '?'}, {'POS':'ADJ'}] matcher.add("FeelAdj", [pattern]) matches = matcher(doc, as_spans=True) for span in matches: feel_adj.extend([token.lemma_ for token in span if token.pos_ == 'ADJ']) Counter(feel_adj).most_common(20)
The pattern now asks for the following: the lower-case version of “I” or “we” (so it’s case insensitive), any possible word in one ore zero occurrences (operator ‘?’, like regex), the lower-case version of any of “feel”, “am”, “are” and contractions thereof, two possible filler words, and an adjective.
Then it loops through the matches, looks only for the adjectives captured, and adds it to a list.
Result:
[('concerned', 88),
('worried', 42),
('tired', 41),
('disappointed', 37),
('able', 31),
('exhausted', 29),
('sure', 27),
('safe', 24),
('good', 22),
('essential', 21),
('scared', 20),
('happy', 19),
('online', 15),
('only', 14),
('grateful', 14),
('front', 14),
('frustrated', 13),
('glad', 12),
('proud', 11),
('elementary', 11),
('more', 11),
('thankful', 11),
('close', 10),
('lucky', 10),
This is incredibly informative. When we asked teachers to write whatever they wanted, so many expressed feeling of exhaustion, concern, and fear.
Here’s a pattern that looks for phrases that start with “I/we want/need”, followed by a noun, with optional filler words in between:
want_adj = [] matcher = Matcher(nlp.vocab) pattern = [{'LOWER' : {'IN' : ['i', 'we']}}, {'IS_ALPHA':True, 'OP':'?'}, {'LOWER': {'IN' : ['need', 'want']}}, {'IS_ALPHA':True, 'OP':'?'}, {'IS_ALPHA':True, 'OP':'?'}, {'POS':'NOUN'}] matcher.add("WantPhrase", [pattern]) matches = matcher(doc, as_spans=True)
Here’s a sample of matches:
I need a break,
I need more help,
We need more vaccines,
We need more adults,
I want to shout NO,
We need a public outcry,
I want schools,
We need to take school,
We need more online learning,
we need more resources,
We need the government,
we need uninterrupted time,
We need more support,
We desperately need the province,
we need continuity,
We need serious inputs,
We need stronger government policy,
I want my vaccine,
We need to show children,
I still need to find time,
I need time,
we want to support students,
We need small classes....
Another flavour of spaCy’s Matcher is the PhraseMatcher, which looks for instances of a specific phrase that you define. Let’s say I want to find the words that most frequently occur near the phrase “mental health”:
from spacy.matcher import PhraseMatcher mental_health_colloc = [] matcher = PhraseMatcher(nlp.vocab, attr = 'LOWER') # The attr above ensures all instances are converted to lower-case so the search is case-insensitive pattern = [nlp.make_doc('mental health')] matcher.add('mentalHealth', pattern) matches = matcher(doc) for match_id, start, end in matches: span = doc[start-10 : end+10] mental_health_colloc.extend([token.lemma_.lower() for token in span if not token.is_stop and not token.is_punct]) Counter(mental_health_colloc).most_common(20)
Look at how I defined span : it grabs the 10 tokens before and after “mental health”. Then I strip out stop-words and count the words that remain.
Result:
[('health', 544),
('mental', 532),
('student', 245),
('school', 106),
('teacher', 92),
('staff', 72),
('year', 62),
('need', 57),
('support', 54),
('issue', 46),
('pandemic', 39),
('time', 35),
('work', 35),
('learning', 34),
('government', 32),
('impact', 30),
('suffer', 29),
('child', 28),
('take', 27),
('concern', 26),
('kid', 23),
('struggle', 23),
('feel', 21),
('family', 21),
('education', 19)]
Impressive. Right away you can tell that teachers mention students when talking about mental health more then themselves or other teachers. With just a few lines of code we were able to get some basic insights that would have taken hours of reading to achieve.
Recently, I have gone through several sites and tutorials for topics on sentiment analysis/qualitative analysis. Finally, this work has given me complete clarity.
Thanks for your Excellent work and contribution. Nicely presented than a professional Data scientist.
Thank you so much.
Wish many tutorials of this kind
I’m new to using Python and natural language processing in general. I have many years experience analyzing survey data but have been trying to figure out a way to do exactly this. Thank you so much for sharing this online. These techniques are exactly what I needed for my project.
Really really useful – thank you so much.
I’ve been resorting to LDA tutorials until now, and that’s been incredibly difficult given that survey responses are often only a few words long.
Appreciate you!