Data (and Computer) Scientists have been working for a long time on improving the ability for algorithms to derive meaning from natural (human) languages - whether they’re trying to create a bot that responds to users questions on their website or determine whether people love or hate their brand on Twitter.
The bad news is there is still a deep stack of concepts that you need to understand to tune your results. The good news is that with tools like BERT and ERNIE, getting good results from Natural Language Processing (NLP) is more accessible than ever - even with modestly sized data sets and computing budgets. Plus, who wouldn’t want to do NLP with the Sesame Street crew?!
An abbreviated history of NLP
Let’s start with a brief look at the history of the discipline. It’s possible to break down the development of NLP systems into three broad phases:
Rules engines - In the early days, most NLP systems were based on complex sets of hand written rules. The good news is that they were easy to understand, but they didn’t do a very good job (they were interpretable, but not very accurate)
Statistical inference - In the 80’s researchers started to use Part of Speech Tagging (tagging nouns, verbs, etc) using Hidden Markov Models to return statistically likely meanings for, and relationships between, words
Deep learning - In the last decade, neural networks have become the most common way to solve most non-trivial NLP problems, layering techniques such as CNN’s, RNN’s, and LSTM’s to improve performance for specific classes of NLP tasks
Deep learning has transformed the practice of NLP over the last decade. Whether you’re trying to implement machine translation, question answering, short text categorization or sentiment analysis, there are deep learning tools available to help solve those problems. However, historically, the process of creating the right network and then training it took a lot of time, expertise, a huge data set and a lot of computational power (which was expensive).
(Machine) Learning with Sesame Street
The whole “Sesame Street” revolution in NLP kicked off in early 2018 with a paper discussing ELMo representations (ELMo stands for Embeddings from Language Models). ELMo is a technique that uses a deep bidirectional language model, pre-trained on a large text corpus to improve performance for a range of NLP tasks.
What does that mean? Let’s break it down. “Deep” refers to the fact that it’s using a multi-layer neural network (as in “deep learning”). Bidirectional? Well, historically most language models were unidirectional, so for English they’d read the words from left to right. With a bidirectional model, all of the words are ingested simultaneously. This allows for context to be inferred more accurately given sufficient training. And pre-training means that the model is already trained on a very large general purpose language data set. Pre-training has been shown in both image recognition and NLP to substantially improve accuracy and/or reduce the time and cost required for final training of the model.
In November of 2018, Google open sourced BERT. BERT stands for Bidirectional Encoder Representations from Transformers. It was a new technique for contextual pre-training. Contextual means that it takes into account the words around a given word, so unlike a context-free model like the popular Word2Vec models, with BERT, bank is not the same concept in “bank account” and “river bank."
BERT leverages concepts from a number of existing approaches including ELMo and ULMFiT. The core advance with BERT is that it masks different words in any given input phrase and then estimates the likelihood of various words that might be able to “fill that slot.”
In addition to breaking a number of records for handling language-based tasks including its performance with the Stanford Question Answering Dataset, BERT also substantially reduced the cost and complexity of training language models. As they stated in their blog post, “With this release, anyone in the world can train their own state-of-the-art question answering system (or a variety of other models) in about 30 minutes on a single Cloud TPU, or in a few hours using a single GPU”.
To implement a classification task such as sentiment analysis (categorizing phrases by the primary sentiment that they express), you just need to add a classification layer on top of the Transformer output.
For question answering tasks where you have to map a question to the answer in a larger body of text, you add two extra vectors for the start point and the end point of the answer for any given question in the text.
For Named Entity Recognition (NER - identifying specific entities such as people, companies or products), the model can be trained by feeding the output vector of each token into a classification layer that predicts the NER label - so it’s just another classifier. The bottom line is that even with a small data set and a limited budget and experience, with BERT you can create a state-of-the-art NLP model in a very small amount of time.
There are a couple of weaknesses in the way BERT operates. By treating the words it masks as independent, it doesn’t learn as much as it could from the training data, and by not passing the mask token to the output, it reduces the effectiveness when fine-tuning results.
In June 2019, members of the Google Brain team published the XLNet paper. XLNet avoids the issues that BERT suffers from by using a technique called “permutation language modeling”. In permutation language modeling, the models are trained to predict one token given preceding context like traditional language model, but instead of predicting the tokens sequentially, it predicts them in a random order. The bottom line is that XLNet outperformed BERT on a number of key NLP tasks and advanced the state of the art.
Completing the lineup
Not to be outdone (in either computational effectiveness or Sesame Street references), in March 2019, the Baidu Research team unveiled ERNIE, following up with ERNIE 2.0 in July of 2019. ERNIE stands for the slightly convoluted Enhanced Representation through kNowledge IntEgration which brings together many of the concepts used by BERT, but also matches information on semantic elements from other resources such as encyclopedias, news outlets, and online forums. Knowing that, for example, Harbin is the capital of Heilongjiang Province in China and that Harbin is a city which gets ice and snow in the winter, it can do a better job of performing many NLP tasks when compared to a model like BERT that limits its knowledge of the world to the text it is being trained on. While some of the drivers of the ERNIE approach were designed to deal with the unique challenges of working with the Chinese language, ERNIE 2 appears to outperform both BERT and XLNet in a number of key NLP tasks in both Chinese and English.
We’re in a period of rapid change in the field of NLP, but in less than 18 months there have been at least four substantial breakthroughs in pre-trained deep learning solutions, and there’s no reason to believe that there won’t be more on the way.
At the moment, it still takes some time to download the source code, get everything running with TensorFlow and to add the final layer(s) to the network and train it with your data set. But it’s pretty clear that as the field matures, the barriers to entry in performing NLP are going to drop and the quality of results are going to continue to increase - especially for small data sets.
Head of Data Science
Peter is a veteran technologist, CTO, entrepreneur, and longtime educator, having taught digital literacy at Columbia and authored numerous programming books.