Trends in input representation for state-of-art NLP models (2019)

Ajit Rajasekharan
7 min readMar 23, 2019

The most natural/intuitive way to represent words when they are input to a language model (or any NLP task model) is to just represent words as they are — as a single unit.

For example, if we are training a language model on a corpus, we would traditionally represent each word as a vector and have the model learn word embeddings — values for each dimension of that vector. Then subsequently, at test time, if we are given a new sentence, the language model can compute how likely that sentence is using those learnt word embeddings. However, when we represent words as single units there is a possibility that we may come across a word at test time, that we never saw at training time, forcing us to treat it as an out-of-vocabulary word (OOV), and impacting model performance. The OOV problem is there for any NLP model that represents inputs as words.

One solution to address this problem is treat input as individual characters. This approach has shown to yield good results for many NLP tasks despite the additional computation and memory requirements character level input processing introduces (we have train long sequences at character level and backpropagate gradients through time across long sequences too). Even more importantly, a recent comparison (2018) of character based language model has shown character based language models do not perform as well as word based language models for large corpora.

--

--