Trends in input representation for state-of-art NLP models (2019)

The most natural/intuitive way to represent words when they are input to a language model (or any NLP task model) is to just represent words as they are — as a single unit.

For example, if we are training a language model on a corpus, we would traditionally represent each word as a vector and have the model learn word embeddings — values for each dimension of that vector. Then subsequently, at test time, if we are given a new sentence, the language model can compute how likely that sentence is using those learnt word embeddings. However, when we represent words as single units there is a possibility that we may come across a word at test time, that we never saw at training time, forcing us to treat it as an out-of-vocabulary word (OOV), and impacting model performance. The OOV problem is there for any NLP model that represents inputs as words.

One solution to address this problem is treat input as individual characters. This approach has shown to yield good results for many NLP tasks despite the additional computation and memory requirements character level input processing introduces (we have train long sequences at character level and backpropagate gradients through time across long sequences too). Even more importantly, a recent comparison (2018) of character based language model has shown character based language models do not perform as well as word based language models for large corpora.

The current trend to address this is a middle ground approach — subword representation of words that is a balance between the advantage of character based models (avoiding OOV) and word based models (efficiency, performance). Image below shows the way input is fed to these three categories of models.

Input representations differences across word, character, and subword based models

Examining below two state of art language models, BERT (2018) and GPT-2 (2019) — both of which represent inputs as subwords, but adopt slightly different approaches. The problem of how to represent input is not restricted to language models alone — other models these days adopt subword representations too. We will look at just language models specifically because, of late

  • The unsupervised learning from language models have been used for subsequent task specific learning with need for lesser labeled data. BERT model for instance has demonstrated state-of-art performance on a wide variety of NLP tasks such as Q&A, NER etc. with comparatively less training data.
  • The recent (Feb 2019) transformer based language model GPT-2 has demonstrated state of art performance on a range of NLP tasks without even the need for any task specific labeled training data. The model was trained in an unsupervised manner on a large diverse corpus.

GPT-2’s input representation using subwords

  • Given the objective of a language model is to compute the probability of any string (and also to be able to generate strings), in GPT-2, the training corpus does not undergo the typical pre-processing steps like conversion to lowercase etc. to avoid restricting the space of strings that can be modeled. A consequence of this decision is the presence of a larger word vocab given uppercase/lowercase variants of the same word. The subword scheme still constraints the vocab space to around 50,000 ( for the model that was released).
  • The basic idea behind GPT’s subword scheme is fairly simple. Start off with an initial vocabulary of just characters, treat the corpus as a space separated character stream, and iteratively replace the presence of two symbols “A B” with the symbol “AB” based on occurrence frequency. Each such merge operation produces a new symbol. At the end of this operation, the vocabulary contains the initial single characters and the merged symbols.
  • For example, if we start off the few words representing a toy corpus, and we do the merge operation sequence as shown below,

we end up with a vocab that contains mergeable symbol pairs such as ‘wi’,’i’ etc The last line shows how the input words are represented with the subwords.

  • Figure below shows a sample of some of the “mergeable symbol pairs” from the released model’s vocab file (note the duplicates of same word “cons sole” “Cons sole” due to casing preserving)

BERT’s input representation using subwords or wordpiece (as they are referred to in BERT paper)

  • BERT’s large model has a vocab size of 30,522 subwords. That is any input word is represented as sequence of one ore more of these 30,522 words. If the input was represented as single unit words, in contrast, a large corpus vocab can easily run into the millions. If the input corpus was pre-processed to convert words like new york to phrases, the size can be even be in the tens of millions. This subword representation significantly cuts down on the vocab size in addition to the advantage of being able to represent any input word using subword units
  • For instance, an input of the form

how does bert handle my name ajit rajasekharan or a phrase like quantum electrodynamics

gets converted into the following sequence

how does bert handle my name aj ##it raja ##se ##khar ##an or a phrase like quantum electro ##dy ##nami ##cs

The vocab file that comes with the large model does have words like “how”, “bert”, etc. and it has subwords “##khar”, ##nami etc. The #symbol makes decoding unambiguous.

  • The subword is generated as follows: Given a corpus, and a desired size D, the optimization problem is to select D wordpieces such that the resulting corpus is minimal in the number of wordpieces when segmented according to the chosen wordpiece model.
  • When we examine the vocab file of BERT’s model, despite the vocabulary being just 30K, the approach seems to have many tokens with the same prefix. This is perhaps just the nature of the way the optimization is framed.

Despite the different subword approaches adopted by the these models, they both share a common attribute — they are both attention based models.

Transfer learning scenarios using learned representations

In both the language model use cases above, the input was

  • tokenized into subwords and
  • the model learnt embeddings for those subwords.
  • these learnt embeddings are then used to represent any input at test time by first tokenizing the input into subwords and using the learnt embeddings to represent them.
  • these learnt embeddings are also used for downstream tasks to represent input after they are tokenized to subwords. An example of this is a relation extraction model that uses subwords for representing input.
  • However, there may be tasks where we want to reconstitute the words. Three choices are possible — simply add them up (average), use a CNN, use an LSTM. This work shows the latter gives the best results for a NER word tagging task. However, it is the slowest of the three. We need to make the trade-off choice based on our needs.


  • GPT-2 model. The example of subwords shown above is from the GPT model that was released. GPT-2 has not been released.
  • BERT paper
  • Wordpiece approach used in BERT
  • Byte pair encoding of subwords using in GPT models. Github link for the byte pair code
  • N-gram based representation have been used for a while to address OOV issue and also to learn good word embeddings even for rare words. For example Fasttext allows users to specify a range of ngrams to represent corpus. The subword approaches outlined above are superior to fasttext style ngram methods because the vocab size is bounded. The vocab size in fasttext can be quite large based on the minimum and maximum size of ngrams we choose. Most importantly the model does not attempt to reduce the number of subwords like for instance the wordpiece approach does.
  • Ways to train character based language models to address its computation and memory needs
  • Relation extraction model using subwords to represent input. This model also tags entities. It leverages off existing labeled data for NER by converting input to subwords and labeling the subword accordingly. For instance, if New York became “<N> <ew> <Y> <ork>” the labeling would be transformed from B_LOC I_LOC for “New York” to “B_LOC I_LOC I_LOC I_LOC”.
  • This paper compares word based, character based, fasttext, and subword based representation of input for an entity tagging task. Subword level representation performs better than other approaches. It is also consumes lesser memory than fasttext ( 6GB vs 11 MB for a test). The performance seems to be best when subwords are re-consituted back with an LSTM instead of just summing the subword vectors. CNN based reconstitution seems to be a reasonable mid ground trade-off for speed. Summing/Averaging has the poorest performance but is the fastest of all three approaches. Blog post link

Originally published at

Machine learning practitioner

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store