Member-only story

GPT-2 A nascent transfer learning method that could eliminate supervised learning in some NLP tasks

10 min readMar 23, 2019

Many of the current state-of-art models for supervised NLP tasks are models pre-trained on language modeling (which is an unsupervised task), and then fine tuned (supervised) with labeled data specific to a task.

Fig 1. Transfer learning to downstream tasks started around 2013 with using context independent word vectors from unsupervised bag of word models (word2vec), to then using context dependent word vectors from sequence models(Elmo) , to the current direct use of trained transformer blocks with an additional output layer stacked for task specific fine tuning.

A question that naturally arises is,

Why/How does unsupervised learning through language modeling

boost the performance of supervised models?
reduce the amount of labeled data required to fine tune them?

A paper published a week ago ( ~14 Feb 2019 ) offers some insights to these questions.

What does this model do?
It is trained on a large diverse corpus (Common Crawl subset equivalent) in an unsupervised manner. The training is the standard language model approach — predict the next word, given the words seen so far. A language model once trained, can compute the probability of any string given to it. Additionally it is also generative — it can generate strings from the underlying distribution it has learnt. The model used in this paper is an OpenAI transformer architecture with minor variations.
The trained language model shows state of the art performance on 7 out of 8 language modeling benchmarks. The one it does not perform well is on, is the largest language modeling data set. The reason stated is that it may be in part due to the fact, this large dataset has undergone destructive pre-processing, removing long range structure, which this model relies on given its input is hardly pre-processed (no lowercasing etc.)
The model performance on supervised tasks is promising but not sufficient for practical applications yet. For instance on tasks like reading comprehension, it does reasonably well considering it does not use labeled data (it scores 55 F1 on dev. BERT does…

GPT-2 A nascent transfer learning method that could eliminate supervised learning in some NLP tasks

Written by Ajit Rajasekharan

No responses yet