GPT-2 A nascent transfer learning method that could eliminate supervised learning in some NLP tasks
Many of the current state-of-art models for supervised NLP tasks are models pre-trained on language modeling (which is an unsupervised task), and then fine tuned (supervised) with labeled data specific to a task.
Fig 1. Transfer learning to downstream tasks started around 2013 with using context independent word vectors from unsupervised bag of word models (word2vec), to then using context dependent word vectors from sequence models(Elmo) , to the current direct use of trained transformer blocks with an additional output layer stacked for task specific fine tuning.
A question that naturally arises is,
Why/How does unsupervised learning through language modeling
- boost the performance of supervised models?
- reduce the amount of labeled data required to fine tune them?
A paper published a week ago ( ~14 Feb 2019 ) offers some insights to these questions.
- What does this model do?
- It is trained on a large diverse corpus (Common Crawl subset equivalent) in an unsupervised manner. The training is the standard language model approach — predict the next word, given the words seen so far. A language model once…