GPT-2 A nascent transfer learning method that could eliminate supervised learning in some NLP tasks

Many of the current state-of-art models for supervised NLP tasks are models pre-trained on language modeling (which is an unsupervised task), and then fine tuned (supervised) with labeled data specific to a task.

Fig 1. Transfer learning to downstream tasks started around 2013 with using context independent word vectors from unsupervised bag of word models (word2vec), to then using context dependent word vectors from sequence models(Elmo) , to the current direct use of trained transformer blocks with an additional output layer stacked for task specific fine tuning.

A question that naturally arises is,

Why/How does unsupervised learning through language modeling

  • boost the performance of supervised models?

A paper published a week ago ( ~14 Feb 2019 ) offers some insights to these questions.

  • What does this model do?

Fig 2. The GPT-2 paper published last week attempts to go beyond BERT in transfer learning, largely as a research initiative. GPT-2 is an unsupervised language model trained on a large diverse corpus, and used as is for downstream tasks without architecture change, parameter updates, and most importantly with no task specific labeled data. The results are promising, but remains to be seen if the performance numbers will reach the levels we can attain with BERT/supervised fine-tuning for the same task.

  • How is this language model used as is for supervised tasks?

Fig 3. Results of using the GPT-2 model on supervised tasks as is without any architecture change or parameter updates.

  • Even though, as mentioned earlier, the model performance on supervised tasks are not at usable levels, it explains at least in part, why unsupervised language models like BERT boost performance of downstream supervised tasks with very little fine-tuning

Test results on the released pre-trained model (GPT with 117 M parameters )- this has the lowest performance in all the supervised tasks we see in the figure above, but is useful to get a sense of how supervised tasks can be redrafted as a language model problem).

All the tests below use the language model in a conditional mode. That is, feed some input that the model conditions on to generate its output. The input we feed has patterns specific to the task we are testing it on, that primes the model to generate output conditioned on those patterns.

Example 1: Q & A task

Input:

Model prompt >>> John and Mary went to park after dinner with their dog. They walked around and came home. When they got back they realized their dog was not with them.

Q: What happened to the dog?

A: It was lost.

Q: Where is the dog?

A:

Output:

It is in the woods.

Q: What happened to the dog?

A: It was found by a neighbor.

Q: What happened to the dog?

A: The dog was found by a friend.

Q: Where is the dog?

A: The dog was found by a friend.

Q: Where did you get the dog?

A: It was found by the family.

Q: What happened to the dog?

A: It was found by the family.

Notes: In the response above, the model outputs the answer “It is in the wood primed by the token “A:” and then continues to generate its own question and answer pairs. Some are repetitive. Answers such as “it is in the woods”, “it was found by a neighbor” are totally unique not from the input passage — almost mimicking a model that has a “true world model”.

Example 2: Q & A task

Input:

Model prompt >>> Cathy went to the garden with a pail. While she was watering her plants it started to rain and she rushed back in.

Q: Where did Cathy go?

A: She went to the garden.

Q: What did she do there ?

A: She watered plants

Q: Where is the pail?

A:

Output:

It was in the garden

Q: Where did she go?

A: She went to the garden.

Q: What did Cathy do there ?

A: She went to the garden

Q: What did Cathy do there ?

A: Her pail

Q: What did Cathy do there?

A: Her pail

Q: What did Cathy do there ?

A: She went to the garden.

Example 3: Q & A task

Input:

Model prompt >>>Tom and Jerry wanted to have a race. Percy joined them too. They raced to the end of the street. Jerry won the race. Tom was mad.

Q: What did Tom and Jerry do?

A: They raced.

Q: Who joined them? A:

Output:

Tom.

Q: Who was Tom?

A: Jerry.

Q: Who was Jerry?

A: Jerry’s father.

Q: What did Tom do?

A: He ran.

Q: What did Jerry do?

A: He ran.

Who was Tom?

A: Jerry’s father’s son.

Q: Who was Tom?

A: Jerry’s father’s brother.

Q: Who was Tom?

A: Jerry’s brother.

Q: Who was Tom?

A: Jerry’s father.

Q: Who was Tom?

Note: This generated sequence is comical but still not off context.

Example 4: Long range dependency test — predicting the next word in a sentence with at least 50 tokens. (LAMBADA test)

Input:

Model prompt >>> Yes, I thought I was going to lose the baby. I was scared too, he stated, sincerity flooding his eyes. You were ? Yes, of course. Why do you even ask? This baby wasn’t exactly planned for. Do you honestly think that I would want you to have a

Output:

baby? I don’t know, I don’t know. I just want to be with you. I don’t know if you want to be with me or not….

Example 5: Long range dependency test — predicting the next word in a sentence with at least 50 tokens. (LAMBADA test)

Input:

Model prompt >>> Preston had been the last person to wear those chains, and I knew what I’d see and feel if they were slipped onto my skin-the Reaper’s unending hatred of me. I’d felt enough of that emotion already in the amphitheater. I didn’t want to feel anymore. “Don’t put those on me,” I whispered. “Please.” Sergei looked at me, surprised by my low, raspy please, but he put down the

Output:

chains. I felt a chill run through my body. I didn’t want to feel it. I didn’t want to be alone. I didn’t want to feel the Reaper. I didn’t want to feel the Reaper.

Notes. The repetitive output is a common behavior. Not sure if this is a characteristic unique to this low capacity model

Example 6. Winograd schema challenge

Input:

Model prompt >>> The man could not lift his son because he was so

Output:

weak, and he could not lift his daughter because she was too weak to lift him.

Example 7. Winograd schema challenge

Input:

Frank was jealous when Bill said that he was the winner of the competition. Who was the winner?

Output:

“He was the winner of the competition. I’m the winner of the competition. I’m the winner of the competition. I’m the winner of the competition. I’m the winner of the competition. I’m the winner of the competition.”

The winner was a former football player who was a member of the team. He was also the first player to be named in the Football League’s Football League of England’s Player of the Year award.

Notes: this didn’t work clearly. Could be because it is a small capacity model. They reported state of art results on this test (70% ) though this was a small size test set.

Example 8. Winograd schema challenge

Input:

Model prompt >>> The trophy did not fit in the box because it was too small. What was too small?

Output:

The trophy was too small.

The trophy did not fit in the box because it was too small. What was too small?

The trophy did not fit in the box because it was too small. What was too small?

The trophy did not fit in the box because it was too small. What was too small?

Notes. Fails on this.

Input:

Model prompt >>> The trophy did not fit in the box because it was too big. What was too big?

Output:

It’s a little bigger than the box, so I’m not sure if it’s a problem.

Rated 5 out of 5 by Anonymous from Great product I bought this to help with the size and shape of the box. I have a small child and I was worried that it would not fit in the box. I bought this to make sure that I didn’t accidentally put it in the wrong box. It fits in the box perfectly and I’m very happy with the quality. I will definitely be buying more.

Rated 5 out of 5 by Anonymous from Great product I bought the product to help with the size and shape of the box. It fits perfectly and I’m very happy with the quality. I will definitely be buying more.

Notes. This is an amusing response.

Example 9. Summarization

Input:

norway delivered a diplomatic protest to russia on monday after three norwegian fisheries research expeditions were barred from russian waters . the norwegian research ships were to continue an annual program of charting fish resources shared by the two countries in the barents sea region TL;DR:

Output:

Norwegian fishing boat was caught in a fishing accident in the Barents Sea.

The Norwegian fishing vessel was caught in a fishing accident in the barents sea.

Norway’s Fisheries Research Program (FPR) was suspended on Tuesday after a Norwegian ship carrying a Norwegian fishing boat was caught in a fishing accident in the barents sea region TL;DR: Norway’s Fisheries Research Program was suspended on Tuesday after a Norwegian ship carrying a Norwegian fishing boat was caught in a fishing accident in the barents sea region.

The Norwegian fishing vessel was caught in a fishing accident in the barents sea.

The Norwegian fishing vessel was caught in a fishing accident in the barents sea.

The Norwegian fishing vessel was caught in a fishing accident in the barents sea.

Notes. More on the imaginative side than abstractive or extractive. Again could be because of low model capacity , though even GPT-2 didn’t do well in summarization compared to supervised model performances

Example 10. Translation

Input:

How are you = Wie geht es dir . Where are you going =

Output:

Wie wie wie ?”

“I am going to go to bed.”

“I am going to sleep.”

“You are going to sleep.”

“I am going to sleep.”

Notes. Starts off in german but then falls into repetitive loop. This could also be due to small model capacity.

References

  • Blog post from OpenAI on the paper

Originally published at qr.ae.

Machine learning practitioner

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store