Member-only story

Solving NER with BERT for any entity type with very little training data (compared to past approaches)

4 min readMay 20, 2019

One of the roadblocks to entity recognition for any entity type other than person, location, organization, disease, gene, drugs, and species is the absence of labeled training data.

BERT offers a solution that works in practice for entity recognition of a custom type with very little labeled data — sometimes even about 300 examples of labeled data may suffice to get a first cut working solution.

This is possible because we can leverage off the model’s unsupervised learning on a large corpus to then fine tune the model to recognize a specific entity type with very little labeled data.

Here are the sequence of steps to perform entity recognition with BERT

Labeled data acquisition/preparation. For entity types like location, person, disease etc. we can leverage off existing labeled data sets. However for a custom entity type, we can in some instances, get reasonable results with as little as 300 labeled examples. One advantage of using BERT model is that we can train it not just on sentences containing the entity of interest, but also train on instances of the entities themselves. For instance, for a person tagger, we can train the model on just the names of persons alone in addition to their mention in sentences. Since BERT composes words using subwords, we can leverage off the learning from single entity mentions to generalize to other entity instance that share the same subwords.
Input preparation to fine tune model. Since BERT composes words from subwords, it is best to minimally preprocess the input. The only preprocessing that is really needed is casing normalization. NER works better using case-sensitive BERT model. So we can get some additional boost by normalizing input to be cased appropriately. That is make the first letter of every noun in the sentence uppercase and force all others letter to be lowercase. This can be done by using a POS tagger to tag sentences and then use tags such as NN,NNP, JJ etc. to selectively uppercase first letters of words
Fine tune a BERT model for each entity type separately. For instance, if we are interested in tagging two entity types, say E1 and E2, we create two separate fine tuned models for them tagging entities with B, I, O tags. For example, if the entity types are person and disease, then we create two separate training/testing data sets each…

Solving NER with BERT for any entity type with very little training data (compared to past approaches)

Written by Ajit Rajasekharan

No responses yet