Solving NER with BERT for any entity type with very little training data (compared to past approaches)

One of the roadblocks to entity recognition for any entity type other than person, location, organization, disease, gene, drugs, and species is the absence of labeled training data.

BERT offers a solution that works in practice for entity recognition of a custom type with very little labeled data — sometimes even about 300 examples of labeled data may suffice to get a first cut working solution.

This is possible because we can leverage off the model’s unsupervised learning on a large corpus to then fine tune the model to recognize a specific entity type with very little labeled data.

Here are the sequence of steps to perform entity recognition with BERT

  • Labeled data acquisition/preparation. For entity types like location, person, disease etc. we can leverage off existing labeled data sets. However for a custom entity type, we can in some instances, get reasonable results with as little as 300 labeled examples. One advantage of using BERT model is that we can train it not just on sentences containing the entity of interest, but also train on instances of the entities themselves. For instance, for a person tagger, we can train the model on just the names of persons alone in addition to their mention in sentences. Since BERT composes words using subwords, we can leverage off the learning from single entity mentions to generalize to other entity instance that share the same subwords.
  • Input preparation to fine tune model. Since BERT composes words from subwords, it is best to minimally preprocess the input. The only preprocessing that is really needed is casing normalization. NER works better using case-sensitive BERT model. So we can get some additional boost by normalizing input to be cased appropriately. That is make the first letter of every noun in the sentence uppercase and force all others letter to be lowercase. This can be done by using a POS tagger to tag sentences and then use tags such as NN,NNP, JJ etc. to selectively uppercase first letters of words
  • Fine tune a BERT model for each entity type separately. For instance, if we are interested in tagging two entity types, say E1 and E2, we create two separate fine tuned models for them tagging entities with B, I, O tags. For example, if the entity types are person and disease, then we create two separate training/testing data sets each tagged with the three tags separately. So the labeled data set for person would have labeled sentences of the form shown below.
  • This approach of fine tuning separate models for each entity type offers some advantages
  • This allows for us to tag the same term with different entity types. For instance, a term could have more than one entity type assigned to it.
  • It is often easier to generate labeled data for one entity type as opposed to generating sentences with more than one entity type.
  • We can different base models to fine tune. For instance, to recognize person we might choose the base BERT model to fine tune. However, to recognize a disease type, we might fine tune a model like Scibert, BioBert or our trained BERT model on a domain specific corpus. This approach in practice enables us to get better results than fine tuning just one model for all different entity types.
  • The approach outlined above works not just for recognizing entities that are noun phrases. We can use this approach to tag quantities/numbers of a specific kind of interest to us — including ranges. For instance stock prices in sentences, dosage in notes etc. Again the fact that we can get something going by about 300 count label data sets in some instances enables us to get started and then add training data as we continue to procure.
  • Here are some links to models that we can use to get started.
  • Base BERT models to fine tune on (both are Pytorch based models — much easier to work with in practice than Tensorflow based models). BERT , SciBert, BioBert
  • POS tagger for noun phrase tagging
  • BERT-NER fine-tuning scripts. This may require some minor tweaking to restrict it to just B,I,O tags
  • Fine-tuning and testing time numbers.
  • Fine-tuning for about 10,000 labeled data sets could range from 30–50 minutes on a single GPU machine.
  • Testing/deployment on CPU machines could be in range of 200–500 msecs on average. We can bring this down with some optimizations. For offline tagging, it is best done in batches.

Originally published at

Machine learning practitioner

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store