Thoughts and Theory

A prerequisite to use a pre-trained model as is, without fine tuning

Figure 1. Quantitative evaluation of a pre-trained BERT model. The test quantitatively evaluates a pre-trained model’s (a) context sensitive vectors by the models ability to predict a masked position and (b) the [CLS] vector quality by examining vector quality of the masked phrase. The clustering quality of underlying vocabulary vectors, particularly the separation of entity types into clusters plays in implicit role in this. This test is done by using a test data set of triples (sentence with a masked phrase, masked phrase in sentence, entity type of masked phrase in the context of the sentence). Performance of the model on a sentence is determined by the entity type of the predictions for a masked position and the [CLS] vector for the masked phrase. The entity type of predictions for a masked position or [CLS] vector is determined by the clusters of context independent vectors — whose quality is determined qualitatively by the nature of clusters (how separated the entity types are). The quantitative test yields a confusion matrix and F1-scores for each entity type. Image created by Author.


BERT is a prize addition to the practitioner’s toolbox

Figure 1. Few reasons why BERT is a valuable addition to a practitioner’s toolbox in addition to its well known use of fine-tuning for downstream tasks. (1) BERT’s learned vocabulary of vectors (in say 768 dimensional space) serve as targets that masked output vectors predict and learn from prediction errors during training. After training, these moving targets settle into landmarks that can be clustered and annotated (a one-time step) and used for classifying model output vectors in a variety of tasks — NER, relation extraction etc. (2) A model pre-trained enough to achieve a low next sentence prediction loss (in addition to the masked word prediction loss) yields quality CLS vectors representing any input term/phrase/sentence. The CLS vector needs to be harvested from the MLM head and not from the topmost layer to get the best possible representation of the input (figure below). (3) MLM head decoder bias value is a useful score of the importance of a vocabulary term and is the equivalent of a TF-IDF score for vocabulary terms. (4) BERT’s capacity to predict, in most cases, the entity type of word in a sentence indirectly through vocabulary word alternatives for that position, can be quite handy in addition to its use for NER tagging. Occasionally the predictions for a position may even be the correct instance, but this is typically unreliable directly for any practical use (5) Vector representation for any input term/phrase (and their misspelled variants) either harvested directly from BERT’s learned vocabulary or created using CLS to a large degree subsumes the context independent vectors of prior models like word2vec, Fasttext, making BERT a one-stop shop for harvesting vector representations of both context dependent and context independent vectors. The only exception to this are representations for input involving characters not present in BERT’s vocabulary (e.g. a custom BERT vocabulary carefully chosen to avoid characters from out of application domain languages like Chinese, Tamil etc.). Central to harvesting the most of all these benefits is how well a model is pre-trained with a custom vocabulary on a domain specific corpus of interest to our application. Image created by author


A hybrid approach combining symbolic processing with distributed representations

Figure 1. A hybrid approach combining symbolic processing with distributed representations for unsupervised synonym harvesting (acronyms also harvested during this process). Input is a domain specific corpus of interest and the output for each extracted synonym candidate sentence fragment is a synonym pair. The aggregate output across sentences is a set with a pivot term element and its synonym variations. An additional output is a superset family a synonym set belongs to, if it indeed does. Details of the steps are described in detail below. Image created by author


An approach to evaluate a pre-trained BERT model to increase performance

Figure 1. Training pathways to maximize BERT model performance. For application domains where entity types — people, location, organization etc. are the dominant entity types, training pathways 1a-1d would suffice. That is, we start off with a publicly released BERT model (bert-base/large-cased/uncased, or the tiny bert versions) and optionally train it further (1c — continual pre-training) before fine-tuning it for a specific task (1d — supervised task with labeled data). For a domain where person, location, organization etc. are not the dominant entity types, use of original BERT model for continual pre-training (1c) with a domain specific corpus, followed by fine tuning may not boost performance as much as pathway 2a-2d, given the vocabulary in the 1a-1d pathway is still the original BERT model vocabulary with an entity bias towards people, location organization etc. Pathway 2a-2d trains a BERT model from scratch using a vocabulary that is generated from the domain specific corpus. Note: Any form of model training - pre-training, continual pre-training or fine tuning, modifies both model weights as well as the vocabulary vectors — the different shades of same color model(shades of beige)as well as vocabulary(shades of blue/green) in the training stages from left to right illustrates this fact. The box labeled with a “?”, is the focus of this article — evaluate a pre-trained or a continually pre-trained model to improve model performance. Image by Author


  • Biomedical space has a lot of terms or phrases unique to that domaine.g. names of drugs, diseases, genes etc. These terms…

For sentence similarity/document search applications

Figure 1. Unsupervised creation of sentence representation signatures for sentence similarity tasks. Illustration uses BERT (bert-large-cased) model.


Viruses (COVID-19) — from a computational perspective

Figure 1. Information replication strategies of Coronavirus. Coronavirus information tape and code layout shown in left inset. The COVID-19 tape is ~30,000 letters long encoding about 27 functions. Portions of these code segments are evolving faster than other portions driven by selection pressure. These functions can be broadly classified as encoding structural components and nonstructural/accessory components. The main flow sequence of entry and replication is illustrated in labeled sequence (1) The docking apparatus on virus facilitates attachment to host cell and injecting its information tape into cell (2) The host cell’s component construction machinery constructs the virus’s self-replication machine encoded in the virus tape since it cannot distinguish between host code and virus code. (3) The virus’s self-replication machine then creates copies of virus tape as well as code fragments encoding virus components such as docking apparatus. (4) The host cell’s component construction machinery then assembles virus components. (5) Steps 2 and 3 happen inside compartments made up of host cell material but whose assembly is initiated by the virus enabling it to concentrate materials desired for replication as well as shields it from defense mechanisms in the host cell. Step 4 happens separately though the process is still poorly understood. Coronavirus illustration created at CDC. Essentially, the virus enters the host cell with nothing but a linear information sequence and tricks the host cell to bootstrap it to life encoded in its sequence.

COVID-19 questions — a use case for improving sentence fragment search

Figure 1. Illustrates embeddings driven fragment search used to answer specific questions (left panel) as well broader questions(right panel). The highlighted text fragments in yellow are document matches to search input obtained using BERT embeddings. The right panel is a sample of animals with literature evidence for presence of coronavirus — the font size is a qualitative measure of reference counts in literature. Bats (in general and chinese horseshoe bats specifically) and birds have been mentioned as sources of coronavirus — bats as the gene source of alpha and beta coronaviruses and birds as the gene source of gammacoronavirus and deltacoronaviruses. Zoonotic transmission of coronavirus from civet cats and pangolins(betacoronavirus) have also been reported. All the information above was obtained automated using machine learning models without human curation. For the broad question in right panel, a bootstrap list was created by the search for term “animals” and clustering result in the neighborhood of Word2vec embeddings. This list was then filtered for biological entity types using unsupervised NER with BERT , which was then used to create the final list of animals with literature evidence captured in fragments as extractive summary of the corresponding documents. The animal source of COVID-19 is not confirmed to date. Coronavirus illustration created at CDC


Figure 1. illustrates tagged sentence samples of unsupervised NER performed using BERT (bert-large-cased) with no fine tuning. The examples highlight just a few entity types tagged by this approach. Tagging 500 sentences yielded about 1000 unique entity types — of which a select few were mapped to the synthetic labels shown above. The bert-large-cased model is unable to distinguish between GENE and PROTEIN because descriptors for these entities fall within the same tail of predicted distributions for masked terms (they are not distinguishable in the base vocabulary either). Distinguishing closely related entities like these may require MLM fine tuning on domain specific corpus or pre-training a model from scratch using a custom vocabulary (examined below)


This figure is synthesized from recent talks by Yoshua Bengio (NeurIPS 2019 talk), Yann LeCun and Leon Bottou. Acronym IID in figure expands to Independent and Identically Distributed random variables; OOD expands to Out Of Distribution


  1. Self-supervised learning — learning by predicting input
  2. Leverage power of compositionality in distributed representations
  3. Drop IID(Independent and Identically Distributed random variables) assumption
  4. Approaches for self-supervised representation learning
  5. Role of attention
  6. Lifelong Learning at multiple time scales
  7. Architecture priors

Deep Learning 1.0 — a quick recap of limitations

Ajit Rajasekharan

Machine learning practitioner

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store