Thoughts and Theory

Figure 1. Quantitative evaluation of a pre-trained BERT model. The test quantitatively evaluates a pre-trained model’s (a) context sensitive vectors by the models ability to predict a masked position and (b) the [CLS] vector quality by examining vector quality of the masked phrase. The clustering quality of underlying vocabulary vectors, particularly the separation of entity types into clusters plays in implicit role in this. This test is done by using a test data set of triples (sentence with a masked phrase, masked phrase in sentence, entity type of masked phrase in the context of the sentence). Performance of the model on a sentence is determined by the entity type of the predictions for a masked position and the [CLS] vector for the masked phrase. The entity type of predictions for a masked position or [CLS] vector is determined by the clusters of context independent vectors — whose quality is determined qualitatively by the nature of clusters (how separated the entity types are). The quantitative test yields a confusion matrix and F1-scores for each entity type. Image created by Author.


Self-supervised learning is being leveraged off at scale using transformers, not only for text, but lately also for images(CLIP, ALIGN), to solve traditionally supervised tasks (e.g. classification), either as is, or with subsequent fine tuning. …

Figure 1. Few reasons why BERT is a valuable addition to a practitioner’s toolbox in addition to its well known use of fine-tuning for downstream tasks. (1) BERT’s learned vocabulary of vectors (in say 768 dimensional space) serve as targets that masked output vectors predict and learn from prediction errors during training. After training, these moving targets settle into landmarks that can be clustered and annotated (a one-time step) and used for classifying model output vectors in a variety of tasks — NER, relation extraction etc. (2) A model pre-trained enough to achieve a low next sentence prediction loss (in addition to the masked word prediction loss) yields quality CLS vectors representing any input term/phrase/sentence. The CLS vector needs to be harvested from the MLM head and not from the topmost layer to get the best possible representation of the input (figure below). (3) MLM head decoder bias value is a useful score of the importance of a vocabulary term and is the equivalent of a TF-IDF score for vocabulary terms. (4) BERT’s capacity to predict, in most cases, the entity type of word in a sentence indirectly through vocabulary word alternatives for that position, can be quite handy in addition to its use for NER tagging. Occasionally the predictions for a position may even be the correct instance, but this is typically unreliable directly for any practical use (5) Vector representation for any input term/phrase (and their misspelled variants) either harvested directly from BERT’s learned vocabulary or created using CLS to a large degree subsumes the context independent vectors of prior models like word2vec, Fasttext, making BERT a one-stop shop for harvesting vector representations of both context dependent and context independent vectors. The only exception to this are representations for input involving characters not present in BERT’s vocabulary (e.g. a custom BERT vocabulary carefully chosen to avoid characters from out of application domain languages like Chinese, Tamil etc.). Central to harvesting the most of all these benefits is how well a model is pre-trained with a custom vocabulary on a domain specific corpus of interest to our application. Image created by author


Natural language processing tasks traditionally requiring labeled data could be solved entirely or in part, subject to a few constraints, without the need for labeled data by leveraging the self-supervised learning of a BERT model, provided those tasks lend themselves to be viewed entirely or in part, as a similarity…

Figure 1. A hybrid approach combining symbolic processing with distributed representations for unsupervised synonym harvesting (acronyms also harvested during this process). Input is a domain specific corpus of interest and the output for each extracted synonym candidate sentence fragment is a synonym pair. The aggregate output across sentences is a set with a pivot term element and its synonym variations. An additional output is a superset family a synonym set belongs to, if it indeed does. Details of the steps are described in detail below. Image created by author


Extracting all the different ways a particular term can be referred to (synonym harvesting) is key for applications in biomedical domain where drugs, genes etc. have many synonyms. While there are human curated knowledge bases for synonyms in the biomedical domain, they are generally incomplete, continually trying to play catchup…

Figure 1. Training pathways to maximize BERT model performance. For application domains where entity types — people, location, organization etc. are the dominant entity types, training pathways 1a-1d would suffice. That is, we start off with a publicly released BERT model (bert-base/large-cased/uncased, or the tiny bert versions) and optionally train it further (1c — continual pre-training) before fine-tuning it for a specific task (1d — supervised task with labeled data). For a domain where person, location, organization etc. are not the dominant entity types, use of original BERT model for continual pre-training (1c) with a domain specific corpus, followed by fine tuning may not boost performance as much as pathway 2a-2d, given the vocabulary in the 1a-1d pathway is still the original BERT model vocabulary with an entity bias towards people, location organization etc. Pathway 2a-2d trains a BERT model from scratch using a vocabulary that is generated from the domain specific corpus. Note: Any form of model training - pre-training, continual pre-training or fine tuning, modifies both model weights as well as the vocabulary vectors — the different shades of same color model(shades of beige)as well as vocabulary(shades of blue/green) in the training stages from left to right illustrates this fact. The box labeled with a “?”, is the focus of this article — evaluate a pre-trained or a continually pre-trained model to improve model performance. Image by Author


Training a BERT model from scratch on a domain specific corpus such as biomedical space with a custom vocabulary generated specific to that space has proven to be critical to maximize model performance in biomedical domain. This is largely because of language characteristics that are unique to biomedical space which…

Figure 1. Unsupervised creation of sentence representation signatures for sentence similarity tasks. Illustration uses BERT (bert-large-cased) model.


To date, models learn fixed size representation of sentences, typically with some form of supervision, which are then used for sentence similarity or other downstream tasks. Examples of this are Google’s Universal sentence encoder (2018) and Sentence transformers (2019). Supervised learning of fixed size representations tends to outperform unsupervised creation…

Figure 1. Information replication strategies of Coronavirus. Coronavirus information tape and code layout shown in left inset. The COVID-19 tape is ~30,000 letters long encoding about 27 functions. Portions of these code segments are evolving faster than other portions driven by selection pressure. These functions can be broadly classified as encoding structural components and nonstructural/accessory components. The main flow sequence of entry and replication is illustrated in labeled sequence (1) The docking apparatus on virus facilitates attachment to host cell and injecting its information tape into cell (2) The host cell’s component construction machinery constructs the virus’s self-replication machine encoded in the virus tape since it cannot distinguish between host code and virus code. (3) The virus’s self-replication machine then creates copies of virus tape as well as code fragments encoding virus components such as docking apparatus. (4) The host cell’s component construction machinery then assembles virus components. (5) Steps 2 and 3 happen inside compartments made up of host cell material but whose assembly is initiated by the virus enabling it to concentrate materials desired for replication as well as shields it from defense mechanisms in the host cell. Step 4 happens separately though the process is still poorly understood. Coronavirus illustration created at CDC. Essentially, the virus enters the host cell with nothing but a linear information sequence and tricks the host cell to bootstrap it to life encoded in its sequence.

Dr. Britt Glaunsinger (virologist) offers an in-depth biological perspective of COVID-19, in her recent video and talks. This post is a computational perspective largely based on the substance of her talks. A computational perspective of an evolving self-replicating linear sequence of data, particularly its manifestations in three dimensions both structurally…

Figure 1. Illustrates embeddings driven fragment search used to answer specific questions (left panel) as well broader questions(right panel). The highlighted text fragments in yellow are document matches to search input obtained using BERT embeddings. The right panel is a sample of animals with literature evidence for presence of coronavirus — the font size is a qualitative measure of reference counts in literature. Bats (in general and chinese horseshoe bats specifically) and birds have been mentioned as sources of coronavirus — bats as the gene source of alpha and beta coronaviruses and birds as the gene source of gammacoronavirus and deltacoronaviruses. Zoonotic transmission of coronavirus from civet cats and pangolins(betacoronavirus) have also been reported. All the information above was obtained automated using machine learning models without human curation. For the broad question in right panel, a bootstrap list was created by the search for term “animals” and clustering result in the neighborhood of Word2vec embeddings. This list was then filtered for biological entity types using unsupervised NER with BERT , which was then used to create the final list of animals with literature evidence captured in fragments as extractive summary of the corresponding documents. The animal source of COVID-19 is not confirmed to date. Coronavirus illustration created at CDC


Embeddings for sentence fragments harvested from a document can serve as extractive summary facets of that document and potentially accelerate its discovery, particularly when user input is a sentence fragment. These fragment embeddings not only yield better quality results than traditional text matching systems, but also circumvent a problem inherent…

Figure 1. illustrates tagged sentence samples of unsupervised NER performed using BERT (bert-large-cased) with no fine tuning. The examples highlight just a few entity types tagged by this approach. Tagging 500 sentences yielded about 1000 unique entity types — of which a select few were mapped to the synthetic labels shown above. The bert-large-cased model is unable to distinguish between GENE and PROTEIN because descriptors for these entities fall within the same tail of predicted distributions for masked terms (they are not distinguishable in the base vocabulary either). Distinguishing closely related entities like these may require MLM fine tuning on domain specific corpus or pre-training a model from scratch using a custom vocabulary (examined below)


In natural language processing, identifying entities of interest (NER) in a sentence such as person, location, organization etc. requires labeled data. We need sentences labeled with entities of interest where the labeling of each sentence is done either manually or by some automated method (often using heuristics to create a…

Progress in (slow) conscious task solving?

This figure is synthesized from recent talks by Yoshua Bengio (NeurIPS 2019 talk), Yann LeCun and Leon Bottou. Acronym IID in figure expands to Independent and Identically Distributed random variables; OOD expands to Out Of Distribution


  1. Self-supervised learning — learning by predicting input
  2. Leverage power of compositionality in distributed representations
  3. Drop IID(Independent and Identically Distributed random variables) assumption
  4. Approaches for self-supervised representation learning
  5. Role of attention
  6. Lifelong Learning at multiple time scales
  7. Architecture priors

While Deep Learning (DL) models continued…

Ajit Rajasekharan

Machine learning practitioner

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store