Very glad to hear you enjoyed reading it. The histogram is computed as follows. For each of the 1000 terms, bucket the occurrence of cosine values (rounded to 2 decimal places). This will yield an output dictionary with about ~120 key values pairs, where key is the rounded to 2 decimal places cosine value and the value is the number of neighbors with that cosine value. The sum of all the values for those ~120 keys will be the vocab size. We do this for all the 1000 terms and aggregate them for each cosine key value, keeping track of the number of times we add to a cosine bucket. Then just average each bucket to get the cosine value across al the 1000 terms for that cosine bucket. In essence, the each curve is the plot of the average number of neighbors for a particular cosine value across all the 1000 sampled terms. Hope this clarifies. The code is in dist_v2.py in github repository (git clone https://github.com/ajitrajasekharan/bert_vector_clustering.git) in function in case you want to look at it directly. gen_dist_for_vocabs(self). Please let me know if this does not clarify your question

--

Machine learning practitioner

Love podcasts or audiobooks? Learn on the go with our new app.