Thank you very much.

1) Sorry I was not clear in that line. I will fix that - thank you for pointing out.

When pre-training, the custom vocabulary was built for scratch using Transformers library implementation of wordpiece training for subword creation. Continual pre-training simply starts off with the vocabulary from pre-training. As you rightly pointed out, we could add tokens to the vocabulary when continual pre-training if we want to. I did not do that since I already created a custom vocabulary from scratch during pre-training

2) Breaking down your question into two parts "When we pre-train from scratch, what is the impact of random initialization of (a) on vocab vectors (2) model layer weights.

(a) vocab vectors are nearly orthogonal simple by virtue of random initialization because of their high dimensionality (b) I didnt even think of this point - how are the weights initialized in BERT - I dont know the answer - my guess is, as usual,random initialization too but I will check code if there is any specific way of doing it - thank you for asking this.

Regarding NSP loss, it is only implemented in BERT's original code release. I don’t see it in transformer implementation - this could be because of evidence that NSP loss is not necessary and is even considered detrimental (cant remember the reference) - hence models like Roberta dont even have NSP loss.

Machine learning practitioner