One key issue and is being addressed by some newer architecture variants - the quadratic scale of attention- e.g for a sentence with n tokens the computation of every token is dependent on all other tokens including itself - n^2 . The other one is not so much a disadvantage as much as a deficiency in that despite the remarkable success of transformers within and across various modalities ( text , image , audio etc.) to get us closer to general intelligence of the kind biological life have, may require an external memory more than what is baked into model parameter weights. This would be essential for a world model. While some think scaling language models in size would get us to general intelligence, it seems unlikely that would be the case despite the remarkable emergent capabilities from pure scaling


Machine learning practitioner

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store