One key issue and is being addressed by some newer architecture variants - the quadratic scale of attention- e.g for a sentence with n tokens the computation of every token is dependent on all other tokens including itself - n^2 . The other one is not so much a disadvantage as much as a deficiency in that despite the remarkable success of transformers within and across various modalities ( text , image , audio etc.) to get us closer to general intelligence of the kind biological life have, may require an external memory more than what is baked into model parameter weights. This would be essential for a world model. While some think scaling language models in size would get us to general intelligence, it seems unlikely that would be the case despite the remarkable emergent capabilities from pure scaling
Machine learning practitioner
Love podcasts or audiobooks? Learn on the go with our new app.