This paper has a very clear presentation of different attention architectures in transformers. I’d be thankful if people could share their experience in trying multi-query vs standard multi-head attention. Thanks https://t.co/aY1AW5etWI
— Nando de Freitas 🏳️🌈 (@NandoDF) March 13, 2022