Tweeted By @NandoDF
This paper has a very clear presentation of different attention architectures in transformers. Iād be thankful if people could share their experience in trying multi-query vs standard multi-head attention. Thanks https://t.co/aY1AW5etWI
ā Nando de Freitas š³ļøāš (@NandoDF) March 13, 2022