Ceshine's Data Science Tweet Collection

by gneubig on 2019-05-28 (UTC).

What happens when you remove most of BERT's heads? Answer: surprisingly little! Check out @pmichelX's new preprint on pruning heads from multi-head attention models, with interesting analysis and 81% inference-time accuracy gains on BERT-based models! https://t.co/XtnUoP9stc
— Graham Neubig (@gneubig) May 28, 2019

research nlp

by diegovogeid on 2019-05-28 (UTC).

Did you see this paper (by @perez) where they proved something similar theoretically? (Transformer with a single attention head are Turing complete)

- On the Turing Completeness of Modern Neural Network Architectureshttps://t.co/DaHP1biIa8
— Diego Francisco Valenzuela Iturra (@diegovogeid) May 28, 2019

research

Tags