What happens when you remove most of BERT's heads? Answer: surprisingly little! Check out @pmichelX's new preprint on pruning heads from multi-head attention models, with interesting analysis and 81% inference-time accuracy gains on BERT-based models! https://t.co/XtnUoP9stc
— Graham Neubig (@gneubig) May 28, 2019