Tweeted By @Tim_Dettmers
The most dramatic performance gain comes from discrete embedding dropout: You embed as usual, but now with a probability p you zero the entire word vector. This is akin to masked language modeling but the goal is not to predict the mask — just regular LM with uncertain context.
— Tim Dettmers (@Tim_Dettmers) April 8, 2020