Another point of reference from @Smerity https://t.co/H9KcMKcdKd
— Sasha Rush (@srush_nlp) April 2, 2020
Another point of reference from @Smerity https://t.co/H9KcMKcdKd
— Sasha Rush (@srush_nlp) April 2, 2020
Winning response. Props for author for responding.
— Sasha Rush (@srush_nlp) April 3, 2020
I will accept other submissions if others are motivated to find a different solution.https://t.co/Hft21inqhL
Very interesting explanation for why this is so difficult, and why it should arguably not be used in the future. pic.twitter.com/IyvtzVNkDH
The Transformer-XL results from Google Brain on language modeling could not be reproduced by some top NLP researchers (and the authors are not helping).@srush_nlp offers a bounty for whoever can reproduce the results.
— Yann LeCun (@ylecun) April 3, 2020
(I assume the authors are excluded from the challenge!). https://t.co/ssnMjSVxdd
How can you successfully train transformers on small datasets like PTB and WikiText-2? Are LSTMs better on small datasets? I ran 339 experiments worth 568 GPU hours and came up with some answers. I do not have time to write a blog post, so here a twitter thread instead. 1/n
— Tim Dettmers (@Tim_Dettmers) April 8, 2020
The key insight is the following: In the small dataset regime, it is all about dataset augmentation. The analog in computer vision is that you get much better results, particularly on small datasets, if you do certain dataset augmentations. This also regularizes the model.
— Tim Dettmers (@Tim_Dettmers) April 8, 2020
The most dramatic performance gain comes from discrete embedding dropout: You embed as usual, but now with a probability p you zero the entire word vector. This is akin to masked language modeling but the goal is not to predict the mask — just regular LM with uncertain context.
— Tim Dettmers (@Tim_Dettmers) April 8, 2020
The second most important factor is regular input dropout: You take the embeddings and dropout elements with probability p. This also has a data augmentation effect very similar to dropping out random pixels for images. What is a good way to think about this? 1/2
— Tim Dettmers (@Tim_Dettmers) April 8, 2020