Tweeted By @Tim_Dettmers
How can you successfully train transformers on small datasets like PTB and WikiText-2? Are LSTMs better on small datasets? I ran 339 experiments worth 568 GPU hours and came up with some answers. I do not have time to write a blog post, so here a twitter thread instead. 1/n
— Tim Dettmers (@Tim_Dettmers) April 8, 2020