Ceshine's Data Science Tweet Collection

by chipro on 2019-05-17 (UTC).

SOTA for PTB without extra data is 46.54 (Transformer-XL 54.5). On paperswithcode, all top models on WikiText-103 & 1billion are transformer, and all top models on small datasets are lstm. Could just be hp but could also be something else https://t.co/Vtms96ScKd
— Chip Huyen (@chipro) May 17, 2019

nlp research

by jeremyphoward on 2019-05-17 (UTC).

AWD-LSTM benefits from all the work done on regularization by @Smerity . Not sure there's the same richness of regularization available just yet for transformer architectures? It's particularly important for small datasets
— Jeremy Howard (@jeremyphoward) May 17, 2019

nlp thought

by m__dehghani on 2019-05-17 (UTC).

The "recurrent inductive bias" of RNNs usually helps them be more data efficient, compared to vanilla Transformer. If you introduce such a bias to Transformers (like recurrence in depth in Universal Transformers), they generalize better on small datasets: https://t.co/gWzKXz8xRU
— Mostafa Dehghani (@m__dehghani) May 17, 2019

nlp research

Tags