Homepage
Close
Menu

Site Navigation

  • Home
  • Archive(TODO)
    • By Day
    • By Month
  • About(TODO)
  • Stats
Close
by srush_nlp on 2020-04-02 (UTC).

Another point of reference from @Smerity https://t.co/H9KcMKcdKd

— Sasha Rush (@srush_nlp) April 2, 2020
nlpresearch
by srush_nlp on 2020-04-03 (UTC).

Winning response. Props for author for responding.

I will accept other submissions if others are motivated to find a different solution.https://t.co/Hft21inqhL

Very interesting explanation for why this is so difficult, and why it should arguably not be used in the future. pic.twitter.com/IyvtzVNkDH

— Sasha Rush (@srush_nlp) April 3, 2020
nlpresearch
by ylecun on 2020-04-03 (UTC).

The Transformer-XL results from Google Brain on language modeling could not be reproduced by some top NLP researchers (and the authors are not helping).@srush_nlp offers a bounty for whoever can reproduce the results.
(I assume the authors are excluded from the challenge!). https://t.co/ssnMjSVxdd

— Yann LeCun (@ylecun) April 3, 2020
nlpresearch
by Tim_Dettmers on 2020-04-08 (UTC).

How can you successfully train transformers on small datasets like PTB and WikiText-2? Are LSTMs better on small datasets? I ran 339 experiments worth 568 GPU hours and came up with some answers. I do not have time to write a blog post, so here a twitter thread instead. 1/n

— Tim Dettmers (@Tim_Dettmers) April 8, 2020
research
by Tim_Dettmers on 2020-04-08 (UTC).

The key insight is the following: In the small dataset regime, it is all about dataset augmentation. The analog in computer vision is that you get much better results, particularly on small datasets, if you do certain dataset augmentations. This also regularizes the model.

— Tim Dettmers (@Tim_Dettmers) April 8, 2020
researchnlp
by Tim_Dettmers on 2020-04-08 (UTC).

The most dramatic performance gain comes from discrete embedding dropout: You embed as usual, but now with a probability p you zero the entire word vector. This is akin to masked language modeling but the goal is not to predict the mask — just regular LM with uncertain context.

— Tim Dettmers (@Tim_Dettmers) April 8, 2020
researchtipnlp
by Tim_Dettmers on 2020-04-08 (UTC).

The second most important factor is regular input dropout: You take the embeddings and dropout elements with probability p. This also has a data augmentation effect very similar to dropping out random pixels for images. What is a good way to think about this? 1/2

— Tim Dettmers (@Tim_Dettmers) April 8, 2020
nlpresearchtip

Tags

learning tutorial misc nlp rstats gan ethics research dataviz survey python tool security kaggle video thought bayesian humour tensorflow w_code bias dataset pytorch cv tip application javascript forecast swift golang rl jax julia gnn causal surey diffusion
© Copyright Philosophy 2018 Site Template by Colorlib