Tweeted By @hardmaru
It turns out that vanilla optimizers such as Nesterov momentum and Adam work just as fine for large batch sizes.
— hardmaru (@hardmaru) April 12, 2021
Paper by @zacharynado, @jmgilmer, Chris Shallue, @_arohan_ and George Dahl conducted extensive ablations training vision and language models.https://t.co/KU8pkgAPfn https://t.co/u23KigmXTb