Tweeted By @Thom_Wolf
I've spent most of 2018 training models that could barely fit 1-4 samples/GPU.
— Thomas Wolf (@Thom_Wolf) October 15, 2018
But SGD usually needs more than few samples/batch for decent results.
I wrote a post gathering practical tips I use, from simple tricks to multi-GPU code & distributed setups: https://t.co/oLe6JlxcVw pic.twitter.com/pQTXQ9X7Ug