Ceshine's Data Science Tweet Collection

by Thom_Wolf on 2018-10-15 (UTC).

I've spent most of 2018 training models that could barely fit 1-4 samples/GPU.
But SGD usually needs more than few samples/batch for decent results.
I wrote a post gathering practical tips I use, from simple tricks to multi-GPU code & distributed setups: https://t.co/oLe6JlxcVw pic.twitter.com/pQTXQ9X7Ug
— Thomas Wolf (@Thom_Wolf) October 15, 2018

research learning survey

by jeremyphoward on 2018-10-15 (UTC).

These are extremely important techniques that I haven't seen written up elsewhere before.

Many people still think batch size is limited by gpu ram, but that's not true. https://t.co/1WqizgTSJP
— Jeremy Howard (@jeremyphoward) October 15, 2018

misc

by karpathy on 2018-10-16 (UTC).

good post & links! Touches on gradient accumulation, gradient checkpointing (no, not the normal checkpointing), the nearly unambiguous superiority of distributed data parallel container in PyTorch, and the overall importance of understanding what's under the hood. https://t.co/2WYZRz9a2X
— Andrej Karpathy (@karpathy) October 16, 2018

misc

Tags