Homepage
Close
Menu

Site Navigation

  • Home
  • Archive(TODO)
    • By Day
    • By Month
  • About(TODO)
  • Stats
Close
by Thom_Wolf on 2018-10-15 (UTC).

I've spent most of 2018 training models that could barely fit 1-4 samples/GPU.
But SGD usually needs more than few samples/batch for decent results.
I wrote a post gathering practical tips I use, from simple tricks to multi-GPU code & distributed setups: https://t.co/oLe6JlxcVw pic.twitter.com/pQTXQ9X7Ug

— Thomas Wolf (@Thom_Wolf) October 15, 2018
researchlearningsurvey
by jeremyphoward on 2018-10-15 (UTC).

These are extremely important techniques that I haven't seen written up elsewhere before.

Many people still think batch size is limited by gpu ram, but that's not true. https://t.co/1WqizgTSJP

— Jeremy Howard (@jeremyphoward) October 15, 2018
misc
by karpathy on 2018-10-16 (UTC).

good post & links! Touches on gradient accumulation, gradient checkpointing (no, not the normal checkpointing), the nearly unambiguous superiority of distributed data parallel container in PyTorch, and the overall importance of understanding what's under the hood. https://t.co/2WYZRz9a2X

— Andrej Karpathy (@karpathy) October 16, 2018
misc

Tags

learning tutorial misc nlp rstats gan ethics research dataviz survey python tool security kaggle video thought bayesian humour tensorflow w_code bias dataset pytorch cv tip application javascript forecast swift golang rl jax julia gnn causal surey diffusion
© Copyright Philosophy 2018 Site Template by Colorlib