Tweeted By @karpathy
good post & links! Touches on gradient accumulation, gradient checkpointing (no, not the normal checkpointing), the nearly unambiguous superiority of distributed data parallel container in PyTorch, and the overall importance of understanding what's under the hood. https://t.co/2WYZRz9a2X
— Andrej Karpathy (@karpathy) October 16, 2018