Tweeted By @DeepMind
Why does Stochastic Gradient Descent generalise well in deep networks?
— DeepMind (@DeepMind) February 3, 2021
Our team shows that if the learning rate is small but finite, the mean iterate of random shuffling SGD stays close to the path of gradient flow, but on a modified loss landscape https://t.co/JUAzPujWfP pic.twitter.com/NtlbVaMfLC