Tweeted By @RogerGrosse
Interestingly, the hyperparameters seem to equilibrate over a shorter timescale than the weights, allowing us to learn a schedule. E.g., start with low dropout, then crank it up once the network starts overfitting. Works better than any fixed value! pic.twitter.com/Mw3mp7ph3f
— Roger Grosse (@RogerGrosse) March 8, 2019