Tweeted By @Smerity
- Your model runs will have the exact same perplexity spikes (hits confusing data at the same time)
— Smerity (@Smerity) September 19, 2019
- You can compare timestamp / batch results in early training as a pseudo-estimate of convergence
- Improved gradient flow visibly helps the same init do better