Tweeted By @Smerity

on 2019-09-19 (UTC)
thought tip

a group with 3 other tweets.

- Your model runs will have the exact same perplexity spikes (hits confusing data at the same time)
- You can compare timestamp / batch results in early training as a pseudo-estimate of convergence
- Improved gradient flow visibly helps the same init do better
— Smerity (@Smerity) September 19, 2019

Tags