Tweeted By @srush_nlp
Understanding the Difficulty of Training Transformers (https://t.co/lN5JccNOlz, @LiyuanLucas)
— Sasha Rush (@srush_nlp) November 11, 2020
Studies instability of transformer. Dives into the dark arts of NN stability, impact of layer norm / residuals. Isolates residual paths as a main cause. Trains 60 layers transformers. pic.twitter.com/SChDwCo4xa