Tweeted By @gneubig
The experiments here were really informative to me; I now understand well why (and when) knowledge distillation works in sequence generation: it creates "easy-to-learn" data with more lexical consistency and less reordering. Also: weak child models prefer weaker teacher models. https://t.co/k9P3py0ewB
— Graham Neubig (@gneubig) December 27, 2019