Tweeted By @gneubig

on 2019-12-27 (UTC)
research

The experiments here were really informative to me; I now understand well why (and when) knowledge distillation works in sequence generation: it creates "easy-to-learn" data with more lexical consistency and less reordering. Also: weak child models prefer weaker teacher models. https://t.co/k9P3py0ewB
— Graham Neubig (@gneubig) December 27, 2019

Tweeted By @gneubig

Tags