Tweeted By @chipro
Now that we know it's possible to achieve comparable results to BERT using only 66M parameters, can someone find a way to train a 66M param model from scratch instead of distilling? https://t.co/ycJjMwSwsr
— Chip Huyen (@chipro) August 28, 2019