Tweeted By @marian_nmt
Another comment on the GPT-2 data: the WMT 2019 training data this year for English-German consists of 28GB of English and 58GB(!!!) of German plain text news data with document boundaries. So, similar to @OpenAI Webtext, news-domain but bilingual: https://t.co/EHOD3ZvGL7
— Marian NMT (@marian_nmt) February 27, 2019