Tweeted By @alex_conneau
DATASET RELEASE: "CC100", the CommonCrawl dataset of 2.5TB of clean unsupervised text from 100 languages (used to train XLM-R) is now publicly available.
— Alexis Conneau (@alex_conneau) October 28, 2020
You can find below the
Data: https://t.co/KDnynkH6hX
Script: https://t.co/BY906YXEHg
By @VishravC et al. pic.twitter.com/KrjcdeRG6P