Tweeted By @alex_conneau

on 2020-10-28 (UTC)
dataset nlp

DATASET RELEASE: "CC100", the CommonCrawl dataset of 2.5TB of clean unsupervised text from 100 languages (used to train XLM-R) is now publicly available.

You can find below the

Data: https://t.co/KDnynkH6hX
Script: https://t.co/BY906YXEHg

By @VishravC et al. pic.twitter.com/KrjcdeRG6P
— Alexis Conneau (@alex_conneau) October 28, 2020

Tweeted By @alex_conneau

Tags