Neat corpus of typos: https://t.co/FhRJlQIdWj (HT @jsvine's "Data is Plural" newsletter)
— Rachael Tatman (@rctatman) December 11, 2019
Neat corpus of typos: https://t.co/FhRJlQIdWj (HT @jsvine's "Data is Plural" newsletter)
— Rachael Tatman (@rctatman) December 11, 2019
Deepfake Detection Challenge on #Kaggle: Identify videos with facial manipulations
— Alexandr Kalinin (@alxndrkalinin) December 11, 2019
- $1,000,000 USD in prize money
- Kaggle notebook-only submission
- over 470GB of training data
- 1GB limit on external data
- no custom packages or internet accesshttps://t.co/BUU9NFqyaz pic.twitter.com/FpBVvcu9kG
If you're looking for a dataset to quickly try out your ideas for semi-supervised and unbalanced data classification, you might be interested in this new dataset: "Image网" (pronounced "Imagewang").https://t.co/Xmmz4cd9jI pic.twitter.com/mcaD3BqJuI
— Jeremy Howard (@jeremyphoward) December 11, 2019
AI needs better datasets (not just the most convenient or easy to collect data):
— Rachel Thomas @ #NeurIPS2019 (@math_rachel) December 4, 2019
- data that better reflects scope of human imagination
- better practices for data collection
- data collected by specialists
- takes into account ethics & consent@ctnzr #FantasticFutures2019 pic.twitter.com/JPohaa7c7a
SimpleBooks is a longterm dependency dataset that is 90% the size of WikiText-103 but has 1/3 vocab and 1/4 OOV. I created it last year to test, benchmark, & do tutorials for word-level language models but didn't publish it bc small datasets get 0 love 😅 https://t.co/3TNA2xoz5Z https://t.co/8zCdf1ovdd
— Chip Huyen @ NeurIPS (@chipro) December 2, 2019
Large language models are starting to capture larger swaths of English Grammar, and several of us at NYU have gotten interested in trying to get a broad overview of where models are succeeding and failing. [new dataset alert; thread] pic.twitter.com/9DcqCEAtUR
— Sam Bowman (@sleepinyourhat) December 1, 2019
Some more information about the M5 Competition. It will start on the 2nd of March, 2020 and end the 30th of June 2020 It will be run using the @Kaggle Platform. There will be about 100K time series of sales data made generously available by @Walmart as shown in the attached Table pic.twitter.com/cB59xI2WL7
— Spyros Makridakis (@spyrosmakrid) November 28, 2019
PlantDoc: A Dataset for Visual Plant Disease Detection. https://t.co/Bhbzw9ZTHL pic.twitter.com/2FWVx6qDmn
— arxiv (@arxiv_org) November 28, 2019
We’re sharing a new benchmark called MLQA to help extend performance improvements in extractive question-answering (QA) to more languages. It contains thousands of QA instances in Arabic, German, Hindi, Spanish, Vietnamese, and Simplified Chinese. https://t.co/qGSfOc30Co pic.twitter.com/XIITMxpWt8
— Facebook AI (@facebookai) November 23, 2019
Today we also started open-sourcing some of our datasets & NLP example projects!
— Ines Montani 〰️ (@_inesmontani) November 22, 2019
Includes 1k+ annotated examples each, train/eval scripts, results, data vizualizers & some powerful tok2vec weights trained on Reddit to initialize models.
💝 Repo: https://t.co/xHLVaMRc69 https://t.co/8pkn2xz0AG pic.twitter.com/SCfWX2ahby
This is an amazing dataset and open challenge - a datasets with 700,000 building assessment annotations, with the goal of improving disaster recovery.
— Jeremy Howard (@jeremyphoward) November 20, 2019
A fantastic project for any deep learning practitioner looking to deepen and test out their skills.https://t.co/gWRVutzG0S pic.twitter.com/xFlw410cbv
Interesting work (and a nice large and clean dataset as well, looking forward to see it released):
— Thomas Wolf (@Thom_Wolf) November 16, 2019
"Compressive Transformers for Long-Range Sequence Modelling"
by Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Timothy P. Lillicrap (at DeepMind)
Paper: https://t.co/CV3ThAAweg pic.twitter.com/JQMMjsPJcX