Amazon Web Services is collaborating with the Allen Institute for Artificial Intelligence to distribute benchmark datasets used for solving the hardest problems in artificial intelligence.
If you want to add a dataset or example of how to use a dataset to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository.
Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.
If you have a project using a listed dataset, please tell us about it. We may work with you to feature your project in a blog post.
machine learning
4,817 illustrative diagrams for research on diagram understanding and associated question answering.
csvmachine learning
630 paper annotations
csvjsonmachine learning
7,787 multiple choice science questions and associated corpora
coronavirusCOVID-19life sciencesMERSSARS
Full-text and metadata dataset of COVID-19 and coronavirus-related research articles optimized for machine readability.
machine learningnatural language processing
The DROP dataset contains 96k Question and Answer pairs (QAs) over 6.7K paragraphs, split between train (77k QAs), development (9.5k QAs) and a hidden test partition (9.5k QAs).
machine learningnatural language processing
24K Question/Answer (QA) pairs over 4.7K paragraphs, split between train (19K QAs), development (2.4K QAs) and a hidden test partition (2.5K QAs).
jsonmachine learningnatural language processing
14k QA pairs over 1.7K paragraphs, split between train (10k QAs), development (1.6k QAs) and a hidden test partition (1.7k QAs).
machine learningnatural language processing
9092 crowd-sourced science questions and 68 tables of curated facts
machine learningnatural language processing
68 tables of curated facts
machine learning
1,076 textbook lessons, 26,260 questions, 6229 images
machine learningnatural language processing
ZEST is a benchmark for zero-shot generalization to unseen NLP tasks, with 25K labeled instances across 1,251 different tasks.