Registry of Open Data on AWS

Allen Institute for Artificial Intelligence (AI2)

Amazon Web Services is collaborating with the Allen Institute for Artificial Intelligence to distribute benchmark datasets used for solving the hardest problems in artificial intelligence.

Search datasets (currently 13 matching datasets)

Add to this registry

If you want to add a dataset or example of how to use a dataset to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository.

Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.

Tell us about your project

If you have a project using a listed dataset, please tell us about it. We may work with you to feature your project in a blog post.

AI2 Diagram Dataset (AI2D)

machine learning

4,817 illustrative diagrams for research on diagram understanding and associated question answering.

Details →

Usage examples

A Diagram is Worth a Dozen Images by Aniruddha Kembhavi, Michael Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi

See 1 usage example →

AI2 Meaningful Citations Data Set

csvmachine learning

630 paper annotations

Details →

Usage examples

Identifying Meaningful Citations by Marco Valenzuela, Vu A. Ha, Oren Etzioni

See 1 usage example →

AI2 Reasoning Challenge (ARC) 2018

csvjsonmachine learning

7,787 multiple choice science questions and associated corpora

Details →

Usage examples

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challengg by Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord

See 1 usage example →

COVID-19 Open Research Dataset (CORD-19)

coronavirusCOVID-19life sciencesMERSSARS

Full-text and metadata dataset of COVID-19 and coronavirus-related research articles optimized for machine readability.

Details →

Usage examples

COVID-19 Open Research Dataset Challenge (CORD-19) by Kaggle

See 1 usage example →

Discrete Reasoning Over the content of Paragraphs (DROP)

machine learningnatural language processing

The DROP dataset contains 96k Question and Answer pairs (QAs) over 6.7K paragraphs, split between train (77k QAs), development (9.5k QAs) and a hidden test partition (9.5k QAs).

Details →