The Registry of Open Data on AWS is now available on AWS Data Exchange
All datasets on the Registry of Open Data are now discoverable on AWS Data Exchange alongside 3,000+ existing data products from category-leading data providers across industries. Explore the catalog to find open, free, and commercial data sets. Learn more about AWS Data Exchange

Usage examples for all datasets listed in the Registry of Open Data on AWS tagged with natural language processing.


Common Crawl

Tutorials
Tools & Applications
Publications

Sudachi Language Resources

Tutorials
Tools & Applications
Publications

Synthea synthetic patient generator data in OMOP Common Data Model

Tutorials
Tools & Applications

Japanese Tokenizer Dictionaries

Tutorials
Tools & Applications
Publications

MIMIC-III (‘Medical Information Mart for Intensive Care’)

Tutorials
Tools & Applications

REDASA COVID-19 Open Data

Tools & Applications
Publications

CMS 2008-2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) in OMOP Common Data Model

Tutorials
Tools & Applications

Common Screens

Tutorials

Discrete Reasoning Over the content of Paragraphs (DROP)

Publications

End of Term Web Archive Dataset

Publications

MultiCoNER Datasets

Publications

Quoref

Publications

Reasoning Over Paragraph Effects in Situations (ROPES)

Publications

Gretel Synthetic Safety Alignment Dataset

Tutorials
Tools & Applications

ABEJA CC JA

Tutorials
Publications

Amazon-PQA

Publications

Answer Reformulation

Publications

Automatic Speech Recognition (ASR) Error Robustness

Publications

DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue

Publications

Enriched Topical-Chat Dataset for Knowledge-Grounded Dialogue Systems

Publications

Helpful Sentences from Reviews

Publications

Humor Detection from Product Question Answering Systems

Publications

Humor patterns used for querying Alexa traffic

Publications

Learning to Rank and Filter - community question answering

Publications

Low Context Name Entity Recognition (NER) Datasets with Gazetteer

Publications

Multi Token Completion

Publications

Multilingual Name Entity Recognition (NER) Datasets with Gazetteer

Publications

PASS: Perturb-and-Select Summarizer for Product Reviews

Publications

Phrase Clustering Dataset (PCD)

Publications

Pre- and post-purchase product questions

Publications

Product Comparison Dataset for Online Shopping

Publications

Shopping Humor Generation

Publications

WikiSum: Coherent Summarization Dataset for Efficient Human-Evaluation

Publications

Wizard of Tasks

Publications

If you want to add a dataset or usage example to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository or tell us about your project.

Home