The Registry of Open Data on AWS is now available on AWS Data Exchange
All datasets on the Registry of Open Data are now discoverable on AWS Data Exchange alongside 3,000+ existing data products from category-leading data providers across industries. Explore the catalog to find open, free, and commercial data sets. Learn more about AWS Data Exchange

NLP - datasets

deep learning machine learning natural language processing


Some of the most important datasets for NLP, with a focus on classification, including IMDb, AG-News, Amazon Reviews (polarity and full), Yelp Reviews (polarity and full), Dbpedia, Sogou News (Pinyin), Yahoo Answers, Wikitext 2 and Wikitext 103, and ACL-2010 French-English 10^9 corpus. This is part of the datasets collection hosted by AWS for convenience of students. See documentation link for citation and license details for each dataset.

Update Frequency

As required


Varies by dataset - see documentation link


Managed By

See all datasets managed by


How to Cite

NLP - datasets was accessed on DATE from

Resources on AWS

  • Description
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    AWS Region
    AWS CLI Access (No AWS account required)
    aws s3 ls --no-sign-request s3://fast-ai-nlp/

Edit this dataset entry on GitHub

Tell us about your project