Name: Google Books Ngrams
License: Creative Commons Attribution 3.0 Unported License

Description

N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.

Update Frequency

Not updated

License

Creative Commons Attribution 3.0 Unported License

Documentation

http://books.google.com/ngrams/

Managed By

Not managed

See all datasets managed by Not managed.

Contact

https://books.google.com/ngrams

How to Cite

Google Books Ngrams was accessed on DATE from https://registry.opendata.aws/google-ngrams.

Resources on AWS

Description

A data set containing Google Books n-gram corpora in a Hadoop friendly file format.

Resource type

S3 Bucket

Amazon Resource Name (ARN)

arn:aws:s3:::datasets.elasticmapreduce/ngrams/books/

AWS Region

us-east-1

AWS CLI Access (No AWS account required)

aws s3 ls --no-sign-request s3://datasets.elasticmapreduce/ngrams/books/