natural language processing
N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.
Creative Commons Attribution 3.0 Unported License
See all datasets managed by Not managed.
Google Books Ngrams was accessed on
DATE from https://registry.opendata.aws/google-ngrams.
aws s3 ls --no-sign-request s3://datasets.elasticmapreduce/ngrams/books/
Edit this dataset entry on GitHub