Google Books Ngrams

natural language processing

Description

N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.

Update Frequency

Not updated

License

Creative Commons Attribution 3.0 Unported License

Documentation

http://books.google.com/ngrams/

Contact

https://books.google.com/ngrams

Resources on AWS

  • Description
    A data set containing Google Books n-gram corpora in a Hadoop friendly file format.
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::datasets.elasticmapreduce/ngrams/books/
    AWS Region
    us-east-1

Edit this dataset entry on GitHub

Home