Google Books Ngrams

amazon.science natural language processing

Description

N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.

Update Frequency

Not updated

License

Creative Commons Attribution 3.0 Unported License

Documentation

http://books.google.com/ngrams/

Managed By

Not managed

See all datasets managed by Not managed.

Contact

https://books.google.com/ngrams

How to Cite

Google Books Ngrams was accessed on DATE from https://registry.opendata.aws/google-ngrams.

Resources on AWS

  • Description
    A data set containing Google Books n-gram corpora in a Hadoop friendly file format.
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::datasets.elasticmapreduce/ngrams/books/
    AWS Region
    us-east-1
    AWS CLI Access (No AWS account required)
    aws s3 ls --no-sign-request s3://datasets.elasticmapreduce/ngrams/books/

Edit this dataset entry on GitHub

Tell us about your project

Home