Essential-Web v1.0: 24T tokens of organized web data

machine learning natural language processing text analysis web archive

Description

A 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality.

Update Frequency

Not updated

License

Essential-Web-v1.0 contributions are made available under the ODC attribution license; however, users should also abide by the Common Crawl - Terms of Use. We do not alter the license of any of the underlying data.

Documentation

https://huggingface.co/datasets/EssentialAI/essential-web-v1.0

Managed By

See all datasets managed by EssentialAI.

Contact

research@essential.ai

How to Cite

Essential-Web v1.0: 24T tokens of organized web data was accessed on DATE from https://registry.opendata.aws/eai-essential-web-v1.

Usage Examples

Publications

Resources on AWS

  • Description
    Essential-Web v1.0: 24T tokens of organized web data
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::essential-web-v1.0
    AWS Region
    us-west-2
    AWS CLI Access (No AWS account required)
    aws s3 ls --no-sign-request s3://essential-web-v1.0/
  • Description
    Notifications for new Essential-Web v1.0 data
    Resource type
    SNS Topic
    Amazon Resource Name (ARN)
    arn:aws:sns:us-west-2:021391128517:essential-web-v10-object_created
    AWS Region
    us-west-2

Edit this dataset entry on GitHub

Tell us about your project

Home