Description

A 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality.

Update Frequency

Not updated

License

Essential-Web-v1.0 contributions are made available under the ODC attribution license; however, users should also abide by the Common Crawl - Terms of Use. We do not alter the license of any of the underlying data.

Documentation

https://huggingface.co/datasets/EssentialAI/essential-web-v1.0

Managed By

See all datasets managed by EssentialAI.

Contact

research@essential.ai

How to Cite

Essential-Web v1.0: 24T tokens of organized web data was accessed on DATE from https://registry.opendata.aws/eai-essential-web-v1.

Usage Examples

Publications

Essential-Web v1.0: 24T tokens of organized web data by Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar et al.

Resources on AWS

Description

Essential-Web v1.0: 24T tokens of organized web data

Resource type

S3 Bucket

Amazon Resource Name (ARN)

arn:aws:s3:::essential-web-v1.0

AWS Region

us-west-2

AWS CLI Access (No AWS account required)

aws s3 ls --no-sign-request s3://essential-web-v1.0/
Description

Notifications for new Essential-Web v1.0 data

Resource type

SNS Topic

Amazon Resource Name (ARN)

arn:aws:sns:us-west-2:021391128517:essential-web-v10-object_created

AWS Region

us-west-2