machine learning natural language processing text analysis web archive
A 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality.
Not updated
Essential-Web-v1.0 contributions are made available under the ODC attribution license; however, users should also abide by the Common Crawl - Terms of Use. We do not alter the license of any of the underlying data.
https://huggingface.co/datasets/EssentialAI/essential-web-v1.0
See all datasets managed by EssentialAI.
Essential-Web v1.0: 24T tokens of organized web data was accessed on DATE
from https://registry.opendata.aws/eai-essential-web-v1.
arn:aws:s3:::essential-web-v1.0
us-west-2
aws s3 ls --no-sign-request s3://essential-web-v1.0/
arn:aws:sns:us-west-2:021391128517:essential-web-v10-object_created
us-west-2