Common Crawl

encyclopedic internet machine learning natural language processing

Description

A corpus of web crawl data composed of over 50 billion web pages.

Update Frequency

Monthly

License

This data is available for anyone to use under the Common Crawl Terms of Use

Documentation

https://commoncrawl.org/the-data/get-started/

Managed By

See all datasets managed by Common Crawl.

Contact

https://commoncrawl.org/connect/contact-us/

How to Cite

Common Crawl was accessed on DATE from https://registry.opendata.aws/commoncrawl.

Usage Examples

Tutorials
Tools & Applications
Publications

Resources on AWS

  • Description
    Crawl data (WARC and ARC format)
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::commoncrawl
    AWS Region
    us-east-1
    AWS CLI Access (No AWS account required)
    aws s3 ls s3://commoncrawl/ --no-sign-request

Edit this dataset entry on GitHub

Home