The Registry of Open Data on AWS is now available on AWS Data Exchange
All datasets on the Registry of Open Data are now discoverable on AWS Data Exchange alongside 3,000+ existing data products from category-leading data providers across industries. Explore the catalog to find open, free, and commercial data sets. Learn more about AWS Data Exchange

Common Crawl

encyclopedic internet natural language processing web archive

Description

A corpus of web crawl data composed of over 50 billion web pages.

Update Frequency

Monthly

License

This data is available for anyone to use under the Common Crawl Terms of Use

Documentation

https://commoncrawl.org/the-data/get-started/

Managed By

See all datasets managed by Common Crawl.

Contact

https://commoncrawl.org/connect/contact-us/

How to Cite

Common Crawl was accessed on DATE from https://registry.opendata.aws/commoncrawl.

Usage Examples

Tutorials
Tools & Applications
Publications

Resources on AWS

  • Description
    Crawl data (WARC and ARC format)
    Resource type
    S3 Bucket Account Required
    Amazon Resource Name (ARN)
    arn:aws:s3:::commoncrawl
    AWS Region
    us-east-1
    AWS CLI Access
    aws s3 ls s3://commoncrawl/

Edit this dataset entry on GitHub

Tell us about your project

Home