Common Crawl

encyclopedic machine learning internet

Description

A corpus of web crawl data composed of over 5 billion web pages.

Update Frequency

Monthly

License

This data is available for anyone to use under the Common Crawl Terms of Use

Documentation

http://commoncrawl.org/the-data/get-started/

Contact

http://commoncrawl.org/connect/contact-us/

Usage Examples

Resources on AWS

  • Description
    Crawl data (WARC and ARC format)
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::commoncrawl
    AWS Region
    us-east-1

Edit this dataset entry on GitHub

Home