Common Crawl

encyclopedic internet machine learning natural language processing

Resources on AWS

  • Description
    Crawl data (WARC and ARC format)
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::commoncrawl
    AWS Region
    us-east-1
    AWS CLI Access (No AWS account required)
    aws s3 ls s3://commoncrawl/ --no-sign-request

Edit this dataset entry on GitHub

Home