The Registry of Open Data on AWS is now available on AWS Data Exchange
All datasets on the Registry of Open Data are now discoverable on AWS Data Exchange alongside 3,000+ existing data products from category-leading data providers across industries. Explore the catalog to find open, free, and commercial data sets. Learn more about AWS Data Exchange

ABEJA CC JA

internet japanese natural language processing web archive

Description

A large Japanese language corpus created through preprocessing Common Crawl data

Update Frequency

None

License

This data is available for anyone to use under the Common Crawl Terms of Use

Documentation

https://github.com/abeja-inc/Megatron-LM/blob/main/docs/dataset/about_data.md

Managed By

ABEJA inc.

See all datasets managed by ABEJA inc..

Contact

abeja-datascience@abejainc.com

How to Cite

ABEJA CC JA was accessed on DATE from https://registry.opendata.aws/abeja-cc-ja.

Usage Examples

Tutorials
Publications

Resources on AWS

  • Description
    Text corpus
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::abeja-cc-ja
    AWS Region
    ap-northeast-1
    AWS CLI Access (No AWS account required)
    aws s3 ls --no-sign-request s3://abeja-cc-ja/

Edit this dataset entry on GitHub

Tell us about your project

Home