internet japanese natural language processing web archive
A large Japanese language corpus created through preprocessing Common Crawl data
None
This data is available for anyone to use under the Common Crawl Terms of Use
https://github.com/abeja-inc/Megatron-LM/blob/main/docs/dataset/about_data.md
See all datasets managed by ABEJA inc..
abeja-datascience@abejainc.com
ABEJA CC JA was accessed on DATE
from https://registry.opendata.aws/abeja-cc-ja.
arn:aws:s3:::abeja-cc-ja
ap-northeast-1
aws s3 ls --no-sign-request s3://abeja-cc-ja/