ABEJA CC JA

internet japanese natural language processing web archive

Description

A large Japanese language corpus created through preprocessing Common Crawl data

Update Frequency

None

License

This data is available for anyone to use under the Common Crawl Terms of Use

Documentation

https://github.com/abeja-inc/Megatron-LM/blob/main/docs/dataset/about_data.md

Managed By

ABEJA inc.

See all datasets managed by ABEJA inc..

Contact

abeja-datascience@abejainc.com

How to Cite

ABEJA CC JA was accessed on DATE from https://registry.opendata.aws/abeja-cc-ja.

Usage Examples

Tutorials
Publications

Resources on AWS

  • Description
    Text corpus
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::abeja-cc-ja
    AWS Region
    ap-northeast-1
    AWS CLI Access (No AWS account required)
    aws s3 ls --no-sign-request s3://abeja-cc-ja/

Edit this dataset entry on GitHub

Tell us about your project

Home