ABEJA CC JA

internet japanese natural language processing web archive

Description

A large Japanese language corpus created through preprocessing Common Crawl data

Update Frequency

None

License

This data is available for anyone to use under the Common Crawl Terms of Use

Documentation

https://github.com/abeja-inc/Megatron-LM/blob/main/docs/dataset/about_data.md

Managed By

See all datasets managed by ABEJA inc..

Contact

abeja-datascience@abejainc.com

How to Cite

ABEJA CC JA was accessed on DATE from https://registry.opendata.aws/abeja-cc-ja.

Usage Examples

Tutorials

Tutorial of ABEJA CC JA dataset by Kyo Hattori

Publications

Building a Large-Scale Japanese Corpus from Common Crawl and Its Preprocessing by Kyo Hattori

Resources on AWS

Description

Text corpus

Resource type

S3 Bucket

Amazon Resource Name (ARN)

arn:aws:s3:::abeja-cc-ja

AWS Region

ap-northeast-1

AWS CLI Access (No AWS account required)

aws s3 ls --no-sign-request s3://abeja-cc-ja/

Edit this dataset entry on GitHub

Tell us about your project