Sudachi Language Resources

natural language processing

Description

Japanese dictionaries and pre-trained models (word embeddings and language models) for natural language processing. SudachiDict is the dictionary for a Japanese tokenizer (morphological analyzer) Sudachi. chiVe is Japanese pretrained word embeddings (word vectors), trained using the ultra-large-scale web corpus NWJC by National Institute for Japanese Language and Linguistics, analyzed by Sudachi. chiTra is a library for using large-scale pre-trained language models with the Japanese tokenizer SudachiPy.

Update Frequency

The dictionaries are updated every few months to include neologism and fixes for the existing words.

License

Apache-2.0

Documentation

https://worksapplications.github.io/Sudachi/

Managed By

Works Applications

See all datasets managed by Works Applications.

Contact

sudachi@worksap.co.jp

How to Cite

Sudachi Language Resources was accessed on DATE from https://registry.opendata.aws/sudachi.

Usage Examples

Tutorials
Tools & Applications
Publications

Resources on AWS

  • Description
    SudachiDict: Binary format of the mophological analysis dictionaries chiVe: Pretrained word embedding in various formats
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::sudachi
    AWS Region
    ap-northeast-1
    AWS CLI Access (No AWS account required)
    aws s3 ls --no-sign-request s3://sudachi/
  • Description
    Cloudfront CDN mirror
    Resource type
    CloudFront Distribution
    Hostname
    d2ej7fkh96fzlu.cloudfront.net
    AWS Region
    ap-northeast-1

Edit this dataset entry on GitHub

Tell us about your project

Home