Sudachi Language Resources - Registry of Open Data on AWS

Description

Japanese dictionaries and pre-trained models (word embeddings and language models) for natural language processing. SudachiDict is the dictionary for a Japanese tokenizer (morphological analyzer) Sudachi. chiVe is Japanese pretrained word embeddings (word vectors), trained using the ultra-large-scale web corpus NWJC by National Institute for Japanese Language and Linguistics, analyzed by Sudachi. chiTra is a library for using large-scale pre-trained language models with the Japanese tokenizer SudachiPy.

Update Frequency

The dictionaries are updated every few months to include neologism and fixes for the existing words.

License

Apache-2.0

Documentation

https://worksapplications.github.io/Sudachi/

Managed By

Works Applications

See all datasets managed by Works Applications.

Contact

sudachi@worksap.co.jp

How to Cite

Sudachi Language Resources was accessed on DATE from https://registry.opendata.aws/sudachi.

Usage Examples

Tutorials

analysis-sudachi Tutorial by Works Applications
chiTra Tutorial by Works Applications
chiVe Tutorial by Works Applications
Sudachi Tutorial by Works Applications
SudachiPy Tutorial by Works Applications

Tools & Applications

analysis-sudachi: Sudachi pluglin for Elasticsearch by Works Applications
chiTra: SudachiPy for hugging face Transformers by Works Applications
jdartsclone: TRIE Data Structure using Double-Array by Works Applications
Kintoki: Dependency Parser by Works Applications
Sudachi: Japanese Tokenizer for Business by Works Applications
sudachidict_core on pypi.python.org - a Python module to download and install SudachiDict for the python tokenizer by Works Applications
sudachidict_full on pypi.python.org - a Python module to download and install SudachiDict for the python tokenizer by Works Applications
sudachidict_small on pypi.python.org - a Python module to download and install SudachiDict for the python tokenizer by Works Applications
SudachiPy: Python version of Sudachi by Works Applications

Publications

chiVe 2.0: SudachiとNWJCを用いた実用的な日本語単語ベクトルの実現に向けて by 河村宗一郎, 久本空海, 真鍋陽俊, 髙岡一馬, 内田佳孝, 岡照晃, 浅原正幸
chiVe: 製品利用可能な日本語単語ベクトル資源の実現へ向けて～形態素解析器Sudachiと超大規模ウェブコーパスNWJCによる分散表現の獲得と改良～ by 久本空海, 山村崇, 勝田哲弘, 竹林佑斗, 髙岡一馬, 内田佳孝, 岡照晃, 浅原正幸
Sudachi: a Japanese Tokenizer for Business by Kazuma Takaoka, Sorami Hisamoto, Noriko Kawahara, Miho Sakamoto, Yoshitaka Uchida, Yuji Matsumoto
形態素解析器『Sudachi』のための大規模辞書開発 by 坂本美保, 川原典子, 久本空海, 髙岡一馬, 内田佳孝
複数粒度の分割結果に基づく日本語単語分散表現 by 真鍋陽俊, 岡照晃, 海川祥毅, 髙岡一馬, 内田佳孝, 浅原正幸
詳細化した同義関係をもつ同義語辞書の作成 by 高岡一馬, 岡部裕子, 川原典子, 坂本美保, 内田佳孝

Resources on AWS

Description

SudachiDict: Binary format of the mophological analysis dictionaries chiVe: Pretrained word embedding in various formats

Resource type

S3 Bucket

Amazon Resource Name (ARN)

arn:aws:s3:::sudachi

AWS Region

ap-northeast-1

AWS CLI Access (No AWS account required)

aws s3 ls --no-sign-request s3://sudachi/
Description

Cloudfront CDN mirror

Resource type

CloudFront Distribution

Hostname

d2ej7fkh96fzlu.cloudfront.net

AWS Region

ap-northeast-1