natural language processing
Japanese dictionaries and pre-trained models (word embeddings and language models) for natural language processing. SudachiDict is the dictionary for a Japanese tokenizer (morphological analyzer) Sudachi. chiVe is Japanese pretrained word embeddings (word vectors), trained using the ultra-large-scale web corpus NWJC by National Institute for Japanese Language and Linguistics, analyzed by Sudachi. chiTra is a library for using large-scale pre-trained language models with the Japanese tokenizer SudachiPy.
The dictionaries are updated every few months to include neologism and fixes for the existing words.
Apache-2.0
https://worksapplications.github.io/Sudachi/
See all datasets managed by Works Applications.
Sudachi Language Resources was accessed on DATE
from https://registry.opendata.aws/sudachi.
arn:aws:s3:::sudachi
ap-northeast-1
aws s3 ls --no-sign-request s3://sudachi/
d2ej7fkh96fzlu.cloudfront.net
ap-northeast-1