-
-
Notifications
You must be signed in to change notification settings - Fork 55
Dictionaries in kagome
are a "set of morphemes" of dict.Dict type, and the differences are the information contained in the dictionary.
IPADIC has a vocabulary of about 400,000 words and UniDIC about 750,000; IPADIC is suitable for memory-limited environments and most use cases, while UniDIC is more suitable for splitting words when searching by its shorter lexical units.
For both pros and cons of the dictionary, see "About the dictionary" | kagome | Wiki @ GitHub
- From issue #293
t.Tokenize(s)
is an alias of t.Analyze(s, tokenizer.Normal)
. The argument "tokenizer.Normal
" describes the segmentation mode during analysis.
kagome
has some segmentation modes.
- Normal: Regular segmentation
- Search: Use a heuristic to do additional segmentation useful for search
- Extended: Similar to search mode, but also uni-gram unknown words
- From issue #274
As you may know, most Asian texts are not word-separated. The word "wakati
" means "word divide" in Japanese. Thus, wakati
helps to divide the text into word tokens. Imagine the following.
-
Wakati("thistextwritingissomewhatsimilartotheasianstyle.")
-->this text writing is somewhat similar to the asian style.
The Tokenizer.Wakati()
is used to simply divide the text into space-separated-words. Used to create a meta data for a Full-text search. E.g. FTS5 in SQLite3.
The Tokenizer.Tokenize()
is similar to Wakati()
. But each wakatized
(?) chunks contains more information. Mostly used to analyze the grammar, text-lint and etc.
- From issue #274
In order to do the wakati
thing, a word dictionary is needed to determine the proper names, nouns, etc. of a word.
The difference between dictionaries is simply the number of words. The default built-in dictionary supports most of the important proper names, nouns, verbs, etc.
The "pros" of using different dictionaries is, therefore, that they can separate words more accurately. Imagine the following.
-
Mr.McIntoshandMr.McNamara
-->Mr. Mc Into sh and Mr. Mc Namara
orMr. McIntosh and Mr. McNamara
And the "cons" would be memory usage and slowness.