-
-
Notifications
You must be signed in to change notification settings - Fork 55
- From issue #293
t.Tokenize(s)
is an alias of t.Analyze(s, tokenizer.Normal)
. The argument "tokenizer.Normal
" describes the segmentation mode during analysis.
kagome
has some segmentation modes.
- Normal: Regular segmentation
- Search: Use a heuristic to do additional segmentation useful for search
- Extended: Similar to search mode, but also uni-gram unknown words
- From issue #274
As you may know, most Asian texts are not word-separated. The word "wakati
" means "word divide" in Japanese. Thus, wakati
helps to divide the text into word tokens. Imagine the following.
-
Wakati("thistextwritingissomewhatsimilartotheasianstyle.")
-->this text writing is somewhat similar to the asian style.
The Tokenizer.Wakati()
is used to simply divide the text into space-separated-words. Used to create a meta data for a Full-text search. E.g. FTS5 in SQLite3.
The Tokenizer.Tokenize()
is similar to Wakati()
. But each wakatized
(?) chunks contains more information. Mostly used to analyze the grammar, text-lint and etc.
- From issue #274
In order to do the wakati
thing, a word dictionary is needed to determine the proper names, nouns, etc. of a word.
The difference between dictionaries is simply the number of words. The default built-in dictionary supports most of the important proper names, nouns, verbs, etc.
The "pros" of using different dictionaries is, therefore, that they can separate words more accurately. Imagine the following.
-
Mr.McIntoshandMr.McNamara
-->Mr. Mc Into sh and Mr. Mc Namara
orMr. McIntosh and Mr. McNamara
And the "cons" would be memory usage and slowness.