FAQ

What is the difference between `Tokenizer.Analyze()` and `Tokenizer.Tokenize()`?

From issue #293

t.Tokenize(s) is an alias of t.Analyze(s, tokenizer.Normal). The argument "tokenizer.Normal" describes the segmentation mode during analysis.

kagome has some segmentation modes.

Normal: Regular segmentation
Search: Use a heuristic to do additional segmentation useful for search
Extended: Similar to search mode, but also uni-gram unknown words

What is the difference between `Tokenizer.Wakati()` and `Tokenizer.Tokenize()`?

From issue #274

As you may know, most Asian texts are not word-separated. The word "wakati" means "word divide" in Japanese. Thus, wakati helps to divide the text into word tokens. Imagine the following.

Wakati("thistextwritingissomewhatsimilartotheasianstyle.") --> this text writing is somewhat similar to the asian style.

The Tokenizer.Wakati() is used to simply divide the text into space-separated-words. Used to create a meta data for a Full-text search. E.g. FTS5 in SQLite3.

The Tokenizer.Tokenize() is similar to Wakati(). But each wakatized(?) chunks contains more information. Mostly used to analyze the grammar, text-lint and etc.

What are the pros/cons of using the different dictionaries?

From issue #274

In order to do the wakati thing, a word dictionary is needed to determine the proper names, nouns, etc. of a word.

The difference between dictionaries is simply the number of words. The default built-in dictionary supports most of the important proper names, nouns, verbs, etc.

The "pros" of using different dictionaries is, therefore, that they can separate words more accurately. Imagine the following.

Mr.McIntoshandMr.McNamara --> Mr. Mc Into sh and Mr. Mc Namara or Mr. McIntosh and Mr. McNamara

And the "cons" would be memory usage and slowness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAQ

What is the difference between `Tokenizer.Analyze()` and `Tokenizer.Tokenize()`?

What is the difference between `Tokenizer.Wakati()` and `Tokenizer.Tokenize()`?

What are the pros/cons of using the different dictionaries?

Clone this wiki locally

FAQ

What is the difference between Tokenizer.Analyze() and Tokenizer.Tokenize()?

What is the difference between Tokenizer.Wakati() and Tokenizer.Tokenize()?

What are the pros/cons of using the different dictionaries?

Clone this wiki locally

What is the difference between `Tokenizer.Analyze()` and `Tokenizer.Tokenize()`?

What is the difference between `Tokenizer.Wakati()` and `Tokenizer.Tokenize()`?