Skip to content
KEINOS edited this page Dec 6, 2022 · 5 revisions

What is the difference between Tokenizer.Analyze() and Tokenizer.Tokenize()?

t.Tokenize(s) is an alias of t.Analyze(s, tokenizer.Normal). The argument "tokenizer.Normal" describes the segmentation mode during analysis.

kagome has some segmentation modes.

  • Normal: Regular segmentation
  • Search: Use a heuristic to do additional segmentation useful for search
  • Extended: Similar to search mode, but also uni-gram unknown words

What is the difference between Tokenizer.Wakati() and Tokenizer.Tokenize()?

As you may know, most Asian texts are not word-separated. The word "wakati" means "word divide" in Japanese. Thus, wakati helps to divide the text into word tokens. Imagine the following.

  • Wakati("thistextwritingissomewhatsimilartotheasianstyle.") --> this text writing is somewhat similar to the asian style.

The Tokenizer.Wakati() is used to simply divide the text into space-separated-words. Used to create a meta data for a Full-text search. E.g. FTS5 in SQLite3.

The Tokenizer.Tokenize() is similar to Wakati(). But each wakatized(?) chunks contains more information. Mostly used to analyze the grammar, text-lint and etc.

What are the pros/cons of using the different dictionaries?

In order to do the wakati thing, a word dictionary is needed to determine the proper names, nouns, etc. of a word.

The difference between dictionaries is simply the number of words. The default built-in dictionary supports most of the important proper names, nouns, verbs, etc.

The "pros" of using different dictionaries is, therefore, that they can separate words more accurately. Imagine the following.

  • Mr.McIntoshandMr.McNamara --> Mr. Mc Into sh and Mr. Mc Namara or Mr. McIntosh and Mr. McNamara

And the "cons" would be memory usage and slowness.

Clone this wiki locally