Chinese language support #39

znzn40 · 2014-07-20T03:33:07Z

Is there any plan to support chinese language for forage?

fergiemcdowall · 2014-07-20T16:22:37Z

That would be really cool. We would need a contributor who speaks chinese.

RobinQu · 2014-08-11T06:40:48Z

+1. Finding API for adding custom analyzer.

fergiemcdowall · 2014-08-11T07:55:57Z

Do you know of any good libraries that can analyse Chinese text?

RobinQu · 2014-08-11T07:59:42Z

These are very stable implementations:

RobinQu · 2014-08-12T03:41:45Z

Any progress? Or anything I can help.
Could you tell me what to start with if we are adding a word analyzer for Chinese, etc. ?

fergiemcdowall · 2014-08-12T07:15:30Z

The following things need to be done:

Find a Chinese dataset that can be used for testing. Format it so that it is similar to the file at test/testdata/reuters-004.json
Find a library that generates Term Frequency in Chinese, and verify that it works on the test data
Create an option in search-index for indexing different languages, and make chinese and english options.

To make progress with this case we need a Chinese (Mandarin?) speaker to do 1 and 2.

RobinQu · 2014-08-14T06:22:44Z

I am a native Chinese speaker and I will try what you suggest in this weekend.

fergiemcdowall · 2014-08-14T07:41:32Z

Great! Even if you could point to a test dataset in Chinese that already exists, that would be a great help.

RobinQu · 2014-08-17T15:22:40Z

Hi,

test data is ready. it's generated from RSS feed of Chinese reuters.com.

https://github.com/RobinQu/forage/blob/master/test/testdata/reuters-001.json
The three open sourced projects mentioned above don't have any interface to directly compute TF by given texts. I am afraid we have to do text segmentation at first step. As you know, Chinese words are not separated by whitespace like ' ', e.g: "天气很好" meaning "Nice weather!". In order to continue any further process, we should extract word units from the whole texts. We call it segmentation. Assuming we have a segment function:
```
var words = segment("天气很好");
//words =  ["天气", "很", "好"]
```
Maybe we have to calculate TF with segmented word list. Well, I am still researching into this and trying to contact the author of those libraries.
I haven't got enough time to read all your code........LOL

fergiemcdowall · 2014-08-18T07:32:06Z

Great! Your dataset is probably worthy of its own project and Github repo.

The fastest way to get Chinese text indexed by Forage is to improve the the tf-idf functionality of https://github.com/NaturalNode/natural . If natural can be made to produce document vectors for Chinese text, then Forage can index Chinese text. In fact it might work already- I notice that natural can produce Chinese n-grams, so maybe the tf-idf also works for Chinese?

RobinQu · 2014-08-19T04:03:25Z

I skimmed the source code of lib/natural/tfidf； the guilty is here: https://github.com/NaturalNode/natural/blob/master/lib/natural/tfidf/tfidf.js#L26

They only use English stop words.

The whole project is not designed for international language usage, there are so many hard-coded lines. see NaturalNode/natural#159 and NaturalNode/natural#177

The major concern is natrual lacks Chinese stop words, and most of its API neglect the fact that this world has many more commonly used languages other than English.

I'll try to hack into TF-IDF module of natural first.

RobinQu · 2014-08-19T04:11:22Z

I think forage is a strong competitor to elasticsearch, which is too heavy and not so portable as they claim.

Have you ever done any benchmarks against elasticsearch?

fergiemcdowall · 2014-08-19T04:16:35Z

Re Hacking tf-idf in Natural: That sounds like a good plan!

Yes, Forage could be a competitor to Elasticsearch for some cases. I work a bit with Elasticsearch, but havent yet done any benchmarks- I should definitely do that.

fergiemcdowall · 2015-08-21T08:15:24Z

I recently rewrote a lot of this code, and some of my Chinese colleagues tell me that it is now working for Chinese text.

If there are still things that dont work, please submit a test case :)

dzcpy · 2017-01-06T11:43:06Z

How can I define a Chinese word dictionary?
Considering Chinese (and Japanese) isn't like most western languages which use spaces to separate words. Instead there is a need to use a word segmentation tool. A good reference should be this patch olivernn/lunr.js@master...codepiano:master which made lunr.js being able to support Chinese and it doesn't seem very complicated. Do you think you will apply similar things to norch?

fergiemcdowall · 2017-01-06T11:46:11Z

@andyhu yes, the best strategy is to insert a separator into the text before you index it, and then specify that separator when you index it.

eklem added question enhancement labels Dec 2, 2014

fergiemcdowall closed this as completed Aug 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chinese language support #39

Chinese language support #39

znzn40 commented Jul 20, 2014

fergiemcdowall commented Jul 20, 2014

RobinQu commented Aug 11, 2014

fergiemcdowall commented Aug 11, 2014

RobinQu commented Aug 11, 2014

RobinQu commented Aug 12, 2014

fergiemcdowall commented Aug 12, 2014

RobinQu commented Aug 14, 2014

fergiemcdowall commented Aug 14, 2014

RobinQu commented Aug 17, 2014

fergiemcdowall commented Aug 18, 2014

RobinQu commented Aug 19, 2014

RobinQu commented Aug 19, 2014

fergiemcdowall commented Aug 19, 2014

fergiemcdowall commented Aug 21, 2015

dzcpy commented Jan 6, 2017 •

edited

Loading

fergiemcdowall commented Jan 6, 2017

Chinese language support #39

Chinese language support #39

Comments

znzn40 commented Jul 20, 2014

fergiemcdowall commented Jul 20, 2014

RobinQu commented Aug 11, 2014

fergiemcdowall commented Aug 11, 2014

RobinQu commented Aug 11, 2014

RobinQu commented Aug 12, 2014

fergiemcdowall commented Aug 12, 2014

RobinQu commented Aug 14, 2014

fergiemcdowall commented Aug 14, 2014

RobinQu commented Aug 17, 2014

fergiemcdowall commented Aug 18, 2014

RobinQu commented Aug 19, 2014

RobinQu commented Aug 19, 2014

fergiemcdowall commented Aug 19, 2014

fergiemcdowall commented Aug 21, 2015

dzcpy commented Jan 6, 2017 • edited Loading

fergiemcdowall commented Jan 6, 2017

dzcpy commented Jan 6, 2017 •

edited

Loading