Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chinese language support #39

Closed
znzn40 opened this issue Jul 20, 2014 · 16 comments
Closed

Chinese language support #39

znzn40 opened this issue Jul 20, 2014 · 16 comments

Comments

@znzn40
Copy link

znzn40 commented Jul 20, 2014

Is there any plan to support chinese language for forage?

@fergiemcdowall
Copy link
Owner

That would be really cool. We would need a contributor who speaks chinese.

@RobinQu
Copy link

RobinQu commented Aug 11, 2014

+1. Finding API for adding custom analyzer.

@fergiemcdowall
Copy link
Owner

Do you know of any good libraries that can analyse Chinese text?

@RobinQu
Copy link

RobinQu commented Aug 11, 2014

These are very stable implementations:

@RobinQu
Copy link

RobinQu commented Aug 12, 2014

Any progress? Or anything I can help.
Could you tell me what to start with if we are adding a word analyzer for Chinese, etc. ?

@fergiemcdowall
Copy link
Owner

The following things need to be done:

  1. Find a Chinese dataset that can be used for testing. Format it so that it is similar to the file at test/testdata/reuters-004.json
  2. Find a library that generates Term Frequency in Chinese, and verify that it works on the test data
  3. Create an option in search-index for indexing different languages, and make chinese and english options.

To make progress with this case we need a Chinese (Mandarin?) speaker to do 1 and 2.

@RobinQu
Copy link

RobinQu commented Aug 14, 2014

I am a native Chinese speaker and I will try what you suggest in this weekend.

@fergiemcdowall
Copy link
Owner

Great! Even if you could point to a test dataset in Chinese that already exists, that would be a great help.

@RobinQu
Copy link

RobinQu commented Aug 17, 2014

Hi,

  1. test data is ready. it's generated from RSS feed of Chinese reuters.com.

    https://github.com/RobinQu/forage/blob/master/test/testdata/reuters-001.json

  2. The three open sourced projects mentioned above don't have any interface to directly compute TF by given texts. I am afraid we have to do text segmentation at first step. As you know, Chinese words are not separated by whitespace like ' ', e.g: "天气很好" meaning "Nice weather!". In order to continue any further process, we should extract word units from the whole texts. We call it segmentation. Assuming we have a segment function:

    var words = segment("天气很好");
    //words =  ["天气", "很", "好"]
    

    Maybe we have to calculate TF with segmented word list. Well, I am still researching into this and trying to contact the author of those libraries.

  3. I haven't got enough time to read all your code........LOL

@fergiemcdowall
Copy link
Owner

Great! Your dataset is probably worthy of its own project and Github repo.

The fastest way to get Chinese text indexed by Forage is to improve the the tf-idf functionality of https://github.com/NaturalNode/natural . If natural can be made to produce document vectors for Chinese text, then Forage can index Chinese text. In fact it might work already- I notice that natural can produce Chinese n-grams, so maybe the tf-idf also works for Chinese?

@RobinQu
Copy link

RobinQu commented Aug 19, 2014

I skimmed the source code of lib/natural/tfidf; the guilty is here: https://github.com/NaturalNode/natural/blob/master/lib/natural/tfidf/tfidf.js#L26

They only use English stop words.

The whole project is not designed for international language usage, there are so many hard-coded lines. see NaturalNode/natural#159 and NaturalNode/natural#177

The major concern is natrual lacks Chinese stop words, and most of its API neglect the fact that this world has many more commonly used languages other than English.

I'll try to hack into TF-IDF module of natural first.

@RobinQu
Copy link

RobinQu commented Aug 19, 2014

I think forage is a strong competitor to elasticsearch, which is too heavy and not so portable as they claim.

Have you ever done any benchmarks against elasticsearch?

@fergiemcdowall
Copy link
Owner

Re Hacking tf-idf in Natural: That sounds like a good plan!

Yes, Forage could be a competitor to Elasticsearch for some cases. I work a bit with Elasticsearch, but havent yet done any benchmarks- I should definitely do that.

@fergiemcdowall
Copy link
Owner

I recently rewrote a lot of this code, and some of my Chinese colleagues tell me that it is now working for Chinese text.

If there are still things that dont work, please submit a test case :)

@dzcpy
Copy link

dzcpy commented Jan 6, 2017

How can I define a Chinese word dictionary?
Considering Chinese (and Japanese) isn't like most western languages which use spaces to separate words. Instead there is a need to use a word segmentation tool. A good reference should be this patch olivernn/lunr.js@master...codepiano:master which made lunr.js being able to support Chinese and it doesn't seem very complicated. Do you think you will apply similar things to norch?

@fergiemcdowall
Copy link
Owner

@andyhu yes, the best strategy is to insert a separator into the text before you index it, and then specify that separator when you index it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants