Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I make it calculate more faster? #66

Open
A11en0 opened this issue Oct 26, 2021 · 8 comments
Open

How can I make it calculate more faster? #66

A11en0 opened this issue Oct 26, 2021 · 8 comments
Labels

Comments

@A11en0
Copy link

A11en0 commented Oct 26, 2021

The calculate is a little bit slow, is there some method to speed it up? Can I use GPU instead of CPU?

@MichaelRoeder
Copy link
Member

MichaelRoeder commented Oct 26, 2021

That depends a lot on how you actually use it.

The main bottleneck is reading the index. Using a GPU instead of a CPU does most probably not help since there are no expensive matrix operations 😉

Do you have a large set of topics that you would like to evaluate at once?

@A11en0
Copy link
Author

A11en0 commented Oct 26, 2021

yes, I embedded it into my training code, to evaluate C_V per 10 epoch, and it apparently slow down my training speed.

@A11en0
Copy link
Author

A11en0 commented Oct 26, 2021

by the way, I set my topic as 20 and topic words as 15, I guess calculating all topics (20) one time instead of calculate one by one would boost the speed since it just needs to read the index file one time, it can be reach?

@MichaelRoeder
Copy link
Member

  1. You add an additional step to your training that tries to evaluate your topics based on big statistics that it has to gather. So it is expected that it will need more time 😉
    However, I understand that the longer training is annoying.
  2. I am not fully clear about your setup. I assume that you have a python program and that you execute the Palmetto web service in parallel. Is that right? Or do you use Palmetto on the command line? 🤔

@A11en0
Copy link
Author

A11en0 commented Oct 27, 2021

I use the palmetto-py API palmetto.get_coherence() in my python training code, evaluate per 10 epoch. In my opinion, it will load the index file one time when I call the API one time. But it can just evaluate one topic each time. So, when I calculate the whole topic distribution(K topics), I need to call it K times! it'll spend too much time.

So, I suggest giving a new API that can calculate the whole K topic in a single call, as it just needs to load the index file one time.

@MichaelRoeder
Copy link
Member

Thanks for clarifying your setup. Your assumption is not correct. You start the web service only once at the very beginning of your program (at least I assume that). The calls to the web service will simply always cause a search on the index. It doesn't matter how many topics you could send at once. So although I think the extension of the API might be a good idea, it won't change the runtime.

The only change that I can think of at the moment with respect to runtime might be the implementation of a cache. The cache could be implemented as a decorator of the WindowSupportingLuceneCorpusAdapter class and cache the result of the requestDocumentsWithWord method. However, it may consume a lot of memory. Apart from that, I simply won't have time during the next months to look into that. Feel free to create a Pull Request or ask for some guidance if you want to give it a try as it is not as trivial as it might seem.

I may have two suggestions that could improve the runtime (I guess you already thought of them):

  1. Try to avoid unnecessary calls. You can store the coherence values that have been calculated in previous epochs. If a topic has the same top words as in the previous evaluation run, you don't have to evaluate it again. Depending on the topic coherence, the order of the top words may not have an influence on the coherence value (e.g., for C_V, the order of the words doesn't matter).
  2. Depending how far you are already with your approach, you may want to think about using a "cheaper" topic coherence. Most of the coherences in our paper use a window-based approach and are pretty costly. However, the UMass coherence is quite fast since it does not make use of the positions of words within the documents. So for trying whether your overall approach works, this might be an easy alternative to get some fast, first results. However, the quality of the coherence results are not so good.

@A11en0
Copy link
Author

A11en0 commented Oct 27, 2021

Thanks for your careful reply! I get your advice. But maybe it needs to change the java code in the project, I'm afraid that I don't have too much time to do it, as I just use the wonderful tool to test my own topic model, so I can't pay too much time on the coherent calculation methods.

I have a new problem is that the python interface API often gives me endpoint down error as I using the backend server in the local. I build the tomcat-based server following the instruction in https://github.com/dice-group/Palmetto/wiki/How-Palmetto-can-be-used. The problem from the python interface or java backend? I have no idea.

@MichaelRoeder
Copy link
Member

Yes, I can understand that. Seems like nobody has a lot of time these days 😉

You can increase the time the python client waits by setting the timeout attribute.

As an alternative, you could also run Palmetto from command line and read the result from command line. This would ensure that your program waits and that you get rid of the HTTP-based communication. However, I am not sure how much effort it is to implement that within Python. So maybe it is just another weird idea 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants