Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After converting to a single tf graph, the prediction time becomes longer. #4

Open
birdmu opened this issue May 13, 2024 · 2 comments

Comments

@birdmu
Copy link

birdmu commented May 13, 2024

hello,
After converting the model ( https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2 ) using tf_exporter, the converted model is placed into tensorflow serving.
However, there is an issue: using the original model, completing one predict request from input to output takes around 10ms. But with the converted model, completing one inference request from input to output takes around 100ms, even when accessing TensorFlow Serving locally and ignoring network latency.
Is this 100ms latency normal, and what modifications should be made to reduce the latency to be similar to the original model?

thanks a lot.

@balikasg
Copy link
Owner

balikasg commented May 13, 2024 via email

@birdmu
Copy link
Author

birdmu commented May 13, 2024

Hello, I am not actively working on this project for the moment being. I will try to reproduce this at a later moment, not sure when though. I would suggest to add debugging statements within the steps (tokenisation, forward pass, normalization, …) to see where the time is spent and optimise from there.. I hope this helps! I would be happy to review any fixes!

________________________________ From: birdmu @.> Sent: Monday, May 13, 2024 2:52:02 PM To: balikasg/tf-exporter @.> Cc: Subscribed @.> Subject: [balikasg/tf-exporter] After converting to a single tf graph, the prediction time becomes longer. (Issue #4) hello, After converting the model ( https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2 ) using tf_exporter, the converted model is placed into tensorflow serving. However, there is an issue: using the original model, completing one predict request from input to output takes around 10ms. But with the converted model, completing one inference request from input to output takes around 100ms, even when accessing TensorFlow Serving locally and ignoring network latency. Is this 100ms latency normal, and what modifications should be made to reduce the latency to be similar to the original model? thanks a lot. — Reply to this email directly, view it on GitHub<#4>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB5NJHOJDE4PG6TOABCO4HLZCCZPFAVCNFSM6AAAAABHUE34WSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI4TENZTGU4DMNA. You are receiving this because you are subscribed to this thread.Message ID: @.>

thanks for replying, I am a beginner when it comes to transformer and tensorflow serving detail, so currently I can hardly use your advice on distributed debugging.
ANYWAY, thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants