-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After converting to a single tf graph, the prediction time becomes longer. #4
Comments
Hello,
I am not actively working on this project for the moment being. I will try to reproduce this at a later moment, not sure when though.
I would suggest to add debugging statements within the steps (tokenisation, forward pass, normalization, …) to see where the time is spent and optimise from there.. I hope this helps! I would be happy to review any fixes!
…________________________________
From: birdmu ***@***.***>
Sent: Monday, May 13, 2024 2:52:02 PM
To: balikasg/tf-exporter ***@***.***>
Cc: Subscribed ***@***.***>
Subject: [balikasg/tf-exporter] After converting to a single tf graph, the prediction time becomes longer. (Issue #4)
hello,
After converting the model ( https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2 ) using tf_exporter, the converted model is placed into tensorflow serving.
However, there is an issue: using the original model, completing one predict request from input to output takes around 10ms. But with the converted model, completing one inference request from input to output takes around 100ms, even when accessing TensorFlow Serving locally and ignoring network latency.
Is this 100ms latency normal, and what modifications should be made to reduce the latency to be similar to the original model?
thanks a lot.
—
Reply to this email directly, view it on GitHub<#4>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB5NJHOJDE4PG6TOABCO4HLZCCZPFAVCNFSM6AAAAABHUE34WSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI4TENZTGU4DMNA>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
thanks for replying, I am a beginner when it comes to transformer and tensorflow serving detail, so currently I can hardly use your advice on distributed debugging. |
hello,
After converting the model ( https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2 ) using tf_exporter, the converted model is placed into tensorflow serving.
However, there is an issue: using the original model, completing one predict request from input to output takes around 10ms. But with the converted model, completing one inference request from input to output takes around 100ms, even when accessing TensorFlow Serving locally and ignoring network latency.
Is this 100ms latency normal, and what modifications should be made to reduce the latency to be similar to the original model?
thanks a lot.
The text was updated successfully, but these errors were encountered: