After converting to a single tf graph, the prediction time becomes longer. #4

birdmu · 2024-05-13T12:51:41Z

hello,
After converting the model ( https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2 ) using tf_exporter, the converted model is placed into tensorflow serving.
However, there is an issue: using the original model, completing one predict request from input to output takes around 10ms. But with the converted model, completing one inference request from input to output takes around 100ms, even when accessing TensorFlow Serving locally and ignoring network latency.
Is this 100ms latency normal, and what modifications should be made to reduce the latency to be similar to the original model?

thanks a lot.

balikasg · 2024-05-13T13:26:55Z

Hello, I am not actively working on this project for the moment being. I will try to reproduce this at a later moment, not sure when though. I would suggest to add debugging statements within the steps (tokenisation, forward pass, normalization, …) to see where the time is spent and optimise from there.. I hope this helps! I would be happy to review any fixes!

…

________________________________ From: birdmu ***@***.***> Sent: Monday, May 13, 2024 2:52:02 PM To: balikasg/tf-exporter ***@***.***> Cc: Subscribed ***@***.***> Subject: [balikasg/tf-exporter] After converting to a single tf graph, the prediction time becomes longer. (Issue #4) hello, After converting the model ( https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2 ) using tf_exporter, the converted model is placed into tensorflow serving. However, there is an issue: using the original model, completing one predict request from input to output takes around 10ms. But with the converted model, completing one inference request from input to output takes around 100ms, even when accessing TensorFlow Serving locally and ignoring network latency. Is this 100ms latency normal, and what modifications should be made to reduce the latency to be similar to the original model? thanks a lot. — Reply to this email directly, view it on GitHub<#4>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB5NJHOJDE4PG6TOABCO4HLZCCZPFAVCNFSM6AAAAABHUE34WSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI4TENZTGU4DMNA>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

birdmu · 2024-05-13T14:09:53Z

Hello, I am not actively working on this project for the moment being. I will try to reproduce this at a later moment, not sure when though. I would suggest to add debugging statements within the steps (tokenisation, forward pass, normalization, …) to see where the time is spent and optimise from there.. I hope this helps! I would be happy to review any fixes!
…
________________________________ From: birdmu @.> Sent: Monday, May 13, 2024 2:52:02 PM To: balikasg/tf-exporter @.> Cc: Subscribed @.> Subject: [balikasg/tf-exporter] After converting to a single tf graph, the prediction time becomes longer. (Issue #4) hello, After converting the model ( https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2 ) using tf_exporter, the converted model is placed into tensorflow serving. However, there is an issue: using the original model, completing one predict request from input to output takes around 10ms. But with the converted model, completing one inference request from input to output takes around 100ms, even when accessing TensorFlow Serving locally and ignoring network latency. Is this 100ms latency normal, and what modifications should be made to reduce the latency to be similar to the original model? thanks a lot. — Reply to this email directly, view it on GitHub<#4>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB5NJHOJDE4PG6TOABCO4HLZCCZPFAVCNFSM6AAAAABHUE34WSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI4TENZTGU4DMNA. You are receiving this because you are subscribed to this thread.Message ID: @.>

thanks for replying， I am a beginner when it comes to transformer and tensorflow serving detail, so currently I can hardly use your advice on distributed debugging.
ANYWAY, thanks a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After converting to a single tf graph, the prediction time becomes longer. #4

After converting to a single tf graph, the prediction time becomes longer. #4

birdmu commented May 13, 2024

balikasg commented May 13, 2024 via email

birdmu commented May 13, 2024

After converting to a single tf graph, the prediction time becomes longer. #4

After converting to a single tf graph, the prediction time becomes longer. #4

Comments

birdmu commented May 13, 2024

balikasg commented May 13, 2024 via email

birdmu commented May 13, 2024