-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix mountPath
to use /tmp
instead of /data
#1584
base: main
Are you sure you want to change the base?
Fix mountPath
to use /tmp
instead of /data
#1584
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi team,
I'm currently reviewing this pull request and will provide feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly understand the changes:
This pull request, authored by alvarobartt, addresses an issue where the mountPath
in several Kubernetes YAML files was incorrectly set to /data
instead of /tmp
. The issue stemmed from a mismatch between the default image and a Text Generation Inference (TGI) Docker image, where the HF_HOME
environment variable points to /tmp
.
Intent: To correct the mountPath
to /tmp
for consistency with the TGI DLCs and resolve issues encountered when using the TGI DLCs in Kubernetes manifests.
Changes:
The pull request modifies four YAML files, updating the mountPath
from /data
to /tmp
within the volumeMounts
section:
ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml
(line 54)ai-ml/llm-multiple-gpus/llama2-70b/text-generation-inference.yaml
(line 59)ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml
(line 61)ai-ml/llm-multiple-gpus/mixtral-8x7b/text-generation-inference.yaml
(line 59)
Activity: The PR description mentions that this fix has already been implemented in another repository and references related pull requests (GoogleCloudPlatform/ai-on-gke#816 and GoogleCloudPlatform/ai-on-gke#852).
This change ensures consistency across different environments and resolves potential runtime issues.
A little haiku to lighten the mood:
/data
path is wrong,
/tmp
now holds the secrets,
Code flows smoothly on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR updates the mountPath
for several LLM deployments to use /tmp
instead of /data
, aligning with the HF_HOME
environment variable setting in the TGI DLC images. This change is necessary to avoid issues when using TGI DLC images as the container URI in Kubernetes manifests. The PR description clearly explains the problem, its cause, and the solution. The references to related PRs provide helpful context. Overall, the changes are well-justified and effectively address the identified issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi team,
I'm currently reviewing this pull request and will provide my feedback shortly. In the meantime, here's a summary for you and other reviewers to quickly get up to speed:
This pull request, authored by alvarobartt, aims to fix an issue where the mountPath
in several Kubernetes YAML files was incorrectly set to /data
instead of /tmp
. This was causing problems because the HF_HOME
environment variable in the Text Generation Inference (TGI) Docker images is set to /tmp
.
The issue stemmed from using the TGI DLC (Docker container image) instead of the default TGI image. While the DLC already used /tmp
, the Kubernetes manifests weren't updated to reflect this change.
Here's a breakdown of the changes:
- Intent: Correct the
mountPath
to/tmp
for consistency with the TGI DLC'sHF_HOME
environment variable. - Changes: The
mountPath
in the following files has been updated from/data
to/tmp
:ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml
(line 54)ai-ml/llm-multiple-gpus/llama2-70b/text-generation-inference.yaml
(line 59)ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml
(line 61)ai-ml/llm-multiple-gpus/mixtral-8x7b/text-generation-inference.yaml
(line 59)
- Related Issues: This PR closes issue Serve an LLM with multiple GPUs in GKE is doesn't work and fails with The node was low on resource: ephemeral-storage. #1581 and is related to Update references to Hugging Face DLC for TGI ai-on-gke#816 and Update HuggingFace DLC for TGI URI to 2.3 (latest) ai-on-gke#852.
This fix ensures that the examples run correctly when using the TGI DLC.
And now, a little haiku to lighten the mood:
/data
path is wrong,
/tmp
now holds the secrets,
Code compiles at last!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR updates the mountPath
to /tmp
to align with the HF_HOME
environment variable in TGI DLCs. This change is necessary for compatibility with the TGI DLC images and resolves issues encountered when using these images in Kubernetes deployments. The PR description clearly outlines the problem, its cause, the solution, and related fixes in other repositories. Overall, the changes are well-justified and implemented correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi team,
I'm currently reviewing this pull request and will provide feedback shortly. In the meantime, here's a summary for you and other reviewers to quickly get up to speed:
This pull request, submitted by alvarobartt, addresses an issue where the mountPath
in several Kubernetes YAML files was incorrectly set to /data
instead of /tmp
. The issue arose due to a mismatch between the default image and a Text Generation Inference (TGI) Docker image used in the examples. The TGI DLC uses /tmp
as its HF_HOME
environment variable, while the default image uses /data
. This PR corrects the mountPath
to /tmp
for consistency.
Here's a breakdown of the changes:
- Intent: To fix inconsistencies in
mountPath
settings across different LLM examples, ensuring compatibility with the TGI DLC. - Changes: The
mountPath
in the following files has been updated from/data
to/tmp
:ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml
(line 54)ai-ml/llm-multiple-gpus/llama2-70b/text-generation-inference.yaml
(line 59)ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml
(line 61)ai-ml/llm-multiple-gpus/mixtral-8x7b/text-generation-inference.yaml
(line 59)
- Related Issues: This PR closes issue Serve an LLM with multiple GPUs in GKE is doesn't work and fails with The node was low on resource: ephemeral-storage. #1581 and is related to PRs Update references to Hugging Face DLC for TGI ai-on-gke#816 and Update HuggingFace DLC for TGI URI to 2.3 (latest) ai-on-gke#852.
I'll provide a more detailed review shortly.
A little haiku to brighten your day:
/tmp
now takes stage,
Data's path, a thing of the past,
Smooth code flows again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR correctly addresses the mountPath issue with TGI DLCs. The change ensures consistency with the HF_HOME environment variable and resolves the issues encountered when using TGI DLC images. The PR description is well-written and provides sufficient context, including references to related PRs. The suggested comments enhance clarity and maintainability across the affected YAML files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi team,
I'm currently reviewing this pull request and will provide my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly understand the changes:
This pull request, submitted by alvarobartt, addresses an issue where the mountPath
in several Kubernetes YAML files was incorrectly set to /data
instead of /tmp
. The issue arose when switching to a new Docker image where the HF_HOME
environment variable points to /tmp
.
Intent: The primary goal is to correct the mountPath
to /tmp
for consistency with the new Docker image and to resolve issues caused by the mismatch.
Changes:
The pull request modifies four YAML files, updating the mountPath
from /data
to /tmp
within the volumeMounts
section:
ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml
(lines 54-55)ai-ml/llm-multiple-gpus/llama2-70b/text-generation-inference.yaml
(lines 59-60)ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml
(lines 61-62)ai-ml/llm-multiple-gpus/mixtral-8x7b/text-generation-inference.yaml
(lines 59-60)
Activity: The PR description mentions that this fix has already been implemented in other related PRs (links provided in the description).
This PR is a straightforward fix that ensures consistency and resolves existing issues. I'll provide a more detailed review shortly.
A tiny poem for your coding pleasure:
A path mislaid,
From/data
to/tmp
's shade,
Now all is made right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR updates the mountPath
for Text Generation Inference deployments to /tmp
to align with the HF_HOME
environment variable in TGI DLC images. This change is necessary to resolve issues encountered when using TGI DLC images in Kubernetes deployments. The PR description provides a clear explanation of the issue, its cause, and the proposed solution. Referencing related PRs adds valuable context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi team,
I'm currently reviewing this pull request and will provide feedback shortly. In the meantime, here's a summary for you and other reviewers to quickly get up to speed:
This pull request, authored by alvarobartt, addresses an issue where the mountPath
in several Kubernetes YAML files was incorrectly set to /data
instead of /tmp
. The issue stems from a discrepancy between the default TGI image and the TGI DLC (Docker container image). The DLC uses /tmp
as the location for the HF_HOME
environment variable, where the Hugging Face model is stored. This PR updates the mountPath
to /tmp
to resolve compatibility problems.
Here's a breakdown of the changes:
- Intent: Correct the
mountPath
in Kubernetes YAML files to point to/tmp
instead of/data
for consistency with the TGI DLC'sHF_HOME
environment variable. - Changes: The
mountPath
in the following files has been updated from/data
to/tmp
:ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml
(lines 54-57)ai-ml/llm-multiple-gpus/llama2-70b/text-generation-inference.yaml
(lines 59-62)ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml
(lines 61-64)ai-ml/llm-multiple-gpus/mixtral-8x7b/text-generation-inference.yaml
(lines 59-62)
- Activity: The PR description mentions related work in other repositories and closes issue Serve an LLM with multiple GPUs in GKE is doesn't work and fails with The node was low on resource: ephemeral-storage. #1581.
I'll provide a more detailed review shortly. In the meantime, let me know if you have any questions.
A little haiku to lighten the mood:
/tmp
now takes stage,
Models load, a happy sight,
Fixes flow like streams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR makes a necessary change to ensure compatibility between the Kubernetes manifests and the TGI DLC image. The change is well-explained and the provided context is helpful. A minor suggestion would be to standardize the comments across all affected files for better consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There still some issue in this sample. It fails with
"Traceback (most recent call last):
File ""/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py"", line 1269, in warmup
_, batch, _ = self.generate_token(batch)
File ""/opt/conda/lib/python3.11/contextlib.py"", line 81, in inner
return func(*args, **kwds)
File ""/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py"", line 1730, in generate_token
prefill_logprobs_tensor = torch.log_softmax(out, -1)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 514.00 MiB. GPU 0 has a total capacity of 21.96 GiB of which 393.12 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 20.87 GiB is allocated by PyTorch, and 382.89 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I even tried with g2-standard-48
and it still keeps on crashing. Can you please run some tests to validate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all logs feel free to request access. But the error above should be enough to find the issue and diff between TGI default and TGI DLC image, as the only difference is the image and mountpath
https://docs.google.com/spreadsheets/d/1hKZP9X2ueP-Zvnb9zIXMLk6LeGXvfr-mGR6z-NGfA3s/edit?gid=1556804789#gid=1556804789
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It worked with a2-highgpu-2g
which has 40G (A100) GPU memory. Which means there is something not right with the DLC image and breaks using LLAMA3-70B running on L4 GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed report @raushan2016, let me run some tests on our end to investigate and I'll ping you as soon as those are completed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure sounds fair, I'll try to investigate the issue today! Thanks for your time!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alvarobartt Somehow things are working now, maybe the issue was fixed in the image since the image is pointing to latest ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the confirmation, but yes that is odd indeed, because AFAIK us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311
was released some time ago (not sure if they rolled an update over that image, but doesn't look like it as per https://cloud.google.com/deep-learning-containers/docs/choosing-container#hugging-face).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry
For LLAMA3 this image works us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-1.ubuntu2204.py310
But the current image us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311 doesn't work.
So something related to the changed TGI or Cuda version. Will really be helpful if you can look into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the PR with right images which are working
#1591
@@ -51,7 +51,10 @@ spec: | |||
volumeMounts: | |||
- mountPath: /dev/shm | |||
name: dshm | |||
- mountPath: /data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alvarobartt Falcon model has this error. Tried couple of different version > 2 all of them failed.
INFO 2025-01-16T23:04:59.120810571Z [resource.labels.containerName: llm] �[2m2025-01-16T23:04:59.120688Z�[0m �[31mERROR�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Error when initializing model
INFO 2025-01-16T23:04:59.120839184Z [resource.labels.containerName: llm] Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve server.serve( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve asyncio.run( File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args)
INFO 2025-01-16T23:04:59.120940322Z [resource.labels.containerName: llm] > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
INFO 2025-01-16T23:04:59.120942462Z [resource.labels.containerName: llm] model = get_model(
INFO 2025-01-16T23:04:59.120944599Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 714, in get_model
INFO 2025-01-16T23:04:59.120946477Z [resource.labels.containerName: llm] return FlashRWSharded(
INFO 2025-01-16T23:04:59.120948410Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_rw.py", line 77, in init
INFO 2025-01-16T23:04:59.120950825Z [resource.labels.containerName: llm] model = FlashRWForCausalLM(config, weights)
INFO 2025-01-16T23:04:59.120953217Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 659, in init
INFO 2025-01-16T23:04:59.120955289Z [resource.labels.containerName: llm] self.transformer = FlashRWModel(config, weights)
INFO 2025-01-16T23:04:59.120957337Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 593, in init
INFO 2025-01-16T23:04:59.120959345Z [resource.labels.containerName: llm] [
INFO 2025-01-16T23:04:59.120961371Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 594, in
INFO 2025-01-16T23:04:59.120963316Z [resource.labels.containerName: llm] FlashRWLargeLayer(layer_id, config, weights)
INFO 2025-01-16T23:04:59.120965300Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 525, in init
INFO 2025-01-16T23:04:59.120967342Z [resource.labels.containerName: llm] self.ln_layer = FlashRWLayerNorm(config, prefix, weights)
INFO 2025-01-16T23:04:59.120969434Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 484, in init
INFO 2025-01-16T23:04:59.120971427Z [resource.labels.containerName: llm] self.num_ln = config.num_ln_in_parallel_attn
INFO 2025-01-16T23:04:59.120973477Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 264, in getattribute
INFO 2025-01-16T23:04:59.120975858Z [resource.labels.containerName: llm] return super().getattribute(key)
INFO 2025-01-16T23:04:59.120977917Z [resource.labels.containerName: llm] AttributeError: 'RWConfig' object has no attribute 'num_ln_in_parallel_attn'
INFO 2025-01-16T23:04:59.160341634Z [resource.labels.containerName: llm] �[2m2025-01-16T23:04:59.160221Z�[0m �[31mERROR�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Error when initializing model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found an issue but its resolved > 6 months back
huggingface/text-generation-inference#2349
Description
This PR updates the
mountPath
so as to point to/tmp
instead of/data
, as theHF_HOME
environment variable in the TGI DLCs is set to/tmp
instead of the default/data
set within the TGI default image (see Dockerfile); the change has always been there for the TGI DLC, but when setting the container URI i.e. theimage
within the Kubernetes manifests to the TGI DLC instead of the default TGI image (hosted on GitHub) as per GoogleCloudPlatform/ai-on-gke#816, themountPath
were not updated, and it was still using/data
which was leading to some issues when running the examples.This has already been fixed within this repository in this PR, and also within GoogleCloudPlatform/ai-on-gke#852.
cc @annapendleton @raushan2016
Closes #1581
Tasks