Fix `mountPath` to use `/tmp` instead of `/data` #1584

alvarobartt · 2025-01-07T08:59:07Z

Description

This PR updates the mountPath so as to point to /tmp instead of /data, as the HF_HOME environment variable in the TGI DLCs is set to /tmp instead of the default /data set within the TGI default image (see Dockerfile); the change has always been there for the TGI DLC, but when setting the container URI i.e. the image within the Kubernetes manifests to the TGI DLC instead of the default TGI image (hosted on GitHub) as per GoogleCloudPlatform/ai-on-gke#816, the mountPath were not updated, and it was still using /data which was leading to some issues when running the examples.

This has already been fixed within this repository in this PR, and also within GoogleCloudPlatform/ai-on-gke#852.

cc @annapendleton @raushan2016

Closes #1581

Tasks

The contributing guide has been read and followed.
The samples added / modified have been fully tested.
Workflow files have been added / modified, if applicable.
Region tags have been properly added, if new samples.
Editable variables have been used, where applicable.
All dependencies are set to up-to-date versions, as applicable.
Merge this pull-request for me once it is approved.

code-review-assist

Hi team,

I'm currently reviewing this pull request and will provide feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly understand the changes:

This pull request, authored by alvarobartt, addresses an issue where the mountPath in several Kubernetes YAML files was incorrectly set to /data instead of /tmp. The issue stemmed from a mismatch between the default image and a Text Generation Inference (TGI) Docker image, where the HF_HOME environment variable points to /tmp.

Intent: To correct the mountPath to /tmp for consistency with the TGI DLCs and resolve issues encountered when using the TGI DLCs in Kubernetes manifests.

Changes:

The pull request modifies four YAML files, updating the mountPath from /data to /tmp within the volumeMounts section:

ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml (line 54)
ai-ml/llm-multiple-gpus/llama2-70b/text-generation-inference.yaml (line 59)
ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml (line 61)
ai-ml/llm-multiple-gpus/mixtral-8x7b/text-generation-inference.yaml (line 59)

Activity: The PR description mentions that this fix has already been implemented in another repository and references related pull requests (GoogleCloudPlatform/ai-on-gke#816 and GoogleCloudPlatform/ai-on-gke#852).

This change ensures consistency across different environments and resolves potential runtime issues.

A little haiku to lighten the mood:

/data path is wrong,
/tmp now holds the secrets,
Code flows smoothly on.

code-review-assist

This PR updates the mountPath for several LLM deployments to use /tmp instead of /data, aligning with the HF_HOME environment variable setting in the TGI DLC images. This change is necessary to avoid issues when using TGI DLC images as the container URI in Kubernetes manifests. The PR description clearly explains the problem, its cause, and the solution. The references to related PRs provide helpful context. Overall, the changes are well-justified and effectively address the identified issue.

code-review-assist

Hi team,

I'm currently reviewing this pull request and will provide my feedback shortly. In the meantime, here's a summary for you and other reviewers to quickly get up to speed:

This pull request, authored by alvarobartt, aims to fix an issue where the mountPath in several Kubernetes YAML files was incorrectly set to /data instead of /tmp. This was causing problems because the HF_HOME environment variable in the Text Generation Inference (TGI) Docker images is set to /tmp.

The issue stemmed from using the TGI DLC (Docker container image) instead of the default TGI image. While the DLC already used /tmp, the Kubernetes manifests weren't updated to reflect this change.

Here's a breakdown of the changes:

Intent: Correct the mountPath to /tmp for consistency with the TGI DLC's HF_HOME environment variable.
Changes: The mountPath in the following files has been updated from /data to /tmp:
- ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml (line 54)
- ai-ml/llm-multiple-gpus/llama2-70b/text-generation-inference.yaml (line 59)
- ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml (line 61)
- ai-ml/llm-multiple-gpus/mixtral-8x7b/text-generation-inference.yaml (line 59)
Related Issues: This PR closes issue Serve an LLM with multiple GPUs in GKE is doesn't work and fails with The node was low on resource: ephemeral-storage. #1581 and is related to Update references to Hugging Face DLC for TGI ai-on-gke#816 and Update HuggingFace DLC for TGI URI to 2.3 (latest) ai-on-gke#852.

This fix ensures that the examples run correctly when using the TGI DLC.

And now, a little haiku to lighten the mood:

/data path is wrong,
/tmp now holds the secrets,
Code compiles at last!

code-review-assist

This PR updates the mountPath to /tmp to align with the HF_HOME environment variable in TGI DLCs. This change is necessary for compatibility with the TGI DLC images and resolves issues encountered when using these images in Kubernetes deployments. The PR description clearly outlines the problem, its cause, the solution, and related fixes in other repositories. Overall, the changes are well-justified and implemented correctly.

code-review-assist

Hi team,

I'm currently reviewing this pull request and will provide feedback shortly. In the meantime, here's a summary for you and other reviewers to quickly get up to speed:

This pull request, submitted by alvarobartt, addresses an issue where the mountPath in several Kubernetes YAML files was incorrectly set to /data instead of /tmp. The issue arose due to a mismatch between the default image and a Text Generation Inference (TGI) Docker image used in the examples. The TGI DLC uses /tmp as its HF_HOME environment variable, while the default image uses /data. This PR corrects the mountPath to /tmp for consistency.

Here's a breakdown of the changes:

Intent: To fix inconsistencies in mountPath settings across different LLM examples, ensuring compatibility with the TGI DLC.
Changes: The mountPath in the following files has been updated from /data to /tmp:
- ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml (line 54)
- ai-ml/llm-multiple-gpus/llama2-70b/text-generation-inference.yaml (line 59)
- ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml (line 61)
- ai-ml/llm-multiple-gpus/mixtral-8x7b/text-generation-inference.yaml (line 59)
Related Issues: This PR closes issue Serve an LLM with multiple GPUs in GKE is doesn't work and fails with The node was low on resource: ephemeral-storage. #1581 and is related to PRs Update references to Hugging Face DLC for TGI ai-on-gke#816 and Update HuggingFace DLC for TGI URI to 2.3 (latest) ai-on-gke#852.

I'll provide a more detailed review shortly.

A little haiku to brighten your day:

/tmp now takes stage,
Data's path, a thing of the past,
Smooth code flows again.

code-review-assist

This PR correctly addresses the mountPath issue with TGI DLCs. The change ensures consistency with the HF_HOME environment variable and resolves the issues encountered when using TGI DLC images. The PR description is well-written and provides sufficient context, including references to related PRs. The suggested comments enhance clarity and maintainability across the affected YAML files.

code-review-assist

Hi team,

I'm currently reviewing this pull request and will provide my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly understand the changes:

This pull request, submitted by alvarobartt, addresses an issue where the mountPath in several Kubernetes YAML files was incorrectly set to /data instead of /tmp. The issue arose when switching to a new Docker image where the HF_HOME environment variable points to /tmp.

Intent: The primary goal is to correct the mountPath to /tmp for consistency with the new Docker image and to resolve issues caused by the mismatch.

Changes:

The pull request modifies four YAML files, updating the mountPath from /data to /tmp within the volumeMounts section:

ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml (lines 54-55)
ai-ml/llm-multiple-gpus/llama2-70b/text-generation-inference.yaml (lines 59-60)
ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml (lines 61-62)
ai-ml/llm-multiple-gpus/mixtral-8x7b/text-generation-inference.yaml (lines 59-60)

Activity: The PR description mentions that this fix has already been implemented in other related PRs (links provided in the description).

This PR is a straightforward fix that ensures consistency and resolves existing issues. I'll provide a more detailed review shortly.

A tiny poem for your coding pleasure:

A path mislaid,
From /data to /tmp's shade,
Now all is made right.

code-review-assist

This PR updates the mountPath for Text Generation Inference deployments to /tmp to align with the HF_HOME environment variable in TGI DLC images. This change is necessary to resolve issues encountered when using TGI DLC images in Kubernetes deployments. The PR description provides a clear explanation of the issue, its cause, and the proposed solution. Referencing related PRs adds valuable context.

ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml

code-review-assist

Hi team,

I'm currently reviewing this pull request and will provide feedback shortly. In the meantime, here's a summary for you and other reviewers to quickly get up to speed:

This pull request, authored by alvarobartt, addresses an issue where the mountPath in several Kubernetes YAML files was incorrectly set to /data instead of /tmp. The issue stems from a discrepancy between the default TGI image and the TGI DLC (Docker container image). The DLC uses /tmp as the location for the HF_HOME environment variable, where the Hugging Face model is stored. This PR updates the mountPath to /tmp to resolve compatibility problems.

Here's a breakdown of the changes:

Intent: Correct the mountPath in Kubernetes YAML files to point to /tmp instead of /data for consistency with the TGI DLC's HF_HOME environment variable.
Changes: The mountPath in the following files has been updated from /data to /tmp:
- ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml (lines 54-57)
- ai-ml/llm-multiple-gpus/llama2-70b/text-generation-inference.yaml (lines 59-62)
- ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml (lines 61-64)
- ai-ml/llm-multiple-gpus/mixtral-8x7b/text-generation-inference.yaml (lines 59-62)
Activity: The PR description mentions related work in other repositories and closes issue Serve an LLM with multiple GPUs in GKE is doesn't work and fails with The node was low on resource: ephemeral-storage. #1581.

I'll provide a more detailed review shortly. In the meantime, let me know if you have any questions.

A little haiku to lighten the mood:

/tmp now takes stage,
Models load, a happy sight,
Fixes flow like streams.

code-review-assist

This PR makes a necessary change to ensure compatibility between the Kubernetes manifests and the TGI DLC image. The change is well-explained and the provided context is helpful. A minor suggestion would be to standardize the comments across all affected files for better consistency.

raushan2016 · 2025-01-08T21:21:55Z

ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml

There still some issue in this sample. It fails with

"Traceback (most recent call last): File ""/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py"", line 1269, in warmup _, batch, _ = self.generate_token(batch) File ""/opt/conda/lib/python3.11/contextlib.py"", line 81, in inner return func(*args, **kwds) File ""/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py"", line 1730, in generate_token prefill_logprobs_tensor = torch.log_softmax(out, -1) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 514.00 MiB. GPU 0 has a total capacity of 21.96 GiB of which 393.12 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 20.87 GiB is allocated by PyTorch, and 382.89 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)"

I even tried with g2-standard-48 and it still keeps on crashing. Can you please run some tests to validate.

For all logs feel free to request access. But the error above should be enough to find the issue and diff between TGI default and TGI DLC image, as the only difference is the image and mountpath
https://docs.google.com/spreadsheets/d/1hKZP9X2ueP-Zvnb9zIXMLk6LeGXvfr-mGR6z-NGfA3s/edit?gid=1556804789#gid=1556804789

It worked with a2-highgpu-2g which has 40G (A100) GPU memory. Which means there is something not right with the DLC image and breaks using LLAMA3-70B running on L4 GPU.

Thanks for the detailed report @raushan2016, let me run some tests on our end to investigate and I'll ping you as soon as those are completed!

Sure sounds fair, I'll try to investigate the issue today! Thanks for your time!

@alvarobartt Somehow things are working now, maybe the issue was fixed in the image since the image is pointing to latest ?

Thanks for the confirmation, but yes that is odd indeed, because AFAIK us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311 was released some time ago (not sure if they rolled an update over that image, but doesn't look like it as per https://cloud.google.com/deep-learning-containers/docs/choosing-container#hugging-face).

Sorry
For LLAMA3 this image works us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-1.ubuntu2204.py310

But the current image us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311 doesn't work.

So something related to the changed TGI or Cuda version. Will really be helpful if you can look into this.

Here is the PR with right images which are working
#1591

raushan2016 · 2025-01-16T23:11:07Z

ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml

@@ -51,7 +51,10 @@ spec:
        volumeMounts:
          - mountPath: /dev/shm
            name: dshm
-          - mountPath: /data


@alvarobartt Falcon model has this error. Tried couple of different version > 2 all of them failed.

INFO 2025-01-16T23:04:59.120810571Z [resource.labels.containerName: llm] �[2m2025-01-16T23:04:59.120688Z�[0m �[31mERROR�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Error when initializing model
INFO 2025-01-16T23:04:59.120839184Z [resource.labels.containerName: llm] Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve server.serve( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve asyncio.run( File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args)
INFO 2025-01-16T23:04:59.120940322Z [resource.labels.containerName: llm] > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
INFO 2025-01-16T23:04:59.120942462Z [resource.labels.containerName: llm] model = get_model(
INFO 2025-01-16T23:04:59.120944599Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 714, in get_model
INFO 2025-01-16T23:04:59.120946477Z [resource.labels.containerName: llm] return FlashRWSharded(
INFO 2025-01-16T23:04:59.120948410Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_rw.py", line 77, in init
INFO 2025-01-16T23:04:59.120950825Z [resource.labels.containerName: llm] model = FlashRWForCausalLM(config, weights)
INFO 2025-01-16T23:04:59.120953217Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 659, in init
INFO 2025-01-16T23:04:59.120955289Z [resource.labels.containerName: llm] self.transformer = FlashRWModel(config, weights)
INFO 2025-01-16T23:04:59.120957337Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 593, in init
INFO 2025-01-16T23:04:59.120959345Z [resource.labels.containerName: llm] [
INFO 2025-01-16T23:04:59.120961371Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 594, in
INFO 2025-01-16T23:04:59.120963316Z [resource.labels.containerName: llm] FlashRWLargeLayer(layer_id, config, weights)
INFO 2025-01-16T23:04:59.120965300Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 525, in init
INFO 2025-01-16T23:04:59.120967342Z [resource.labels.containerName: llm] self.ln_layer = FlashRWLayerNorm(config, prefix, weights)
INFO 2025-01-16T23:04:59.120969434Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 484, in init
INFO 2025-01-16T23:04:59.120971427Z [resource.labels.containerName: llm] self.num_ln = config.num_ln_in_parallel_attn
INFO 2025-01-16T23:04:59.120973477Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 264, in getattribute
INFO 2025-01-16T23:04:59.120975858Z [resource.labels.containerName: llm] return super().getattribute(key)
INFO 2025-01-16T23:04:59.120977917Z [resource.labels.containerName: llm] AttributeError: 'RWConfig' object has no attribute 'num_ln_in_parallel_attn'
INFO 2025-01-16T23:04:59.160341634Z [resource.labels.containerName: llm] �[2m2025-01-16T23:04:59.160221Z�[0m �[31mERROR�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Error when initializing model

Found an issue but its resolved > 6 months back
huggingface/text-generation-inference#2349

Fix mountPath to use /tmp instead of /data

a874c02

alvarobartt requested review from alizaidis, yoshi-approver and a team as code owners January 7, 2025 08:59

code-review-assist bot reviewed Jan 7, 2025

View reviewed changes

alvarobartt mentioned this pull request Jan 7, 2025

Fix mountPath to use /tmp instead of /data huggingface/Google-Cloud-Containers#137

Merged

raushan2016 requested changes Jan 7, 2025

View reviewed changes

ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml Show resolved Hide resolved

Add comment on mountPath for ephemeral-volume set to /tmp

af62d0c

alvarobartt requested a review from raushan2016 January 8, 2025 08:07

code-review-assist bot reviewed Jan 8, 2025

View reviewed changes

raushan2016 reviewed Jan 8, 2025

View reviewed changes

This was referenced Jan 9, 2025

Update HuggingFace DLC for TGI URI to 2.3 (latest) GoogleCloudPlatform/ai-on-gke#852

Open

Serve an LLM with multiple GPUs in GKE is doesn't work and fails with The node was low on resource: ephemeral-storage. #1581

Open

raushan2016 reviewed Jan 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `mountPath` to use `/tmp` instead of `/data` #1584

Fix `mountPath` to use `/tmp` instead of `/data` #1584

alvarobartt commented Jan 7, 2025

code-review-assist bot left a comment

code-review-assist bot left a comment

code-review-assist bot left a comment

code-review-assist bot left a comment

code-review-assist bot left a comment

code-review-assist bot left a comment

code-review-assist bot left a comment

code-review-assist bot left a comment

code-review-assist bot left a comment

code-review-assist bot left a comment

raushan2016 Jan 8, 2025 •

edited

Loading

raushan2016 Jan 8, 2025

raushan2016 Jan 9, 2025

raushan2016 Jan 9, 2025 •

edited

Loading

alvarobartt Jan 9, 2025

alvarobartt Jan 15, 2025

raushan2016 Jan 16, 2025

alvarobartt Jan 16, 2025

raushan2016 Jan 16, 2025

raushan2016 Jan 16, 2025

raushan2016 Jan 16, 2025

raushan2016 Jan 16, 2025

Fix mountPath to use /tmp instead of /data #1584

Are you sure you want to change the base?

Fix mountPath to use /tmp instead of /data #1584

Conversation

alvarobartt commented Jan 7, 2025

Description

Tasks

code-review-assist bot left a comment

Choose a reason for hiding this comment

code-review-assist bot left a comment

Choose a reason for hiding this comment

code-review-assist bot left a comment

Choose a reason for hiding this comment

code-review-assist bot left a comment

Choose a reason for hiding this comment

code-review-assist bot left a comment

Choose a reason for hiding this comment

code-review-assist bot left a comment

Choose a reason for hiding this comment

code-review-assist bot left a comment

Choose a reason for hiding this comment

code-review-assist bot left a comment

Choose a reason for hiding this comment

code-review-assist bot left a comment

Choose a reason for hiding this comment

code-review-assist bot left a comment

Choose a reason for hiding this comment

raushan2016 Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raushan2016 Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fix `mountPath` to use `/tmp` instead of `/data` #1584

Fix `mountPath` to use `/tmp` instead of `/data` #1584

raushan2016 Jan 8, 2025 •

edited

Loading

raushan2016 Jan 9, 2025 •

edited

Loading