Issue with GPU memory allocation for pangu and graphcast #48

Lucas-Hardy · 2024-07-05T15:38:56Z

Hi

I'm trying to set up GraphCast and Pangu to run on a 3060 12GB GPU and am getting memory allocation errors for both models.

Pangu:

2024-07-05 14:59:18,484 INFO Writing results to pangu_outputs/20240626_1200_6h_pangu.grib
2024-07-05 14:59:18,485 INFO Loading pressure fields from CDS
2024-07-05 14:59:18,814 INFO Loading surface fields from CDS
2024-07-05 14:59:18,840 INFO Using device 'GPU'. The speed of inference depends greatly on the device.
2024-07-05 14:59:18,840 INFO ONNXRuntime providers: ['CUDAExecutionProvider', 'CPUExecutionProvider']
2024-07-05 14:59:24,177 INFO Loading pangu_assets/pangu_weather_24.onnx: 5 seconds.
2024-07-05 14:59:29,314 INFO Loading pangu_assets/pangu_weather_6.onnx: 5 seconds.
2024-07-05 14:59:29,822 INFO Writing step 0: 0.5 second.
2024-07-05 14:59:29,822 INFO Model initialisation: 11 seconds
2024-07-05 14:59:29,822 INFO Starting inference for 1 steps (6h).
2024-07-05 14:59:30.391755898 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running FusedMatMul node. Name:'/b1/MatMul/MatmulTransposeFusion//MatMulScaleFusion/' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 1851310080

2024-07-05 14:59:30,391 INFO Elapsed: 0.6 second.
2024-07-05 14:59:30,391 INFO Average: 0.6 second per step.
2024-07-05 14:59:30,391 INFO Total time: 11 seconds.
Traceback (most recent call last):
  File "/home/ock/anaconda3/envs/ai-models-pangu/bin/ai-models", line 8, in <module>
    sys.exit(main())
  File "/home/ock/anaconda3/envs/ai-models-pangu/lib/python3.10/site-packages/ai_models/__main__.py", line 358, in main
    _main(sys.argv[1:])
  File "/home/ock/anaconda3/envs/ai-models-pangu/lib/python3.10/site-packages/ai_models/__main__.py", line 306, in _main
    run(vars(args), unknownargs)
  File "/home/ock/anaconda3/envs/ai-models-pangu/lib/python3.10/site-packages/ai_models/__main__.py", line 331, in run
    model.run()
  File "/home/ock/anaconda3/envs/ai-models-pangu/lib/python3.10/site-packages/ai_models_panguweather/model.py", line 107, in run
    output, output_surface = ort_session_6.run(
  File "/home/ock/anaconda3/envs/ai-models-pangu/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running FusedMatMul node. Name:'/b1/MatMul/MatmulTransposeFusion//MatMulScaleFusion/' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 1851310080

GraphCast:

2024-07-05 14:42:35.208814: I external/xla/xla/stream_executor/cuda/cuda_driver.cc:1558] failed to allocate 2.97GiB (3189473280 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-05 14:42:35,226 INFO Doing full rollout prediction in JAX: 57 seconds.
2024-07-05 14:42:35,226 INFO Total time: 1 minute.
jax.errors.SimplifiedTraceback: For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ock/anaconda3/envs/ai-models-graphcast/bin/ai-models", line 8, in <module>
    sys.exit(main())
  File "/home/ock/anaconda3/envs/ai-models-graphcast/lib/python3.10/site-packages/ai_models/__main__.py", line 358, in main
    _main(sys.argv[1:])
  File "/home/ock/anaconda3/envs/ai-models-graphcast/lib/python3.10/site-packages/ai_models/__main__.py", line 306, in _main
    run(vars(args), unknownargs)
  File "/home/ock/anaconda3/envs/ai-models-graphcast/lib/python3.10/site-packages/ai_models/__main__.py", line 331, in run
    model.run()
  File "/home/ock/anaconda3/envs/ai-models-graphcast/lib/python3.10/site-packages/ai_models_graphcast/model.py", line 240, in run
    output = self.model(
  File "/home/ock/anaconda3/envs/ai-models-graphcast/lib/python3.10/site-packages/ai_models_graphcast/model.py", line 114, in <lambda>
    return lambda **kw: fn(**kw)[0]
MemoryError: std::bad_alloc

I am using Cuda 12.4 in Pangu and 12.3 with GraphCast, I have tried using Cuda 11 and it does not recognise my GPU. I am using cudnn=8.9.7.29. I have also tried setting XLA_PYTHON_CLIENT_PREALLOCATE=false, setting XLA_PYTHON_CLIENT_MEM_FRACTION to smaller values and XLA_PYTHON_CLIENT_ALLOCATOR=platform. The model also runs fine on the CPU, just very slow. Is there a fix to this or it just simply that my GPU has not got enough VRAM?

Thanks.

decadeneo · 2024-07-06T12:39:26Z

I met the same problem in pangu on my device 4060 32G GPU today,but yestaday the model work.
so weird

decadeneo · 2024-07-06T14:20:40Z

I tried creating a new Python environment and after pip install ai-models , l installed onnxruntime via conda, suspecting that the issue might be related to the version of numpy (the version running smoothly for me is 2.0.0).

YUTAIPAN · 2024-11-12T02:05:24Z

I also met the problem of insufficient GPU memory, I'm wonder if there are any methods to decrease the batch size when doing the prediction

YUTAIPAN · 2024-12-27T05:43:12Z

I fixed the problem, you need to have a single GPU which has large enough GPU memory.

YUTAIPAN · 2024-12-27T05:43:48Z

For pangu, 27GiB GPU memory is required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with GPU memory allocation for pangu and graphcast #48

Issue with GPU memory allocation for pangu and graphcast #48

Lucas-Hardy commented Jul 5, 2024

decadeneo commented Jul 6, 2024

decadeneo commented Jul 6, 2024

YUTAIPAN commented Nov 12, 2024

YUTAIPAN commented Dec 27, 2024

YUTAIPAN commented Dec 27, 2024

Issue with GPU memory allocation for pangu and graphcast #48

Issue with GPU memory allocation for pangu and graphcast #48

Comments

Lucas-Hardy commented Jul 5, 2024

decadeneo commented Jul 6, 2024

decadeneo commented Jul 6, 2024

YUTAIPAN commented Nov 12, 2024

YUTAIPAN commented Dec 27, 2024

YUTAIPAN commented Dec 27, 2024