Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with GPU memory allocation for pangu and graphcast #48

Open
Lucas-Hardy opened this issue Jul 5, 2024 · 5 comments
Open

Issue with GPU memory allocation for pangu and graphcast #48

Lucas-Hardy opened this issue Jul 5, 2024 · 5 comments

Comments

@Lucas-Hardy
Copy link

Hi

I'm trying to set up GraphCast and Pangu to run on a 3060 12GB GPU and am getting memory allocation errors for both models.

Pangu:

2024-07-05 14:59:18,484 INFO Writing results to pangu_outputs/20240626_1200_6h_pangu.grib
2024-07-05 14:59:18,485 INFO Loading pressure fields from CDS
2024-07-05 14:59:18,814 INFO Loading surface fields from CDS
2024-07-05 14:59:18,840 INFO Using device 'GPU'. The speed of inference depends greatly on the device.
2024-07-05 14:59:18,840 INFO ONNXRuntime providers: ['CUDAExecutionProvider', 'CPUExecutionProvider']
2024-07-05 14:59:24,177 INFO Loading pangu_assets/pangu_weather_24.onnx: 5 seconds.
2024-07-05 14:59:29,314 INFO Loading pangu_assets/pangu_weather_6.onnx: 5 seconds.
2024-07-05 14:59:29,822 INFO Writing step 0: 0.5 second.
2024-07-05 14:59:29,822 INFO Model initialisation: 11 seconds
2024-07-05 14:59:29,822 INFO Starting inference for 1 steps (6h).
2024-07-05 14:59:30.391755898 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running FusedMatMul node. Name:'/b1/MatMul/MatmulTransposeFusion//MatMulScaleFusion/' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 1851310080

2024-07-05 14:59:30,391 INFO Elapsed: 0.6 second.
2024-07-05 14:59:30,391 INFO Average: 0.6 second per step.
2024-07-05 14:59:30,391 INFO Total time: 11 seconds.
Traceback (most recent call last):
  File "/home/ock/anaconda3/envs/ai-models-pangu/bin/ai-models", line 8, in <module>
    sys.exit(main())
  File "/home/ock/anaconda3/envs/ai-models-pangu/lib/python3.10/site-packages/ai_models/__main__.py", line 358, in main
    _main(sys.argv[1:])
  File "/home/ock/anaconda3/envs/ai-models-pangu/lib/python3.10/site-packages/ai_models/__main__.py", line 306, in _main
    run(vars(args), unknownargs)
  File "/home/ock/anaconda3/envs/ai-models-pangu/lib/python3.10/site-packages/ai_models/__main__.py", line 331, in run
    model.run()
  File "/home/ock/anaconda3/envs/ai-models-pangu/lib/python3.10/site-packages/ai_models_panguweather/model.py", line 107, in run
    output, output_surface = ort_session_6.run(
  File "/home/ock/anaconda3/envs/ai-models-pangu/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running FusedMatMul node. Name:'/b1/MatMul/MatmulTransposeFusion//MatMulScaleFusion/' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 1851310080

GraphCast:

2024-07-05 14:42:35.208814: I external/xla/xla/stream_executor/cuda/cuda_driver.cc:1558] failed to allocate 2.97GiB (3189473280 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-07-05 14:42:35,226 INFO Doing full rollout prediction in JAX: 57 seconds.
2024-07-05 14:42:35,226 INFO Total time: 1 minute.
jax.errors.SimplifiedTraceback: For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ock/anaconda3/envs/ai-models-graphcast/bin/ai-models", line 8, in <module>
    sys.exit(main())
  File "/home/ock/anaconda3/envs/ai-models-graphcast/lib/python3.10/site-packages/ai_models/__main__.py", line 358, in main
    _main(sys.argv[1:])
  File "/home/ock/anaconda3/envs/ai-models-graphcast/lib/python3.10/site-packages/ai_models/__main__.py", line 306, in _main
    run(vars(args), unknownargs)
  File "/home/ock/anaconda3/envs/ai-models-graphcast/lib/python3.10/site-packages/ai_models/__main__.py", line 331, in run
    model.run()
  File "/home/ock/anaconda3/envs/ai-models-graphcast/lib/python3.10/site-packages/ai_models_graphcast/model.py", line 240, in run
    output = self.model(
  File "/home/ock/anaconda3/envs/ai-models-graphcast/lib/python3.10/site-packages/ai_models_graphcast/model.py", line 114, in <lambda>
    return lambda **kw: fn(**kw)[0]
MemoryError: std::bad_alloc

I am using Cuda 12.4 in Pangu and 12.3 with GraphCast, I have tried using Cuda 11 and it does not recognise my GPU. I am using cudnn=8.9.7.29. I have also tried setting XLA_PYTHON_CLIENT_PREALLOCATE=false, setting XLA_PYTHON_CLIENT_MEM_FRACTION to smaller values and XLA_PYTHON_CLIENT_ALLOCATOR=platform. The model also runs fine on the CPU, just very slow. Is there a fix to this or it just simply that my GPU has not got enough VRAM?

Thanks.

@decadeneo
Copy link

I met the same problem in pangu on my device 4060 32G GPU today,but yestaday the model work.
so weird

@decadeneo
Copy link

I tried creating a new Python environment and after pip install ai-models , l installed onnxruntime via conda, suspecting that the issue might be related to the version of numpy (the version running smoothly for me is 2.0.0).

@YUTAIPAN
Copy link

I also met the problem of insufficient GPU memory, I'm wonder if there are any methods to decrease the batch size when doing the prediction

@YUTAIPAN
Copy link

I fixed the problem, you need to have a single GPU which has large enough GPU memory.

@YUTAIPAN
Copy link

For pangu, 27GiB GPU memory is required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants