-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Nvidia driver script to set recommendations for LD_PRELOAD
#754
Allow Nvidia driver script to set recommendations for LD_PRELOAD
#754
Conversation
Instance
|
Instance
|
Instance
|
Example output: [rocky@ip-172-31-27-81 software-layer]$ ./scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh --ld-preload --no-download
Found NVIDIA GPU driver version 545.23.08
Found host CUDA version 12.3
Using default list of libraries
Matched 48 CUDA Libraries
When attempting to use LD_PRELOAD we exclude anything related to graphics
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGL.so.1.
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGL.so.
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGLX_nvidia.so.0.
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGLX.so.0.
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGLX.so.
libwayland-server.so.0 is NOT in the provided preload list, filtering /lib64/libnvidia-egl-wayland.so.1.
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libnvidia-fbc.so.1.
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libnvidia-fbc.so.
libXNVCtrl.so.0 is NOT in the provided preload list, filtering /lib64/libnvidia-gtk3.so.545.23.08.
The recommended way to use LD_PRELOAD is to only use it when you need to:
export EESSI_GPU_LD_PRELOAD="/lib64/libcuda.so.1:/lib64/libcuda.so:/lib64/libcudadebugger.so.1:/lib64/libnvcuvid.so.1:/lib64/libnvcuvid.so:/lib64/libnvidia-cfg.so.1:/lib64/libnvidia-cfg.so:/lib64/libnvidia-eglcore.so.545.23.08:/lib64/libnvidia-encode.so.1:/lib64/libnvidia-encode.so:/lib64/libnvidia-glcore.so.545.23.08:/lib64/libnvidia-glsi.so.545.23.08:/lib64/libnvidia-glvkspirv.so.545.23.08:/lib64/libnvidia-gpucomp.so.545.23.08:/lib64/libnvidia-ml.so.1:/lib64/libnvidia-ml.so:/lib64/libnvidia-nvvm.so.4:/lib64/libnvidia-nvvm.so:/lib64/libnvidia-opencl.so.1:/lib64/libnvidia-opticalflow.so.1:/lib64/libnvidia-ptxjitcompiler.so.1:/lib64/libnvidia-ptxjitcompiler.so:/lib64/libnvidia-rtcore.so.545.23.08:/lib64/libnvidia-tls.so.545.23.08:/lib64/libnvoptix.so.1:/lib64/libOpenCL.so.1"
export EESSI_OVERRIDE_GPU_CHECK="1"
Then you can set LD_PRELOAD only when you want to run a GPU application, e.g.,
LD_PRELOAD="$EESSI_GPU_LD_PRELOAD" device_query |
@ocaisa There's duplicate entries here, |
# Filter out all symlinks and libraries that have missing library dependencies under EESSI | ||
filtered_libraries=() | ||
for library in "${matched_libraries[@]}"; do | ||
if [ ! -L "$library" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too aggressive, instead we should just resolve the symlink and remove duplicate entries
bot: build repo:eessi.io-2023.06-software arch:x86_64/generic |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
LD_PRELOAD
Also tested the script within
|
Accepted all except one Co-authored-by: TopRichard <[email protected]>
@TopRichard This will need to be re-tested now to make sure the changes haven't had an unintended impact |
bot: build repo:eessi.io-2023.06-software arch:x86_64/generic |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
re-testing: Apptainer> /cvmfs/software.eessi.io/versions/2023.06/scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh
Found host CUDA version 9.0
Found NVIDIA GPU driver version 535.129.03
Using downloaded list of libraries
Matched 41 CUDA Libraries
Successfully created symlink between latest and host in /cvmfs/software.eessi.io/host_injections/nvidia/aarch64
Successfully created symlink between /cvmfs/software.eessi.io/host_injections/nvidia/aarch64/latest and lib in /cvmfs/software.eessi.io/host_injections/2023.06/compat/linux/aarch64
Host NVIDIA GPU drivers linked successfully for EESSI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this has just been deployed, so this review comes way too late, but I had a bunch of draft remarks that I somehow never submitted.
Take it as input for a follow-up PR...
@@ -1,144 +1,416 @@ | |||
#!/bin/bash | |||
|
|||
# This script links host libraries related to GPU drivers to a location where | |||
# they can be found by the EESSI linker | |||
# they can be found by the EESSI linker (or sets LD_PRELOAD as an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't set $LD_PRELOAD
, it prints how $EESSI_*LD_PRELOAD
can be set
|
||
get_nvlib_list() { | ||
local nvliblist_url="https://raw.githubusercontent.com/apptainer/apptainer/main/etc/nvliblist.conf" | ||
local default_nvlib_list=( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment here mentioning how this list was compiled, and with which driver version this current list corresponds?
"tls_test_.so" | ||
) | ||
|
||
# Check if the function was called with the "default" argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit-picking: please fix indent of the comment
|
||
# Check if curl failed (i.e., the content is empty) | ||
if [ -z "$nvliblist_content" ]; then | ||
# Failed to download nvliblist.conf, using default list instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we print a warning here (via echo_yellow
) that the download failed, and that we're doing a fallback?
|
||
# Check if umask allows global read | ||
if [ "$umask_octal" -gt 022 ]; then | ||
fatal_error "The current umask ($current_umask) does not allow global read permissions, you'll want everyone to be able to read the created directory." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make a suggestion on how to fix this in the error message?
Also, can't we set $UMASK
here to dance around this ourselves?
filtered_libraries=() | ||
compat_filtered_libraries=() | ||
for library in "${matched_libraries[@]}"; do | ||
# Run ldd on the given binary and filter for "not found" libraries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not binary
, but library
compat_filtered_libraries=() | ||
for library in "${matched_libraries[@]}"; do | ||
# Run ldd on the given binary and filter for "not found" libraries | ||
not_found_libs=$(ldd "$library" 2>/dev/null | grep "not found" | awk '{print $1}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid relying on awk
, it may not be there, cut
is way more likely to be there
# Find the host ldconfig | ||
host_ldconfig=$(get_host_ldconfig) | ||
# Gather libraries on the host (_must_ be host ldconfig) | ||
host_libraries=$($host_ldconfig -p | awk '{print $NF}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can, we should avoid relying on awk
, it may not be installed, use cut
instead?
# Check if it is missing an so dep under EESSI | ||
if [[ -z "$not_found_libs" ]]; then | ||
# Resolve any symlink | ||
realpath_library=$(realpath "$library") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a small chance that realpath
is not available, so perhaps we should check for this early and exit with an error if it's not available
# Resolve any symlink | ||
realpath_library=$(realpath "$library") | ||
if [[ ! " ${filtered_libraries[@]} " =~ " $realpath_library " ]]; then | ||
filtered_libraries+=("$realpath_library") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line above assumes there's spaces?
filtered_libraries+=("$realpath_library") | |
filtered_libraries+=(" ${realpath_library} ") |
@bedroge This was deployed, so PR should be merged too? |
Yes, the tarball has been ingested. |
This has already been deployed, requested changes can be done in a follow-up PR.
PR merged! Moved |
PR merged! Moved |
No description provided.