-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu_fdinfo bug fixes, almost full support for Intel is finally here #1499
Conversation
14d5c11
to
ffd9c19
Compare
a565fd0
to
f5d7c0c
Compare
src/gpu_fdinfo.h
Outdated
@@ -67,16 +67,28 @@ class GPU_fdinfo { | |||
: module(module) | |||
, pci_dev(pci_dev) | |||
{ | |||
SPDLOG_DEBUG("GPU driver is \"{}\"", module); | |||
SPDLOG_INFO("GPU driver is \"{}\"", module); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to print things every time unless it's not expected, this should be debug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gpu driver is printend only once per every gpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned earlier, the problem is that nothing is mentioned once when launched from steam etc, it will spam the logs
src/gpu_fdinfo.h
Outdated
fdinfo_data.size() > 0 && | ||
fdinfo_data[0].find(drm_memory_type) == fdinfo_data[0].end() | ||
) { | ||
SPDLOG_INFO( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also be debug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also printed only once per gpu
e652a3b
to
f6de29b
Compare
since static variables are shared across all instances, this leads to cases like this: two gpus store their power usage at the same memory location, and power usage calculations become broken. same thing for get_gpu_load
now opens only unique fdinfo file descriptors to avoid duplicates otherwise this can lead to double or triple or quadruple gpu usage, vram usage, and so on...
I wrote an explanation down below why current method is the correct one. But really, you can ignore that, because I have a better reason. Because Intel's engineers do like this in their gputop program. So I just copied them, because they know better. https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tools/gputop.c#L249 ======================================================================== Currently xe load is calculated like this: sum_of_all_deltas_cycles / sum_of_all_deltas_total_cycles This is fine as long as you have on fd open. But if not, this will lead to incorrect results. Imagine this scenario: fd1: delta_cycles = 3152315 delta_total_cycles = 9611144 fd_load = 0.327985409 = 33% fd2: delta_cycles = 1132858 delta_total_cycles = 9607938 fd_load = 0.117908546 = 12% Total load: 33 + 12 = 45% If you calculated this the old way, you would get: (3152315 / 1132858) / (9611144 / 9607938) = 0.2229645 = 22% Co-authored-by: Ibrahim Ansari <[email protected]>
and change some messages level from debug to info in gpu.cpp i think users deserve to see a little bit more of useful info, which also might be actually useful during troubleshooting. This is what I get for example: [MANGOHUD] [error] [cpu.cpp:675] Failed to initialize CPU power data [MANGOHUD] [info] [gl_renderer.cpp:422] GL version: 4.6 [MANGOHUD] [error] [cpu.cpp:675] Failed to initialize CPU power data [MANGOHUD] [info] [gpu_fdinfo.h:69] GPU driver is "xe" [MANGOHUD] [info] [gpu.cpp:69] GPU Found: node_name: renderD128, vendor_id: 8086 device_id: 56a0 pci_dev: 0000:03:00.0 I don't think that this can be considered too much info.
This commit removes the hardcoding of specific hwmon sensor IDs to that of i915 and Xe KMD, thus making `find_intel_hwmon` vendor-independent. Instead, it iterates through available hwmon sensors and selects the first available sensor. This should allow fan speed and temp monitoring to work out of the box on Xe DRM in the future as well, once Xe DRM exposes the necessary interfaces via hwmon. Co-authored-by: Ibrahim Ansari <[email protected]>
stop being so mean to intel, they can now display fan speeds too since 6.12 kernel.
if user has dual gpu setup and wants to select only one gpu, he can use either pci_dev or gpu_list if both pci_dev and gpu_list are specified, use only gpu_list and print a warning since some code still relies on active gpu like fps logging, throttling and vram graph, make user be able to select active gpu. if no gpu is active, pick last from list of available gpus.
Resolves #1082
Resolves #1454
First I would like to thank you these guys for testing patches: @PerAstraAdDeum, @nokia8801 and @retrixe for helping in development.
In short, this pr fixes:
pci_dev
orgpu_list
To summarize: pretty much every vital metric is available now for intel.
struct gpu_metrics:
Example metrics (using linux kernel 6.13):