gpu_fdinfo bug fixes, almost full support for Intel is finally here #1499

17314642 · 2024-12-07T22:16:12Z

Resolves #1082
Resolves #1454

First I would like to thank you these guys for testing patches: @PerAstraAdDeum, @nokia8801 and @retrixe for helping in development.

In short, this pr fixes:

wrong power usage for some people (MangoHud Intel GPU metrics #1082 (comment))
wrong gpu usage in some cases
vram support for integrated gpus in xe (forgot about those 😿)
1 million watt gpu power at start (which confused one user when testing)
some info messages.
gpu clock support for i915 and xe
fan speed support (tested only on i915 and xe, should have linux 6.12+)
generic hwmon support (tested only on i915 and xe)
throttling support (only i915 and xe)
temperature support (only i915, should have linux 6.13+)
shows all gpus by default, otherwise user can choose which gpus he wants to see by using pci_dev or gpu_list

To summarize: pretty much every vital metric is available now for intel.

struct gpu_metrics:

Example metrics (using linux kernel 6.13):

flightlessmango · 2024-12-15T22:37:06Z

src/gpu_fdinfo.h

@@ -67,16 +67,28 @@ class GPU_fdinfo {
        : module(module)
        , pci_dev(pci_dev)
    {
-        SPDLOG_DEBUG("GPU driver is \"{}\"", module);
+        SPDLOG_INFO("GPU driver is \"{}\"", module);


We don't want to print things every time unless it's not expected, this should be debug

Gpu driver is printend only once per every gpu

As mentioned earlier, the problem is that nothing is mentioned once when launched from steam etc, it will spam the logs

flightlessmango · 2024-12-15T22:37:15Z

src/gpu_fdinfo.h

+                fdinfo_data.size() > 0 &&
+                fdinfo_data[0].find(drm_memory_type) == fdinfo_data[0].end()
+            ) {
+                SPDLOG_INFO(


Should also be debug

Also printed only once per gpu

src/gpu.cpp

src/hud_elements.cpp

since static variables are shared across all instances, this leads to cases like this: two gpus store their power usage at the same memory location, and power usage calculations become broken. same thing for get_gpu_load

now opens only unique fdinfo file descriptors to avoid duplicates otherwise this can lead to double or triple or quadruple gpu usage, vram usage, and so on...

I wrote an explanation down below why current method is the correct one. But really, you can ignore that, because I have a better reason. Because Intel's engineers do like this in their gputop program. So I just copied them, because they know better. https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tools/gputop.c#L249 ======================================================================== Currently xe load is calculated like this: sum_of_all_deltas_cycles / sum_of_all_deltas_total_cycles This is fine as long as you have on fd open. But if not, this will lead to incorrect results. Imagine this scenario: fd1: delta_cycles = 3152315 delta_total_cycles = 9611144 fd_load = 0.327985409 = 33% fd2: delta_cycles = 1132858 delta_total_cycles = 9607938 fd_load = 0.117908546 = 12% Total load: 33 + 12 = 45% If you calculated this the old way, you would get: (3152315 / 1132858) / (9611144 / 9607938) = 0.2229645 = 22% Co-authored-by: Ibrahim Ansari <[email protected]>

and change some messages level from debug to info in gpu.cpp i think users deserve to see a little bit more of useful info, which also might be actually useful during troubleshooting. This is what I get for example: [MANGOHUD] [error] [cpu.cpp:675] Failed to initialize CPU power data [MANGOHUD] [info] [gl_renderer.cpp:422] GL version: 4.6 [MANGOHUD] [error] [cpu.cpp:675] Failed to initialize CPU power data [MANGOHUD] [info] [gpu_fdinfo.h:69] GPU driver is "xe" [MANGOHUD] [info] [gpu.cpp:69] GPU Found: node_name: renderD128, vendor_id: 8086 device_id: 56a0 pci_dev: 0000:03:00.0 I don't think that this can be considered too much info.

This commit removes the hardcoding of specific hwmon sensor IDs to that of i915 and Xe KMD, thus making `find_intel_hwmon` vendor-independent. Instead, it iterates through available hwmon sensors and selects the first available sensor. This should allow fan speed and temp monitoring to work out of the box on Xe DRM in the future as well, once Xe DRM exposes the necessary interfaces via hwmon. Co-authored-by: Ibrahim Ansari <[email protected]>

stop being so mean to intel, they can now display fan speeds too since 6.12 kernel.

if user has dual gpu setup and wants to select only one gpu, he can use either pci_dev or gpu_list if both pci_dev and gpu_list are specified, use only gpu_list and print a warning since some code still relies on active gpu like fps logging, throttling and vram graph, make user be able to select active gpu. if no gpu is active, pick last from list of available gpus.

17314642 changed the title ~~bug fixes and support for integrated intel gpus~~ gpu_fdinfo bug fixes and support for integrated intel gpus Dec 7, 2024

17314642 force-pushed the master branch 3 times, most recently from 14d5c11 to ffd9c19 Compare December 8, 2024 00:09

17314642 mentioned this pull request Dec 8, 2024

MangoHud Intel GPU metrics #1082

Closed

17314642 force-pushed the master branch from 0a0527c to a81d9e0 Compare December 8, 2024 01:59

17314642 changed the title ~~gpu_fdinfo bug fixes and support for integrated intel gpus~~ gpu_fdinfo bug fixes gpu clock for i915 and xe, support for integrated intel gpus Dec 8, 2024

17314642 changed the title ~~gpu_fdinfo bug fixes gpu clock for i915 and xe, support for integrated intel gpus~~ gpu_fdinfo bug fixes, gpu clock for i915 and xe, support for integrated intel gpus Dec 8, 2024

17314642 changed the title ~~gpu_fdinfo bug fixes, gpu clock for i915 and xe, support for integrated intel gpus~~ gpu_fdinfo bug fixes, almost full support for Intel is finally here Dec 10, 2024

17314642 force-pushed the master branch 7 times, most recently from a565fd0 to f5d7c0c Compare December 15, 2024 03:45