Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu_fdinfo bug fixes, almost full support for Intel is finally here #1499

Merged
merged 12 commits into from
Dec 16, 2024

Conversation

17314642
Copy link
Contributor

@17314642 17314642 commented Dec 7, 2024

Resolves #1082
Resolves #1454

First I would like to thank you these guys for testing patches: @PerAstraAdDeum, @nokia8801 and @retrixe for helping in development.

In short, this pr fixes:

  1. wrong power usage for some people (MangoHud Intel GPU metrics #1082 (comment))
  2. wrong gpu usage in some cases
  3. vram support for integrated gpus in xe (forgot about those 😿)
  4. 1 million watt gpu power at start (which confused one user when testing)
  5. some info messages.
  6. gpu clock support for i915 and xe
  7. fan speed support (tested only on i915 and xe, should have linux 6.12+)
  8. generic hwmon support (tested only on i915 and xe)
  9. throttling support (only i915 and xe)
  10. temperature support (only i915, should have linux 6.13+)
  11. shows all gpus by default, otherwise user can choose which gpus he wants to see by using pci_dev or gpu_list

To summarize: pretty much every vital metric is available now for intel.

struct gpu_metrics:

  • int load;
  • int temp; (only i915, no xe support by intel yet)
  • int junction_temp {-1}; (isn't exposed by intel)
  • int memory_temp {-1}; (isn't exposed by intel)
  • float memoryUsed;
  • float memoryTotal; (isn't exposed by intel)
  • int MemClock; (isn't exposed by intel)
  • int CoreClock;
  • float powerUsage;
  • float apu_cpu_power;
  • int apu_cpu_temp;
  • bool is_power_throttled;
  • bool is_current_throttled;
  • bool is_temp_throttled;
  • bool is_other_throttled;
  • float gtt_used; (not used by mangohud)
  • int fan_speed; (only i915, no xe support by intel yet)
  • int voltage;

Example metrics (using linux kernel 6.13):
image

@17314642 17314642 changed the title bug fixes and support for integrated intel gpus gpu_fdinfo bug fixes and support for integrated intel gpus Dec 7, 2024
@17314642 17314642 force-pushed the master branch 3 times, most recently from 14d5c11 to ffd9c19 Compare December 8, 2024 00:09
@17314642 17314642 changed the title gpu_fdinfo bug fixes and support for integrated intel gpus gpu_fdinfo bug fixes gpu clock for i915 and xe, support for integrated intel gpus Dec 8, 2024
@17314642 17314642 changed the title gpu_fdinfo bug fixes gpu clock for i915 and xe, support for integrated intel gpus gpu_fdinfo bug fixes, gpu clock for i915 and xe, support for integrated intel gpus Dec 8, 2024
@17314642 17314642 changed the title gpu_fdinfo bug fixes, gpu clock for i915 and xe, support for integrated intel gpus gpu_fdinfo bug fixes, almost full support for Intel is finally here Dec 10, 2024
@17314642 17314642 force-pushed the master branch 7 times, most recently from a565fd0 to f5d7c0c Compare December 15, 2024 03:45
src/gpu_fdinfo.h Outdated
@@ -67,16 +67,28 @@ class GPU_fdinfo {
: module(module)
, pci_dev(pci_dev)
{
SPDLOG_DEBUG("GPU driver is \"{}\"", module);
SPDLOG_INFO("GPU driver is \"{}\"", module);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to print things every time unless it's not expected, this should be debug

Copy link
Contributor Author

@17314642 17314642 Dec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gpu driver is printend only once per every gpu

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned earlier, the problem is that nothing is mentioned once when launched from steam etc, it will spam the logs

src/gpu_fdinfo.h Outdated
fdinfo_data.size() > 0 &&
fdinfo_data[0].find(drm_memory_type) == fdinfo_data[0].end()
) {
SPDLOG_INFO(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also be debug

Copy link
Contributor Author

@17314642 17314642 Dec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also printed only once per gpu

src/gpu.cpp Outdated Show resolved Hide resolved
@17314642 17314642 force-pushed the master branch 2 times, most recently from e652a3b to f6de29b Compare December 15, 2024 23:16
17314642 and others added 5 commits December 16, 2024 03:26
since static variables are shared across all instances, this leads
to cases like this: two gpus store their power usage at the same
memory location, and power usage calculations become broken.

same thing for get_gpu_load
now opens only unique fdinfo file descriptors to avoid duplicates
otherwise this can lead to double or triple or quadruple gpu
usage, vram usage, and so on...
I wrote an explanation down below why current method is the correct one.
But really, you can ignore that, because I have a better reason.

Because Intel's engineers do like this in their gputop program. So I
just copied them, because they know better.

https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tools/gputop.c#L249

========================================================================

Currently xe load is calculated like this:

sum_of_all_deltas_cycles / sum_of_all_deltas_total_cycles

This is fine as long as you have on fd open. But if not, this will lead
to incorrect results.

Imagine this scenario:

fd1:
    delta_cycles = 3152315
    delta_total_cycles = 9611144
    fd_load = 0.327985409 = 33%

fd2:
    delta_cycles = 1132858
    delta_total_cycles = 9607938
    fd_load = 0.117908546 = 12%

Total load: 33 + 12 = 45%

If you calculated this the old way, you would get:
(3152315 / 1132858) / (9611144 / 9607938) = 0.2229645 = 22%

Co-authored-by: Ibrahim Ansari <[email protected]>
17314642 and others added 7 commits December 16, 2024 03:26
and change some messages level from debug to info in gpu.cpp

i think users deserve to see a little bit more of useful info,
which also might be actually useful during  troubleshooting.

This is what I get for example:

[MANGOHUD] [error] [cpu.cpp:675] Failed to initialize CPU power data
[MANGOHUD] [info] [gl_renderer.cpp:422] GL version: 4.6
[MANGOHUD] [error] [cpu.cpp:675] Failed to initialize CPU power data
[MANGOHUD] [info] [gpu_fdinfo.h:69] GPU driver is "xe"
[MANGOHUD] [info] [gpu.cpp:69] GPU Found: node_name: renderD128,
vendor_id: 8086 device_id: 56a0 pci_dev: 0000:03:00.0

I don't think that this can be considered too much info.
This commit removes the hardcoding of specific hwmon sensor IDs to that
of i915 and Xe KMD, thus making `find_intel_hwmon` vendor-independent.

Instead, it iterates through available hwmon sensors and selects the
first available sensor. This should allow fan speed and temp monitoring
to work out of the box on Xe DRM in the future as well, once Xe DRM
exposes the necessary interfaces via hwmon.

Co-authored-by: Ibrahim Ansari <[email protected]>
stop being so mean to intel, they can now display fan speeds too since
6.12 kernel.
if user has dual gpu setup and wants to select only one gpu,
he can use either pci_dev or gpu_list

if both pci_dev and gpu_list are specified, use only gpu_list
and print a warning

since some code still relies on active gpu like fps logging,
throttling and vram graph, make user be able to
select active gpu. if no gpu is active, pick last from list
of available gpus.
@flightlessmango flightlessmango merged commit 81a2ad8 into flightlessmango:master Dec 16, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants