Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_bpfland: improvements #1224

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

scx_bpfland: improvements #1224

wants to merge 8 commits into from

Conversation

arighi
Copy link
Contributor

@arighi arighi commented Jan 19, 2025

A set of changes to improve bpfland performance and stability:

The "balance_performance" energy profile is intended to provide a middle
ground between power savings and performance. However, when running with
this profile, there is no strong reason to enforce power-saving
measures, especially in scenarios where performance is the primary
concern.

For laptops, this choice is particularly relevant because
balance_performance is typically used when the system is plugged into AC
power. In this case, restricting power consumption is unnecessary, as
battery life is not a constraint. Instead, we should prioritize higher
performance for better responsiveness and throughput.

Therefore, use the maximum performance level with the
balance_performance energy profile.

Signed-off-by: Andrea Righi <[email protected]>
When the cpufreq target is constant there is no need to constantly
refresh it from ops.running().

Signed-off-by: Andrea Righi <[email protected]>
Scaling the slice lag with task weight can amplify task prioritization
and may even cause stalls when there is a significant difference in nice
values among tasks.

To prevent this, always use a constant slice lag to determine the
maximum vruntime budget a task can accumulate.

Signed-off-by: Andrea Righi <[email protected]>
Evaluate the deadline of a task as following:

  deadline = vruntime + exec_vruntime

Here, vruntime represents the task's total runtime, scaled inversely by
its weight, while exec_vruntime accounts for the vruntime accumulated
from the moment the task becomes runnable until it voluntarily releases
the CPU.

Fairness is ensured through vruntime, whereas exec_vruntime helps in
prioritizing latency-sensitive tasks: tasks that are frequently blocked
waiting for an event (typically latency sensitive) will accumulate a
smaller exec_vruntime, compared to tasks that continuously consume CPU
without interruption.

As a result, tasks with a smaller exec_vruntime will have a shorter
deadline and will be dispatched earlier, ensuring better responsiveness
for latency-sensitive tasks.

Signed-off-by: Andrea Righi <[email protected]>
Make it easier for tasks to overflow beyond the primary domain in a more
aggressive way, using all available idle CPUs as a last resort while
still prioritizing idle CPUs within the primary domain.

This should address issue #1145.

Signed-off-by: Andrea Righi <[email protected]>
The interactive task classification based on the average number of
voluntary context switches is only used in the cpufreq logic.
Maintaining this complexity is overkill, as it often provides little
benefit.

Signed-off-by: Andrea Righi <[email protected]>
Per-CPU tasks tend to be de-prioritized, since they can't be migrated
when their only usable CPU is busy.

To mitigate this, introdce a new option `--local-pcpu`. If enabled all
the per-CPU tasks will be dispatched directly.

This can introduce unfariness and potentially trigger stalls, but it can
help to improve performance of server-type workloads, such as parallel
builds.

With this option in place we can now deprecate `--nvcsw-max-thresh`
(that was only used to prioritize per-CPU tasks at this point).

Moreover, change `--local-kthreads` behavior to prioritize all kthreads,
bot just the per-CPU kthreads, since per-CPU kthreads can be prioritized
using `--local-pcpu`.

Signed-off-by: Andrea Righi <[email protected]>
Enabling CPU frequency scaling by default may result in suboptimal
performance now that interactive task classification has been removed,
so interactive tasks can no longer boost the CPU frequency. As a result,
CPU frequency is now solely determined by CPU load.

This can lead to sub-optimal performance with the schedutil governor,
especially when running benchmarks and comparing performance across
different schedulers.

Therefore, introduce the new option --cpufreq to explicitly enable CPU
frequency scaling and change the logic as following:
 - by default boost the CPU frequency to the max level
 - if powersave mode is enabled boost the CPU frequency to the min
 - if performance mode is enabled boost the CPU frequency to the max
 - if --cpufreq is enabled, adjust the CPU frequency dynamically based
   on the load

Signed-off-by: Andrea Righi <[email protected]>
@arighi arighi requested review from htejun and multics69 January 19, 2025 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant