scx_bpfland: improvements #1224

arighi · 2025-01-19T17:34:15Z

A set of changes to improve bpfland performance and stability:

use a new virtual deadline algorithm (deadline = vruntime + exec_vruntime)
get rid of the interactive task classification logic via nvcsw
allow tasks to overflow from the primary domain more aggressively (fixes scx_bpfland: primary domain hurts burst multi-core performance and prevents some workloads from using all cores #1145)
add a new --cpufreq option to enable frequency scaling (otherwise cpu frequency will be only adjusted when running in powersave mode)
add a new --local-pcpu option to prioritize per-cpu tasks
allow to prioritize all kthreads using --local-kthreads

The "balance_performance" energy profile is intended to provide a middle ground between power savings and performance. However, when running with this profile, there is no strong reason to enforce power-saving measures, especially in scenarios where performance is the primary concern. For laptops, this choice is particularly relevant because balance_performance is typically used when the system is plugged into AC power. In this case, restricting power consumption is unnecessary, as battery life is not a constraint. Instead, we should prioritize higher performance for better responsiveness and throughput. Therefore, use the maximum performance level with the balance_performance energy profile. Signed-off-by: Andrea Righi <[email protected]>

When the cpufreq target is constant there is no need to constantly refresh it from ops.running(). Signed-off-by: Andrea Righi <[email protected]>

Scaling the slice lag with task weight can amplify task prioritization and may even cause stalls when there is a significant difference in nice values among tasks. To prevent this, always use a constant slice lag to determine the maximum vruntime budget a task can accumulate. Signed-off-by: Andrea Righi <[email protected]>

Evaluate the deadline of a task as following: deadline = vruntime + exec_vruntime Here, vruntime represents the task's total runtime, scaled inversely by its weight, while exec_vruntime accounts for the vruntime accumulated from the moment the task becomes runnable until it voluntarily releases the CPU. Fairness is ensured through vruntime, whereas exec_vruntime helps in prioritizing latency-sensitive tasks: tasks that are frequently blocked waiting for an event (typically latency sensitive) will accumulate a smaller exec_vruntime, compared to tasks that continuously consume CPU without interruption. As a result, tasks with a smaller exec_vruntime will have a shorter deadline and will be dispatched earlier, ensuring better responsiveness for latency-sensitive tasks. Signed-off-by: Andrea Righi <[email protected]>

Make it easier for tasks to overflow beyond the primary domain in a more aggressive way, using all available idle CPUs as a last resort while still prioritizing idle CPUs within the primary domain. This should address issue #1145. Signed-off-by: Andrea Righi <[email protected]>

The interactive task classification based on the average number of voluntary context switches is only used in the cpufreq logic. Maintaining this complexity is overkill, as it often provides little benefit. Signed-off-by: Andrea Righi <[email protected]>

Per-CPU tasks tend to be de-prioritized, since they can't be migrated when their only usable CPU is busy. To mitigate this, introdce a new option `--local-pcpu`. If enabled all the per-CPU tasks will be dispatched directly. This can introduce unfariness and potentially trigger stalls, but it can help to improve performance of server-type workloads, such as parallel builds. With this option in place we can now deprecate `--nvcsw-max-thresh` (that was only used to prioritize per-CPU tasks at this point). Moreover, change `--local-kthreads` behavior to prioritize all kthreads, bot just the per-CPU kthreads, since per-CPU kthreads can be prioritized using `--local-pcpu`. Signed-off-by: Andrea Righi <[email protected]>

Enabling CPU frequency scaling by default may result in suboptimal performance now that interactive task classification has been removed, so interactive tasks can no longer boost the CPU frequency. As a result, CPU frequency is now solely determined by CPU load. This can lead to sub-optimal performance with the schedutil governor, especially when running benchmarks and comparing performance across different schedulers. Therefore, introduce the new option --cpufreq to explicitly enable CPU frequency scaling and change the logic as following: - by default boost the CPU frequency to the max level - if powersave mode is enabled boost the CPU frequency to the min - if performance mode is enabled boost the CPU frequency to the max - if --cpufreq is enabled, adjust the CPU frequency dynamically based on the load Signed-off-by: Andrea Righi <[email protected]>

arighi added 8 commits January 18, 2025 15:33

scx_bpfland: Avoid updating cpuperf target when not needed

d67090c

When the cpufreq target is constant there is no need to constantly refresh it from ops.running(). Signed-off-by: Andrea Righi <[email protected]>

arighi requested review from htejun and multics69 January 19, 2025 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx_bpfland: improvements #1224

scx_bpfland: improvements #1224

arighi commented Jan 19, 2025

scx_bpfland: improvements #1224

Are you sure you want to change the base?

scx_bpfland: improvements #1224

Conversation

arighi commented Jan 19, 2025