Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[6.12] Track gaming patches #33

Draft
wants to merge 51 commits into
base: base-6.12
Choose a base branch
from
Draft

Conversation

kakra
Copy link
Owner

@kakra kakra commented Nov 23, 2024

Export patch series: https://github.com/kakra/linux/pull/33.patch

  • ntsync7: experimental ntsync device driver which can be used by some Proton versions (add udev rule ACTION=="add|change", KERNEL=="ntsync", SUBSYSTEM=="misc", MODE="0666" to set proper permissions)
  • hugepages background reclaim: patch cherry-picked from ZEN
  • memory management: improved scheduling for huge memory pages and memory zones, cherry-picked from ZEN
  • and more ZEN cherry picks including configure option
  • threaded IRQs by default: cherry-picked from CK
  • always use bfq IO scheduler by default: although it might benchmark lower throughput it is almost always better for more consistent desktop IO latency during high IO write loads
  • memory soft-dirty flag: used by Proton to support Windows memory write monitoring with better performance
  • futex backward compatibility patch: Properly supports older Proton versions using the latest futex kernel functions
  • lower latency scheduling: EEVDF patches cherry-picked
  • mgLRU thrash protection: enabled by default for protecting the active working set under memory pressure
  • raised vm.max_map_count: as suggested by Valve (in Steam Deck) and TKG, cherry-picked from ZEN

@kakra kakra marked this pull request as draft November 23, 2024 21:00
@orbea
Copy link

orbea commented Nov 23, 2024

With 6.12.1 it fails to build.

block/elevator.c: In function 'elevator_get_default':
block/elevator.c:570:42: error: passing argument 1 of 'elevator_find_get' from incompatible pointer type [-Werror=incompatible-pointer-types]
  570 |                 return elevator_find_get(q, "kyber");
      |                                          ^
      |                                          |
      |                                          struct request_queue *
block/elevator.c:109:60: note: expected 'const char *' but argument is of type 'struct request_queue *'
  109 | static struct elevator_type *elevator_find_get(const char *name)
      |                                                ~~~~~~~~~~~~^~~~
block/elevator.c:570:24: error: too many arguments to function 'elevator_find_get'
  570 |                 return elevator_find_get(q, "kyber");
      |                        ^~~~~~~~~~~~~~~~~
block/elevator.c:109:30: note: declared here
  109 | static struct elevator_type *elevator_find_get(const char *name)
      |                              ^~~~~~~~~~~~~~~~~
  CC      init/version.o
cc1: some warnings being treated as errors
  LDS     arch/x86/kernel/vmlinux.lds
make[3]: *** [scripts/Makefile.build:229: block/elevator.o] Error 1
make[3]: *** Waiting for unfinished jobs....

Which appears from this commit. bd1a0d1

@kakra
Copy link
Owner Author

kakra commented Nov 24, 2024

Strange, it works for me... But yes, the q parameter is gone. I'll look into it.

@kakra kakra force-pushed the rebase-6.12/gaming-patches branch from ed21f9c to a26daa8 Compare November 24, 2024 00:33
@kakra
Copy link
Owner Author

kakra commented Nov 24, 2024

@orbea Don't ask me how that broken patch got there... I didn't notice because my ZEN_INTERACTIVE=y was gone somehow. It's fixed now. Thanks for reporting.

@orbea
Copy link

orbea commented Nov 24, 2024

@kakra Thanks for the quick fix, I can confirm it also builds for me.

@kakra kakra force-pushed the rebase-6.12/gaming-patches branch from a26daa8 to a8a1a45 Compare December 7, 2024 15:38
Elizabeth Figura added 22 commits December 15, 2024 09:24
Simplify the user API a bit by returning the fd as return value from the ioctl
instead of through the argument pointer.

Signed-off-by: Elizabeth Figura <[email protected]>
Use the more common "release" terminology, which is also the term used by NT,
instead of "post" (which is used by POSIX).

Signed-off-by: Elizabeth Figura <[email protected]>
This corresponds to part of the functionality of the NT syscall
NtWaitForMultipleObjects(). Specifically, it implements the behaviour where
the third argument (wait_any) is TRUE, and it does not handle alertable waits.
Those features have been split out into separate patches to ease review.

This patch therefore implements the wait/wake infrastructure which comprises the
core of ntsync's functionality.

NTSYNC_IOC_WAIT_ANY is a vectored wait function similar to poll(). Unlike
poll(), it "consumes" objects when they are signaled. For semaphores, this means
decreasing one from the internal counter. At most one object can be consumed by
this function.

This wait/wake model is fundamentally different from that used anywhere else in
the kernel, and for that reason ntsync does not use any existing infrastructure,
such as futexes, kernel mutexes or semaphores, or wait_event().

Up to 64 objects can be waited on at once. As soon as one is signaled, the
object with the lowest index is consumed, and that index is returned via the
"index" field.

A timeout is supported. The timeout is passed as a u64 nanosecond value, which
represents absolute time measured against either the MONOTONIC or REALTIME clock
(controlled by the flags argument). If U64_MAX is passed, the ioctl waits
indefinitely.

This ioctl validates that all objects belong to the relevant device. This is not
necessary for any technical reason related to NTSYNC_IOC_WAIT_ANY, but will be
necessary for NTSYNC_IOC_WAIT_ALL introduced in the following patch.

Some padding fields are added for alignment and for fields which will be added
in future patches (split out to ease review).

Signed-off-by: Elizabeth Figura <[email protected]>
This is similar to NTSYNC_IOC_WAIT_ANY, but waits until all of the objects are
simultaneously signaled, and then acquires all of them as a single atomic
operation.

Because acquisition of multiple objects is atomic, some complex locking is
required. We cannot simply spin-lock multiple objects simultaneously, as that
may disable preëmption for a problematically long time.

Instead, modifying any object which may be involved in a wait-all operation takes
a device-wide sleeping mutex, "wait_all_lock", instead of the normal object
spinlock.

Because wait-for-all is a rare operation, in order to optimize wait-for-any,
this lock is only taken when necessary. "all_hint" is used to mark objects which
are involved in a wait-for-all operation, and if an object is not, only its
spinlock is taken.

The locking scheme used here was written by Peter Zijlstra.

Signed-off-by: Elizabeth Figura <[email protected]>
This corresponds to the NT syscall NtCreateMutant().

An NT mutex is recursive, with a 32-bit recursion counter. When acquired via
NtWaitForMultipleObjects(), the recursion counter is incremented by one. The OS
records the thread which acquired it.

The OS records the thread which acquired it. However, in order to keep this
driver self-contained, the owning thread ID is managed by user-space, and passed
as a parameter to all relevant ioctls.

The initial owner and recursion count, if any, are specified when the mutex is
created.

Signed-off-by: Elizabeth Figura <[email protected]>
This corresponds to the NT syscall NtReleaseMutant().

This syscall decrements the mutex's recursion count by one, and returns the
previous value. If the mutex is not owned by the current task, the function
instead fails and returns -EPERM.

Signed-off-by: Elizabeth Figura <[email protected]>
This does not correspond to any NT syscall. Rather, when a thread dies, it
should be called by the NT emulator for each mutex, with the TID of the dying
thread.

NT mutexes are robust (in the pthread sense). When an NT thread dies, any
mutexes it owned are immediately released. Acquisition of those mutexes by other
threads will return a special value indicating that the mutex was abandoned,
like EOWNERDEAD returned from pthread_mutex_lock(), and EOWNERDEAD is indeed
used here for that purpose.

Signed-off-by: Elizabeth Figura <[email protected]>
This correspond to the NT syscall NtCreateEvent().

An NT event holds a single bit of state denoting whether it is signaled or
unsignaled.

There are two types of events: manual-reset and automatic-reset. When an
automatic-reset event is acquired via a wait function, its state is reset to
unsignaled. Manual-reset events are not affected by wait functions.

Whether the event is manual-reset, and its initial state, are specified at
creation time.

Signed-off-by: Elizabeth Figura <[email protected]>
This corresponds to the NT syscall NtSetEvent().

This sets the event to the signaled state, and returns its previous state.

Signed-off-by: Elizabeth Figura <[email protected]>
This corresponds to the NT syscall NtResetEvent().

This sets the event to the unsignaled state, and returns its previous state.

Signed-off-by: Elizabeth Figura <[email protected]>
This corresponds to the NT syscall NtPulseEvent().

This wakes up any waiters as if the event had been set, but does not set the
event, instead resetting it if it had been signalled. Thus, for a manual-reset
event, all waiters are woken, whereas for an auto-reset event, at most one
waiter is woken.

Signed-off-by: Elizabeth Figura <[email protected]>
This corresponds to the NT syscall NtQuerySemaphore().

This returns the current count and maximum count of the semaphore.

Signed-off-by: Elizabeth Figura <[email protected]>
This corresponds to the NT syscall NtQueryMutant().

This returns the recursion count, owner, and abandoned state of the mutex.

Signed-off-by: Elizabeth Figura <[email protected]>
This corresponds to the NT syscall NtQueryEvent().

This returns the signaled state of the event and whether it is manual-reset.

Signed-off-by: Elizabeth Figura <[email protected]>
NT waits can optionally be made "alertable". This is a special channel for
thread wakeup that is mildly similar to SIGIO. A thread has an internal single
bit of "alerted" state, and if a thread is alerted while an alertable wait, the
wait will return a special value, consume the "alerted" state, and will not
consume any of its objects.

Alerts are implemented using events; the user-space NT emulator is expected to
create an internal ntsync event for each thread and pass that event to wait
functions.

Signed-off-by: Elizabeth Figura <[email protected]>
Wine has tests for its synchronization primitives, but these are more accessible
to kernel developers, and also allow us to test some edge cases that Wine does
not care about.

This patch adds tests for semaphore-specific ioctls NTSYNC_IOC_SEM_POST and
NTSYNC_IOC_SEM_READ, and waiting on semaphores.

Signed-off-by: Elizabeth Figura <[email protected]>
Test mutex-specific ioctls NTSYNC_IOC_MUTEX_UNLOCK and NTSYNC_IOC_MUTEX_READ,
and waiting on mutexes.

Signed-off-by: Elizabeth Figura <[email protected]>
Test basic synchronous functionality of NTSYNC_IOC_WAIT_ANY, when objects are
considered signaled or not signaled, and how they are affected by a successful
wait.

Signed-off-by: Elizabeth Figura <[email protected]>
Test basic synchronous functionality of NTSYNC_IOC_WAIT_ALL, and when objects
are considered simultaneously signaled.

Signed-off-by: Elizabeth Figura <[email protected]>
…IOC_WAIT_ANY.

Test contended "wait-for-any" waits, to make sure that scheduling and wakeup
logic works correctly.

Signed-off-by: Elizabeth Figura <[email protected]>
…IOC_WAIT_ALL.

Test contended "wait-for-all" waits, to make sure that scheduling and wakeup
logic works correctly, and that the wait only exits once objects are all
simultaneously signaled.

Signed-off-by: Elizabeth Figura <[email protected]>
Test event-specific ioctls NTSYNC_IOC_EVENT_SET, NTSYNC_IOC_EVENT_RESET,
NTSYNC_IOC_EVENT_PULSE, NTSYNC_IOC_EVENT_READ for manual-reset events, and
waiting on manual-reset events.

Signed-off-by: Elizabeth Figura <[email protected]>
Elizabeth Figura and others added 29 commits December 15, 2024 09:24
Test event-specific ioctls NTSYNC_IOC_EVENT_SET, NTSYNC_IOC_EVENT_RESET,
NTSYNC_IOC_EVENT_PULSE, NTSYNC_IOC_EVENT_READ for auto-reset events, and
waiting on auto-reset events.

Signed-off-by: Elizabeth Figura <[email protected]>
Expand the contended wait tests, which previously only covered events and
semaphores, to cover events as well.

Signed-off-by: Elizabeth Figura <[email protected]>
Test the "alert" functionality of NTSYNC_IOC_WAIT_ALL and NTSYNC_IOC_WAIT_ANY,
when a wait is woken with an alert and when it is woken by an object.

Signed-off-by: Elizabeth Figura <[email protected]>
Expand the alert tests to cover alerting a thread mid-wait, to test that the
relevant scheduling logic works correctly.

Signed-off-by: Elizabeth Figura <[email protected]>
Test a more realistic usage pattern, and one with heavy contention, in order to
actually exercise ntsync's internal synchronization.

This test has several threads in a tight loop acquiring a mutex, modifying some
shared data, and then releasing the mutex. At the end we check if the data is
consistent.

Signed-off-by: Elizabeth Figura <[email protected]>
Add myself as maintainer, supported by CodeWeavers.

Signed-off-by: Elizabeth Figura <[email protected]>
Add an overall explanation of the driver architecture, and complete and precise
specification for its intended behaviour.

Signed-off-by: Elizabeth Figura <[email protected]>
f5b335d ("misc: ntsync: mark driver as "broken"
to prevent from building") was committed to avoid the driver being used while
only part of its functionality was released. Since the rest of the functionality
has now been committed, revert this.

Signed-off-by: Elizabeth Figura <[email protected]>
v2: ported from 6.1 to 6.6

Signed-off-by: Kai Krakow <[email protected]>
v2: ported from 6.1 to 6.6

Signed-off-by: Kai Krakow <[email protected]>
Add an option to wait on multiple futexes using the old interface, that
uses opcode 31 through futex() syscall. Do that by just translation the
old interface to use the new code. This allows old and stable versions
of Proton to still use fsync in new kernel releases.

Signed-off-by: André Almeida <[email protected]>
There's plenty of room on the stack for a few more inlined bytes here
and there. The measured stack usage at runtime is still safe without
this, and performance is surely improved at a microscopic level, so
remove it.

Signed-off-by: Sultan Alsawaf <[email protected]>
ATA init is the long pole in the boot process, and its asynchronous.
move the graphics init after it so that ata and graphics initialize
in parallel

Signed-off-by: Kai Krakow <[email protected]>
Significant time was spent on synchronize_rcu in evdev_detach_client
when applications closed evdev devices. Switching VT away from a
graphical environment commonly leads to mass input device closures,
which could lead to noticable delays on systems with many input devices.

Replace synchronize_rcu with call_rcu, deferring reclaim of the evdev
client struct till after the RCU grace period instead of blocking the
calling application.

While this does not solve all slow evdev fd closures, it takes care of a
good portion of them, including this simple test:

	#include <fcntl.h>
	#include <unistd.h>

	int main(int argc, char *argv[])
	{
		int idx, fd;
		const char *path = "/dev/input/event0";
		for (idx = 0; idx < 1000; idx++) {
			if ((fd = open(path, O_RDWR)) == -1) {
				return -1;
			}
			close(fd);
		}
		return 0;
	}

Time to completion of above test when run locally:

	Before: 0m27.111s
	After:  0m0.018s

Signed-off-by: Kenny Levinsen <[email protected]>
Per [Fedora][1], they intend to change the default max map count for
their distribution to improve OOTB compatibility with games played
through Steam/Proton.  The value they picked comes from the Steam Deck,
which defaults to INT_MAX - MAPCOUNT_ELF_CORE_MARGIN.

Since most ZEN and Liquorix users probably play games, follow Valve's
lead and raise this value to their default.

[1]: https://fedoraproject.org/wiki/Changes/IncreaseVmMaxMapCount

Signed-off-by: Kai Krakow <[email protected]>
Contains:
  - mm: Stop kswapd early when nothing's waiting for it to free pages

    Keeping kswapd running when all the failed allocations that invoked it
    are satisfied incurs a high overhead due to unnecessary page eviction
    and writeback, as well as spurious VM pressure events to various
    registered shrinkers. When kswapd doesn't need to work to make an
    allocation succeed anymore, stop it prematurely to save resources.

    Signed-off-by: Sultan Alsawaf <[email protected]>

  - mm: Don't stop kswapd on a per-node basis when there are no waiters

    The page allocator wakes all kswapds in an allocation context's allowed
    nodemask in the slow path, so it doesn't make sense to have the kswapd-
    waiter count per each NUMA node. Instead, it should be a global counter
    to stop all kswapds when there are no failed allocation requests.

    Signed-off-by: Sultan Alsawaf <[email protected]>

  - mm: Increment kswapd_waiters for throttled direct reclaimers

    Throttled direct reclaimers will wake up kswapd and wait for kswapd to
    satisfy their page allocation request, even when the failed allocation
    lacks the __GFP_KSWAPD_RECLAIM flag in its gfp mask. As a result, kswapd
    may think that there are no waiters and thus exit prematurely, causing
    throttled direct reclaimers lacking __GFP_KSWAPD_RECLAIM to stall on
    waiting for kswapd to wake them up. Incrementing the kswapd_waiters
    counter when such direct reclaimers become throttled fixes the problem.

    Signed-off-by: Sultan Alsawaf <[email protected]>

Signed-off-by: Kai Krakow <[email protected]>
Use [defer+madvise] as default khugepaged defrag strategy:

For some reason, the default strategy to respond to THP fault fallbacks
is still just madvise, meaning stall if the program wants transparent
hugepages, but don't trigger a background reclaim / compaction if THP
begins to fail allocations.  This creates a snowball affect where we
still use the THP code paths, but we almost always fail once a system
has been active and busy for a while.

The option "defer" was created for interactive systems where THP can
still improve performance.  If we have to fallback to a regular page due
to an allocation failure or anything else, we will trigger a background
reclaim and compaction so future THP attempts succeed and previous
attempts eventually have their smaller pages combined without stalling
running applications.

We still want madvise to stall applications that explicitely want THP,
so defer+madvise _does_ make a ton of sense.  Make it the default for
interactive systems, especially if the kernel maintainer left
transparent hugepages on "always".

Reasoning and details in the original patch: https://lwn.net/Articles/711248/

Signed-off-by: Kai Krakow <[email protected]>
Although not identical to the le9 patches that protect a byte-amount of
cache through tunables, multigenerational LRU now supports protecting
cache accessed in the last X milliseconds.

In torvalds#218, Yu recommends starting with 1000ms and tuning as needed.  This
looks like a safe default and turning on this feature should help users
that don't know they need it.

Signed-off-by: Kai Krakow <[email protected]>
5.7:
Take "sysctl_sched_nr_migrate" tune from early XanMod builds of 128. As
of 5.7, XanMod uses 256 but that may affect applications that require
timely response to IRQs.

5.15:
Per [a comment][1] on our ZEN INTERACTIVE commit, reducing the cost of
migration causes the system less responsive under high load.  Most
likely the combination of reduced migration cost + the higher number of
tasks that can be migrated at once contributes to this.

To better handle this situation, restore the mainline migration cost
value and also reduce the max number of tasks that can be migrated in
batch from 128 to 64.

If this doesn't help, we'll restore the reduced migration cost and keep
total number of tasks that can be migrated at once to 32.

[1]: zen-kernel@be5ba23#commitcomment-63159674

6.6:
Port the tuning to EEVDF, which removed a couple of settings.

6.7:
Instead of increasing the number of tasks that migrate at once, migrate
the amount acceptable for PREEMPT_RT, but reduce the cost so migrations
occur more often.

This should make CFS/EEVDF behave more like out-of-tree schedulers that
aggressively use idle cores to reduce latency, but without the jank
caused by rebalancing too many tasks at once.

Signed-off-by: Kai Krakow <[email protected]>
4.10:
During some personal testing with the Dolphin emulator, MuQSS has
serious problems scaling its frequencies causing poor performance where
boosting the CPU frequencies would have fixed them.  Reducing the
up_threshold to 45 with MuQSS appears to fix the issue, letting the
introduction to "Star Wars: Rogue Leader" run at 100% speed versus about
80% on my test system.

Also, lets refactor the definitions and include some indentation to help
the reader discern what the scope of all the macros are.

5.4:
On the last custom kernel benchmark from Phoronix with Xanmod, Michael
configured all the kernels to run using ondemand instead of the kernel's
[default selection][1].  This reminded me that another option outside of
the kernels control is the user's choice to change the cpufreq governor,
for better or for worse.

In Liquorix, performance is the default governor whether you're running
acpi-cpufreq or intel-pstate.  I expect laptop users to install TLP or
LMT to control the power balance on their system, especially when
they're plugged in or on battery.  However, it's pretty clear to me a
lot of people would choose ondemand over performance since it's not
obvious it has huge performance ramifications with MuQSS, and ondemand
otherwise is "good enough" for most people.

Lets codify lower up thresholds for MuQSS to more closely synergize with
its aggressive thread migration behavior.  This way when ondemand is
configured, you get sort of a "performance-lite" type of result but with
the power savings you expect when leaving the running system idle.

[1]: https://www.phoronix.com/scan.php?page=article&item=xanmod-2020-kernel

5.14:
Although CFS and similar schedulers (BMQ, PDS, and CacULE), reuse a lot
more of mainline scheduling and do a good job of pinning single threaded
tasks to their respective core, there's still applications that
confusingly run steady near 50% and benefit from going full speed or
turbo when they need to run (emulators for more recent consoles come to
mind).

Drop the up threshold for all non-MuQSS schedulers from 80/95 to 55/60.

5.15:
Remove MuQSS cpufreq configuration.

Signed-off-by: Kai Krakow <[email protected]>
This option is already disabled when CONFIG_PREEMPT_RT is enabled, lets
turn it off when CONFIG_ZEN_INTERACTIVE is set as well.

Signed-off-by: Kai Krakow <[email protected]>
What watermark boosting does is preemptively fire up kswapd to free
memory when there hasn't been an allocation failure. It does this by
increasing kswapd's high watermark goal and then firing up kswapd. The
reason why this causes freezes is because, with the increased high
watermark goal, kswapd will steal memory from processes that need it in
order to make forward progress. These processes will, in turn, try to
allocate memory again, which will cause kswapd to steal necessary pages
from those processes again, in a positive feedback loop known as page
thrashing. When page thrashing occurs, your system is essentially
livelocked until the necessary forward progress can be made to stop
processes from trying to continuously allocate memory and trigger
kswapd to steal it back.

This problem already occurs with kswapd *without* watermark boosting,
but it's usually only encountered on machines with a small amount of
memory and/or a slow CPU. Watermark boosting just makes the existing
problem worse enough to notice on higher spec'd machines.

Disable watermark boosting by default since it's a total dumpster fire.
I can't imagine why anyone would want to explicitly enable it, but the
option is there in case someone does.

Signed-off-by: Sultan Alsawaf <[email protected]>
…uce scheduling delays

The page allocator processes free pages in groups of pageblocks, where
the size of a pageblock is typically quite large (1024 pages without
hugetlbpage support). Pageblocks are processed atomically with the zone
lock held, which can cause severe scheduling delays on both the CPU
going through the pageblock and any other CPUs waiting to acquire the
zone lock. A frequent offender is move_freepages_block(), which is used
by rmqueue() for page allocation.

As it turns out, there's no requirement for pageblocks to be so large,
so the pageblock order can simply be reduced to ease the scheduling
delays and zone lock contention. PAGE_ALLOC_COSTLY_ORDER is used as a
reasonable setting to ensure non-costly page allocation requests can
still be serviced without always needing to free up more than one
pageblock's worth of pages at a time.

This has a noticeable effect on overall system latency when memory
pressure is elevated. The various mm functions which operate on
pageblocks no longer appear in the preemptoff tracer, where previously
they would spend up to 100 ms on a mobile arm64 CPU processing a
pageblock with preemption disabled and the zone lock held.

Signed-off-by: Sultan Alsawaf <[email protected]>
Queueing in dm-crypt for crypto operations reduces performance on modern
systems.  As discussed in an article from Cloudflare, they discovered
that queuing was introduced because the crypto subsystem used to be
synchronous.  Since it's now asynchronous, we get double queueing when
using the subsystem through dm-crypt.  This is obviously undesirable and
reduces throughput and increases latency.

Disable queueing when using our Zen Interactive configuration.

Fixes: zen-kernel#282
Signed-off-by: Kai Krakow <[email protected]>
Per an [issue][1] on the chromium project, swap-in readahead causes more
jank than not.  This might be caused by poor optimization on the
swapping code, or the fact under memory pressure, we're pulling in pages
we don't need, causing more swapping.

Either way, this is mainline/upstream to Chromium, and ChromeOS
developers care a lot about system responsiveness. Lets implement the
same change so Zen Kernel users benefit.

[1]: https://bugs.chromium.org/p/chromium/issues/detail?id=263561

Signed-off-by: Kai Krakow <[email protected]>
@kakra kakra force-pushed the rebase-6.12/gaming-patches branch from a8a1a45 to 7abf354 Compare December 18, 2024 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants