-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sk-inet: Support IP_TRANSPARENT and IPV6_TRANSPARENT socket options #2357
base: criu-dev
Are you sure you want to change the base?
sk-inet: Support IP_TRANSPARENT and IPV6_TRANSPARENT socket options #2357
Conversation
Change made through this commit: - Include copy of flog as a seperate tree. - Modify the makefile to add and compile flog code. Signed-off-by: prakritigoyal19 <[email protected]>
CID 302713 (checkpoint-restore#1 of 1): Missing varargs init or cleanup (VARARGS) va_end was not called for argptr. Signed-off-by: Adrian Reber <[email protected]>
Separate commit for easier criu-dev <-> master transfer. Acked-by: Mike Rapoport <[email protected]> Signed-off-by: Adrian Reber <[email protected]>
It is mapped, not maped. Same applies for mmap I guess. Found by codespell, except it wants to change it to mapped, which will make it less specific. Signed-off-by: Kir Kolyshkin <[email protected]>
Brought to you by codespell -w (using codespell v2.1.0). [v2: use "make indent" on the result] Signed-off-by: Kir Kolyshkin <[email protected]>
Fixes: checkpoint-restore#2121 Signed-off-by: Pengda Yang <[email protected]>
The TOS(type of service) field in the ip header allows you specify the priority of the socket data. Signed-off-by: Suraj Shirvankar <[email protected]>
Signed-off-by: Suraj Shirvankar <[email protected]>
The pipe_size type is unsigned int, when the fcntl call fails and return -1, it will cause a negative rollover problem. Signed-off-by: zhoujie <[email protected]>
Newer Intel CPUs (Sapphire Rapids) have a much larger xsave area than before. Looking at older CPUs I see 2440 bytes. # cpuid -1 -l 0xd -s 0 ... bytes required by XSAVE/XRSTOR area = 0x00000988 (2440) On newer CPUs (Sapphire Rapids) it grows to 11008 bytes. # cpuid -1 -l 0xd -s 0 ... bytes required by XSAVE/XRSTOR area = 0x00002b00 (11008) This increase the xsave area from one page to four pages. Without this patch the fpu03 test fails, with this patch it works again. Signed-off-by: Adrian Reber <[email protected]>
Signed-off-by: Adrian Reber <[email protected]>
Signed-off-by: Adrian Reber <[email protected]>
Using the fact that we know criu_pid and criu is a parent of restored process we can create pidfile with pid on caller pidns level. We need to move mount namespace creation to child so that criu-ns can see caller pidns proc. Signed-off-by: Pavel Tikhomirov <[email protected]>
By default, the file name 'amdgpu_plugin.txt' is used also as the name for the corresponding man page (`man amdgpu_plugin`). However, when this man page is installed system-wide it would be more appropriate to have a prefix 'criu-' (e.g., `man criu-amdgpu-plugin`). Signed-off-by: Radostin Stoyanov <[email protected]>
crun wants to set empty_ns and this interface is missing from the library. This adds it to libcriu. Signed-off-by: Adrian Reber <[email protected]>
--criu-binary argument provides a way to supply the CRIU binary location to run_criu(). Related to: checkpoint-restore#1909 Signed-off-by: Dhanuka Warusadura <[email protected]>
These changes remove and update the changes introduced in 7177938 in favor of the Python version in CI. os.waitstatus_to_exitcode() function appeared in Python 3.9 Related to: checkpoint-restore#1909 Signed-off-by: Dhanuka Warusadura <[email protected]>
These changes add test implementations for criu-ns script. Fixes: checkpoint-restore#1909 Signed-off-by: Dhanuka Warusadura <[email protected]>
These changes fix the `ImportError: No module named pathlib` error when executing criu-ns tests located at criu/test/others/criu-ns Signed-off-by: Dhanuka Warusadura <[email protected]>
CentOS 7 CI environment uses Python 2. To execute criu-ns script in CentOS 7 changing the current shebang line to python is required. This reverse the changes made in a15a63f Signed-off-by: Dhanuka Warusadura <[email protected]>
This is a patch proposed by Thomas here: https://lore.kernel.org/all/87ilczc7d9.ffs@tglx/ It removes (created id > desired id) "sanity" check and adds proper checking that ids start at zero and increment by one each time when we create/delete a posix timer. First purpose of it is to fix infinite looping in create_posix_timers on old pre 3.11 kernels. Second purpose is to allow kernel interface of creating posix timers with desired id change from iterating with predictable next id to just setting next id directly. And at the same time removing predictable next id so that criu with this patch would not get to infinite loop in create_posix_timers if this happens. Thanks a lot to Thomas! Signed-off-by: Pavel Tikhomirov <[email protected]>
This hook allows to start image streamer process from an action script. Signed-off-by: Radostin Stoyanov <[email protected]>
…tions This does cgroup namespace creation separately from joining task cgroups. This makes the code more logical, because creating cgroup namespace also involves joining cgroups but these cgroups can be different to task's cgroups as they are cgroup namespace roots (cgns_prefix), and mixing all of them together may lead to misunderstanding. Another positive thing is that we consolidate !item->parent checks in one place in restore_task_with_children. Signed-off-by: Valeriy Vdovin <[email protected]> Signed-off-by: Pavel Tikhomirov <[email protected]>
4.15-based kernels don't allow F_*SEAL for memfds created with MFD_HUGETLB. Since seals are not possible in this case, fake F_GETSEALS result as if it was queried for a non-sealing-enabled memfd. Signed-off-by: Michał Mirosław <[email protected]>
Linux 4.15 doesn't like empty string for cgroup2 mount options. Pass NULL then to satisfy the kernel check. Log the options for easier debugging. Signed-off-by: Michał Mirosław <[email protected]>
The original commit added saving THP_DISABLED flag value, but missed restoring it. There is restoring code, but used only when --lazy_pages mode is enabled. Restore the prctl flag always. While at it, rename the `has_thp_enabled` -> `!thp_disabled` for consistency. Fixes: bbbd597 (2017-06-28 "mem: add dump state of THP_DISABLED prctl") Signed-off-by: Michał Mirosław <[email protected]>
If prctl(SET_THP_DISABLE) is not used due to bad semantics, log it for easier debugging. Signed-off-by: Michał Mirosław <[email protected]>
While at it, don't carry over stale errno to the fail() message. Signed-off-by: Michał Mirosław <[email protected]>
Signed-off-by: Michał Mirosław <[email protected]>
Add a sanity check for THP_DISABLE. This discovered a broken commit in Google's kernel tree. Signed-off-by: Michał Mirosław <[email protected]>
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
To support sigreturn with CET enabled parasite must rewind its stack before calling sigreturn so that shadow stack will be compatible with actual calling sequence. In addition, calling sigreturn from top level routine (__export_parasite_head_start) will significantly simplify the shadow stack manipulations required to execute sigreturn. For x86 make fini_sigreturn() return the stack pointer for the signal frame that will be used by sigreturn and propagate that return value up to __export_parasite_head_start. In non-daemon mode parasite_trap_cmd() returns non-positive value which allows to distinguish daemon and non-daemon mode and properly stop at int3 in non-daemon mode. Architectures other than x86 remain unchanged and will still call sigreturn from fini_sigreturn(). Signed-off-by: Mike Rapoport (IBM) <[email protected]>
When calling sigreturn with CET enabled, the kernel verifies that the shadow stack has proper address of sa_restorer and a "restore token". Normally, they pushed to the shadow stack when signal processing is started. Since compel calls sigreturn directly, the shadow stack should be updated to match the kernel expectations for sigreturn invocation. Add parasite_setup_shstk() that sets up the shadow stack with the address of __export_parasite_head_start as sa_restorer and with the required restore token. Signed-off-by: Mike Rapoport (IBM) <[email protected]>
The shadow stack VMAs require special care because they can only be created and populated using special system calls. Add VMA_AREA_SHSTK flag and set it for VMAs that are marked as "ss" in /proc/pid/smaps Signed-off-by: Mike Rapoport (IBM) <[email protected]>
Shadow stack VMAs cannot be mmap()ed, they must be created using map_shadow_stack() system call and populated using special wrss instruction available only when shadow stack is enabled. Premap them to reserve virtual address space and populate it to have there contents available for later copying after enabling shadow stack. Along with the space required by shadow stack VMAs also reserve an extra page that will be later used as a temporary shadow stack. Signed-off-by: Mike Rapoport (IBM) <[email protected]>
Shadow stacks must be populated using special WRSS instruction. This instruction is only available when shadow stack is enabled, calling it with disabled shadow stack causes #UD. Moreover, shadow stack VMAs cannot be mremap()ed and they must be created using map_shadow_stack() system call. This requires delaying the restore of shadow stacks to restorer blob after the CRIU mappings are cleared. Introduce rst_shstk_info structure to hold shadow stack parameters required in the restorer blob and populate this structure in arch_prepare_shstk() method. Signed-off-by: Mike Rapoport (IBM) <[email protected]> Signed-off-by: Andrei Vagin <[email protected]>
Detect if CRIU runs with shadow stack enabled and store the result in kerndat. Unlike most kerndat knobs, kdat_has_shstk() does not check for availability of the shadow stack in the kernel, but rather checks if criu runs with shadow stack enabled. This depends on hardware availabilty, kernel and glibc support, compiler options and glibc tunables, so kdat_has_shstk() must be called every time CRIU starts and its result cannot be cached. The result will be used by the code that controls shadow stack enablement in the next commit. Signed-off-by: Mike Rapoport (IBM) <[email protected]>
There are several gotachs when restoring a task with shadow stack: * depending on the compiler options, glibc version and glibc tunables CRIU can run with or without shadow stack. * shadow stack VMAs are special, they must be created using a dedicated map_shadow_stack() system call and can be modified only by a special instruction (wrss) that is only available when shadow stack is enabled. * once shadow stack is enabled, it is not writable even with wrss; writes to shadow stack can be only enabled with ptrace() and only when shadow stack is enabled in the tracee. * if the shadow stack is enabled during restore rather than by glibc, calling retq after arch_prctl() that enables the shadow stack causes #CP, so the function that enables shadow stack can never return. Add the infrastructure required to cope with all of those: * modify the restore code to allow trampoline (arch_shstk_trampoline) that will enable shadow stack and call restore_task_with_children(). * add call to arch_shstk_unlock() right after the tasks are clone()ed; this will allow unlocking shadow stack features and making shadow stack writable. * add stubs for architectures that do not support shadow stacks * add implementation of arch_shstk_trampoline() and arch_shstk_unlock() for x86, but keep it disabled; it will be enabled along with addtion of the code that will restore shadow stack in the restorer blob Signed-off-by: Mike Rapoport (IBM) <[email protected]>
The restore of a task with shadow stack enabled adds these steps: * switch from the default shadow stack to a temporary shadow stack allocated in the premmaped area * unmap CRIU mappings; nothing changed here, but it's important that CRIU mappings can be removed only after switching to a temporary shadow stack * create shadow stack VMA with map_shadow_stack() * restore shadow stack contents with wrss * switch to "real" shadow stack * lock shadow stack features Signed-off-by: Mike Rapoport (IBM) <[email protected]>
Signed-off-by: Juntong Deng <[email protected]>
IP_TRANSPARENT and IPV6_TRANSPARENT are used in transparent proxies. Support for these two socket options is useful for restoring transparent proxies. Signed-off-by: Juntong Deng <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## criu-dev #2357 +/- ##
============================================
+ Coverage 70.21% 70.72% +0.50%
============================================
Files 132 135 +3
Lines 34372 32907 -1465
============================================
- Hits 24134 23272 -862
+ Misses 10238 9635 -603 ☔ View full report in Codecov by Sentry. |
Upgrade to 22.04 base image and use the existing version of docker. Signed-off-by: Adrian Reber <[email protected]>
A memory interval is a half-open interval, so the condition when pr->pe->vaddr == vma->e->end should not be interpreted as an intersection and should cause vma to be marked with VMA_NO_PROT_WRITE. Fixes: checkpoint-restore#2364 Signed-off-by: Artem Trushkin <[email protected]>
This patch extends CRIU with support for SCHED_RESET_ON_FORK. When the SCHED_RESET_ON_FORK flag is set, the following rules apply for subsequently created children: - If the calling thread has a scheduling policy of SCHED_FIFO or SCHED_RR, the policy is reset to SCHED_OTHER in child processes. - If the calling process has a negative nice value, the nice value is reset to zero in child processes. (See 'man 7 sched') Fixes: checkpoint-restore#2359 Signed-off-by: Radostin Stoyanov <[email protected]>
This patch extends the sched_policy00 test case to verify that the SCHED_RESET_ON_FORK flag is restored correctly. Signed-off-by: Radostin Stoyanov <[email protected]>
Fixes: checkpoint-restore#335 Signed-off-by: ccccrrrr <[email protected]>
Please reorder patches, c/r support must be first patch and a test for it must be second patch. This way you would not break tests on bisect. |
CircleCI currently prints out the following warning: This job is using a deprecated image 'ubuntu-2004:202010-01', please update to a newer image According to https://discuss.circleci.com/t/linux-image-deprecations-and-eol-for-2024/ the recommended image name is: "image: default" Signed-off-by: Adrian Reber <[email protected]>
Currently we have tabs + spaces on the wrapped line but the wrapped part is not alligned to the opening bracket. Fixes: bbe26d1b7 ("timer: fix allignment in function definition") Signed-off-by: Pavel Tikhomirov <[email protected]>
@@ -21,6 +21,7 @@ message ip_opts_entry { | |||
optional bool pktinfo = 5; | |||
optional uint32 tos = 6; | |||
optional uint32 ttl = 7; | |||
optional bool transparent = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excess tab -> bad allignment.
This PR tries to do a good thing. But actually there it misses a bigger problem with IP_TRANSPARENT option, this option allows to bind to non-local address, so if we modify the test a bit to actually do the bind to non-local address:
We see that nothing works:
Just dumping and restoring IP_TRANSPARENT is not enough, it should be properly integrated with the bind stage (I doubt that any app which uses it uses it without bind to non-local adress). Note that we have exactly the same problem with IP_FREEBIND (on ipv4) and to actually check IP_TRANSPARENT you need to:
|
I think the cause of this problem is that in the current open_inet_sk() of sk-inet.c, the socket options are restored at the end, and listen(), connect(), bind() are all called before setsockopt(). I think we should change this order and set all socket options after creating the socket, before listen(), connect(), bind(). What do you think? If you agree, I can create a new pull request. |
Also one more possible complication: 1) set IP_TRANSPARENT 2) bind to non-local ip 3) unset IP_TRANSPARENT, this way we somehow need to guess to set this before bind and then unset it to restore properly. So maybe we just need to extend 73a739b ("inet: set IP_FREEBIND before binding socket to an ipv6 address (v2)") to ipv4 to fix that. |
A friendly reminder that this PR had no activity for 30 days. |
Support for IP_TRANSPARENT and IPV6_TRANSPARENT socket options is
useful for restoring transparent proxies.