Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bun run build hangs while running in a gVisor sandbox #16063

Open
azliu0 opened this issue Dec 30, 2024 · 8 comments
Open

bun run build hangs while running in a gVisor sandbox #16063

azliu0 opened this issue Dec 30, 2024 · 8 comments
Assignees
Labels
atw bug Something isn't working

Comments

@azliu0
Copy link

azliu0 commented Dec 30, 2024

What version of Bun is running?

1.1.42+50eec0025

What platform is your computer?

gVisor sandbox

What steps can reproduce the bug?

Dockerfile:

FROM oven/bun:latest

RUN apt-get update -yqq && \
    apt-get install -yqq git && \
    bun install -g turbo

WORKDIR /march
RUN git clone --depth 1 https://github.com/marchhq/march.git . && \
    git fetch --depth 1 origin c0f0d196c4f7f94e10a66747be9bf96ce0135967 && \
    git checkout -f FETCH_HEAD
RUN bun install
CMD ["bun", "run", "build"]

Follow the instructions to install runsc here, and set it up with docker here.

Then build and run the container with --runtime=runsc.

What is the expected behavior?

bun run build finishes relatively quickly.

What do you see instead?

bun run build hangs at the step Generating static pages (21/21).

Additional information

This behavior does not show up in normal runc. runsc does not support pidfd_open, but a workaround was implemented in bun (see here).

@Jarred-Sumner
Copy link
Collaborator

Do you know what errno it's returning? We currently look for:

  • ENOSYS
  • EOPNOTSUPP
  • EPERM
  • EACCES
  • EINVAL

Maybe another one needs to be added to that list?

if (err == .INVAL) {
if (pidfd_flags != 0) {
rc = std.os.linux.pidfd_open(
@intCast(this.pid),
0,
);
pidfd_flags = 0;
continue;
}
}
// No such process can happen if it exited between the time we got the pid and called pidfd_open
// Until we switch to clone3, this needs to be handled separately.
if (err == .SRCH) {
return .{ .err = bun.sys.Error.fromCode(err, .pidfd_open) };
}
// seccomp filters can be used to block this system call or pidfd's altogether
// https://github.com/moby/moby/issues/42680
// so let's treat a bunch of these as actually meaning we should use the waiter thread fallback instead.
if (err == .NOSYS or err == .OPNOTSUPP or err == .PERM or err == .ACCES or err == .INVAL) {
WaiterThread.setShouldUseWaiterThread();
return .{ .err = bun.sys.Error.fromCode(err, .pidfd_open) };
}

@azliu0
Copy link
Author

azliu0 commented Dec 31, 2024

@Jarred-Sumner
Copy link
Collaborator

Could it be a missing errno in pwritev2? We attempt to use that sometimes.

bun/src/sys.zig

Lines 3141 to 3153 in babd8b6

if (Maybe(usize).errnoSysFd(rc, .write, fd)) |err| {
switch (err.getErrno()) {
.OPNOTSUPP, .NOSYS => {
bun.C.linux.RWFFlagSupport.disable();
switch (bun.isWritable(fd)) {
.hup, .ready => return write(fd, buf),
else => return .{ .err = Error.retry },
}
},
.INTR => continue,
else => return err,
}
}

Would you be able to paste the output of perf trace?

@azliu0
Copy link
Author

azliu0 commented Dec 31, 2024

I've uploaded a sample logfile here: https://modal-public-assets.s3.us-east-1.amazonaws.com/vendor/bun-debug-logs.zip

There's a single instance of pwritev2:

I1228 04:04:31.520962       1 strace.go:573] [  57:  57] node E pwritev2(0x10d pipe:[19], 0x7e91083860f0 {base=0x2aa092244000, len=12, "hi from api\n"}, 0x1, 0xffffffffffffffff, 0x0)
I1228 04:04:31.520986       1 strace.go:611] [  57:  57] node X pwritev2(0x10d pipe:[19], ..., 0x1, 0xffffffffffffffff, 0x0) = 12 (0xc) (4.93µs)

@Jarred-Sumner
Copy link
Collaborator

I think the 2,795,330 calls to sched_yield looks suspicious. Sounds like some threads are never going to sleep.

Is SIGUSR1 being sent anywhere from the parent process? JavaScriptCore uses that to force the thread to enter stop-the-world GC

@azliu0
Copy link
Author

azliu0 commented Dec 31, 2024

Sounds like some threads are never going to sleep

This sounds relevant to me too.

The buggy behavior (i.e., the observed hanging) sometimes doesn't show up—across multiple buggy and non-buggy logfiles, I notice a pattern of groups of 15 SIGINTs sent by node roughly after static pages are supposed to finish generating.

When hanging is not observed, there are two groups of 15 SIGINTs, where the second group is sent ~40ms after the first. When hanging is observed, only the first group shows up, while the threads that expect (?) to be killed instead appear to be polled indefinitely.

Example of non-buggy logfile: https://modal-public-assets.s3.us-east-1.amazonaws.com/vendor/bun-good-logs.zip

Is SIGUSR1 being sent anywhere from the parent process?

I don't see this in the logs anywhere?

@milantracy
Copy link

gVisor dev here, I used the Dockerfile, without the pidfd_open patch, I can complete the build.

the details can be found at google/gvisor#11331 (comment)

And i don't think pidfd_open is needed at least for the bun build here.

@azliu0
Copy link
Author

azliu0 commented Jan 18, 2025

@Jarred-Sumner @heimskr sorry to bump this, but I'm still encountering this issue. any updates here? I've collected all the errors I can see from the logs:

name,errno,description,freq
access,2,no such file or directory,9699
access,20,not a directory,987
epoll_ctl,17,file exists,1075
epoll_pwait,4,request was interrupted,36
futex,11,try again,3457
futex,110,connection timed out,2709
futex,512,to be restarted if SA_RESTART is set,19
getrandom,22,invalid argument,3
ioctl,25,not a typewriter,234
ioctl,25,inappropriate ioctl for device,5
madvise,22,invalid argument,26
mkdir,17,file exists,1
newfstatat,2,no such file or directory,32256
newfstatat,20,not a directory,56
openat,2,no such file or directory,20106
pidfd_open,38,invalid system call number,6
prctl,22,invalid argument,76
readlink,22,invalid argument,178
recvfrom,11,request would block,2
renameat,2,no such file or directory,1
rt_sigaction,22,invalid argument,16
rt_sigsuspend,514,to be restarted if no handler,22
sched_setscheduler,22,invalid argument,422
setsockopt,92,protocol not available,35
stat,2,no such file or directory,65
symlink,17,file exists,2
wait4,512,to be restarted if SA_RESTART is set,2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
atw bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants