-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload never finishes (or takes a **very** long time) #1298
Comments
It just failed because it ran out of disk space. This was on a Github Actions macos-latest-xlarge runner which, according to the docs, should have 14 GiB of SSD storage available. So it seems it used up all the space when trying to upload the node_modules directory. |
@avdv is it possible to create an isolated project setup that fails or point to something open source that uses buck2 / node that exhibits this failure? |
Hi @adam-singer, here's a repository which contains a reproducer: https://github.com/avdv/nl-repro/ I cancelled the workflow run after ~1h, see https://github.com/avdv/nl-repro/actions/runs/10664811435/job/29556846974
BTW, these numbers are from my local Linux system. There were no symlinks involved, but maybe that is different on MacOS? |
Note, I also created a workflow running on Linux. It shows the same symptom: https://github.com/avdv/nl-repro/actions/runs/10679208408/job/29597917819 |
Are we sure this is a NativeLink issue? |
We are currently using bazel-remote-worker and BuildBuddy RE with our project. Both work fine for us (although bazel-remote-worker is very slow). That leads me to think the problem is related to NativeLink... |
@avdv I'm not totally sure about what happened here because we haven't seen it. My best guess would be a regression that we saw in Can you attempt the same operation with If it works, feel free to close this issue. |
On second thought, we'll attempt to run the reproducer on |
Already did that today, I cancelled it after ~50 mins: https://github.com/avdv/nl-repro/actions/runs/10723077069/job/29735546354 |
I started looking into this a couple days ago, but then had a conference to attend. I'll try to get some time tomorrow/this-weekend to deep-dive this. These kind of things usually end up being a config problem somewhere, but I'll keep you tuned. |
It appears the issue is that you are using the default nativelink container to run the jobs in, but this container is bare-bones and has pretty much nothing installed in it, not even
It ends up failing because Buck2 did get a "I am not able to run this command" response, but seems to keep waiting anyway. |
Thank you for looking into it, @allada
OK, good point. I have now removed the hello_world target (which was just a left-over from Also, I am now extracting the nativelink binary from the docker image, see https://github.com/avdv/nl-repro/actions/runs/10744420726 (BTW, it reports version 0.5.1 although I am using the 0.5.3 Docker image). There are no errors reported, so I suspect that the action is just not run but waiting for the upload to finish... Note, in my actual project we were running nativelink on MacOS and in my initial repro on Linux (https://github.com/avdv/nl-repro/actions/runs/10679208408) I was building nativelink with nix (avdv/nl-repro@fef5774#diff-fde0e5d64aae13964fdda6d47af304cf1a7015cbc17e440ac4a5e662ee1d875eR25) |
I've been looking into this and when I run it locally I get:
I want to make sure I'm debugging the right thing here, so just to clarify, you are not even able to get to this stage, correct? If that is the case, one thing I do see just shy of 100k files, 16k directories and 677 symlinks for a total of ~750Mb This leads me to think it is possible that the worker is being bound by IO and/or kernel calls. github runners give famously slow disk performance. Given this, what I might suggest is trying to mount a tmpfs or ramfs then configuring the nativelink worker to use that mount instead. This may not be the best long term solution, but at least it'll tell us if the github runners are being disk io bound. |
This is what I would expect to see too.
Yes, correct 👍 . Running the
(see avdv/nl-repro@6a98651) I ran $ buck2 build -v2,stderr --prefer-remote :tsc_generated
File changed: root//.#BUCK
File changed: root//BUCK
Build ID: 03766e8d-435a-4427-b54e-6798674ffcc3
Network: (GRPC-SESSION-ID)
Command: build. Remaining: 1/8. Time elapsed: 23:08:26.1s
--------------------------------------------------------------------------------
root//:tsc_generated -- action (genrule) [re_upload] 23:08:26.1s Running the command again, it succeeded in about 55 seconds: $ buck2 build -v2,stderr --prefer-remote :tsc_generated
stderr for root//:tsc_generated (genrule):
total 0
drwxr-xr-x 1 claudio users 18 Sep 8 16:29 @aashutoshrathi
drwxr-xr-x 1 claudio users 18 Sep 8 16:29 @ampproject
drwxr-xr-x 1 claudio users 2.8K Sep 8 16:29 @babel
drwxr-xr-x 1 claudio users 92 Sep 8 16:29 @bazel
...
Build ID: 207239c9-5bea-45de-8edc-970b0cfae932
Network: (GRPC-SESSION-ID)
Jobs completed: 4. Time elapsed: 54.0s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 1, local: 0)
BUILD SUCCEEDED
Since it also happens on my machine, I don't think this is the issue here. Rather it looks like it got stuck somehow, although everything is already uploaded, since a subsequent run "immediately" succeeds... Here's the latest run on Github: https://github.com/avdv/nl-repro/actions/runs/10760776248/job/29839142839 |
Oh, I wounder if you are hitting max open files limits. try running (in same shell):
Then also increase You are probably running on ?mac, and I believe it has much lower default limits than linux. |
No, I am using Linux (NixOS actually). File open descriptor limits are already set quite high, at least the hard limit is: $ ulimit -H -n
524288
$ ulimit -S -n
1024 I tried to set the soft limit to 65000 anyway, and also changed the |
BTW, I also tried to increase the limit on MacOS too (https://github.com/avdv/nl-repro/commits/main/) . On the first run (https://github.com/avdv/nl-repro/actions/runs/10768327053/job/29857259444) it helped and I got:
So I replaced the yarn command with |
Remember you now need to install
Edit: It appears that eventually buck2 does upload the files and it does execute. FYI: The fact that it even got to:
Means that nativelink did try to execute the command. I suggest running nativelink with |
I am currently trying to use nativelink 0.5.1 with buck2. It works fine when building our backend, but building the frontend on MacOS aarch64 never finishes since it is stuck in the re_upload phase:
(it's still running and I'll see if it finishes at all; I'll report back what happened)
I have started nativelink locally, using the basic config, just setting
additional_environment
.In the build, there are these genrules:
So, yes the node_modules output is indeed large:
The text was updated successfully, but these errors were encountered: