Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance in large projects #301

Open
anthonymahshigian opened this issue Nov 26, 2024 · 3 comments
Open

Performance in large projects #301

anthonymahshigian opened this issue Nov 26, 2024 · 3 comments

Comments

@anthonymahshigian
Copy link

Hey there, I started using DWYU recently and I love it! One issue I've had with attempting to bring it into a large project is performance, particularly this action taking 10+ minutes in some cases.

The project I am referring to has gigantic (100+ files) cc_* targets that depend on each other. I experimented with splitting that action up to run on a per-file basis rather than per-target. Then, it concatenates the results for all the files in a target. This has worked great for my humungous targets by allowing me to parallelize a lot of the c-preprocessing and file parsing work. Unfortunately, it does seem to slow down simple cases, or if you cannot spawn enough concurrent actions due to say only local execution being used. Does it sound like a good idea to parallelize this action? If not for the general case, what about behind a feature flag of some sort such as an argument to the aspect factory?

I appreciate the feedback, thanks!

@martis42
Copy link
Owner

In general I am willing to work on this. There are for sure several large projects out there, wich would profit from this. However, it is hard to implement performance improvements without proper benchmarks.
Any chance you can point me to a publicly visible repository where one can see such drastic slowdowns?

@anthonymahshigian
Copy link
Author

I can go do this, but before that is there any notable ones you know that have an existing setup? Anyways, I'm wondering if it should be an optional feature that gets turned on rather than ensuring it speeds up the normal case. By parallelizing the action I linked, we're not doing any less work, rather spreading it out across many more actions. I think this would only be useful in specific setting such as large repos (like I mentioned) and environments where you're using remote execution.

@martis42
Copy link
Owner

is there any notable ones you know that have an existing setup?

Unfortunately, I am not aware of such a repo.


There are 2 possible approaches to speed things up by parallelism.

One approach would be to perform multi threading inside the analysis Action in the Python Code.
The good thing about this is, it would be easy to implement. The downside is, the performance gains are limited. Bazel is already executing actions in parallel. So while we could increase CPU utilization, we would also have to be careful not to bog the system down with too many threads.

The other option is what you proposed, aka analyzing each file in a separate action.
The good thing here is this scales perfectly in case of remote execution being used. Also it is closer to the bazel best practices where one action should not occupy more than 1 CPU.
The downside is, DWYU then has to create many more output files (one per analyzed source file) and spawn an extra action just to combine those into the final result for the whole target.

If we do this, we definitely should hide this behind a flag. I would even make it an experimental_foo flag. Just in case behavior tweaks are required after initial release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants