Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor zoekt-webserver flags parsing #756

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dgruzd
Copy link
Contributor

@dgruzd dgruzd commented Apr 16, 2024

This PR is a small flag parsing refactoring to start extracting functions from cmd/zoekt-webserver/main.go to a newwebserver package.

The underlying idea here is to make the creation of webserver-like binaries easier. This is something we're considering doing at GitLab for the Zoekt integration. That way we can reduce the amount of duplicate code between the binaries.

Please let us know what do you think. Thank you!

Copy link

cla-bot bot commented Apr 16, 2024

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Dmitry Gruzd.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@dgruzd dgruzd force-pushed the webserver-refactor-options branch from 879e443 to dc2a60e Compare April 16, 2024 16:46
Copy link

cla-bot bot commented Apr 16, 2024

We require contributors to sign our Contributor License Agreement (CLA), and we don't have yours on file. In order for us to review and merge your code, please sign CLA to get yourself added.

Sourcegraph teammates should refer to Accepting contributions for guidance.

@dgruzd dgruzd force-pushed the webserver-refactor-options branch from dc2a60e to 829dc48 Compare April 16, 2024 16:49
Copy link

cla-bot bot commented Apr 16, 2024

We require contributors to sign our Contributor License Agreement (CLA), and we don't have yours on file. In order for us to review and merge your code, please sign CLA to get yourself added.

Sourcegraph teammates should refer to Accepting contributions for guidance.

@dgruzd
Copy link
Contributor Author

dgruzd commented Apr 16, 2024

@DylanGriffith Could I ask you to review this PR and assign it to a maintainer if you're happy? Thanks!

Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already the web package which exists for this purpose, I suspect a bit too much has grown into the zoekt-webserver command? What key things do you need at GitLab that we should maybe move to the web package instead?

@dgruzd
Copy link
Contributor Author

dgruzd commented Apr 17, 2024

@keegancsmith Thank you for taking a look!

There is already the web package which exists for this purpose, I suspect a bit too much has grown into the zoekt-webserver command?

That's my impression as well. I'd like to make the code responsible for the command itself much more compact and reusable.

What key things do you need at GitLab that we should maybe move to the web package instead?

I'd say it's almost all the code from cmd/zoekt-webserver/main.go, but a lot of it is quite command-specific, so maybe we should introduce another package. It could be webserver (like in this PR) or web_cmd.

Overall, these are our current draft plans:

  • Add webserver mode to the gitlab-zoekt-indexer binary. That way we can ship one binary to self-managed customers, but still keep reads and writes as separate processes. For example, gitlab-zoekt -mode indexer and gitlab-zoekt -mode webserver. With the current webserver code, we'd need to copy a lot of cmd/zoekt-webserver/main.go, which is not ideal from the maintainability perspective.
  • Introduce features that could potentially be added to the upstream (if they are useful for the wider community):
    • JWT auth
    • Add federated search to the webserver. Right now for global searches we spawn threads on GitLab's side to send requests to multiple Zoekt pods, but we would like to offload that to the webserver instead.
    • Support subdirectories. We'd like to have one directory per repoID to allow efficient deletions. Probably without fsnotify, but with some kind of signaling from the indexer process to avoid exhausting inotify limits.

Please let me know what you think 🤝

@keegancsmith
Copy link
Member

There is already the web package which exists for this purpose, I suspect a bit too much has grown into the zoekt-webserver command?

That's my impression as well. I'd like to make the code responsible for the command itself much more compact and reusable.

Ok great. Let me tackle this command a bit, it's been on the back of my mind for a while to clean this up since it has grown so much. We may end up in a place where you don't need to introduce your new package.

Is your main goal to have all the same flags? Do you actually expect at gitlab for people to change most of those values? If it is like the sourcegraph use case, the flags we specify are pretty much static.

  • Add webserver mode to the gitlab-zoekt-indexer binary. That way we can ship one binary to self-managed customers, but still keep reads and writes as separate processes. For example, gitlab-zoekt -mode indexer and gitlab-zoekt -mode webserver. With the current webserver code, we'd need to copy a lot of cmd/zoekt-webserver/main.go, which is not ideal from the maintainability perspective.

Cool, we have thought about this in the past as well to simplify the release pipeline.

  • JWT auth

Not exactly sure if that is something we should have in upstream. I'd recommend if you need that you use the web.Server handler and wrap it to add in the authorization checks in your own binary.

  • Add federated search to the webserver. Right now for global searches we spawn threads on GitLab's side to send requests to multiple Zoekt pods, but we would like to offload that to the webserver instead.

FYI we have code to do this in Sourcegraph but its quite tied to how we do service discovery/telemetry. Additionally the license for it right now is not permissively licensed, but going back in history to a time when it was may be a viable way to copy-paste :) At the end of the day the code is quite straightforward to do TBH, you can probably copy-paste the code used to aggregate across shards inside of the shards pkg.

  • Support subdirectories. We'd like to have one directory per repoID to allow efficient deletions. Probably without fsnotify, but with some kind of signaling from the indexer process to avoid exhausting inotify limits.

I'm interested to learn more about what problems you are observing here. Do you have examples of repositories with large number of shards and have found an issue around deletion? At the end of the day though we have thought a bit about introducing more synchronization between indexer and webserver, but have always just decided to keep it simple.

@dgruzd
Copy link
Contributor Author

dgruzd commented Apr 18, 2024

@keegancsmith

Ok great. Let me tackle this command a bit, it's been on the back of my mind for a while to clean this up since it has grown so much. We may end up in a place where you don't need to introduce your new package.

Sounds great, thank you!

Is your main goal to have all the same flags? Do you actually expect at gitlab for people to change most of those values? If it is like the sourcegraph use case, the flags we specify are pretty much static.

We'll probably have our own flags. I've opened this PR to add some value by the refactoring (options struct) as well as start this discussion.

FYI we have code to do this in Sourcegraph but its quite tied to how we do service discovery/telemetry. Additionally the license for it right now is not permissively licensed, but going back in history to a time when it was may be a viable way to copy-paste :) At the end of the day the code is quite straightforward to do TBH, you can probably copy-paste the code used to aggregate across shards inside of the shards pkg.

I'll take a look 🤝

I'm interested to learn more about what problems you are observing here. Do you have examples of repositories with large number of shards and have found an issue around deletion? At the end of the day though we have thought a bit about introducing more synchronization between indexer and webserver, but have always just decided to keep it simple.

It started with this incident https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17770. We had to reallocate a lot of repositories from one Zoekt node to another. That generated a ton of deletions. Each deletion needed to scan through more than 500K index files (filepath.Glob), which was quite slow and resource intensive.

We have some additional details like cloud profiling graphs from the time of the incident in this MR's description https://gitlab.com/gitlab-org/gitlab-zoekt-indexer/-/merge_requests/173. This MR should alleviate the pain, so in general I'd say that this problem is going to have lower priority after this fix is merged.

I'm pretty sure this problem only affects gitlab-zoekt-indexer since zoekt-sourcegraph-indexserver performs deletions/cleanup differently. We might also pursue other optimization avenues instead of adding subdirectories support.

@jtibshirani
Copy link
Member

@dgruzd @keegancsmith what's the latest on this PR? I'm doing some "spring cleaning" and going through all open PRs :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants