Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Add retry on locks #4997

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Fabianoshz
Copy link
Contributor

what

I'm opening this as a draft to receive feedback early, I don't expect this to break anything but I believe it could be hidden behind a flag and with better default values for timeout and retries (maybe exponential retry?).

This adds a retry logic to the lock mechanism to mitigate the issue described in here and also in this ADR.

The locking issue itself is more complex and requires much more work, this is just a small step so users don't have to see the error anymore effectively making the code wait instead of asking the user to retry.

why

Currently the user has to rerun any operations that fail because a certain workspace path is locked, this tries just to automate the process.

tests

Will add if this approach receive support.

references

Did my best to try to understand which issues this would affect.

Relates to #3345
Relates to #2921
Relates to #2882

Relates to #4489
Relates to #305
Relates to #4829
Relates to #1847
Relates to #4566

Closes #1618
Closes #2200
Closes #3785
Closes #4489
Closes #4368

@Fabianoshz Fabianoshz force-pushed the add-rety-on-locks branch 2 times, most recently from 284e6f9 to d313d5a Compare October 11, 2024 12:14
@Fabianoshz Fabianoshz changed the title Add retry on locks fix: Add retry on locks Oct 11, 2024
@Fabianoshz Fabianoshz force-pushed the add-rety-on-locks branch 2 times, most recently from e6d6135 to 6fdaa09 Compare October 12, 2024 11:23
@Fabianoshz
Copy link
Contributor Author

After some testing, this is what I'm seeing if I run 2 plan at the same time.
image

This doesn't really fix the issue, but at least reduces the amount of times users have to deal with the error messages.

@Fabianoshz Fabianoshz marked this pull request as ready for review October 12, 2024 11:31
@Fabianoshz Fabianoshz requested review from a team as code owners October 12, 2024 11:31
@Fabianoshz Fabianoshz requested review from GenPage, nitrocode and X-Guardian and removed request for a team October 12, 2024 11:31
@github-actions github-actions bot added the go Pull requests that update Go code label Oct 12, 2024
@X-Guardian
Copy link
Contributor

I don't see that this change is adding much value or helping on any of the issues you have linked. The only scenario where it will help is in the example you have given, i.e. running two plan commands consecutively on the same commit, which is not what most people have issues with.

@jamengual
Copy link
Contributor

I don't see that this change is adding much value or helping on any of the issues you have linked. The only scenario where it will help is in the example you have given, i.e. running two plan commands consecutively on the same commit, which is not what most people have issues with.

There are a lot of people using terragrunt, atmos and other tools that can run many projects/plans at once, so I see how this help those users.

@X-Guardian
Copy link
Contributor

I use Terragrunt extensively with Atlantis running parallel plans/applies in a PR, and don't have any locking issues that this change would make any difference to.

@Fabianoshz
Copy link
Contributor Author

Fabianoshz commented Nov 4, 2024

I don't see that this change is adding much value or helping on any of the issues you have linked. The only scenario where it will help is in the example you have given, i.e. running two plan commands consecutively on the same commit, which is not what most people have issues with.

I truly believe we still have much to do to make sure that the locking issue is dealt with. Although I'm not so sure that this code only is applicable for simultaneous commands, I've only used it to showcase the issue since is the same mechanism (TryLock). Maybe I'm overly optimistic about this...

Buy I may argue that the UX will be even better for most cases. The current behavior is:

  • John tries to plan common-dir and dir-1
  • While John still has the locks for common-dir and dir-1 Mary tries to plan common-dir and dir-2
  • Mary receives The default workspace is currently locked by another command that is running for this pull request.

This PR will change this to:

  • John tries to plan common-dir and dir-1
  • While John has the locks for common-dir and dir-1 Mary tries to plan common-dir and dir-2
  • When John finishes using the lock, Mary acquires the lock (given it didn't reached the timeout)

As I stated here most users are suffering from seeing the message that just says "try again later", I'm automating this step so users only see this message if we are more or less sure that there is a real issue with the plan taking too long (which can be configured by the timeout setting).

@Fabianoshz
Copy link
Contributor Author

I've used to work in a company that had a pretty big Atlantis install, unfortunately I don't anymore so I can't really test this in large scale, I invite anyone who can test this PR to give it a try.

Two areas of improvement I already can see:

  • We could evaluate if a user is trying to trigger the same command while another one is already being executed, this would allow us to avoid this scenario while proving a better UX, as I believe it's not good to let duplicated commands reach this locking mechanism.
  • We could queue executions instead of just blindly retrying it as I'm doing it here now if we want to have FIFO, queueing is a bit harder though because we might have issues if the queue gets to big because external issues.

@lukemassa
Copy link
Contributor

I've seen this once or twice (for example if you're impatient and run atlantis plan when it was already auto-planning); I'm not against the idea of having essentially automating the "try again later" step, I feel like spinning a full 15m by default with no way to disable seems aggressive? Maybe the default behavior should be disabled? I feel like the possibility of starving threads/livelock/etc is high with so many people's different use cases. Maybe I'm being paranoid

@Fabianoshz
Copy link
Contributor Author

Fabianoshz commented Dec 18, 2024

I feel like spinning a full 15m by default with no way to disable seems aggressive? Maybe the default behavior should be disabled? I feel like the possibility of starving threads/livelock/etc is high with so many people's different use cases. Maybe I'm being paranoid

I believe a lot of other things would break before this, AFAIK tickers are pretty lightweight so I don't think they would consume a lot of resources and if the system is too slow the tickers might fail and we might skip a few checks but that's not a big issue for our logic.

But to make this more clear and resilient we could add another return value in this line, this value could represent the status of the lock acquire process (acquired, pull running and workspace running). This would leave the decision to the caller, let's say the caller is the manual plan requester, if the return is pull running we could return a more appropriated message, something like: "There is another plan request being executed for this PR, wait until it is finished", or if the return is worspace running: "There is another plan request using this workspace, your plan execution is queued" .

This would mean that each caller would need to implement either a queue system or the retry mechanism like the one I've added in this PR. We could also inject an interface with functions to send the messages, I'm fine with either, I'm also open to other suggestions.

@jamengual jamengual added the waiting-on-review Waiting for a review from a maintainer label Dec 31, 2024
@lukemassa
Copy link
Contributor

I don't know; I'm still worried about the complexity. 15m is a long time after I type "atlantis apply" for something to happen.

For example in your scenario, when Mary goes to run, if she immediately gets an error saying that another PR is locked, she could look and see it's John, then reach out to John, who might have completely lost track of the PR and say "yeah go ahead and close that" at which point Mary can apply her PR. This happens a lot in my workflows, and so adding an additional 15m delay before Mary knows even what the problem is would actually slow this process down for me. My preference would be for this to be disabled by default (timeout = 0) and configurable.

As for having the TryLock return a more specific message I think that could be useful, it would especially help with the "it's spinning and I don't know why" problem from above. As you said though that affects the callers as well, so it might be worth doing as a separate follow up.

@anryko
Copy link
Contributor

anryko commented Jan 16, 2025

Nice feature. I would suggest to make it optional via repo config. Analogous to automerge and atlantis apply --auto-merge-disabled. Something like queue_plan: bool (default false), and allow users to queue their plan if they wish by commenting with atlantis plan --queue.

@Fabianoshz
Copy link
Contributor Author

Hey everyone, I will start working on this in the next few days whenever I have free time, I'm expecting to do the following:

  • Change the default behavior to have a 0 timeout (effectively disabling the retry)
  • Improve the retryLock mechanism to be more explicit, either by changing the interface to expect another interface with contracts used to communicate the state of the lock or by changing/expanding the current implementation to expose the reasons why the lock failed, I expect that this will at least allow us to show a better message to the end user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go Pull requests that update Go code waiting-on-review Waiting for a review from a maintainer
Projects
None yet
5 participants