Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MM-61881: Use os.MkdirAll, which is thread-safe, to ensure Terraform state dir #849

Merged
merged 1 commit into from
Nov 19, 2024

Conversation

agarciamontoro
Copy link
Member

@agarciamontoro agarciamontoro commented Nov 19, 2024

Summary

As explained in the ticket, we concurrently run every deployment inside a comparison run. Each of those deployments call terraform.New(), which in turn calls ensureTerraformStateDir. This function is not thread-safe, since it first calls os.Stat to know whether the directory exists, and if it doesn’t, it then calls os.Makedir. It can happen that:

  1. Routine A checks os.Stat and sees the directory doesn’t exist
  2. Then routine B does the exact same thing
  3. Then routine A runs os.Makedir, and succeeds, creating the directory
  4. Now routine B runs os.Makedir, and fails, since the directory already exists

Technically, the same race condition does happen in the implementation of os.MkdirAll, but that code takes it into account and re-checks the second error (the one on Mkdir), and it ends up returning nil if the directory was already created, thus avoiding the race condition altogether and making it thread-safe.

Also, the code is way cleaner this way, of course :)

Ticket Link

https://mattermost.atlassian.net/browse/MM-61881

The same race condition does happen in the implementation of
os.MkdirAll[0], but that code there takes it into account and re-checks
the second error (the one on Mkdir), and it ends up returning nil if the
directory is indeed created, thus avoiding the race condition and making
os.MkdirAll thread-safe.

[0] https://github.com/golang/go/blob/e33f7c42b084182a3a88ef79857e33c11627159a/src/os/path.go#L19-L66
Copy link
Member

@cpoile cpoile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

@streamer45
Copy link
Contributor

Yeah, the idea with comparison was to have completely decoupled state paths to avoid this problem altogether. But I guess we rely on the same parent directory.

@agarciamontoro
Copy link
Member Author

@streamer45 The states are still different: they're different files, suffixed by an index. But the containing directory is the same, yes.

Copy link
Contributor

@streamer45 streamer45 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be technically pedantic, I think the FS should generally ensure concurrency safety of these operations. This is more of a high-level race condition caused by applying a set of thread safe operations in a non-atomic way.

@agarciamontoro
Copy link
Member Author

To be technically pedantic, I think the FS should generally ensure concurrency safety of these operations. This is more of a high-level race condition caused by applying a set of thread safe operations in a non-atomic way.

Not sure I understand your comment. What was non-thread safe (and now is) is the ensureTerraformStateDir function.

@agarciamontoro agarciamontoro merged commit 7939612 into master Nov 19, 2024
1 check passed
@agarciamontoro agarciamontoro deleted the MM-61881.ensureTerraformStateDir branch November 19, 2024 22:40
@streamer45
Copy link
Contributor

To be technically pedantic, I think the FS should generally ensure concurrency safety of these operations. This is more of a high-level race condition caused by applying a set of thread safe operations in a non-atomic way.

Not sure I understand your comment. What was non-thread safe (and now is) is the ensureTerraformStateDir function.

I mean, there was nothing strictly unsafe about ensureTerraformStateDir. It's not like you could get into a corrupted state or undefined behavior because all of its child functions should be thread-safe. What could happen was that you could get an error from the file system due to the order of potentially concurrent operations.

@agarciamontoro
Copy link
Member Author

agarciamontoro commented Nov 19, 2024

But getting an error due to the order of potentially concurrent operations is a race condition! 😂 (I will stop discussing this here, just wanted to force you to be even a bit more pedantic 😛)

@streamer45
Copy link
Contributor

lol, I guess it's getting into semantics. From a high level, I agree with you that using ensureTerraformStateDir concurrently was not totally safe mostly because we didn't handle the case in which the directory was already created (e.g. by a competing goroutine), something os.MkdirAll does.

I guess I (personally) can't fully consider at the same level a function that returns a proper error vs one that causes data corruption or undefined behaviour.

Good discussion though, especially before bed time 👍

@agnivade
Copy link
Member

All that Rust experience is getting into your head Claudio. It's not healthy.

agarciamontoro added a commit that referenced this pull request Nov 20, 2024
The same race condition does happen in the implementation of
os.MkdirAll[0], but that code there takes it into account and re-checks
the second error (the one on Mkdir), and it ends up returning nil if the
directory is indeed created, thus avoiding the race condition and making
os.MkdirAll thread-safe.

[0] https://github.com/golang/go/blob/e33f7c42b084182a3a88ef79857e33c11627159a/src/os/path.go#L19-L66
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants