Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition on ssh client connection startup #7549

Closed

Conversation

alexandrejbr
Copy link
Contributor

There were 2 race conditions that can happen on the startup of a SSH client connection:

  • two process race to create the ssh_system_sup for a given address, which can cause the start of the ssh_system_sup to return {error, {already_started, Pid}};
  • one process race to create the ssh_system_sup when it's being brought down due to all significant processes have exited.

This commit fixes these race conditions and additionally fixes a third issue that would happen when two processes race to create the ssh_system_sup, an ssh_acceptor_sup would be created for the address the client is trying to connect.

@CLAassistant
Copy link

CLAassistant commented Aug 11, 2023

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 11, 2023

CT Test Results

       2 files       29 suites   30m 58s ⏱️
   453 tests    446 ✔️   5 💤 2
1 667 runs  1 638 ✔️ 27 💤 2

For more details on these failures, see this check.

Results for commit 7d8e6a9.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@rickard-green rickard-green added the team:PS Assigned to OTP team PS label Aug 14, 2023
@IngelaAndin IngelaAndin added the testing currently being tested, tag is used by OTP internal CI label Aug 16, 2023
@u3s u3s self-assigned this Aug 16, 2023
@u3s u3s linked an issue Aug 22, 2023 that may be closed by this pull request
@IngelaAndin IngelaAndin added the Planned Focus issue added in sprint planning label Aug 25, 2023
@u3s
Copy link
Contributor

u3s commented Sep 29, 2023

please change the target branch to maint. I think, we would like to have a fix in OTP-26.2.

@u3s u3s added this to the OTP-26.2 milestone Sep 29, 2023
@u3s u3s removed the testing currently being tested, tag is used by OTP internal CI label Sep 29, 2023
@alexandrejbr alexandrejbr changed the base branch from master to maint October 3, 2023 09:14
There were 2 race conditions that can happen on the startup of
a SSH client connection:
- two process race to create the ssh_system_sup for a given
address, which can cause the start of the ssh_system_sup to return
{error, {already_started, Pid}};
- one process race to create the ssh_system_sup when it's being
brought down due to all significant processes have exited.

This commit fixes these race conditions and additionally fixes a third issue
that would happen when two processes race to create the ssh_system_sup, an
ssh_acceptor_sup would be created for the address the client is trying to
connect.
@alexandrejbr alexandrejbr force-pushed the fix-ssh-connect-races branch from b5969be to 7d8e6a9 Compare October 3, 2023 09:37
@alexandrejbr
Copy link
Contributor Author

@u3s I think I have done it now.

@alexandrejbr
Copy link
Contributor Author

BTW since this is a bug that exists in other releases, for instance OTP 25, can it be backported there as well?

Is that something you have to do or can I help as well with that?

@u3s u3s added the testing currently being tested, tag is used by OTP internal CI label Oct 9, 2023
@u3s
Copy link
Contributor

u3s commented Oct 9, 2023

BTW since this is a bug that exists in other releases, for instance OTP 25, can it be backported there as well?

Yes it can be backported to OTP 25.

Is that something you have to do or can I help as well with that?

Not really. Way of working is that first it has to be merged to maint and later OTP team prepares a patch.

Sorry for delays on that one. I'm working on it but got disrupted last week.
I tried to reproduce some of those race issues but without success. Is there Erlang ssh server in your setup?

@alexandrejbr
Copy link
Contributor Author

In our tests the SSH server is also implemented in Erlang.

This is indeed hard to reproduce unless you somehow influence the interleaving of the 2 processes. If you just want to see the error you can use the debugger. If you'd like to have an automatic test, then the only way that comes to my mind is to use Concuerror.

@u3s
Copy link
Contributor

u3s commented Nov 3, 2023

sorry for delay. I've created a diagram for ssh supervision tree
please note that on client side:

  • each connection has its own supervision tree starting from ssh_system_sup
  • ssh_system_sup is registered with id being address record (defined in ssh.hrl) - record combines: local address, local port and profile

There were 2 race conditions that can happen on the startup of a SSH client connection:

  • two process race to create the ssh_system_sup for a given address, which can cause the start of the ssh_system_sup to return {error, {already_started, Pid}};

I will refer to it as RACE1

  • one process race to create the ssh_system_sup when it's being brought down due to all significant processes have exited.

I will refer to it as RACE2

This commit fixes these race conditions and additionally fixes a third issue that would happen when two processes race to create the ssh_system_sup, an ssh_acceptor_sup would be created for the address the client is trying to connect.

I will refer to it as RACE3

Copy link
Contributor

@u3s u3s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain and maybe verify TCP sockets used for connections?
I think RACE1 can only happen if ssh attempts are made for colliding sockets - this means local address, local port and profile (referred as Address in code since it holds address record) is used more than once.

I currently don't think re-using SysPid on client side is a good idea. I think that instead new connection attempt should be made by user code (hoping that better socket will be provided).

Some ideas for more checks:

  • check SO_REUSEPORT and related OS settings
  • monitor socket related stats on OS level and match them with occurrence of RACE1
  • trace ssh_system_sup:start_system/3 and check if same Address(including port) is used more than once?
  • running modified ssh_system_sup so that in case of already_started more information is gathered (like netstat -na | grep Port or equivalent)
  • maybe socket collisions could be monitored in your environment in some other way ?

lib/ssh/src/ssh_system_sup.erl Show resolved Hide resolved
Comment on lines +70 to +73
{error, {already_started, SysPid}} ->
%% There was other connection that created the supervisor while
%% this process was trying to create it as well
{ok, SysPid};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might get into troubles here. Each connection on client side (identified by socket local address and local port) is expected to have its own ssh_system_sup process.

I'm afraid this might create unexpected behaviors, because more than 1 connection will be using same ssh_system_sup. Was RACE2 observable without adjustment for RACE1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally missed that address had the local port and not the destination port. Good catch. Given that I think I am back to the drawing board with the RACE2.

I haven't actually tried just to run the adjustments for RACE1 and RACE3 without RACE2. I can try that but it will take me quite some time to get results I think. One thing is for certain, we were getting already_started errors from the ssh subsystem and we no longer get those after these changes. And even though I don't have the logs any longer I am quite certain that the already_started error came from start_system.

Shall I remove the changes regarding RACE2 from this PR, so we can proceed with the fixes for RACE1 and RACE3?

@u3s u3s removed the testing currently being tested, tag is used by OTP internal CI label Nov 3, 2023
@u3s
Copy link
Contributor

u3s commented Nov 6, 2023

Can you also explain:

  1. is the given address meaning - there is single ssh deamon listening on single port and handling multiple connection attempts?
  2. are sockets provided to ssh as an argument or ssh receives Host+port and establishes TCP connection internally?
  3. is it Linux environment?
  4. is loopback interface or 127.0.0.1 used as ssh connection target ?

@alexandrejbr
Copy link
Contributor Author

Can you also explain:

  1. is the given address meaning - there is single ssh deamon listening on single port and handling multiple connection attempts?
  2. are sockets provided to ssh as an argument or ssh receives Host+port and establishes TCP connection internally?
  3. is it Linux environment?
  4. is loopback interface or 127.0.0.1 used as ssh connection target ?
  1. I am a bit confused with the first part of your question, these errors happen when the client tries to connect to some ssh servers. Are you asking me about the ssh daemon of the servers?
  2. Host+port and establishes the TCP connection internally;
  3. yes, I am not sure I ever reproduced this one on my Mac, but the noproc one I did.
  4. yes.

I am sorry for the delay replying, my email client was not fetching emails from the email address I use for GitHub so I missed the notifications.

@@ -129,13 +131,30 @@ start_subsystem(Role, Address=#address{}, Socket, Options0) ->
supervisor:terminate_child(SysPid, Id),
{error, connection_start_timeout}
end;
{error,noproc} ->
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@u3s do you also agree that this change is needed to fix RACE1?

@IngelaAndin IngelaAndin added the stalled waiting for input by the Erlang/OTP team label Dec 12, 2023
@garazdawi garazdawi removed this from the OTP-26.2 milestone Dec 15, 2023
@u3s
Copy link
Contributor

u3s commented Jan 3, 2024

sorry for delay. I hope to come back to this soon.

@u3s u3s removed the stalled waiting for input by the Erlang/OTP team label Feb 6, 2024
@u3s
Copy link
Contributor

u3s commented Feb 9, 2024

I've created PR-8107 which won't start acceptor for ssh when in client role.

@u3s
Copy link
Contributor

u3s commented Feb 21, 2024

It was decided to postpone work related to ssh supervision tree.
I would like to revise how supervision tree is implemented and then decide how it should be fixed.

@alexandrejbr
Copy link
Contributor Author

Let's close this PR. If I manage to gather more information about the remaining issues I will come back to you.

@u3s
Copy link
Contributor

u3s commented Feb 22, 2024

OK. In general I agree there are issues with supervision tree appearing in some border cases.
But I'm not sure if the solutions proposed at this stage are the optimal ones ... and need more time for checking that.

@u3s
Copy link
Contributor

u3s commented Sep 10, 2024

replaced with #8766

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Planned Focus issue added in sprint planning priority:low team:PS Assigned to OTP team PS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race conditions on startup of an ssh connection with the client role
6 participants