Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition on ssh client connection startup #7549

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 31 additions & 12 deletions lib/ssh/src/ssh_system_sup.erl
Original file line number Diff line number Diff line change
Expand Up @@ -53,15 +53,26 @@

start_system(Role, Address0, Options) ->
case find_system_sup(Role, Address0) of
{ok,{SysPid,Address}} ->
{ok,{SysPid,Address}} when Role =:= server ->
restart_acceptor(SysPid, Address, Options);
{ok,{SysPid,_Address}} ->
{ok, SysPid};
u3s marked this conversation as resolved.
Show resolved Hide resolved
{error,not_found} ->
supervisor:start_child(sup(Role),
case supervisor:start_child(sup(Role),
#{id => {?MODULE,Address0},
start => {?MODULE, start_link, [Role, Address0, Options]},
restart => temporary,
type => supervisor
})
of
{ok, SysPid} ->
{ok, SysPid};
{error, {already_started, SysPid}} ->
%% There was other connection that created the supervisor while
%% this process was trying to create it as well
{ok, SysPid};
Comment on lines +70 to +73
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might get into troubles here. Each connection on client side (identified by socket local address and local port) is expected to have its own ssh_system_sup process.

I'm afraid this might create unexpected behaviors, because more than 1 connection will be using same ssh_system_sup. Was RACE2 observable without adjustment for RACE1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally missed that address had the local port and not the destination port. Good catch. Given that I think I am back to the drawing board with the RACE2.

I haven't actually tried just to run the adjustments for RACE1 and RACE3 without RACE2. I can try that but it will take me quite some time to get results I think. One thing is for certain, we were getting already_started errors from the ssh subsystem and we no longer get those after these changes. And even though I don't have the logs any longer I am quite certain that the already_started error came from start_system.

Shall I remove the changes regarding RACE2 from this PR, so we can proceed with the fixes for RACE1 and RACE3?

Others -> Others
end
end.

%%%----------------------------------------------------------------
Expand Down Expand Up @@ -100,16 +111,7 @@ start_subsystem(Role, Address=#address{}, Socket, Options0) ->
Id = make_ref(),
case get_system_sup(Role, Address, Options) of
{ok,SysPid} ->
case supervisor:start_child(SysPid,
#{id => Id,
start => {ssh_subsystem_sup, start_link,
[Role,Address,Id,Socket,Options]
},
restart => temporary,
significant => true,
type => supervisor
})
of
case start_subsystem_sup(SysPid, Id, [Role,Address,Id,Socket,Options]) of
{ok,_SubSysPid} ->
try
receive
Expand All @@ -129,13 +131,30 @@ start_subsystem(Role, Address=#address{}, Socket, Options0) ->
supervisor:terminate_child(SysPid, Id),
{error, connection_start_timeout}
end;
{error,noproc} ->
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@u3s do you also agree that this change is needed to fix RACE1?

%% The supervisor has shut itself down due to the termination of
%% all the significant children
start_subsystem(Role, Address, Socket, Options0);
Others ->
Others
end;
Others ->
Others
end.

start_subsystem_sup(SysPid, Id, StartArgs) ->
try
supervisor:start_child(SysPid,
#{id => Id,
start => {ssh_subsystem_sup, start_link, StartArgs},
restart => temporary,
significant => true,
type => supervisor
})
catch
exit:{noproc,_} ->
{error, noproc}
end.

%%%----------------------------------------------------------------
start_link(Role, Address, Options) ->
Expand Down