Graceful shutdown on ctrl-c #9

jaywonchung · 2022-02-06T03:00:31Z

This PR attempts to gracefully shutdown SSH sessions when the user presses ctrl-c.

The overall flow of cancellation propagation is:

The user presses ctrl-c.
The handler closure (set by ctrlc::set_handler) assigns true to the variable cancelled: Arc<Mutex<bool>>.
The scheduling loop checks the value of cancelled at the beginning of every iteration. When true, it breaks.
All channel handles that are used to communicate with the SSH session tasks are explicitly droped.
For each SSH session task, whenever it sends or recvs from any of its channels, the channel will return an Err and the task will break out of the task execution loop.
For each SSH session, the session object will be dropped, terminating the SSH session.

As a side note, along the way, this PR fixes stream in Session by explicitly locking stdout. Without this, multiple calls to print! and println! are not coalesced. This lead to output lines from different commands mixing with each other.

Closes #3.

@NobodyXu Will you be interested in reviewing? Just asking since you have reviewed a lot of my code recently (and it benefited both me and Pegasus so much). Plus, I can add you as a collaborator if you'd like :)

src/main.rs

src/session.rs

NobodyXu · 2022-02-06T03:10:10Z

Cargo.toml

@@ -21,3 +21,4 @@ handlebars = "3"
 colored = "2"
 colourado = "0.2.0"
 memchr = "2.4.1"
+ctrlc = "3.2.1"


You can use tokio::signal::ctrl_c instead by enabling feature signal on dependency tokio.

It removes the need to use tokio::sync::Mutex<Bool> or tokio::sync::Notify altogether,

Oh god tokio has a lot of things. Are you thinking of something like this?

tokio::spawn(async move { loop { tokio::select!( _ = tokio::signal::ctrl_c() => break, _ = async { /* scheduling loop */ } => continue, ); } });

I suppose this kind of loop is totally OK, if everything in it is cancel safe.

I think tokio::signal::ctrl_c() can be called multiple times and all of these .await on it would return on receiving one ctrl_c signal, however since I don't have much experience with tokio::signal, I am not so sure.

The scheduling loop is probably not cancel safe. The scheduling loop is designed to check ctrl-c cancellation before fetching a new command from the queue, and I feel like awaiting on Notify::notified or ctrl_c will end up being async w.r.t the operations of the scheduling loop, making things difficult. Plus, if I'm going to refactor it as a task, then I should figure out a way to drop(command_txs), etc when the loop exits.

High chance that I'm not figuring out what you imagined how to use ctrl_c?

Very interesting suggestion! I think this will work for broadcast mode, but I'm not sure about queue mode. All in all this is a major rework of the core logic, so if the current one at least works, I think the refactoring should be done in a separate PR. Also, I feel like the suggestions about notify + tokio::sync::watch and using tokio::signal::ctrl_c are super nice, and I'll try to also incorporate those in the current architecture!

All Sessions are stored in a Arc<HashMap<String, Session>>, with its name as the key. Since the hosts are loaded at the startup, it is perfectly OK to store them in a Arc<HashMap>.

I suppose it can even be a Arc<Vec<Session>> because sessions aren't really distinguished between each other.

Then, upon receiving new tasks, search for the corresponding Session and spawn a new task to execute the job.

Similarly, there is no "corresponding" session. In broadcast mode, all sessions will be used so that's not a problem, but in queue mode, we need to keep track of which sessions are currently free, and whenever a new command is fetched or a new session becomes free, we execute the command using the free session. To do so, we need to track the status (in use or free) of the sessions. From the top of my head, I suppose keeping is_free: Arc<Mutex<Vec<bool>>> and setting the flags at the beginning and end of each task would be an option, but that will require the scheduling loop to poll on the vector.

The main task can then call tokio::sync::Semaphore::acquire_many to wait for the completion of n tasks and uses tokio::sync::SemaphorePermit::forget to prevent the permit from being returned to the tokio::sync::Semaphore.

Definitely works for broadcast mode but we need something more fine-grained than acquire_many for queue mode.

By doing each job in a task instead of running every Session in a task, we can do graceful shutdown easily using tokio::signal::ctrl_c.

I suppose I should start by ditching ctrlc and use tokio::signal::ctrl_c right now:

// Replacing ctrlc::set_handler tokio::spawn(async move { tokio::signal::ctrl_c().await; *cancelled_handler.lock().await = true; });

Using the architecture I described above, I believe it is also possible to use notify in tokio::spawn combined with tokio::sync::watch to monitor the change for change of queue.yaml.

It is possible since tokio::sync::watch::Receiver::changed is cancel safe.

This is great. Now, instead of just waiting three seconds when queue.yaml is empty, I can await on watcher_rx.changed() and whenever it notifies me, continue to the next iteration which will immediately call job_queue.next().

Upon inspection of implementation of tokio::signal::ctrl_c, I found that it just uses tokio::sync::watch::Receiver to receive signal, so it should also be cancel safe.

That is good to know, but I didn't quite follow why cancel safety for ctrl_c is needed in your proposed architecture.

If we actually need to unload/load new hosts dynamically, then I would use HashMap<String, Arc<Session>> to stores the sessions.

This could be a pretty good potential feature. Although I currently don't have a need for this, I'll definitely agree if someone will add this feature.

Similarly, there is no "corresponding" session.

Oh right, I forgot that in the broadcast mode, the name of session doesn't matter.

And it neither matter in queue mode.

Definitely works for broadcast mode but we need something more fine-grained than acquire_many for queue mode.

We can have a Vec<Arc<Session>> for storing the all sessions and creates a tokio::sync::mpsc::channel for storing all Session that are ready.

I would initialize the channel like this:

let mut sessions = Vec::new(); session.push(Arc::new(...)); // ... let (sender, mut receiver) = channel(sessions.len()); for session in sessions.iter() { sender.send(session.clone()).await.unwrap(); }

When creating a new task:

// Get a session that is free. let session = receiver.recv().await.unwrap(); let sender = sender.clone(); tokio::spawn(async move { // do something... let res = session.run(...).await; // handle the error... sender.send(session).await.unwrap(); });

Using this way, we now have fine-grained notification on which Session is free for queue mode.

I suppose I should start by ditching ctrlc and use tokio::signal::ctrl_c right now:

Agreed.

That is good to know, but I didn't quite follow why cancel safety for ctrl_c is needed in your proposed architecture.

I imagined we will be doing something like this:

tokio::select! { _ = ctrl_c() => // graceful shutdown, _ => wait_for_queue_yaml_to_change() => // do more work, }

So we're basically tossing around the Session object! I That's elegant. And we only have to use one channel. And we can keep a vector of Option<JoinHandle>s where we store the handle returned from tokio::spawn whenever a task begins, and remove it when it's done. Then, on cancellation, we call join_all on that handle vector and then iterate over the sessions vector to gracefully close all the sessions. I acknowledge that switching to your architecture will benefit Pegasus.

But still, as far as I can understand, the performance would not be significantly better (maybe spawning a new task for each command might slightly degrade performance?) and it doesn't enable something that we weren't able to accomplish with the current architecture. Since the architecture is quite simpler, I will definitely keep this in mind and consider refactoring to this when features are added and the code gets too dirty.

You know, I would have went through the refactoring process if it was Friday night, but... it's Monday soon.. haha.. 😢

But still, as far as I can understand, the performance would not be significantly better (maybe spawning a new task for each command might slightly degrade performance?) and it doesn't enable something that we weren't able to accomplish with the current architecture.

This is true, however I think a simpler architecture will always make things easier, regardless of whether it is to support new features, fix existing bugs or performance optimizations.

You know, I would have went through the refactoring process if it was Friday night, but... it's Monday soon.. haha..

Totally understandable.

This is true, however I think a simpler architecture will always make things easier, regardless of whether it is to support new features, fix existing bugs or performance optimizations.

100% agree! Your proposal will not go void.

NobodyXu · 2022-02-06T03:13:31Z

@jaywonchung I recommend to explicitly close openssh::Session and openssh::RemoteChild, since explicit close would return Error if failed instead of silently ignoring them as in Drop::drop implementation.

jaywonchung · 2022-02-06T03:26:29Z

Thanks for the review!

About explicitly closing openssh::Session, what would be the errors that are worth checking and reporting back to the user? What I thought was that since the user just wants Pegasus gone, we'll be fine with ignoring errors. Also, for openssh::RemoteChild, are you talking about its disconnect method?

NobodyXu · 2022-02-06T03:30:29Z

About explicitly closing openssh::Session, what would be the errors that are worth checking and reporting back to the user?

Mostly network disconnected error.

Also, for openssh::RemoteChild, are you talking about its disconnect method?

Yes, sorry I mixed up the name.

RemoteChild::disconnect just kills the ssh, it doesn't do much, though I think explicit close and disconnect is always better than the implicit one.

src/main.rs

NobodyXu · 2022-02-06T03:43:18Z

src/main.rs

+            // terminated.
+            while let Ok(cmd) = command_rx.recv().await {
+                let cmd = cmd.fill_template(&mut registry, &session.host);
+                let result = session.run(cmd).await;


I am pretty sure session.run(cmd) is cancel safe.

NobodyXu · 2022-02-06T03:44:47Z

Cargo.toml

@@ -21,3 +21,4 @@ handlebars = "3"
 colored = "2"
 colourado = "0.2.0"
 memchr = "2.4.1"
+ctrlc = "3.2.1"


I read Session::run and I am pretty sure it is cancel safe.

And, you don't have to call ctrl_c in these tasks.

You can call tokio::task::JoinHandle::abort to abort these tasks.

jaywonchung · 2022-02-06T03:56:22Z

I just pushed out a commit that defines a close method for Session and closes it explicitly right before the task ends! For openssh::RemoteChild, session.run just waits out on it by calling the wait method. Is there a room for explicitly calling disconnect on it before exiting? (Assuming that the scheduling loop does not call abort on the tasks.)

NobodyXu · 2022-02-06T04:10:34Z

Is there a room for explicitly calling disconnect on it before exiting?

Since you already wait on it, calling disconnect would be a no-op.

jaywonchung · 2022-02-07T00:13:57Z

Now tokio::signal::ctrl_c is used to wait for the ctrl-c event!

I tried to use notify to watch queue.yaml, but it turns out that notify is purely sync. It takes std::sync::mpsc::Sender in its constructor but since std::sync::mpsc::{Sender, Receiver} are !Sync, I don't think I can spawn a tokio task to watch events. Anyway, I guess this can also be kept as a potential TODO, so I created an issue ticket for this: #10

NobodyXu · 2022-02-07T00:33:13Z

but it turns out that notify is purely sync

I also realized this.

We could use tokio::spawn_blocking and call receiver.recv, then push the notification into a tokio::sync::watch channel, but that is really complicated and I think we should find an alternative that can be used in async context.

jaywonchung · 2022-02-07T01:25:26Z

@NobodyXu Thanks a lot for reviewing. If you don't have further comments, mind if I merge?

NobodyXu · 2022-02-07T01:26:13Z

LGTM

jaywonchung added 4 commits February 5, 2022 21:39

Graceful termination with 'cancelled' variable

379f6c2

Fix typo in comments

370e371

Remove needless drop

5d71994

Fix typo in comments

99572a0

NobodyXu suggested changes Feb 6, 2022

View reviewed changes

src/main.rs Show resolved Hide resolved

src/main.rs Show resolved Hide resolved

src/session.rs Outdated Show resolved Hide resolved

NobodyXu suggested changes Feb 6, 2022

View reviewed changes

NobodyXu reviewed Feb 6, 2022

View reviewed changes

Explicitly close session

9d5eca0

Change stdout lock name to guard

67415b2

jaywonchung added 2 commits February 6, 2022 19:08

Use tokio::signal::ctrl_c to handle ctrl-c

93a7de0

Pass cargo fmt

84414d7

jaywonchung merged commit 649c201 into master Feb 7, 2022

jaywonchung deleted the graceful-shutdown branch February 7, 2022 01:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful shutdown on ctrl-c #9

Graceful shutdown on ctrl-c #9

jaywonchung commented Feb 6, 2022 •

edited

Loading

NobodyXu Feb 6, 2022

jaywonchung Feb 6, 2022

NobodyXu Feb 6, 2022

NobodyXu Feb 6, 2022

jaywonchung Feb 6, 2022

jaywonchung Feb 6, 2022 •

edited

Loading

NobodyXu Feb 7, 2022 •

edited

Loading

jaywonchung Feb 7, 2022 •

edited

Loading

NobodyXu Feb 7, 2022

jaywonchung Feb 7, 2022

NobodyXu commented Feb 6, 2022 •

edited

Loading

jaywonchung commented Feb 6, 2022

NobodyXu commented Feb 6, 2022

NobodyXu Feb 6, 2022

NobodyXu Feb 6, 2022 •

edited

Loading

jaywonchung commented Feb 6, 2022

NobodyXu commented Feb 6, 2022

jaywonchung commented Feb 7, 2022 •

edited

Loading

NobodyXu commented Feb 7, 2022

jaywonchung commented Feb 7, 2022

NobodyXu commented Feb 7, 2022 •

edited

Loading

Graceful shutdown on ctrl-c #9

Graceful shutdown on ctrl-c #9

Conversation

jaywonchung commented Feb 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaywonchung Feb 6, 2022 • edited Loading

Choose a reason for hiding this comment

NobodyXu Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

jaywonchung Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NobodyXu commented Feb 6, 2022 • edited Loading

jaywonchung commented Feb 6, 2022

NobodyXu commented Feb 6, 2022

Choose a reason for hiding this comment

NobodyXu Feb 6, 2022 • edited Loading

Choose a reason for hiding this comment

jaywonchung commented Feb 6, 2022

NobodyXu commented Feb 6, 2022

jaywonchung commented Feb 7, 2022 • edited Loading

NobodyXu commented Feb 7, 2022

jaywonchung commented Feb 7, 2022

NobodyXu commented Feb 7, 2022 • edited Loading

jaywonchung commented Feb 6, 2022 •

edited

Loading

jaywonchung Feb 6, 2022 •

edited

Loading

NobodyXu Feb 7, 2022 •

edited

Loading

jaywonchung Feb 7, 2022 •

edited

Loading

NobodyXu commented Feb 6, 2022 •

edited

Loading

NobodyXu Feb 6, 2022 •

edited

Loading

jaywonchung commented Feb 7, 2022 •

edited

Loading

NobodyXu commented Feb 7, 2022 •

edited

Loading