Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock of the pool when destroying a replica #1737

Open
tiagolobocastro opened this issue Sep 12, 2024 · 6 comments
Open

Deadlock of the pool when destroying a replica #1737

tiagolobocastro opened this issue Sep 12, 2024 · 6 comments

Comments

@tiagolobocastro
Copy link
Contributor

Describe the bug
Pool lock was taken and never released. This means all grpc for that pool will fail!

To Reproduce
Seems like this may happen if we try to delete a replica which is part of a nexus!

Expected behavior
Don't lock the pool forever...

Additional context

2024-09-12T06:54:14.717272539+02:00 stdout F [2024-09-12T04:54:14.717133800+00:00  INFO io_engine::grpc::v1::replica:replica.rs:402] DestroyReplicaRequest { uuid: "4711b421-0210-4db5-b88f-c2c55cac52da", pool: Some(PoolName("alex-cloud-sn-2-pool")) }
2024-09-12T06:54:14.719461813+02:00 stdout F [2024-09-12T04:54:14.719375762+00:00  INFO io_engine::lvs::lvs_lvol:lvs_lvol.rs:247] Lvol 'alex-cloud-sn-2-pool/3a6b6004-dbc8-4613-b316-f1f35fce24e0/4711b421-0210-4db5-b88f-c2c55cac52da' [50.00 GiB]: unshared
2024-09-12T06:54:14.719767776+02:00 stdout F [2024-09-12T04:54:14.719701943+00:00  INFO io_engine::bdev::device:device.rs:785] Received SPDK remove event for bdev '4711b421-0210-4db5-b88f-c2c55cac52da'
2024-09-12T06:54:14.719783986+02:00 stdout F [2024-09-12T04:54:14.719730236+00:00  INFO io_engine::bdev::nexus::nexus_bdev_children:nexus_bdev_children.rs:899] Unplugging nexus child device �[3mnexus_name�[0m�[2m=�[0m"331c0652-0a75-4a2c-8946-3caa0590af06" �[3mchild_device�[0m�[2m=�[0m"4711b421-0210-4db5-b88f-c2c55cac52da"
2024-09-12T06:54:14.719851202+02:00 stdout F [2024-09-12T04:54:14.719744462+00:00  INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1113] Child 'bdev:///4711b421-0210-4db5-b88f-c2c55cac52da?uuid=4711b421-0210-4db5-b88f-c2c55cac52da @ 331c0652-0a75-4a2c-8946-3caa0590af06' [open synced]: unplugging child...
2024-09-12T06:54:14.720345689+02:00 stdout F [2024-09-12T04:54:14.719979192+00:00  INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:657] Nexus '331c0652-0a75-4a2c-8946-3caa0590af06' [open]: dynamic reconfiguration event: unplug, reconfiguring I/O channels...
2024-09-12T06:54:14.720361719+02:00 stdout F [2024-09-12T04:54:14.720206068+00:00  INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:680] Nexus '331c0652-0a75-4a2c-8946-3caa0590af06' [open]: dynamic reconfiguration event: unplug, reconfiguring I/O channels completed with result: Ok
2024-09-12T06:54:14.7203678+02:00 stdout F [2024-09-12T04:54:14.720225935+00:00  INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1157] Child 'bdev:///4711b421-0210-4db5-b88f-c2c55cac52da?uuid=4711b421-0210-4db5-b88f-c2c55cac52da @ 331c0652-0a75-4a2c-8946-3caa0590af06' [closed synced]: child successfully unplugged

This was found on another report: #1734

@tiagolobocastro
Copy link
Contributor Author

Looks like there was a heartbeat failure, which caused control-plane to mark the node as offline.
In turn, this means we didn't set the nexus node shutdown request.
Nonetheless, we should not have attempted to destroy the replica, because the nexus was not verified as shutdown!
Also we could have a check on the dataplane to avoid getting into trouble by ensuring replica is not being used?

@dsharma-dc
Copy link
Contributor

Pool lock was taken and never released

The DestroyReplica call is the one getting starved of lock, but who is holding the pool lock here?

@tiagolobocastro
Copy link
Contributor Author

Pool lock was taken and never released

The DestroyReplica call is the one getting starved of lock, but who is holding the pool lock here?

The first DestroyReplica call

@tiagolobocastro
Copy link
Contributor Author

tiagolobocastro commented Sep 16, 2024

This ticket needs two fixes:

  • Control-plane should not issue replica destroy if replica is still part of the nexus
  • Data-plane should anyway not lockup the pool in case of control-plane bug

@tiagolobocastro
Copy link
Contributor Author

Control-plane changes: openebs/mayastor-control-plane#862

@tiagolobocastro
Copy link
Contributor Author

Control-plane fix is release on 2.7.1 but we should also ensure from the data-plane that this can't happen, so leaving this issue open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants