Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat[BMQ]: Add an admin command to remove a domain #541

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

emelialei88
Copy link
Collaborator

@emelialei88 emelialei88 commented Dec 9, 2024

The existing admin command "DOMAINS DOMAIN <name> PURGE" only add purge record to the journal files for the queues belonging to this domain. This may cause race condition where it's possible for a user to connect to a queue that's purged.

This PR adds a new admin command "DOMAINS DOMAIN <name> REMOVE" to:

  • Fail this admin command if there's any opened or opening queue
  • Mark domain to block any upcoming openQueue requests
  • Purge inactive queues
  • Force GC queues
  • Wait until queues are unregistered and unassigned
  • Clean cache in domainResolver and configProvider

With a temporary second pass "DOMAINS DOMAIN <name> REMOVE FINALIZE" to:

  • Remove the Domain object from DomainManager (this can be done only after the first pass succeed and the domain config file is removed from the disk)

@emelialei88 emelialei88 force-pushed the admin-domain-remove branch 2 times, most recently from a7cb475 to d760dbb Compare December 10, 2024 15:16
@emelialei88 emelialei88 force-pushed the admin-domain-remove branch 2 times, most recently from fcd0589 to 40eb300 Compare December 10, 2024 23:15
@emelialei88 emelialei88 force-pushed the admin-domain-remove branch 4 times, most recently from 035193e to bc06ff4 Compare December 26, 2024 19:48
@emelialei88 emelialei88 force-pushed the admin-domain-remove branch 4 times, most recently from 434b216 to 12e54ee Compare January 6, 2025 15:58
@emelialei88 emelialei88 marked this pull request as ready for review January 6, 2025 16:26
@emelialei88 emelialei88 requested a review from a team as a code owner January 6, 2025 16:26
@emelialei88 emelialei88 force-pushed the admin-domain-remove branch 4 times, most recently from 3f686e0 to bd5cec4 Compare January 7, 2025 17:30
Copy link
Collaborator

@pniedzielski pniedzielski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only substantial change we need to make is to clear the domain resolver cache in the second phase as well (since loading the domain will repopulate the cache).

Great!

src/groups/mqb/mqbcmd/mqbcmd_commandlist.cpp Outdated Show resolved Hide resolved
@@ -73,6 +73,14 @@ struct CommandDefinition {
"Clear the domain resolution cache entry of the optionally specified "
"'domain', or clear all domain resolution cache entries if 'ALL' is "
"specified."},
{"DOMAINS REMOVE <domain> [finalize]",
"Remove a domain with an optional keyword 'finalize'",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this description should reveal less of the implementation:

Remove a domain from the cluster.  If the keyword 'FINALIZE' is not supplied, perform the first phase of domain deletion, failing if there is an open queue on this domain and blocking any open queue requests subsequent to the command being issued.  After the first phase of domain deletion, it is safe to remove the domain configuration file from disk.  If the keyword 'FINALIZE' is specified, perform the second phase of domain deletion, failing if the domain has not had the first phase of domain deletion performed.  After the second phase of domain deletion, it is safe to create a new domain reusing the same name and to connect to it.

Open questions from the above:

  1. Your text here reads like it is undefined behavior if the second pass occurs without the first pass. If this is true, I think we should change it to just fail in that case.
  2. Reading this written out, I think we need to clean the domain resolver cache and config provider cache in the second phase, not the first. If someone connects to a domain that has been deleted in the first pass, doesn't the broker need to go through the domain resolver to know what Domain object to talk to, so it can find out the domain has been deleted and reject the request? There will still be a tiny race condition, but I think clearing these caches right before you delete the domain object is the correct behavior.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We do check the state of the domain to make sure we only continue the second round when the first is completed.
  2. Good catch!

src/groups/mqb/mqbcmd/mqbcmd_parseutil.cpp Outdated Show resolved Hide resolved
src/groups/mqb/mqbcmd/mqbcmd_parseutil.cpp Outdated Show resolved Hide resolved

DomainSp domainSp;

if (0 != locateOrCreateDomain(&domainSp, name)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this, but just to record it down: the reason we do this here is due to the lazy loading of domains that BlazingMQ does. We don't want to fail if a user creates a domain configuration file, happens not to connect, and then issues this admin command, or creates a domain configuration file, uses the domain, purges the domain, and restarts the broker before issuing this admin command.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, this will load the domain into the resolver cache, so we do need to clear that on the second pass.

src/integration-tests/test_domain_remove.py Outdated Show resolved Hide resolved
src/integration-tests/test_domain_remove.py Outdated Show resolved Hide resolved
src/groups/mqb/mqba/mqba_commandrouter.cpp Outdated Show resolved Hide resolved
src/groups/mqb/mqbblp/mqbblp_cluster.cpp Show resolved Hide resolved
src/groups/mqb/mqba/mqba_domainmanager.cpp Show resolved Hide resolved
implement the first pass where the config file of the domain is still on the disk,
and the admin command can only be sent to leader/primary

Signed-off-by: Emelia Lei <[email protected]>
Remove the domain object in the second pass

Signed-off-by: Emelia Lei <[email protected]>
…d pass.

If an open queue request is issued right fter the first pass,
the domain will be loaded into domainResolver and configProvider.
We need to move cache cleaning to right before the remove of the domain object.

Signed-off-by: Emelia Lei <[email protected]>
@pniedzielski
Copy link
Collaborator

I think we should add a test-case or two just to verify the change in that final commit. Then should be good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants