-
Notifications
You must be signed in to change notification settings - Fork 395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make util_ns fields: run and is_initialized atomics #10570
Open
dsciebu
wants to merge
1
commit into
ofiwg:main
Choose a base branch
from
dsciebu:dariuszs/nr_run
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+13
−13
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name service is basically a simple way to support running fabtests. It's not something we expect to have enabled as a general purpose feature.
If it were, the changes here are not sufficient to support multiple threads. Locks would be needed. It's ultimately the responsibility of the caller to have appropriate serialization to the name service calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip about the name service. I found that it is by default enabled for verbs provider, which, in case of a multithreaded application, leads to data races triggered by e.g. parallel fi_close and util_ns_accept_handler execution. Don't you think, that by as you said: "not expecting this option to be enabled as a general purpose feature" you should default this option to OFF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe what sort of endpoints the app is using? It looks like the name service is related to dgram ep's, and only started if a verbs domain is opened for dgram ep's. Note that this can occur through the rxd provider for rdm ep's.
@j-xiong - I see that the name server is enabled by default, but disabled if specific, unrelated environment variables are set (OMPI_COMM_WORLD_RANK or PMI_RANK). Should the default be to disable it, and use an environment variable to enable instead?
I'm unsure how frequently the verbs dgram provider is used in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shefty The name server is needed for verbs dgram to resolve "node" into dst_addr in fi_getinfo(). By default this is a feature an application may use. MPIs (OpenMPI, MPICH, Intel MPI) don't use this feature so the name server can be safely disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My experiments that revealed the issue were made on a Mellanox RDMA card. Using that hardware and setting up the hints the way that rdm fabtests do, I ended up in getting an RXD protocol 'based' endpoint.
When checking my libfabric based app's test, I noticed that Thread Sanitizer reports data race in libfabric and ended up here. Turning off the 'name service' with the proper env variable solves the problem completely.
So, IMO it should be:
a) turned off by default if not necessary or
b) made thread safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsciebu - I agree. Can you open an issue to track this? The name service should be made thread safe, which I'm guessing won't be that difficult. Likely just needs a lock around most of the functions you identified.