-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
comm: re-implement dynamic processes using mpir-layer lpid #7240
Open
hzhou
wants to merge
59
commits into
pmodels:main
Choose a base branch
from
hzhou:2412_dynamic_am
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
hzhou
force-pushed
the
2412_dynamic_am
branch
from
December 20, 2024 17:24
e07760f
to
fda47b7
Compare
Miscellenous typo fixes to appease the spellchecker.
This test requires to access MPICH internals, thus won't be used with the current design.
We no longer use this file.
Hide the internal fields of MPIR_Group from unnecessary access. Outside group_util.c and group_impl.c, it only need assume the MPIR_Lpid integer type, creation routines based on lpid map or lpid stride description, and access routine to look up lpid from a group rank.
For most external usages, we only need MPIR_Group_rank_to_lpid.
Avoid access group internal fields.
Group similar functions together to facilitate refactoring. There is no changes in this commit other than moving functions around. The 4 incl/excl functions are very similar. The 3 difference/intersection/union functions are very similar.
Use MPIR_Group_{rank_to_lpid,lpid_to_rank} to avoid directly access MPIR_Group internal fields. For most group creation routines, just populate an lpid lookup map and call MPIR_Group_create_map to create the group.
* add option to use stride to describe group composition * remove the linked list design
This is the same as MPID_Comm_get_lpid. NOTE: we'll will remove MPID_Comm_get_lpid as well once we move the ownership of lpid to the MPIR-layer.
There is no real difference between lpid and gpid. Thus rename gpid in the device layer to lpid for clarification. Replace the usage of uint64_t as the type of lpid to MPIR_Lpid. This improves consistency.
We need a device-independent way of identifying processes. One way is to use the combination of (world_idx, world_rank). Thus, we need maintain a list of worlds so that the world_idx points to the world record. This may not fit in the concept of MPI group, but since the group need a ways of id processes, thus it seems most closely related. The first world, world_idx 0, is always initialized at init. Due to session re-init, we need make sure to reset num_worlds to 0 at finalize. New worlds will be added upon spawning or connecting dynamic processes (to-be-implemented).
We need reset num_worlds so that Session re-init will work.
Add builtin MPIR_GROUP_WORLD and MPIR_GROUP_SELF, so we can create builtin communicators from builtin groups.
Internally the only reason to duplicate a group is to copy from NULL session to a new session. Otherwise, we can just use the same group and increment the reference count.
Since builtin groups can be returned to users, they should be allowed to free. They are reference counted anyway.
To make MPI group a first-class citizen, we will always have group before creating communicators, so that when device layer activate communiators, e.g. in MPID_Comm_commit_pre_hook, it can rely on the group to look up the involved processes. It also removes the necessity to maintain any other process addressing schemes.
Many places we just return MPIR_Group_empty without increment the ref_count. This is fixable. But for now, let's avoid freeing it.
The init_comm does the release manually.
Add assertions to make sure the local_group and remote_group (for inter communicators) are always set before MPID_Comm_commit_pre_hook.
Otherwise, the MPI_T functions may not able to convert builtin datatypes.
When we run tests as functions, the stray output in MPI_Finalize, such as the debug messages in debug builds, are not captures previously. This patch make sure we report such stray output as failures.
Now that we always have group inside a communicator, we can simply return the lpid from the group. Because this will be used in the hot path, make it inline.
Add the following macros: MPIR_LPID_WORLD_INDEX MPIR_LPID_WORLD_RANK MPIR_LPID_FROM
Fix a typo in setting the size of MPIR_GROUP_SELF. Add ref_count if we return MPIR_GROUP_EMPTY to prevent freeing the builtin when it is released internally. Unfortunately, since user can directly use MPI_GROUP_EMPTY, we can't keep ref_count accurate. But at least we can keep it positive to prevent an actual free.
The builtin groups are in session NULL. We need duplicate the groups in MPIR_Group_from_session_pset_impl to return a group in the correct session.
Group are a natural place to host vcrt (virtual connection reference table). When communicators are duplicated, groups are simply inherited and reference counted. Thus we won't end up with duplication of vcrt.
Add a macro that tracks local memory allocation from other routines.
Because we need access MPIR_Lpid definitions in mpidpre headers, we need move worlds and lpid definitions to device-independent headers. Add macro MPIR_LPID_INVALID. Make MPIR_Lpid signed. Since we are going to perform arithmetic on MPIR_Lpid, e.g. in using strided pmap, make MPIR_Lpid int64_t instead of uint64_t to avoid accidental conversion errors.
* Add check_map_is_strided to detect strided pattern and convert a map into a strided pmap. * In MPIR_Group_check_subset, use MPIR_Group_lpid_to_rank rather than a manual linear search. * Move internal static routines to the bottom of grouputil.c.
A strided group with nontrivial blocksize is rare. By removing the blocksize parameter (i.e. blocksize is always 1), we greatly simplify the code and also improve the performance of lpid lookup in a more common strided group (such as a typical comm_world group or node group).
The pmap is always used inside MPIR_Group, and its size is always the same as group->size. Having a duplicated field creates more opportunities for bugs from inconsistency.
Replace MPIR_Assert with better error message.
We'll create av tables in ch4 according to world_idx and world_rank. MPIDIU_lpid_to_av can look up the av entry from an lpid in the communication path. MPIDIU_lpid_to_av_slow, used in communicator creation paths, will check and allocate the corresponding av table as needed.
Dynamic av will be used to support MPID_Comm_connect/accept when we need to create the leader av before we know the correct lpid entries. They are expected to be freed at the end of inter communicator creation.
Add - * MPIDI_NM_insert_upid - insert an av entry so the lpid is ready for communication. The lpid can be allocated from a dynamic av table, thus supports temporary communications between intercomm leaders. When later the upid is inserted again into the regular av tables, the dynamic entries are checked and copied over if already exist. * MPIDI_NM_dynamic_sendrecv - used by local group leaders to exchange data over dynamic_av. The dynamic handshakes are susceptible to concurrent interference. Thus the upper layer is assumed to hold the vci-0 critical section.
We can easily exchange the context_id along with the rest of the remote info rather than do it in a separate step. We can determine is_low_group by comparing world namespace and world_rank entirely in the MPIR layer, thus no longer need it in MPID_Intercomm_exchange. Rename MPID_Intercomm_exchange_map to MPID_Intercomm_exchange to better reflect that it is not just exchanging maps.
This is fully replaced with MPIR_comm_rank_to_lpid or MPIR_Group_rank_to_lpid.
Refactor MPID_Intercomm_exchange to Maximize common parts for MPI_Intercomm_create, MPI_Comm_connect/accept, and MPI_Intercomm_create_from_group. They differ in the first step in how to establish a leader-to-leader communication. In ch4, this is to establish an av for remote leader. Once the av is established, the intercomm exchange parts are common. We no longer generate lpid from ch4-layer. Rather, we exchange world information and convert lpids by swapping world_idx. The lpids will be used directly as index to ch4 av tables and upids (address names) are inserted into the av table entries.
In MPID_Comm_connect/accept, simply establish remote_lpid and call MPIR_Intercomm_create_impl.
The local_group and remote_group fully captures the mapper functions.
We have switched to use MPIR_Lpid to address in ch4 av table manager. Both map and local_map in ch4 MPIDI_Devcomm_t no longer needed.
Rename it to MPIDIU_get_grank, remove the dependency on MPIDIU_comm_rank_to_lpid (to be removed next) and use MPIR_comm_rank_to_lpid instead.
This is fully replaced by MPIR_comm_rank_to_lpid.
Track MPIR_Lpid lpid rather than a pair of (avtid, lpid).
Now we use MPIR_Lpid, we no longer needed netmod api to convert upids to lpids. The function is replaced by netmod api insert_upid.
We no longer expose avtid. Replace MPIDIU_get_av with MPIDIU_lpid_to_av. Also remote unused GPID macros.
When ch4-layer allocates an av table, all entries are initialized to 0. However, 0 can be a valid entry for fi_addr_t. We could initialize all entries to FI_ADDR_NOTAVAIL, but that requires an additional complexity of a netmod API. Instead, because the entry 0 is always the first entry to be inserted by fi_av_insert, we can simply remember the entry (MPIDI_OFI_global.lpid0) and be able to tell which entries are empty (in MPIDI_OFI_insert_upid).
We don't really tracek av tables' ref_count. We simply free all av tables at finalize. Rename MPIDIU_avt_destroy to MPIDIU_avt_finalize to better reflect its role.
hzhou
force-pushed
the
2412_dynamic_am
branch
from
December 26, 2024 14:23
303ae94
to
54d2544
Compare
test:mpich/ch3/most All ✔️ |
yfguo
reviewed
Jan 23, 2025
src/mpi/group/grouputil.c
Outdated
@@ -81,7 +80,6 @@ int MPIR_Group_init(void) | |||
pmap->use_map = false; | |||
pmap->u.stride.offset = MPIR_Process.rank; | |||
pmap->u.stride.stride = 1; | |||
pmap->u.stride.blocksize = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, strided block is pretty rare. Only HACC and SNAP have them (through cartesian communicators).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request Description
Dependency PR: #7235 , #7237, #7242
Now both ch3 and ch4 can directly use
comm->local_group
andcomm->remote_group
(intercomm) to set up communicator and look up av addresses, we can remove the redundant code -ch4_spawn
to use a temporary dynamic av to exchange info between group leaders and establish intercommMPIDI_rank_map_t
[skip warnings]
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.