Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comm: re-implement dynamic processes using mpir-layer lpid #7240

Open
wants to merge 59 commits into
base: main
Choose a base branch
from

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Dec 18, 2024

Pull Request Description

Dependency PR: #7235 , #7237, #7242

Now both ch3 and ch4 can directly use comm->local_group and comm->remote_group (intercomm) to set up communicator and look up av addresses, we can remove the redundant code -

  • revamp ch4_spawn to use a temporary dynamic av to exchange info between group leaders and establish intercomm
  • remove ch4 MPIDI_rank_map_t
  • remove MPIR comm mapper

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

hzhou added 29 commits December 22, 2024 15:43
Miscellenous typo fixes to appease the spellchecker.
This test requires to access MPICH internals, thus won't be used with
the current design.
We no longer use this file.
Hide the internal fields of MPIR_Group from unnecessary access.

Outside group_util.c and group_impl.c, it only need assume the MPIR_Lpid
integer type, creation routines based on lpid map or lpid stride
description, and access routine to look up lpid from a group rank.
For most external usages, we only need MPIR_Group_rank_to_lpid.
Avoid access group internal fields.
Group similar functions together to facilitate refactoring.
There is no changes in this commit other than moving functions around.

The 4 incl/excl functions are very similar.

The 3 difference/intersection/union functions are very similar.
Use MPIR_Group_{rank_to_lpid,lpid_to_rank} to avoid directly access
MPIR_Group internal fields.

For most group creation routines, just populate an lpid lookup map and
call MPIR_Group_create_map to create the group.
* add option to use stride to describe group composition
* remove the linked list design
This is the same as MPID_Comm_get_lpid.

NOTE: we'll will remove MPID_Comm_get_lpid as well once we move the
ownership of lpid to the MPIR-layer.
There is no real difference between lpid and gpid. Thus rename gpid in
the device layer to lpid for clarification.

Replace the usage of uint64_t as the type of lpid to MPIR_Lpid. This
improves consistency.
We need a device-independent way of identifying processes. One way is to
use the combination of (world_idx, world_rank). Thus, we need maintain a
list of worlds so that the world_idx points to the world record.

This may not fit in the concept of MPI group, but since the group need a
ways of id processes, thus it seems most closely related.

The first world, world_idx 0, is always initialized at init.

Due to session re-init, we need make sure to reset num_worlds to 0 at
finalize.

New worlds will be added upon spawning or connecting dynamic processes
(to-be-implemented).
We need reset num_worlds so that Session re-init will work.
Add builtin MPIR_GROUP_WORLD and MPIR_GROUP_SELF, so we can create
builtin communicators from builtin groups.
Internally the only reason to duplicate a group is to copy from NULL session
to a new session.  Otherwise, we can just use the same group and increment the
reference count.
Since builtin groups can be returned to users, they should be allowed
to free. They are reference counted anyway.
To make MPI group a first-class citizen, we will always have group
before creating communicators, so that when device layer activate
communiators, e.g. in MPID_Comm_commit_pre_hook, it can rely on the
group to look up the involved processes. It also removes the necessity
to maintain any other process addressing schemes.
Many places we just return MPIR_Group_empty without increment the
ref_count. This is fixable. But for now, let's avoid freeing it.
The init_comm does the release manually.
Add assertions to make sure the local_group and remote_group (for
inter communicators) are always set before MPID_Comm_commit_pre_hook.
Otherwise, the MPI_T functions may not able to convert
builtin datatypes.
When we run tests as functions, the stray output in MPI_Finalize, such
as the debug messages in debug builds, are not captures previously. This
patch make sure we report such stray output as failures.
Now that we always have group inside a communicator, we can simply
return the lpid from the group.

Because this will be used in the hot path, make it inline.
Add the following macros:
    MPIR_LPID_WORLD_INDEX
    MPIR_LPID_WORLD_RANK
    MPIR_LPID_FROM
Fix a typo in setting the size of MPIR_GROUP_SELF.

Add ref_count if we return MPIR_GROUP_EMPTY to prevent freeing the
builtin when it is released internally. Unfortunately, since user can
directly use MPI_GROUP_EMPTY, we can't keep ref_count accurate. But at
least we can keep it positive to prevent an actual free.
The builtin groups are in session NULL. We need duplicate the groups in
MPIR_Group_from_session_pset_impl to return a group in the correct
session.
Group are a natural place to host vcrt (virtual connection reference
table). When communicators are duplicated, groups are simply inherited
and reference counted. Thus we won't end up with duplication of vcrt.
hzhou added 25 commits December 26, 2024 08:21
Add a macro that tracks local memory allocation from other routines.
Because we need access MPIR_Lpid definitions in mpidpre headers, we need
move worlds and lpid definitions to device-independent headers.

Add macro MPIR_LPID_INVALID.

Make MPIR_Lpid signed. Since we are going to perform arithmetic on
MPIR_Lpid, e.g. in using strided pmap, make MPIR_Lpid int64_t instead of
uint64_t to avoid accidental conversion errors.
* Add check_map_is_strided to detect strided pattern and convert a map into a
strided pmap.

* In MPIR_Group_check_subset, use MPIR_Group_lpid_to_rank rather than a
manual linear search.

* Move internal static routines to the bottom of grouputil.c.
A strided group with nontrivial blocksize is rare. By removing the
blocksize parameter (i.e. blocksize is always 1), we greatly simplify
the code and also improve the performance of lpid lookup in a more
common strided group (such as a typical comm_world group or node group).
The pmap is always used inside MPIR_Group, and its size is always the
same as group->size. Having a duplicated field creates more opportunities
for bugs from inconsistency.
Replace MPIR_Assert with better error message.
We'll create av tables in ch4 according to world_idx and world_rank.
MPIDIU_lpid_to_av can look up the av entry from an lpid in the
communication path. MPIDIU_lpid_to_av_slow, used in communicator
creation paths, will check and allocate the corresponding av table as
needed.
Dynamic av will be used to support MPID_Comm_connect/accept when we need
to create the leader av before we know the correct lpid entries. They
are expected to be freed at the end of inter communicator creation.
Add -

* MPIDI_NM_insert_upid - insert an av entry so the lpid is ready for
communication. The lpid can be allocated from a dynamic av table, thus
supports temporary communications between intercomm leaders. When later
the upid is inserted again into the regular av tables, the dynamic
entries are checked and copied over if already exist.

* MPIDI_NM_dynamic_sendrecv - used by local group leaders to exchange
data over dynamic_av. The dynamic handshakes are susceptible to
concurrent interference. Thus the upper layer is assumed to hold the
vci-0 critical section.
We can easily exchange the context_id along with the rest of the remote
info rather than do it in a separate step.

We can determine is_low_group by comparing world namespace and
world_rank entirely in the MPIR layer, thus no longer need it in
MPID_Intercomm_exchange.

Rename MPID_Intercomm_exchange_map to MPID_Intercomm_exchange to better
reflect that it is not just exchanging maps.
This is fully replaced with MPIR_comm_rank_to_lpid or
MPIR_Group_rank_to_lpid.
Refactor MPID_Intercomm_exchange to Maximize common parts for
MPI_Intercomm_create, MPI_Comm_connect/accept, and
MPI_Intercomm_create_from_group. They differ in the first step
in how to establish a leader-to-leader communication. In ch4,
this is to establish an av for remote leader. Once the av is
established, the intercomm exchange parts are common.

We no longer generate lpid from ch4-layer. Rather, we exchange world
information and convert lpids by swapping world_idx. The lpids will be
used directly as index to ch4 av tables and upids (address names) are
inserted into the av table entries.
In MPID_Comm_connect/accept, simply establish remote_lpid and call
MPIR_Intercomm_create_impl.
The local_group and remote_group fully captures the mapper functions.
We have switched to use MPIR_Lpid to address in ch4 av table manager.
Both map and local_map in ch4 MPIDI_Devcomm_t no longer needed.
Rename it to MPIDIU_get_grank, remove the dependency on
MPIDIU_comm_rank_to_lpid (to be removed next) and use
MPIR_comm_rank_to_lpid instead.
This is fully replaced by MPIR_comm_rank_to_lpid.
Track MPIR_Lpid lpid rather than a pair of (avtid, lpid).
Now we use MPIR_Lpid, we no longer needed netmod api to convert upids to
lpids. The function is replaced by netmod api insert_upid.
We no longer expose avtid. Replace MPIDIU_get_av with MPIDIU_lpid_to_av.

Also remote unused GPID macros.
When ch4-layer allocates an av table, all entries are initialized to 0.
However, 0 can be a valid entry for fi_addr_t. We could initialize all
entries to FI_ADDR_NOTAVAIL, but that requires an additional complexity
of a netmod API. Instead, because the entry 0 is always the first entry
to be inserted by fi_av_insert, we can simply remember the entry
(MPIDI_OFI_global.lpid0) and be able to tell which entries are empty (in
MPIDI_OFI_insert_upid).
We don't really tracek av tables' ref_count. We simply free all av
tables at finalize.

Rename MPIDIU_avt_destroy to MPIDIU_avt_finalize to better reflect its
role.
@hzhou
Copy link
Contributor Author

hzhou commented Dec 26, 2024

test:mpich/ch3/most
test:mpich/ch4/most

All ✔️

@hzhou hzhou requested a review from yfguo January 2, 2025 20:01
@yfguo yfguo mentioned this pull request Jan 15, 2025
4 tasks
@@ -81,7 +80,6 @@ int MPIR_Group_init(void)
pmap->use_map = false;
pmap->u.stride.offset = MPIR_Process.rank;
pmap->u.stride.stride = 1;
pmap->u.stride.blocksize = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, strided block is pretty rare. Only HACC and SNAP have them (through cartesian communicators).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants