Skip to content

Commit

Permalink
rdma: extended range of communicator IDs
Browse files Browse the repository at this point in the history
Changing the number of bits used for the communicator ID from 12 to 18 to reduce
the chance of running out of comm IDs.

To fit a larger comm ID in the immediate data, the message sequence number
has been reduced from 16 bits to 10, which is still more than enough since
usually the number of inflight messages is not that high.
The msg_seq_num space can be further reduced if needed.

Signed-off-by: Amedeo Sapio <[email protected]>
  • Loading branch information
AmedeoSapio authored and bwbarrett committed Mar 31, 2024
1 parent e8e99b6 commit 2fbfecf
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions src/nccl_ofi_rdma.c
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ static pthread_mutex_t topo_file_lock = PTHREAD_MUTEX_INITIALIZER;
/*
* @brief Number of bits used for the communicator ID
*/
#define NUM_COMM_ID_BITS ((uint64_t)12)
#define NUM_COMM_ID_BITS ((uint64_t)18)

/* Maximum number of comms open simultaneously. Eventually this will be
runtime-expandable */
Expand All @@ -51,13 +51,13 @@ static pthread_mutex_t topo_file_lock = PTHREAD_MUTEX_INITIALIZER;
* communicator ID, and the message sequence number (msg_seq_num).
* The data is encoded as follows:
*
* | 4-bit segment count | 12-bit comm ID | 16-bit msg_seq_num |
* | 4-bit segment count | 18-bit comm ID | 10-bit msg_seq_num |
*
* - Segment count: number of RDMA writes that will be delivered as part of this message
* - Comm ID: the ID for this communicator
* - Message sequence number: message identifier
*/
#define NUM_MSG_SEQ_NUM_BITS ((uint64_t) 16)
#define NUM_MSG_SEQ_NUM_BITS ((uint64_t)10)

/*
* @brief Number of bits used for number of segments value
Expand Down

0 comments on commit 2fbfecf

Please sign in to comment.