Skip to content

Commit

Permalink
prov/efa: Sender switch to long CTS protocol if runt read fails with …
Browse files Browse the repository at this point in the history
…ENOMR

Runting read protocol could fail with ENOMR if the EFA provider is unable
to register the buffer with the NIC. In that case, we should fall back
to long CTS instead

This commit is for the changes when the sender fails to register the
source buffer. The sender will switch to the long CTS protocol.

Signed-off-by: Sai Sunku <[email protected]>
  • Loading branch information
sunkuamzn authored and shijin-aws committed Nov 7, 2023
1 parent fbdaff1 commit 1f33488
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 2 deletions.
6 changes: 6 additions & 0 deletions prov/efa/src/rdm/efa_rdm_ope.c
Original file line number Diff line number Diff line change
Expand Up @@ -1775,15 +1775,21 @@ ssize_t efa_rdm_ope_post_send_fallback(struct efa_rdm_ope *ope,
int pkt_type, ssize_t err)
{
if (err == -FI_ENOMR) {
/* Long read and runting read protocols could fail because of a
* lack of memory registrations. In that case, we retry with
* long CTS protocol
*/
switch (pkt_type) {
case EFA_RDM_LONGREAD_MSGRTM_PKT:
case EFA_RDM_RUNTREAD_MSGRTM_PKT:
EFA_WARN(FI_LOG_EP_CTRL,
"Sender fallback to long CTS untagged "
"protocol because memory registration limit "
"was reached on the sender\n");
return efa_rdm_ope_post_send_or_queue(
ope, EFA_RDM_LONGCTS_MSGRTM_PKT);
case EFA_RDM_LONGREAD_TAGRTM_PKT:
case EFA_RDM_RUNTREAD_TAGRTM_PKT:
EFA_WARN(FI_LOG_EP_CTRL,
"Sender fallback to long CTS tagged protocol "
"because memory registration limit was "
Expand Down
10 changes: 8 additions & 2 deletions prov/efa/src/rdm/efa_rdm_pke_cmd.c
Original file line number Diff line number Diff line change
Expand Up @@ -131,11 +131,17 @@ int efa_rdm_pke_fill_data(struct efa_rdm_pke *pkt_entry,
ret = efa_rdm_pke_init_medium_tagrtm(pkt_entry, ope, data_offset, data_size);
break;
case EFA_RDM_LONGCTS_MSGRTM_PKT:
assert(data_offset == 0 && data_size == -1);
/* The data_offset will be non-zero when the long CTS RTM packet
* is sent to continue a runting read transfer after the
* receiver has run out of memory registrations */
assert((data_offset == 0 || ope->internal_flags & EFA_RDM_OPE_READ_NACK) && data_size == -1);
ret = efa_rdm_pke_init_longcts_msgrtm(pkt_entry, ope);
break;
case EFA_RDM_LONGCTS_TAGRTM_PKT:
assert(data_offset == 0 && data_size == -1);
/* The data_offset will be non-zero when the long CTS RTM packet
* is sent to continue a runting read transfer after the
* receiver has run out of memory registrations */
assert((data_offset == 0 || ope->internal_flags & EFA_RDM_OPE_READ_NACK) && data_size == -1);
ret = efa_rdm_pke_init_longcts_tagrtm(pkt_entry, ope);
break;
case EFA_RDM_LONGREAD_MSGRTM_PKT:
Expand Down

0 comments on commit 1f33488

Please sign in to comment.