Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/efa: Map EFA errnos to Libfabric codes #9974

Merged
merged 2 commits into from
Apr 17, 2024

Conversation

darrylabbate
Copy link
Member

This adds a rudimentary function to map proprietary EFA status codes to common Libfabric status codes. This is useful when reporting errors to the application for operations that rely solely on ibverbs or RDMA Core, such as CQ polling.

@darrylabbate darrylabbate requested a review from a team April 3, 2024 00:29
prov/efa/src/efa_errno.h Outdated Show resolved Hide resolved
@darrylabbate darrylabbate requested a review from rajachan April 3, 2024 00:34
case EFA_IO_COMP_STATUS_LOCAL_ERROR_INVALID_LKEY:
case EFA_IO_COMP_STATUS_LOCAL_ERROR_UNRESP_REMOTE:
case FI_EFA_ERR_ESTABLISHED_RECV_UNRESP:
return FI_EOPBADSTATE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*-FI_EOPBADSTATE*
: The endpoint's state does not permit the requested operation.

I think it doesn't apply to the error here

case EFA_IO_COMP_STATUS_LOCAL_ERROR_INVALID_AH:
case EFA_IO_COMP_STATUS_LOCAL_ERROR_INVALID_LKEY:
case EFA_IO_COMP_STATUS_LOCAL_ERROR_UNRESP_REMOTE:
case FI_EFA_ERR_ESTABLISHED_RECV_UNRESP:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FI_EFA_ERR_ESTABLISHED_RECV_UNRESP should be ECONNABOUTED

case EFA_IO_COMP_STATUS_OK:
return FI_SUCCESS;
case EFA_IO_COMP_STATUS_FLUSHED:
case EFA_IO_COMP_STATUS_LOCAL_ERROR_QP_INTERNAL_ERROR:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local invalid error seems more close to FI_EINVAL

case EFA_IO_COMP_STATUS_LOCAL_ERROR_QP_INTERNAL_ERROR:
case EFA_IO_COMP_STATUS_LOCAL_ERROR_INVALID_AH:
case EFA_IO_COMP_STATUS_LOCAL_ERROR_INVALID_LKEY:
case EFA_IO_COMP_STATUS_LOCAL_ERROR_UNRESP_REMOTE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unresponsive remote (no handshake made) should be EHOSTUNREACH

case FI_EFA_ERR_ESTABLISHED_RECV_UNRESP:
return FI_EOPBADSTATE;
case EFA_IO_COMP_STATUS_LOCAL_ERROR_INVALID_OP_TYPE:
return FI_EOPNOTSUPP;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be EINVAL as well

switch (err) {
case EFA_IO_COMP_STATUS_OK:
return FI_SUCCESS;
case EFA_IO_COMP_STATUS_FLUSHED:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flushed error should be EHOSTDOWN IMO because it usually means the remote destroyed the QP because of crash

case EFA_IO_COMP_STATUS_LOCAL_ERROR_BAD_LENGTH:
case EFA_IO_COMP_STATUS_REMOTE_ERROR_BAD_LENGTH:
return FI_EMSGSIZE;
case EFA_IO_COMP_STATUS_REMOTE_ERROR_BAD_ADDRESS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the newest interpretation of this error code, this means invalid MR address, I think we should map it to FI_EINVAL

@darrylabbate darrylabbate force-pushed the efa/cq-err-entry branch 2 times, most recently from b33414b to 25f61a4 Compare April 3, 2024 21:53
@shijin-aws
Copy link
Contributor

It seems windows build failed because the errno is not available

@darrylabbate
Copy link
Member Author

Need to re-run against new CI since old CI logs are gone

@darrylabbate
Copy link
Member Author

bot:aws:new:retest

@darrylabbate
Copy link
Member Author

bot:aws:retest

1 similar comment
@a-szegel
Copy link
Contributor

a-szegel commented Apr 6, 2024

bot:aws:retest

@darrylabbate
Copy link
Member Author

Windows error from AWS CI:

         C:\9974_1712364969\libfabric\prov\efa\src\efa_errno.h(170,22): error C2065: 'EREMOTEIO': undeclared identifier [C:\9974_1712364969\libfabric\libfabric.vcxproj]
         C:\9974_1712364969\libfabric\prov\efa\src\efa_errno.h(150,22): error C2065: 'EHOSTDOWN': undeclared identifier [C:\9974_1712364969\libfabric\libfabric.vcxproj]

@darrylabbate
Copy link
Member Author

darrylabbate commented Apr 15, 2024

The issue here is that EREMOTEIO and EHOSTDOWN (among others) are implementation-defined; the fi_errno codes aren't fully portable. This has gone unnoticed since many of the fi_errno codes are completely unused and/or many providers only really care about Linux.

This adds a rudimentary function to map proprietary EFA status codes to
common Libfabric status codes. This is useful when reporting errors to
the application for operations that rely solely on ibverbs or RDMA Core,
such as CQ polling.

Signed-off-by: Darryl Abbate <[email protected]>
@shijin-aws
Copy link
Contributor

@j-xiong is 6d6f4be good to you?

@shijin-aws shijin-aws merged commit acc217b into ofiwg:main Apr 17, 2024
13 checks passed
@darrylabbate darrylabbate deleted the efa/cq-err-entry branch April 17, 2024 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants