-
Notifications
You must be signed in to change notification settings - Fork 396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/efa: Map EFA errnos to Libfabric codes #9974
Conversation
prov/efa/src/efa_errno.h
Outdated
case EFA_IO_COMP_STATUS_LOCAL_ERROR_INVALID_LKEY: | ||
case EFA_IO_COMP_STATUS_LOCAL_ERROR_UNRESP_REMOTE: | ||
case FI_EFA_ERR_ESTABLISHED_RECV_UNRESP: | ||
return FI_EOPBADSTATE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*-FI_EOPBADSTATE*
: The endpoint's state does not permit the requested operation.
I think it doesn't apply to the error here
prov/efa/src/efa_errno.h
Outdated
case EFA_IO_COMP_STATUS_LOCAL_ERROR_INVALID_AH: | ||
case EFA_IO_COMP_STATUS_LOCAL_ERROR_INVALID_LKEY: | ||
case EFA_IO_COMP_STATUS_LOCAL_ERROR_UNRESP_REMOTE: | ||
case FI_EFA_ERR_ESTABLISHED_RECV_UNRESP: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FI_EFA_ERR_ESTABLISHED_RECV_UNRESP
should be ECONNABOUTED
case EFA_IO_COMP_STATUS_OK: | ||
return FI_SUCCESS; | ||
case EFA_IO_COMP_STATUS_FLUSHED: | ||
case EFA_IO_COMP_STATUS_LOCAL_ERROR_QP_INTERNAL_ERROR: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The local invalid error seems more close to FI_EINVAL
prov/efa/src/efa_errno.h
Outdated
case EFA_IO_COMP_STATUS_LOCAL_ERROR_QP_INTERNAL_ERROR: | ||
case EFA_IO_COMP_STATUS_LOCAL_ERROR_INVALID_AH: | ||
case EFA_IO_COMP_STATUS_LOCAL_ERROR_INVALID_LKEY: | ||
case EFA_IO_COMP_STATUS_LOCAL_ERROR_UNRESP_REMOTE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unresponsive remote (no handshake made) should be EHOSTUNREACH
prov/efa/src/efa_errno.h
Outdated
case FI_EFA_ERR_ESTABLISHED_RECV_UNRESP: | ||
return FI_EOPBADSTATE; | ||
case EFA_IO_COMP_STATUS_LOCAL_ERROR_INVALID_OP_TYPE: | ||
return FI_EOPNOTSUPP; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be EINVAL as well
switch (err) { | ||
case EFA_IO_COMP_STATUS_OK: | ||
return FI_SUCCESS; | ||
case EFA_IO_COMP_STATUS_FLUSHED: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flushed error should be EHOSTDOWN
IMO because it usually means the remote destroyed the QP because of crash
prov/efa/src/efa_errno.h
Outdated
case EFA_IO_COMP_STATUS_LOCAL_ERROR_BAD_LENGTH: | ||
case EFA_IO_COMP_STATUS_REMOTE_ERROR_BAD_LENGTH: | ||
return FI_EMSGSIZE; | ||
case EFA_IO_COMP_STATUS_REMOTE_ERROR_BAD_ADDRESS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the newest interpretation of this error code, this means invalid MR address, I think we should map it to FI_EINVAL
b33414b
to
25f61a4
Compare
It seems windows build failed because the errno is not available |
Need to re-run against new CI since old CI logs are gone |
bot:aws:new:retest |
bot:aws:retest |
1 similar comment
bot:aws:retest |
25f61a4
to
c86b5af
Compare
Windows error from AWS CI:
|
The issue here is that |
Signed-off-by: Darryl Abbate <[email protected]>
This adds a rudimentary function to map proprietary EFA status codes to common Libfabric status codes. This is useful when reporting errors to the application for operations that rely solely on ibverbs or RDMA Core, such as CQ polling. Signed-off-by: Darryl Abbate <[email protected]>
c86b5af
to
c1cfb26
Compare
This adds a rudimentary function to map proprietary EFA status codes to common Libfabric status codes. This is useful when reporting errors to the application for operations that rely solely on ibverbs or RDMA Core, such as CQ polling.