Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull main changes for CXI provider into v1.21.x #9932

Closed
wants to merge 79 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
5dab4b8
prov/cxi: Set execute bits on tests
jswaro Mar 20, 2024
9eeb495
prov/cxi: Set EP protocol for endpoint object
swelch Jan 25, 2024
0992d84
prov/cxi: Refactor EP ctrl info into separate object
swelch Jan 26, 2024
aaf2f95
prov/cxi: Refactor ep_obj to txc/rxc to use pointer
swelch Jan 26, 2024
b028d9a
prov/cxi: Modify context logging to handle derived contexts
swelch Jan 31, 2024
5289bcb
prov/cxi: Remove unused ep_list from RXC/TXC
swelch Jan 31, 2024
fd662e3
prov/cxi: Refactor RXC into base and derived objects
swelch Jan 31, 2024
1d54e85
prov/cxi: Refactor TXC into base and derived objects
swelch Feb 1, 2024
a0e4867
prov/cxi: Refactor EP object allocation
swelch Feb 1, 2024
62d9dc8
prov/cxi: Simplify initialization of context objects
swelch Feb 1, 2024
7eae464
prov/cxi: Refactor RXC/TXC message init and cleanup
swelch Feb 2, 2024
9973369
prov/cxi: Add RXC/TXC cleanup operations
swelch Feb 3, 2024
c895105
prov/cxi: Allow RXC/TXC specific progress functions
swelch Feb 3, 2024
6861adb
prov/cxi: A RXC/TXC derived EP cancel support
swelch Feb 3, 2024
525dc43
prov/cxi: Enable derived object specific side-band messaging
swelch Feb 3, 2024
85754d7
prov/cxi: Refactor msg/tagged API to be context aware
swelch Feb 3, 2024
09386be
prov/cxi: Separate messaging into common and hpc files
swelch Feb 4, 2024
f16b656
prov/cxi: Add placeholder skeleton for client/server contexts
swelch Feb 4, 2024
e2977a8
prov/cxi: Add fi_info support for FI_PROTO_CXI_CS
swelch Feb 5, 2024
ce031d6
prov/cxi: Initial FI_PROTO_CXI_CS context definition
swelch Feb 6, 2024
c6e5135
prov/cxi: Implement client/server receive PTE
swelch Feb 6, 2024
c8ee538
prov/cxi: Move completion of direct put to common code
swelch Feb 6, 2024
946c38d
prov/cxi: Move setting of RX match ID to common code
swelch Feb 6, 2024
9d6087d
prov/cxi: Make removal of FI_MULTI_RECV flag tagged specific
swelch Feb 8, 2024
1e58d78
prov/cxi: Add client/server match bit definition
swelch Feb 10, 2024
b5e5804
prov/cxi: Add RXC specific updated of recv_req_tgt_event
swelch Feb 11, 2024
d0241d9
prov/cxi: Allow passing MD into receive request allocation
swelch Feb 10, 2024
259a90a
prov/cxi: Initial client/server receive logic
swelch Feb 11, 2024
1ef173d
prov/cxi: Initial client/server transmit logic
swelch Feb 12, 2024
0a03148
prov/cxi: Allow returning cancelled status for send msg
swelch Feb 12, 2024
14fa872
prov/cxi: Add testing of basic FI_PROTO_CXI_CS msg/tagged
swelch Feb 12, 2024
9ee49d5
prov/cxi: Add client/server IDC support
swelch Feb 13, 2024
af0509a
prov/cxi: Progress retries without forcing user to make progress
swelch Feb 15, 2024
3b09313
prov/cxi: Introduce hybrid MR operation for client/server
swelch Feb 15, 2024
5278a1a
prov/cxi: Add messaging hybrid descriptor test cases
swelch Feb 16, 2024
a283f02
prov/cxi: Add RNR stats for client/server protocol
swelch Feb 16, 2024
360ece5
prov/cxi: Allow maximum RNR retry to be modified
swelch Feb 16, 2024
0671a57
prov/cxi: Support no success events for CS TX
swelch Feb 17, 2024
8016b99
prov/cxi: Offload client/server receive counters
swelch Feb 18, 2024
10744d1
prov/cxi: Support no success events for CS RX
swelch Feb 18, 2024
6ba03d6
prov/cxi: Unit tests for selective completion with counters
swelch Feb 18, 2024
01e37ce
prov/cxi: Add first append unit test
swelch Feb 19, 2024
b20e2fb
prov/cxi: Add CXI external enums mapping to non-backported enums
swelch Feb 19, 2024
2f883e5
prov/cxi: Add ability to initialize FI_CNTR_EVENTS_BYTES counters
swelch Feb 19, 2024
ce35787
prov/cxi: Add client/server hardware support for byte counters
swelch Feb 19, 2024
80cc93b
prov/cxi: Add unit tests for hardware byte counting counters
swelch Feb 19, 2024
d488bff
prov/cxi: Add experimental truncation as a success completions
swelch Feb 19, 2024
a9ccdc9
prov/cxi: FI_PROTO_CXI_RNR is preferred over FI_PROTO_CXI_CS
swelch Feb 21, 2024
458e331
prov/cxi: Rename "cs" to "rnr" no functional change
swelch Feb 21, 2024
52cd505
prov/cxi: Renamed cxip_msg_cs.c -> cxip_msg_rnr.c
swelch Feb 21, 2024
f809653
prov/cxi: Move retry bit from cookie to header.
JosephNemeth Feb 27, 2024
65262e1
prov/cxi: Correct and clarify multicast pid_idx.
JosephNemeth Feb 27, 2024
706c5c2
prov/cxi: Change hwroot to LOW_LATENCY
JosephNemeth Feb 27, 2024
ac2d986
prov/cxi: Fix fabric mgr url string.
JosephNemeth Feb 27, 2024
5710811
prov/cxi: Add VNI to multicast creation command.
JosephNemeth Feb 27, 2024
6102fa5
prov/cxi: change CURL authorizaiton to bearer token
JosephNemeth Feb 27, 2024
1557d8a
prov/cxi: set CURLOPT_STDERR to stderr.
JosephNemeth Feb 27, 2024
be97f0d
prov/cxi: Add CURLOPT_SSL_VERIFYPEER handling.
JosephNemeth Feb 27, 2024
deb26cf
prov/cxi: Add is_multicast flag to cxip_coll_mc.
JosephNemeth Feb 27, 2024
abfad6d
prov/cxi: Add retry_disable test flag.
JosephNemeth Feb 27, 2024
918dca9
prov/cxi: Add COMM_KEY_NONE case to DFA generation
JosephNemeth Feb 27, 2024
6a9edd4
prov/cxi: Add TRACE_CURL() to cxip_coll.c.
JosephNemeth Feb 27, 2024
f2f636c
prov/cxi: Clean up _gen_tx_dfa() for readability.
JosephNemeth Feb 27, 2024
2293e49
prov/cxi: Parse full six-octet NIC address.
JosephNemeth Feb 27, 2024
5b2b3e3
prov/cxi: Suppress busy retry logging.
JosephNemeth Feb 27, 2024
98299dc
prov/cxi: Remove unused cxip_join_state.pid_idx.
JosephNemeth Feb 27, 2024
9e9f37f
prov/cxi: Minor changes to test_zbcoll.
JosephNemeth Feb 27, 2024
cba2059
prov/cxi: Updated multinode_frmwk.
JosephNemeth Feb 27, 2024
02ecf9d
prov/cxip: Fix cxip_trace problems.
JosephNemeth Feb 27, 2024
0b41bf4
prov/cxi: Work-in-progress, checkpoint.
JosephNemeth Feb 27, 2024
89b3096
prov/cxi: Review change
JosephNemeth Feb 28, 2024
7ef7a06
prov/cxi: Modify test scripts to use cxi-sbl
Feb 23, 2024
d790ffa
prov/cxi: Drive TRACE() by env variables only.
JosephNemeth Mar 13, 2024
4ac5ad3
prov/cxi: Register device memory with dmabuf as default
chuckfossen Oct 6, 2023
b18c2fb
prov/cxi: Add FI_CXI_DISABLE_DMABUF* env variable to disable dmabuf
chuckfossen Feb 20, 2024
497c362
prov/cxi: Fix fi_cntr_wait() to return on error count increment
swelch Mar 14, 2024
035ceee
prov/cxi: Unit tests for fi_cntr_wait() with error increment
swelch Mar 14, 2024
d7a81af
prov/cxi: Update CXI provider for 1.21
jswaro Mar 20, 2024
8e2cc9e
prov/cxi: Remove FI_CXI_COMPAT test changes
jswaro Mar 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions man/fi_cxi.7.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,18 @@ CXI integrated launcher and CXI authorization key aware libfabric user:
7. Application processes select from the list of available service IDs and VNIs
to form an authorization key to use for Endpoint allocation.

## Endpoint Protocols

The provider supports multiple endpoint protocols. The default protocol is
FI_PROTO_CXI and fully supports the messaging requirements of parallel
applicaitons.

The FI_PROTO_CXI_RNR endpoint protocol is an optional protocol that targets
client/server environments where send-after-send ordering is not required and
messaging is generally to pre-posted buffers; FI_MULTI_RECV is recommended.
It utilizes a receiver-not-ready implementation where
*FI_CXI_RNR_MAX_TIMEOUT_US* can be tuned to control the maximum retry duration.

## Address Vectors

The CXI provider supports both *FI_AV_TABLE* and *FI_AV_MAP* with the same
Expand Down Expand Up @@ -433,6 +445,15 @@ faults but requires all buffers to be backed by physical memory. Copy-on-write
semantics are broken when using pinned memory. See the Fork section for more
information.

The CXI provider supports DMABUF for device memory registration. If the ROCR
and CUDA libraries support it, the CXI provider will default to use DMA-buf.
There may be situations with CUDA that may double the BAR consumption.
Until this is fixed in the CUDA stack, the environment variable
*FI_CXI_DISABLE_DMABUF_CUDA* can be used to fall back to the nvidia
peer-memory interface.
Also, *FI_CXI_DISABLE_DMABUF_ROCR* can be used to fall back to the amdgpu
peer-memory interface.

## Translation Cache

Mapping a buffer for use by the NIC is an expensive operation. To avoid this
Expand Down Expand Up @@ -1077,6 +1098,12 @@ The CXI provider checks for the following environment variables:
*FI_CXI_DEFAULT_VNI*
: Default VNI value used only for service IDs where the VNI is not restricted.

*FI_CXI_RNR_MAX_TIMEOUT_US*
: When using the endpoint FI_PROTO_CXI_RNR protocol, this setting is used to
control the maximum time from the original posting of the message that the
message should be retried. A value of 0 will return an error completion
on the first RNR ack status.

*FI_CXI_EQ_ACK_BATCH_SIZE*
: Number of EQ events to process before writing an acknowledgement to HW.
Batching ACKs amortizes the cost of event acknowledgement over multiple
Expand Down
4 changes: 3 additions & 1 deletion prov/cxi/Makefile.include
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,16 @@ _cxi_files = \
prov/cxi/src/cxip_rma.c \
prov/cxi/src/cxip_mr.c \
prov/cxi/src/cxip_msg.c \
prov/cxi/src/cxip_msg_rnr.c \
prov/cxi/src/cxip_msg_hpc.c \
prov/cxi/src/cxip_atomic.c \
prov/cxi/src/cxip_iomm.c \
prov/cxi/src/cxip_faults.c \
prov/cxi/src/cxip_info.c \
prov/cxi/src/cxip_ctrl.c \
prov/cxi/src/cxip_req_buf.c \
prov/cxi/src/cxip_rdzv_pte.c \
prov/cxi/src/cxip_trace.c \
prov/cxi/src/cxip_coll_trace.c \
prov/cxi/src/cxip_telemetry.c \
prov/cxi/src/cxip_ptelist_buf.c \
prov/cxi/src/cxip_evtq.c \
Expand Down
Loading