-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1.21.x] Cherry-picked commits for 1.21.0rc2 #9926
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Darryl Abbate <[email protected]> (cherry picked from commit 96172b7)
fi_cq_readerr is no longer called on uninitialized err_data and err_data_size in fi_setup.7.md. Signed-off-by: Rémi Dehenne <[email protected]> (cherry picked from commit 358422a)
Signed-off-by: Shi Jin <[email protected]> (cherry picked from commit e24f1c8)
Updates: - Full support for Intel oneAPI DPC++/C++ compiler - Improved default tuning for Intel GPUs Signed-off-by: Scott Breyer <[email protected]> (cherry picked from commit acde37d)
Signed-off-by: Darryl Abbate <[email protected]> (cherry picked from commit 4e4deae)
Signed-off-by: Darryl Abbate <[email protected]> (cherry picked from commit 6e8765f)
Signed-off-by: Darryl Abbate <[email protected]> (cherry picked from commit bd891fc)
Signed-off-by: Darryl Abbate <[email protected]> (cherry picked from commit 87a1006)
This is a best-effort attempt at propagating core Libfabric error codes upwards wherever possible. Signed-off-by: Darryl Abbate <[email protected]> (cherry picked from commit b266f14)
shijin-aws
approved these changes
Mar 21, 2024
Signed-off-by: James Swaro <[email protected]> (cherry picked from commit c494d00)
EP objects will be able to support different EP protocols. Currently on the existing portals SAS implementation is supported: FI_PROTO_CXI. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit b991fd4) (cherry picked from commit 7d9a79f)
This refactors EP object ctrl elements related to side-band messaging and MR into its own structure. While this information is exclusively accessed for standard EP, it will be owned by the SEP (where MR are bound to the SEP) and shared among TX/RX contexts. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 28c0faa) (cherry picked from commit 734741a)
No functional changes; refactors code to have ep_obj reference the txc and rxc via a pointer. This will allow an ep_obj to support multiple context specializations that implement different endpoint protocols. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 87c50c8) (cherry picked from commit cd7f818)
NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 80f01f7) (cherry picked from commit 6cbc037)
NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 4f1dcf9) (cherry picked from commit 17e03da)
This commit does not alter functionality, it refactors the existing default RXC context into a common base and protocol specific. The default protocol is FI_PROTO_CXI that is implemented by the rxc_hpc derived object. It implements an HPC capable SAS protocol with unexpected messages buffered at the target, and requires a Portals flow control implementation. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 1552c80) (cherry picked from commit a8cb4ac)
This commit does not alter functionality, it refactors the existing default TXC context into a common base and protocol specific. The default protocol is FI_PROTO_CXI that is implemented by the txc_hpc derived object. It implements an HPC capable SAS protocol with unexpected messages buffered at the target and includes rendezvous messaging. It requires a Portals flow control implementation. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit fb15795) (cherry picked from commit 0ab28b2)
Refactor so that context allocation is not entangled with EP object initialization. This will allow for contexts to do specialized initialization of structure at calloc. No functional difference. NETCASSINI-5662 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit f368168) (cherry picked from commit 6314a5b)
Allocation of a TXC/RXC will allocate and initialize the appropriate derived context object. Context initialization is not longer entangled with EP object initialization. Introduces concept of TXC/RXC ops functions that execute derived object specific code. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 5a5661e) (cherry picked from commit 76546e9)
Refactor context initialization to make derived object initialize only what it needs. For example overflow and request buffers are only required for HPC derived object. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit f52389e) (cherry picked from commit ccee0ec)
Refactor context disable to call into derived object for cleanup if operation is supported. No new functionality is added; HPC messaging specific cleanup is moved to helper operation. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 5482af1) (cherry picked from commit 193d0f8)
Refactors code to allow a derived context to implement protocol specific progress. This will allow future protocols with different progress demands not impact existing protocols. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 24ea37c) (cherry picked from commit 712fce3)
Allow RXC/TXC specific cancel functions. This will allow the client/server object to support TX cancel when implemented. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 581a68f) (cherry picked from commit e1061df)
Add RXC op to implement a control messaging callback which can override processing of control messaging events. This allows a context protocol to implement a specific side-band messaging implementation. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 95aa467) (cherry picked from commit 36054b1)
Refactor code to allow derived RXC/TXC to have unique respective recv_common and send_common functionality. Future protocol will integrate seamlessly into API flow. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 7cd4f60) (cherry picked from commit 396c49c)
Move HPC specific protocol code to new file cxip_msg_hpc.c while leaving common protocol code in cxip_msg.c. This only refactors the code. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit d292b5b) (cherry picked from commit 232d280)
Adds the file cxi/src/cxip_msg_cs.c NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit fe0d962) (cherry picked from commit 901e053)
Return fi_info for new protocol, protocol must be explicitly requested if hints are passed. Note that if FI_CXI_COMPAT=2, only old constants are used and new protocol is not present. Update/add unit tests to validate fi_info and selection. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 550516f) (cherry picked from commit 1afb008)
Add initial FI_PROTO_CXI_CS derived rxc/txc structure initialization and man page update. NETCASSINI-5652 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 1d10c13) (cherry picked from commit 1e1642c)
The FM URL provided by the WLM is now the full path to the multicast creation target endpoint, not just the base of the FM RESET. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit d20e3ae) (cherry picked from commit 447f741)
VNI is provided by the WLM, and must be provided in the multicast creation command. Replace json_fmt static const string with inline string. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit db3e832) (cherry picked from commit dfa4722)
FM REST api now uses a Bearer token, not x-xenon-auth-token. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit d09cfb6) (cherry picked from commit 451ee08)
CURL errors should be logged to stderr. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 9abaeef) (cherry picked from commit ac3d632)
In production, we want to optionally support peer verification. In testing, we generally do not. This can now be specified using environment variable CURLOPT_SSL_VERIFYPEER to bee 0 (do not verify) or 1 (verify). The default is 0. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit ef30f59) (cherry picked from commit c774a81)
Evaluate the simulation mode once, and set mc_obj->is_multicast appropriately. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 1131419) (cherry picked from commit 336d178)
Allow retries to be disabled for test cases. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 9e05ee0) (cherry picked from commit bae1e73)
Add COMM_KEY_NONE to _gen_tx_dfa() function. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 5c0bd49) (cherry picked from commit 7a76318)
Allow CURL operations to be traced independently of JOIN. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 9457dd0) (cherry picked from commit a48975b)
Cleanup. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit d383b51) (cherry picked from commit 2727ece)
FM now generates a full 6-octet NIC address. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 75d0879) (cherry picked from commit c3cf64f)
Add flag to suppress repeated logging during CURL polling. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit c2ea999) (cherry picked from commit c1e5b1a)
Remove unused pid_idx value in cxip_join_state structure. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 0ab258f) (cherry picked from commit 7046f4e)
Change minimum test size to 2 (endpoints), from 4. Add "/op" to performance output to clearly indicate that the performance value is per-operation, not a total runtime. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit f215798) (cherry picked from commit e9bbeec)
Added SLURM and FI_CXI environment variable capture. Changed error output to stderr (not stdout). Removed placeholder defaults for environment variables. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 80e756f) (cherry picked from commit 31d9579)
Change cxip_trace_filename to cxip_trace_pathname and allow tracing to occur in alternate directories, which is useful when the current path is not writable by the user. Initialization fails without initializing if no masks are selected, preventing creation of empty files. Early model of initializing only once at test login was flawed. This now can be initialized, disabled, and re-initialized. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 7a75110) (cherry picked from commit 427ff4b)
Checkpoint commit. This code is in development. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit e4354f2) (cherry picked from commit e455af9)
Comment is incorrect and misleading for PID_IDX value for mcast address. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 5c533ce) (cherry picked from commit 136bafc)
Signed-off-by: Kalyan Kodamagula <[email protected]> (cherry picked from commit 9d1afe8) (cherry picked from commit 47fcdf8)
Note: all CXIP_TRACE* references changed to CXIP_COLL_TRACE* Note: all cxip_trace* references changed to cxip_coll_trace* The TRACE() macros produce debugging traces to files that can be on a shared file system, or local to a physical node (and could be memory storage) for debugging collectives, which perform coordinated actions across multiple nodes. This not only prevents implicit synchronization of operations through shared file system waits, but also prevent mangling of the output when using normal character buffering from multiple sources, which is usually faster than line buffering. This was originally put together for use with bench tests that are part of the libfabric suite, and required initialization through function calls within the bench tests, which makes this feature unavailable to to external applications. This commit refactors the TRACE() system to allow it to be entirely configured through environment variables, and can be used with production applications. If the ENABLE_DEBUG flag is zero, all of the TRACE featues are removed entirely: embedded TRACE() calls are a syntactically-robust NOOP that does not emit code during compilation. Otherwise, individual trace features must be activated through environment variables, allowing different areas of code to be traced selectively. If no trace features are selected, the trace files are not created. The original design also used function pointer indirection to allow all of the trace functions to be entirely replaced. This was confusing to maintain, and offers no real benefit. The former cxip_coll_trace_enable() function was overloaded with multiple purposes. This has been simplified into cxip_coll_trace_init() and cxip_coll_trace_close(), which are automatically called during coll module initialization, and a global cxip_coll_trace_muted flag that can be used to temporarily mute tracing. This allows repeated reductions (for instance) to be traced during set up, but then muted during a fast loop. Signed-off-by: Joe Nemeth <[email protected]> (cherry picked from commit 3cd63c5) (cherry picked from commit c983a78)
Use ofi_hmem_* instead of ze_* specific calls NETCASSINI-4994 Signed-off-by: Chuck Fossen <[email protected]> (cherry picked from commit 782cf95) (cherry picked from commit 6da6530)
NETCASSINI-4994 Signed-off-by: Chuck Fossen <[email protected]> (cherry picked from commit 25f5ddc) (cherry picked from commit e543faf)
Libfabric semantics indicate that fi_cntr_wait() if an error count increment occurs before the threshold is reached. NETCASSINI-5909 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit 9180ae2) (cherry picked from commit 2a0a1a6)
Adds unit tests for verification of fi_cntr_wait() semantic operation with error count increment. NETCASSINI-5909 Signed-off-by: Steve Welch <[email protected]> (cherry picked from commit fa5d0db) (cherry picked from commit 87fd2f6)
Signed-off-by: James Swaro <[email protected]> (cherry picked from commit 135e31a)
Signed-off-by: James Swaro <[email protected]> (cherry picked from commit 459edef)
@jswaro I cherry-picked the cxi changes here so you don't need to do that. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.