[pull] main from NVIDIA:main #26

pull · 2024-09-18T15:25:06Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

…sequences (#1179) Modify unit tests to work around cuDNN 9.4 regression. Signed-off-by: Michael Goldfarb <[email protected]>

add dtensor support for te optimizers Signed-off-by: jasonwan <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

Update list of CI users Signed-off-by: Tim Moon <[email protected]>

Implementation of context parallel fused attention using all-gather. Signed-off-by: Michael Goldfarb <[email protected]>

…g cuDNN and NVRTC (#1183) Defaulted CUDA_HOME/CUDA_PATH to /usr/local/cuda when attempting to dynamically load cuDNN and NVRTC Signed-off-by: Alp Dener <[email protected]>

Signed-off-by: Przemyslaw Tredak <[email protected]>

Allow specifying cmake directory Signed-off-by: Ryan Li <[email protected]> Co-authored-by: Ryan Li <[email protected]>

@timmoon10

* Add PyPI install instructions Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Review from @timmoon10 Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

Port optimizer tests to pytest Signed-off-by: Tim Moon <[email protected]>

…1175) * Check if network interface name is valid and show useful warning message when initializing Userbuffers Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix formatting issue in warning message. Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Alp Dener <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <[email protected]>

* make rotary_base arg Signed-off-by: Sudhakar Singh <[email protected]> * rotary base can be a float Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Sudhakar Singh <[email protected]> --------- Signed-off-by: Sudhakar Singh <[email protected]> Co-authored-by: Tim Moon <[email protected]>

* relax contiguous check for flash attention Signed-off-by: Xin Yao <[email protected]> * force contiguous for cp Signed-off-by: Xin Yao <[email protected]> --------- Signed-off-by: Xin Yao <[email protected]>

* allow tutorial to download the model weights automatically Signed-off-by: Sudhakar Singh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * allow users to provide weight cache directory Signed-off-by: Sudhakar Singh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Sudhakar Singh <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Restore compatibility with Python 3.8 Signed-off-by: Przemyslaw Tredak <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Przemyslaw Tredak <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

@pggPL

Add @pggPL to list of CI users Signed-off-by: Tim Moon <[email protected]>

@timmoon10

* Allow to pass architectures like 90a, without being overriden Signed-off-by: aurianer <[email protected]> * Review suggestion from @timmoon10 Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: aurianer <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]>

Add new users to CI Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* fix NVTE_UB_WITH_MPI read Signed-off-by: Sangkug Lym <[email protected]> * Add default value Signed-off-by: Sangkug Lym <[email protected]> --------- Signed-off-by: Sangkug Lym <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

* Docs fixes Signed-off-by: Pawel Gadzinski <[email protected]> * docs fix Signed-off-by: Pawel Gadzinski <[email protected]> * docs fix Signed-off-by: Pawel Gadzinski <[email protected]> --------- Signed-off-by: Pawel Gadzinski <[email protected]> Co-authored-by: Pawel Gadzinski <[email protected]>

* fix detection of 3 in 3hd/h3d layouts Signed-off-by: Charlene Yang <[email protected]> * error out when invalid layout group is provided Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Charlene Yang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

skip FP8 CP tests if hardware does not support FP8 Signed-off-by: Xiaowei Ren <[email protected]>

Add pool argument to make_graphed_callable Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

Fix Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

…with offsets (#1220) * Removing the unused options from GroupedLinear docs and fixing the bug with offsets Signed-off-by: Przemyslaw Tredak <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * offsets -> fp8_meta_offsets Signed-off-by: Przemyslaw Tredak <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Przemyslaw Tredak <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

move block_table arg to varlen_func section Signed-off-by: Charlene Yang <[email protected]>

* CPU perf optimization in linear autograd function Avoid enable_grad context when possible in cast function. Cache distributed group properties. Signed-off-by: Tim Moon <[email protected]> * CPU perf optimization in prepare_forward function Avoid torch.nn.Module impl of __setattr__. Signed-off-by: Tim Moon <[email protected]> * Avoid module import in TE module forwards Signed-off-by: Tim Moon <[email protected]> * Use fast getter for params Signed-off-by: Tim Moon <[email protected]> * Reuse tensor dims in linear autograd func Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply optimizations to grouped linear Signed-off-by: Tim Moon <[email protected]> * Debug test failures Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug test failures Signed-off-by: Tim Moon <[email protected]> * Fix linter warnings Signed-off-by: Tim Moon <[email protected]> * Avoid deepcopy in tests Signed-off-by: Tim Moon <[email protected]> * Move _fast_setattr logic to __setattr__ method Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Tim Moon <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: Emmanuel Ferdman <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

* Tests for distributed Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added the test to the qa script Signed-off-by: Pawel Gadzinski <[email protected]> * Changed qa Signed-off-by: Pawel Gadzinski <[email protected]> * fix to test_numerics file Signed-off-by: Pawel Gadzinski <[email protected]> * pr fixes Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update tests/pytorch/distributed/run_numerics.py Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Paweł Gadziński <[email protected]> --------- Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: Paweł Gadziński <[email protected]> Co-authored-by: Pawel Gadzinski <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Pawel Gadzinski <[email protected]> Co-authored-by: Tim Moon <[email protected]>

* change API for hierarchical CP Signed-off-by: Xiaowei Ren <[email protected]> * move fp8 code before qkv reshape Signed-off-by: Xiaowei Ren <[email protected]> * try to insert A2A for hierarchical CP Signed-off-by: Xiaowei Ren <[email protected]> * make fwd work Signed-off-by: Xiaowei Ren <[email protected]> * remove a redundant sync Signed-off-by: Xiaowei Ren <[email protected]> * make bwd of hierarchical CP work Signed-off-by: Xiaowei Ren <[email protected]> * fix dout a2a in bwd Signed-off-by: Xiaowei Ren <[email protected]> * fix q_f16 with fp8 Signed-off-by: Xiaowei Ren <[email protected]> * assert hierarchical CP implementation does not support THD format Signed-off-by: Xiaowei Ren <[email protected]> * bug fix Signed-off-by: Xiaowei Ren <[email protected]> * assert hierarchical CP does not support attn bias Signed-off-by: Xiaowei Ren <[email protected]> * add unit test for hierarchical CP Signed-off-by: Xiaowei Ren <[email protected]> * fix cp_comm_type in unit test Signed-off-by: Xiaowei Ren <[email protected]> * bug fix and code cleaning Signed-off-by: Xiaowei Ren <[email protected]> * minor change Signed-off-by: Xiaowei Ren <[email protected]> * an assert info change Signed-off-by: Xiaowei Ren <[email protected]> * dout shape fix Signed-off-by: Xiaowei Ren <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move function definitions to the front of the first call Signed-off-by: Xiaowei Ren <[email protected]> * fix tensor view comments Signed-off-by: Xiaowei Ren <[email protected]> * refine CP unit test Signed-off-by: Xiaowei Ren <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * typo fix Signed-off-by: Xiaowei Ren <[email protected]> * typo fix Signed-off-by: Xiaowei Ren <[email protected]> * save cp_size_a2a and rank_a2a in fwd Signed-off-by: Xiaowei Ren <[email protected]> * add more explainations of cp_group in doc_string Signed-off-by: Xiaowei Ren <[email protected]> --------- Signed-off-by: Xiaowei Ren <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Expose JAX sliding window attn API Signed-off-by: Hua Huang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * No SWA in context parallel; fix RNG seed in test Signed-off-by: Hua Huang <[email protected]> * Handle SAW API discrepancy in cuDNN and Python Signed-off-by: Hua Huang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add SAW API for flax, all tests passed Will update tests/jax/test_praxis_layers.py next Signed-off-by: Hua Huang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update test_praxis_layers.py for SWA, test passed Signed-off-by: Hua Huang <[email protected]> * Use tuple window_size; update for PR #1212 Signed-off-by: Hua Huang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add and adjust some pytest.skip Signed-off-by: Hua Huang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revised following Reese Wang's comments Still need further debugging: FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-KV_PACKED-NO_MASK-NO_BIAS] - AssertionError: FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-KV_PACKED-NO_MASK-POST_SCALE_BIAS-1HSS] - AssertionError: FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-SEPARATE-NO_MASK-NO_BIAS] - AssertionError: FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-SEPARATE-NO_MASK-POST_SCALE_BIAS-1HSS] - AssertionError: These errors does not exist in the previous commit Signed-off-by: Hua Huang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix no-SWA test case errors in previous commit Signed-off-by: Hua Huang <[email protected]> * Add Padding mask w/ sliding windows sanity tests Signed-off-by: Reese Wang <[email protected]> * Use float32 for the reference code softmax calculation Signed-off-by: Reese Wang <[email protected]> --------- Signed-off-by: Hua Huang <[email protected]> Signed-off-by: Reese Wang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Reese Wang <[email protected]>

* Fixes to Float8Tensor Signed-off-by: Przemyslaw Tredak <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Przemyslaw Tredak <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix bug in torch compile and seqdim is integer Signed-off-by: 李金梁 <[email protected]> * Update attention.py change the jit_fuser to torch.compile on flash_attn_fwd_out_correction Signed-off-by: 李金梁 <[email protected]> * Annotate fused functions Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: 李金梁 <[email protected]> Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

* fa2 function import renaming Signed-off-by: Xiaowei Ren <[email protected]> * refine fa_fwd_kwargs and fa_bwd_kwargs Signed-off-by: Xiaowei Ren <[email protected]> * import FA3 fucntions for CP Signed-off-by: Xiaowei Ren <[email protected]> * fix output of FA3 fwd Signed-off-by: Xiaowei Ren <[email protected]> * fix rng_state in a2a implementation with FA3 Signed-off-by: Xiaowei Ren <[email protected]> * hack lse correction for packed lse format Signed-off-by: Xiaowei Ren <[email protected]> * make CP thd out correction work with packed lse Signed-off-by: Xiaowei Ren <[email protected]> * fix for packed softmax_lse Signed-off-by: Xiaowei Ren <[email protected]> * fix softmax_lse shape Signed-off-by: Xiaowei Ren <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * change lse_packed to constexpr Signed-off-by: Xiaowei Ren <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Xiaowei Ren <[email protected]> Signed-off-by: Charlene Yang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Charlene Yang <[email protected]>

* Let Fused RoPE support THD with CP Signed-off-by: Xin Yao <[email protected]> * add comment Signed-off-by: Xin Yao <[email protected]> --------- Signed-off-by: Xin Yao <[email protected]> Co-authored-by: Xiaowei Ren <[email protected]>

Signed-off-by: Tim Moon <[email protected]>

#1227) Update test to check support for context parallel attention. Signed-off-by: Michael Goldfarb <[email protected]>

* Create README.md added all PyT examples Signed-off-by: Santosh Bhavani <[email protected]> * Update README.md - Added JAX, PaddlePaddle, and third-party examples - Fixed DL framework links - Removed issue request for new PRs Signed-off-by: Santosh Bhavani <[email protected]> --------- Signed-off-by: Santosh Bhavani <[email protected]> Signed-off-by: Santosh Bhavani <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

* Build custom ORT ops before running ONNX tests Signed-off-by: Tim Moon <[email protected]> * Remove ONNX from context parallelism tests Signed-off-by: Tim Moon <[email protected]> * Export ONNX ops that do compute in FP32 Matches internal impl of TE kernels. Signed-off-by: Tim Moon <[email protected]> * Add build script for custom ORT ops Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]>

fixed assertion bug for SWA Signed-off-by: Md Fahim Faysal Khan <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Phuong Nguyen <[email protected]>

* WIP: make FA2 optional Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: fix logic Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor tweak Signed-off-by: Charlene Yang <[email protected]> * add L1 test to test all supported FA versions Signed-off-by: Charlene Yang <[email protected]> * update version to 2.1.1 and trim L1 tests Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update onnxruntime version Signed-off-by: Charlene Yang <[email protected]> * remove onnxruntime from L1 FA versions tests Signed-off-by: Charlene Yang <[email protected]> --------- Signed-off-by: Charlene Yang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Upgrade pylint and first round formatting Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * round 2 Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * round 3 Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Format and fixes Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Paddle lint Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Reviews Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * FIxes Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * More linting Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Run formatter Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Paddle lint Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fixes Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

Fix FP8 activation recompute Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

Signed-off-by: Przemyslaw Tredak <[email protected]>

#1258) Fix wgrad for GroupedLinear when weights doesn't require grad Signed-off-by: Xin Yao <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

* fix bias for 0-dim tensor Signed-off-by: Xin Yao <[email protected]> * add check Signed-off-by: Xin Yao <[email protected]> * use numel() instead of nullptr Signed-off-by: Xin Yao <[email protected]> --------- Signed-off-by: Xin Yao <[email protected]>

* register CmdBufferCompatible traits via C++ API * renamed FFI_Traits * use register_ffi_target() --------- Signed-off-by: Phuong Nguyen <[email protected]>

fix seq_dim in CP implementation Signed-off-by: Xiaowei Ren <[email protected]>

* Reorganize PyTorch L1 tests Signed-off-by: Tim Moon <[email protected]> * Move ONNX tests to L1 Signed-off-by: Tim Moon <[email protected]> * Move FA version test to L3 Signed-off-by: Tim Moon <[email protected]> * Limit parallel build jobs in FA version test Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]>

* Debug wheel test for PaddlePaddle Signed-off-by: Tim Moon <[email protected]> * Fix typo Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]>

Remove PyTorch L0 distributed test Forgot to remove in #1255. Signed-off-by: Tim Moon <[email protected]>

remove one FA version in the L3 test Signed-off-by: Charlene Yang <[email protected]>

…1230) * Use 64-bit offsets for cuDNN 9.5+ * Align workspace tensors to 16B. * Fix bug where std::accumulate overflowed on large tensor shapes. * Only support 64-bit offsets on arbitrary sequence length fp16 backend. Signed-off-by: Michael Goldfarb <[email protected]>

* Skip encoder tests on V100 * Fix mulitprocessing jax.distributed.init * Remove XLA xla_gpu_deterministic_ops which causes segfault --------- Signed-off-by: Reese Wang <[email protected]>

Add THD + GQA supports for cuDNN >= 9.6 Signed-off-by: Reese Wang <[email protected]>

…rics check in unit tests (#1282) Fix correctness of JAX fused attention with CP. Signed-off-by: Michael Goldfarb <[email protected]>

…, ActLuFP8, LayerNormForwardFP8FFI, and LayerNormBackwardFFI (#1263) * Add TransposeFFI, test passed Signed-off-by: Hua Huang <[email protected]> * Add ActLuFP8FFI; fix TransposeFFI Signed-off-by: Hua Huang <[email protected]> * Add QuantizeFFI Signed-off-by: Hua Huang <[email protected]> * Add FusedAttnForwardFFI and some unit tests Signed-off-by: Hua Huang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor fix Signed-off-by: Hua Huang <[email protected]> * Add LayerNormForwardFP8FFI & LayerNormBackwardFFI Signed-off-by: Hua Huang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revise FusedAttnForwardFFI() Signed-off-by: Hua Huang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add FFI_CudaGraph_Traits All tests passed, ready for merge Signed-off-by: Hua Huang <[email protected]> * Bug fix for FFI data type mismatch Also add a safeguard on the entrance to FFI function Signed-off-by: Hua Huang <[email protected]> --------- Signed-off-by: Hua Huang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Update class names for Paddle 3.0 Signed-off-by: Tim Moon <[email protected]>

* update test numerics Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update test numerics Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update test numerics Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update tests/pytorch/test_numerics.py Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Paweł Gadziński <[email protected]> * tests fix Signed-off-by: Pawel Gadzinski <[email protected]> * Not passing CI fixes Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Not passing CI fixes Signed-off-by: Pawel Gadzinski <[email protected]> * Fix key Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * fixes Signed-off-by: Pawel Gadzinski <[email protected]> --------- Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: Paweł Gadziński <[email protected]> Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Pawel Gadzinski <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

mgoldfarb-nvidia and others added 9 commits September 16, 2024 10:06

[JAX] Fix unit tests to work around cuDNN 9.4 regression of 0 length …

df69965

…sequences (#1179) Modify unit tests to work around cuDNN 9.4 regression. Signed-off-by: Michael Goldfarb <[email protected]>

Add dtensor support for TE optimizers (#1171)

af5daa0

add dtensor support for te optimizers Signed-off-by: jasonwan <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

Update CI users (#1181)

d2d4cf9

Update list of CI users Signed-off-by: Tim Moon <[email protected]>

[JAX] Context Parallel Attention with All-Gather (#1106)

9101a78

Implementation of context parallel fused attention using all-gather. Signed-off-by: Michael Goldfarb <[email protected]>

[Common] Default CUDA_HOME to /usr/local/cuda when dynamically loadin…

44fd316

…g cuDNN and NVRTC (#1183) Defaulted CUDA_HOME/CUDA_PATH to /usr/local/cuda when attempting to dynamically load cuDNN and NVRTC Signed-off-by: Alp Dener <[email protected]>

Changed VERSION to 1.12.0.dev

528d44b

Signed-off-by: Przemyslaw Tredak <[email protected]>

Allow specifying cmake setup directory (#1186)

28f95bd

Allow specifying cmake directory Signed-off-by: Ryan Li <[email protected]> Co-authored-by: Ryan Li <[email protected]>

Add docs for installing from PyPI (#1184)

eb60b1a

* Add PyPI install instructions Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Review from @timmoon10 Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

[PyTorch] Port fused optimizer tests to pytest (#1185)

7e1068b

Port optimizer tests to pytest Signed-off-by: Tim Moon <[email protected]>

pull bot added the ⤵️ pull label Sep 18, 2024

denera and others added 20 commits September 18, 2024 13:09

[PyTorch] Relax the contiguous check for flash attention (#1176)

0ee5ccd

* relax contiguous check for flash attention Signed-off-by: Xin Yao <[email protected]> * force contiguous for cp Signed-off-by: Xin Yao <[email protected]> --------- Signed-off-by: Xin Yao <[email protected]>

Update list of CI users (#1198)

a68acd7

Add @pggPL to list of CI users Signed-off-by: Tim Moon <[email protected]>

Update list of CI users (#1203)

a44cb72

Add new users to CI Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

Fix CP unit test on A100 and L40s (#1211)

7b152a8

skip FP8 CP tests if hardware does not support FP8 Signed-off-by: Xiaowei Ren <[email protected]>

[PyTorch] Add pool argument to make_graphed_callable (#1218)

728c558

Add pool argument to make_graphed_callable Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

[PyTorch] Fix distributed testing (#1219)

46075b9

Fix Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

[PyTorch] Move block_table argument to FA varlen function (#1222)

10cceae

move block_table arg to varlen_func section Signed-off-by: Charlene Yang <[email protected]>

[PyTorch] remove duplicate code (#1215)

f8eb799

Signed-off-by: Emmanuel Ferdman <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

huanghua1994 and others added 29 commits October 10, 2024 09:37

Do not link against CUDA driver when building (#1240)

86f07be

Signed-off-by: Tim Moon <[email protected]>

Check for backend support in Jax context parallel fused attention test (

20c55e4

#1227) Update test to check support for context parallel attention. Signed-off-by: Michael Goldfarb <[email protected]>

fix assertion bug for SWA API in TE-JAX (#1242)

43b9e1e

fixed assertion bug for SWA Signed-off-by: Md Fahim Faysal Khan <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Phuong Nguyen <[email protected]>

[PyTorch] Fix FP8 activation recompute (#1254)

a518151

Fix FP8 activation recompute Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

Changed VERSION to 1.13.0.dev

9001081

Signed-off-by: Przemyslaw Tredak <[email protected]>

[PyTorch] Fix wgrads for GroupedLinear when weights don't require grad (

2d7020e

#1258) Fix wgrad for GroupedLinear when weights doesn't require grad Signed-off-by: Xin Yao <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

[TE/JAX] Enabling CudaGraph for custom calls with FFI (#1228)

12f30ea

* register CmdBufferCompatible traits via C++ API * renamed FFI_Traits * use register_ffi_target() --------- Signed-off-by: Phuong Nguyen <[email protected]>

Fix seq_dim in CP implementation (#1264)

a488b8b

fix seq_dim in CP implementation Signed-off-by: Xiaowei Ren <[email protected]>

[Paddle] Debug wheel test (#1265)

927bca7

* Debug wheel test for PaddlePaddle Signed-off-by: Tim Moon <[email protected]> * Fix typo Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]>

[PyTorch] Remove PyTorch L0 distributed test (#1273)

3ea7dd3

Remove PyTorch L0 distributed test Forgot to remove in #1255. Signed-off-by: Tim Moon <[email protected]>

[PyTorch] Reduce the number of FA versions in L3 tests (#1280)

29e3a09

remove one FA version in the L3 test Signed-off-by: Charlene Yang <[email protected]>

[JAX] Skip V100 encoder tests (#1262)

35f7d26

* Skip encoder tests on V100 * Fix mulitprocessing jax.distributed.init * Remove XLA xla_gpu_deterministic_ops which causes segfault --------- Signed-off-by: Reese Wang <[email protected]>

Add THD + GQA supports (#1260)

d9b4bfb

Add THD + GQA supports for cuDNN >= 9.6 Signed-off-by: Reese Wang <[email protected]>

[JAX] Fix correctness of JAX fused attention with CP and improve nume…

20c7529

…rics check in unit tests (#1282) Fix correctness of JAX fused attention with CP. Signed-off-by: Michael Goldfarb <[email protected]>

[Paddle] Update type names for Paddle 3.0 (#1286)

7a5fd0c

Update class names for Paddle 3.0 Signed-off-by: Tim Moon <[email protected]>

phu0ngng merged commit 7b284fe into phu0ngng:main Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from NVIDIA:main #26

[pull] main from NVIDIA:main #26

pull bot commented Sep 18, 2024 •

edited

Loading

[pull] main from NVIDIA:main #26

[pull] main from NVIDIA:main #26

Conversation

pull bot commented Sep 18, 2024 • edited Loading

pull bot commented Sep 18, 2024 •

edited

Loading