Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] main from NVIDIA:main #26

Merged
merged 63 commits into from
Oct 25, 2024
Merged

[pull] main from NVIDIA:main #26

merged 63 commits into from
Oct 25, 2024

Conversation

pull[bot]
Copy link

@pull pull bot commented Sep 18, 2024

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

mgoldfarb-nvidia and others added 9 commits September 16, 2024 10:06
…sequences (#1179)

Modify unit tests to work around cuDNN 9.4 regression.

Signed-off-by: Michael Goldfarb <[email protected]>
add dtensor support for te optimizers

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
Update list of CI users

Signed-off-by: Tim Moon <[email protected]>
Implementation of context parallel fused attention using all-gather.

Signed-off-by: Michael Goldfarb <[email protected]>
…g cuDNN and NVRTC (#1183)

Defaulted CUDA_HOME/CUDA_PATH to /usr/local/cuda when attempting to dynamically load cuDNN and NVRTC

Signed-off-by: Alp Dener <[email protected]>
Signed-off-by: Przemyslaw Tredak <[email protected]>
Allow specifying cmake directory

Signed-off-by: Ryan Li <[email protected]>
Co-authored-by: Ryan Li <[email protected]>
* Add PyPI install instructions

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Review from @timmoon10

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

---------

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Port optimizer tests to pytest

Signed-off-by: Tim Moon <[email protected]>
@pull pull bot added the ⤵️ pull label Sep 18, 2024
denera and others added 20 commits September 18, 2024 13:09
…1175)

* Check if network interface name is valid and show useful warning message when initializing Userbuffers

Signed-off-by: Alp Dener <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix formatting issue in warning message.

Co-authored-by: Tim Moon <[email protected]>
Signed-off-by: Alp Dener <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Alp Dener <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <[email protected]>
* make rotary_base arg

Signed-off-by: Sudhakar Singh <[email protected]>

* rotary base can be a float

Co-authored-by: Tim Moon <[email protected]>
Signed-off-by: Sudhakar Singh <[email protected]>

---------

Signed-off-by: Sudhakar Singh <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
* relax contiguous check for flash attention

Signed-off-by: Xin Yao <[email protected]>

* force contiguous for cp

Signed-off-by: Xin Yao <[email protected]>

---------

Signed-off-by: Xin Yao <[email protected]>
* allow tutorial to download the model weights automatically

Signed-off-by: Sudhakar Singh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* allow users to provide weight cache directory

Signed-off-by: Sudhakar Singh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Sudhakar Singh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Restore compatibility with Python 3.8

Signed-off-by: Przemyslaw Tredak <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Przemyslaw Tredak <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Add @pggPL to list of CI users

Signed-off-by: Tim Moon <[email protected]>
* Allow to pass architectures like 90a, without being overriden

Signed-off-by: aurianer <[email protected]>

* Review suggestion from @timmoon10

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: aurianer <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
Add new users to CI

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
* fix NVTE_UB_WITH_MPI read

Signed-off-by: Sangkug Lym <[email protected]>

* Add default value

Signed-off-by: Sangkug Lym <[email protected]>

---------

Signed-off-by: Sangkug Lym <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
* Docs fixes

Signed-off-by: Pawel Gadzinski <[email protected]>

* docs fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* docs fix

Signed-off-by: Pawel Gadzinski <[email protected]>

---------

Signed-off-by: Pawel Gadzinski <[email protected]>
Co-authored-by: Pawel Gadzinski <[email protected]>
* fix detection of 3 in 3hd/h3d layouts

Signed-off-by: Charlene Yang <[email protected]>

* error out when invalid layout group is provided

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
skip FP8 CP tests if hardware does not support FP8

Signed-off-by: Xiaowei Ren <[email protected]>
Add pool argument to make_graphed_callable

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Fix

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
…with offsets (#1220)

* Removing the unused options from GroupedLinear docs and fixing the bug
with offsets

Signed-off-by: Przemyslaw Tredak <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* offsets -> fp8_meta_offsets

Signed-off-by: Przemyslaw Tredak <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Przemyslaw Tredak <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
move block_table arg to varlen_func section

Signed-off-by: Charlene Yang <[email protected]>
* CPU perf optimization in linear autograd function

Avoid enable_grad context when possible in cast function. Cache distributed group properties.

Signed-off-by: Tim Moon <[email protected]>

* CPU perf optimization in prepare_forward function

Avoid torch.nn.Module impl of __setattr__.

Signed-off-by: Tim Moon <[email protected]>

* Avoid module import in TE module forwards

Signed-off-by: Tim Moon <[email protected]>

* Use fast getter for params

Signed-off-by: Tim Moon <[email protected]>

* Reuse tensor dims in linear autograd func

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Apply optimizations to grouped linear

Signed-off-by: Tim Moon <[email protected]>

* Debug test failures

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Debug test failures

Signed-off-by: Tim Moon <[email protected]>

* Fix linter warnings

Signed-off-by: Tim Moon <[email protected]>

* Avoid deepcopy in tests

Signed-off-by: Tim Moon <[email protected]>

* Move _fast_setattr logic to __setattr__ method

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Emmanuel Ferdman <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
* Tests for distributed

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* added the test to the qa script

Signed-off-by: Pawel Gadzinski <[email protected]>

* Changed qa

Signed-off-by: Pawel Gadzinski <[email protected]>

* fix to test_numerics file

Signed-off-by: Pawel Gadzinski <[email protected]>

* pr fixes

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixes

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update tests/pytorch/distributed/run_numerics.py

Co-authored-by: Tim Moon <[email protected]>
Signed-off-by: Paweł Gadziński <[email protected]>

---------

Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Paweł Gadziński <[email protected]>
Co-authored-by: Pawel Gadzinski <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Pawel Gadzinski <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
* change API for hierarchical CP

Signed-off-by: Xiaowei Ren <[email protected]>

* move fp8 code before qkv reshape

Signed-off-by: Xiaowei Ren <[email protected]>

* try to insert A2A for hierarchical CP

Signed-off-by: Xiaowei Ren <[email protected]>

* make fwd work

Signed-off-by: Xiaowei Ren <[email protected]>

* remove a redundant sync

Signed-off-by: Xiaowei Ren <[email protected]>

* make bwd of hierarchical CP work

Signed-off-by: Xiaowei Ren <[email protected]>

* fix dout a2a in bwd

Signed-off-by: Xiaowei Ren <[email protected]>

* fix q_f16 with fp8

Signed-off-by: Xiaowei Ren <[email protected]>

* assert hierarchical CP implementation does not support THD format

Signed-off-by: Xiaowei Ren <[email protected]>

* bug fix

Signed-off-by: Xiaowei Ren <[email protected]>

* assert hierarchical CP does not support attn bias

Signed-off-by: Xiaowei Ren <[email protected]>

* add unit test for hierarchical CP

Signed-off-by: Xiaowei Ren <[email protected]>

* fix cp_comm_type in unit test

Signed-off-by: Xiaowei Ren <[email protected]>

* bug fix and code cleaning

Signed-off-by: Xiaowei Ren <[email protected]>

* minor change

Signed-off-by: Xiaowei Ren <[email protected]>

* an assert info change

Signed-off-by: Xiaowei Ren <[email protected]>

* dout shape fix

Signed-off-by: Xiaowei Ren <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* move function definitions to the front of the first call

Signed-off-by: Xiaowei Ren <[email protected]>

* fix tensor view comments

Signed-off-by: Xiaowei Ren <[email protected]>

* refine CP unit test

Signed-off-by: Xiaowei Ren <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* typo fix

Signed-off-by: Xiaowei Ren <[email protected]>

* typo fix

Signed-off-by: Xiaowei Ren <[email protected]>

* save cp_size_a2a and rank_a2a in fwd

Signed-off-by: Xiaowei Ren <[email protected]>

* add more explainations of cp_group in doc_string

Signed-off-by: Xiaowei Ren <[email protected]>

---------

Signed-off-by: Xiaowei Ren <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
huanghua1994 and others added 29 commits October 10, 2024 09:37
* Expose JAX sliding window attn API

Signed-off-by: Hua Huang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* No SWA in context parallel; fix RNG seed in test

Signed-off-by: Hua Huang <[email protected]>

* Handle SAW API discrepancy in cuDNN and Python

Signed-off-by: Hua Huang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add SAW API for flax, all tests passed

Will update tests/jax/test_praxis_layers.py next

Signed-off-by: Hua Huang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update test_praxis_layers.py for SWA, test passed

Signed-off-by: Hua Huang <[email protected]>

* Use tuple window_size; update for PR #1212

Signed-off-by: Hua Huang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add and adjust some pytest.skip

Signed-off-by: Hua Huang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revised following Reese Wang's comments

Still need further debugging:
FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-KV_PACKED-NO_MASK-NO_BIAS] - AssertionError:
FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-KV_PACKED-NO_MASK-POST_SCALE_BIAS-1HSS] - AssertionError:
FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-SEPARATE-NO_MASK-NO_BIAS] - AssertionError:
FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-SEPARATE-NO_MASK-POST_SCALE_BIAS-1HSS] - AssertionError:

These errors does not exist in the previous commit

Signed-off-by: Hua Huang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix no-SWA test case errors in previous commit

Signed-off-by: Hua Huang <[email protected]>

* Add Padding mask w/ sliding windows sanity tests

Signed-off-by: Reese Wang <[email protected]>

* Use float32 for the reference code softmax calculation

Signed-off-by: Reese Wang <[email protected]>

---------

Signed-off-by: Hua Huang <[email protected]>
Signed-off-by: Reese Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Reese Wang <[email protected]>
* Fixes to Float8Tensor

Signed-off-by: Przemyslaw Tredak <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Przemyslaw Tredak <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix bug in torch compile and seqdim is integer

Signed-off-by: 李金梁 <[email protected]>

* Update attention.py

change the jit_fuser to torch.compile on flash_attn_fwd_out_correction

Signed-off-by: 李金梁 <[email protected]>

* Annotate fused functions

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

---------

Signed-off-by: 李金梁 <[email protected]>
Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
* fa2 function import renaming

Signed-off-by: Xiaowei Ren <[email protected]>

* refine fa_fwd_kwargs and fa_bwd_kwargs

Signed-off-by: Xiaowei Ren <[email protected]>

* import FA3 fucntions for CP

Signed-off-by: Xiaowei Ren <[email protected]>

* fix output of FA3 fwd

Signed-off-by: Xiaowei Ren <[email protected]>

* fix rng_state in a2a implementation with FA3

Signed-off-by: Xiaowei Ren <[email protected]>

* hack lse correction for packed lse format

Signed-off-by: Xiaowei Ren <[email protected]>

* make CP thd out correction work with packed lse

Signed-off-by: Xiaowei Ren <[email protected]>

* fix for packed softmax_lse

Signed-off-by: Xiaowei Ren <[email protected]>

* fix softmax_lse shape

Signed-off-by: Xiaowei Ren <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* change lse_packed to constexpr

Signed-off-by: Xiaowei Ren <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Xiaowei Ren <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Charlene Yang <[email protected]>
* Let Fused RoPE support THD with CP

Signed-off-by: Xin Yao <[email protected]>

* add comment

Signed-off-by: Xin Yao <[email protected]>

---------

Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: Xiaowei Ren <[email protected]>
#1227)

Update test to check support for context parallel attention.

Signed-off-by: Michael Goldfarb <[email protected]>
* Create README.md

added all PyT examples

Signed-off-by: Santosh Bhavani <[email protected]>

* Update README.md

- Added JAX, PaddlePaddle, and third-party examples
- Fixed DL framework links
- Removed issue request for new PRs

Signed-off-by: Santosh Bhavani <[email protected]>

---------

Signed-off-by: Santosh Bhavani <[email protected]>
Signed-off-by: Santosh Bhavani <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
* Build custom ORT ops before running ONNX tests

Signed-off-by: Tim Moon <[email protected]>

* Remove ONNX from context parallelism tests

Signed-off-by: Tim Moon <[email protected]>

* Export ONNX ops that do compute in FP32

Matches internal impl of TE kernels.

Signed-off-by: Tim Moon <[email protected]>

* Add build script for custom ORT ops

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
fixed assertion bug for SWA

Signed-off-by: Md Fahim Faysal Khan <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
Co-authored-by: Phuong Nguyen <[email protected]>
* WIP: make FA2 optional

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* WIP: fix logic

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix lint

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fixes

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor tweak

Signed-off-by: Charlene Yang <[email protected]>

* add L1 test to test all supported FA versions

Signed-off-by: Charlene Yang <[email protected]>

* update version to 2.1.1 and trim L1 tests

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update onnxruntime version

Signed-off-by: Charlene Yang <[email protected]>

* remove onnxruntime from L1 FA versions tests

Signed-off-by: Charlene Yang <[email protected]>

---------

Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Upgrade pylint and first round formatting

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* round 2

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* round 3

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Format and fixes

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Paddle lint

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Reviews

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* FIxes

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* More linting

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Run formatter

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Paddle lint

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Fixes

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

---------

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Fix FP8 activation recompute

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Signed-off-by: Przemyslaw Tredak <[email protected]>
#1258)

Fix wgrad for GroupedLinear when weights doesn't require grad

Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
* fix bias for 0-dim tensor

Signed-off-by: Xin Yao <[email protected]>

* add check

Signed-off-by: Xin Yao <[email protected]>

* use numel() instead of nullptr

Signed-off-by: Xin Yao <[email protected]>

---------

Signed-off-by: Xin Yao <[email protected]>
* register CmdBufferCompatible traits via C++ API

* renamed FFI_Traits

* use register_ffi_target()

---------

Signed-off-by: Phuong Nguyen <[email protected]>
fix seq_dim in CP implementation

Signed-off-by: Xiaowei Ren <[email protected]>
* Reorganize PyTorch L1 tests

Signed-off-by: Tim Moon <[email protected]>

* Move ONNX tests to L1

Signed-off-by: Tim Moon <[email protected]>

* Move FA version test to L3

Signed-off-by: Tim Moon <[email protected]>

* Limit parallel build jobs in FA version test

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
* Debug wheel test for PaddlePaddle

Signed-off-by: Tim Moon <[email protected]>

* Fix typo

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Remove PyTorch L0 distributed test

Forgot to remove in #1255.

Signed-off-by: Tim Moon <[email protected]>
remove one FA version in the L3 test

Signed-off-by: Charlene Yang <[email protected]>
…1230)

* Use 64-bit offsets for cuDNN 9.5+
* Align workspace tensors to 16B.
* Fix bug where std::accumulate overflowed on large tensor shapes.
* Only support 64-bit offsets on arbitrary sequence length fp16 backend.

Signed-off-by: Michael Goldfarb <[email protected]>
* Skip encoder tests on V100

* Fix mulitprocessing jax.distributed.init

* Remove XLA xla_gpu_deterministic_ops which causes segfault

---------

Signed-off-by: Reese Wang <[email protected]>
Add THD + GQA supports for cuDNN >= 9.6

Signed-off-by: Reese Wang <[email protected]>
…rics check in unit tests (#1282)

Fix correctness of JAX fused attention with CP.

Signed-off-by: Michael Goldfarb <[email protected]>
…, ActLuFP8, LayerNormForwardFP8FFI, and LayerNormBackwardFFI (#1263)

* Add TransposeFFI, test passed

Signed-off-by: Hua Huang <[email protected]>

* Add ActLuFP8FFI; fix TransposeFFI

Signed-off-by: Hua Huang <[email protected]>

* Add QuantizeFFI

Signed-off-by: Hua Huang <[email protected]>

* Add FusedAttnForwardFFI and some unit tests

Signed-off-by: Hua Huang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Minor fix

Signed-off-by: Hua Huang <[email protected]>

* Add LayerNormForwardFP8FFI & LayerNormBackwardFFI

Signed-off-by: Hua Huang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revise FusedAttnForwardFFI()

Signed-off-by: Hua Huang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add FFI_CudaGraph_Traits

All tests passed, ready for merge

Signed-off-by: Hua Huang <[email protected]>

* Bug fix for FFI data type mismatch

Also add a safeguard on the entrance to FFI function

Signed-off-by: Hua Huang <[email protected]>

---------

Signed-off-by: Hua Huang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Update class names for Paddle 3.0

Signed-off-by: Tim Moon <[email protected]>
* update test numerics

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update test numerics

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update test numerics

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update tests/pytorch/test_numerics.py

Co-authored-by: Tim Moon <[email protected]>
Signed-off-by: Paweł Gadziński <[email protected]>

* tests fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* Not passing CI fixes

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Not passing CI fixes

Signed-off-by: Pawel Gadzinski <[email protected]>

* Fix key

Co-authored-by: Tim Moon <[email protected]>
Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* fixes

Signed-off-by: Pawel Gadzinski <[email protected]>

---------

Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Paweł Gadziński <[email protected]>
Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Co-authored-by: Pawel Gadzinski <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
@phu0ngng phu0ngng merged commit 7b284fe into phu0ngng:main Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.