-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't use implicitly elapsed_time
in autotuner
#3036
Conversation
Signed-off-by: Anatoly Myachev <[email protected]>
@whitneywhtsang we can try the changes in #2484 on DLE runner, but we need to cherry-pick 2a4b818 into Pavel's branch |
benchmarks/triton_kernels_benchmark/gemm_postop_addmatrix_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/gemm_postop_gelu_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/gemm_preop_exp_benchmark.py
Outdated
Show resolved
Hide resolved
Let's cherry-pick this PR to |
ok, but let's use 2a4b818 (last commit in #2484) which compatible with changes on Pavel's branch |
Signed-off-by: Anatoly Myachev <[email protected]>
This reverts commit 2a4b818.
Please rebase this PR. |
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Co-authored-by: Whitney Tsang <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
c21c92a
to
5710fd1
Compare
done |
This performance difference may be due to the different number of warm-up runs of the function. I use the interface of our functions that warm up a certain number of times (10), instead of running only 10 milliseconds, as is the default in Triton: |
We could do a run with upstream do_bench changed to use exact number of runs (without this PR), and see if there are any performance differences, to isolate the reason. (We could do that after Agama update.) |
…-triton into amyachev/autotuner
@whitneywhtsang it seems that the performance is improving from this change. Could you double check me, just like you looked at the dashboards before? |
This change should not have performance impact on XeTLA. As all 3 dots are slightly higher, where one of them is XeTLA, I would think that the machine is in a good stage, and not due to the change. And potentially a performance drop for GEMM with advanced path. |
Signed-off-by: Anatoly Myachev <[email protected]>
…-triton into amyachev/autotuner
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
…-triton into amyachev/autotuner
…bench Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
@whitneywhtsang should be better now. Please take a look. I can run it a couple more times to be sure. |
n_warmup = max(1, int(warmup_time / estimate_ms)) | ||
n_repeat = max(1, int(rep_time / estimate_ms)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The iteration determination procedure is as similar as possible to the one used before. I believe the changes in the results can be fully attributed to the consequences of the transition from implicit elapsed_time
timing to simple wall timing.
Looks like I have time, will try. |
This change itself looks good to me, but from the current performance results, there is no evident that there is performance improvement. Do you have the same observation? |
More or less yes, however, for softmax the improvement is noticeable. |
@whitneywhtsang ok, it seems the benefit of this change is less than the support costs. I suggest closing both the pull request and the issue. WDYT? |
I agree. Thanks for performing all the experiments. |
The main idea of this pull request is not to use
elapsed_time
that enable profiling mode for sycl queues, as this is not needed for profiling with PyTorch and PTI.CI runs: