Don't use implicitly `elapsed_time` in autotuner #3036

anmyachev · 2024-12-17T22:24:38Z

The main idea of this pull request is not to use elapsed_time that enable profiling mode for sycl queues, as this is not needed for profiling with PyTorch and PTI.

CI runs:

https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12390476117 (legacy profiler)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12390481323 (upstream profiler)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12392648088 (legacy profiler - 1a1c98e)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12392654868 (upstream profiler - 1a1c98e)

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2024-12-18T16:06:05Z

@whitneywhtsang we can try the changes in #2484 on DLE runner, but we need to cherry-pick 2a4b818 into Pavel's branch

benchmarks/triton_kernels_benchmark/gemm_benchmark.py

benchmarks/triton_kernels_benchmark/gemm_postop_addmatrix_benchmark.py

benchmarks/triton_kernels_benchmark/gemm_postop_gelu_benchmark.py

benchmarks/triton_kernels_benchmark/gemm_preop_exp_benchmark.py

whitneywhtsang · 2024-12-18T17:29:04Z

@whitneywhtsang we can try the changes in #2484 on DLE runner, but we need to cherry-pick 2a4b818 into Pavel's branch

Let's cherry-pick this PR to ptdb-dle-runner.

anmyachev · 2024-12-18T17:37:52Z

@whitneywhtsang we can try the changes in #2484 on DLE runner, but we need to cherry-pick 2a4b818 into Pavel's branch

Let's cherry-pick this PR to ptdb-dle-runner.

ok, but let's use 2a4b818 (last commit in #2484) which compatible with changes on Pavel's branch

Signed-off-by: Anatoly Myachev <[email protected]>

This reverts commit 2a4b818.

whitneywhtsang · 2024-12-20T15:30:55Z

Please rebase this PR.

Signed-off-by: Anatoly Myachev <[email protected]>

Co-authored-by: Whitney Tsang <[email protected]>

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2024-12-21T11:06:05Z

Please rebase this PR.

done

whitneywhtsang · 2024-12-30T14:12:30Z

I like the idea of this PR, but it looks like performance is not better:

Let's rerun after agama update to see if there will be any performance difference.

anmyachev · 2025-01-02T12:03:42Z

I like the idea of this PR, but it looks like performance is not better

This performance difference may be due to the different number of warm-up runs of the function. I use the interface of our functions that warm up a certain number of times (10), instead of running only 10 milliseconds, as is the default in Triton:
return do_bench(*args, n_warmup=10, n_repeat=10, **kwargs)

whitneywhtsang · 2025-01-02T15:40:29Z

This performance difference may be due to the different number of warm-up runs of the function.

We could do a run with upstream do_bench changed to use exact number of runs (without this PR), and see if there are any performance differences, to isolate the reason. (We could do that after Agama update.)

…-triton into amyachev/autotuner

anmyachev · 2025-01-13T19:31:01Z

@whitneywhtsang it seems that the performance is improving from this change. Could you double check me, just like you looked at the dashboards before?

whitneywhtsang · 2025-01-13T19:48:07Z

@whitneywhtsang it seems that the performance is improving from this change. Could you double check me, just like you looked at the dashboards before?

This change should not have performance impact on XeTLA. As all 3 dots are slightly higher, where one of them is XeTLA, I would think that the machine is in a good stage, and not due to the change. And potentially a performance drop for GEMM with advanced path.

Signed-off-by: Anatoly Myachev <[email protected]>

…-triton into amyachev/autotuner

Signed-off-by: Anatoly Myachev <[email protected]>

…-triton into amyachev/autotuner

…bench Signed-off-by: Anatoly Myachev <[email protected]>

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2025-01-15T22:18:31Z

@whitneywhtsang should be better now. Please take a look. I can run it a couple more times to be sure.

anmyachev · 2025-01-16T00:09:57Z

benchmarks/triton_kernels_benchmark/benchmark_testing.py

+        n_warmup = max(1, int(warmup_time / estimate_ms))
+        n_repeat = max(1, int(rep_time / estimate_ms))


The iteration determination procedure is as similar as possible to the one used before. I believe the changes in the results can be fully attributed to the consequences of the transition from implicit elapsed_time timing to simple wall timing.

.github/workflows/triton-benchmarks.yml

anmyachev · 2025-01-16T14:56:33Z

This performance difference may be due to the different number of warm-up runs of the function.

We could do a run with upstream do_bench changed to use exact number of runs (without this PR), and see if there are any performance differences, to isolate the reason. (We could do that after Agama update.)

Looks like I have time, will try.

whitneywhtsang · 2025-01-16T20:24:55Z

@whitneywhtsang should be better now. Please take a look. I can run it a couple more times to be sure.

This change itself looks good to me, but from the current performance results, there is no evident that there is performance improvement. Do you have the same observation?

anmyachev · 2025-01-16T20:38:47Z

@whitneywhtsang should be better now. Please take a look. I can run it a couple more times to be sure.

This change itself looks good to me, but from the current performance results, there is no evident that there is performance improvement. Do you have the same observation?

More or less yes, however, for softmax the improvement is noticeable.

anmyachev · 2025-01-16T21:27:27Z

@whitneywhtsang ok, it seems the benefit of this change is less than the support costs. I suggest closing both the pull request and the issue. WDYT?

whitneywhtsang · 2025-01-16T21:30:31Z

@whitneywhtsang ok, it seems the benefit of this change is less than the support costs. I suggest closing both the pull request and the issue. WDYT?

I agree. Thanks for performing all the experiments.

anmyachev linked an issue Dec 18, 2024 that may be closed by this pull request

Don't use implicitly elapsed_time in autotuner when profiling with PyTorch and PTI #3039

Closed

anmyachev marked this pull request as ready for review December 18, 2024 13:51

anmyachev requested a review from whitneywhtsang December 18, 2024 13:52

anmyachev added a commit that referenced this pull request Dec 18, 2024

try changes from #3036

2a4b818

Signed-off-by: Anatoly Myachev <[email protected]>

whitneywhtsang reviewed Dec 18, 2024

View reviewed changes

whitneywhtsang pushed a commit that referenced this pull request Dec 18, 2024

try changes from #3036

231b07a

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev added a commit that referenced this pull request Dec 18, 2024

Revert "try changes from #3036"

0d66c8e

This reverts commit 2a4b818.

anmyachev and others added 5 commits December 21, 2024 11:01

Don't use implicitly 'elapsed_time' in autotuner

c4a40f4

Signed-off-by: Anatoly Myachev <[email protected]>

pass a function for autotuner via 'do_bench' param

022d974

Signed-off-by: Anatoly Myachev <[email protected]>

revert warmup/rep changes

de18b7a

Signed-off-by: Anatoly Myachev <[email protected]>

Apply suggestions from code review

e82f873

Co-authored-by: Whitney Tsang <[email protected]>

remove 'kernel_name'

5710fd1

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev force-pushed the amyachev/autotuner branch from c21c92a to 5710fd1 Compare December 21, 2024 11:05

whitneywhtsang added 2 commits December 21, 2024 11:57

Merge branch 'main' into amyachev/autotuner

b67dc9a

Merge branch 'main' into amyachev/autotuner

330e328

Merge branch 'main' of https://github.com/intel/intel-xpu-backend-for…

3221a3e

…-triton into amyachev/autotuner

anmyachev added 4 commits January 14, 2025 13:22

calculate the number of iterations based on time

25ec4a6

Signed-off-by: Anatoly Myachev <[email protected]>

Merge branch 'main' of https://github.com/intel/intel-xpu-backend-for…

51505c5

…-triton into amyachev/autotuner

TRITON_PRINT_AUTOTUNING

eb10a92

Signed-off-by: Anatoly Myachev <[email protected]>

more runs for 'estimate_ms' calculation

f044ec3

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev added 4 commits January 15, 2025 18:07

Merge branch 'main' of https://github.com/intel/intel-xpu-backend-for…

61370a0

…-triton into amyachev/autotuner

align the procedure for getting estimate_ms with the procedure in do_…

ddc6595

…bench Signed-off-by: Anatoly Myachev <[email protected]>

one extra call before

f6a05b7

Signed-off-by: Anatoly Myachev <[email protected]>

align

44f3d02

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev commented Jan 16, 2025

View reviewed changes

.github/workflows/triton-benchmarks.yml Outdated Show resolved Hide resolved

anmyachev added 2 commits January 16, 2025 15:24

Update .github/workflows/triton-benchmarks.yml

da4d163

Merge branch 'main' into amyachev/autotuner

06d9469

anmyachev closed this Jan 16, 2025

anmyachev deleted the amyachev/autotuner branch January 16, 2025 21:33

anmyachev mentioned this pull request Jan 16, 2025

Don't use implicitly elapsed_time in autotuner when profiling with PyTorch and PTI #3039

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't use implicitly `elapsed_time` in autotuner #3036

Don't use implicitly `elapsed_time` in autotuner #3036

anmyachev commented Dec 17, 2024 •

edited

Loading

anmyachev commented Dec 18, 2024

whitneywhtsang commented Dec 18, 2024

anmyachev commented Dec 18, 2024 •

edited

Loading

whitneywhtsang commented Dec 20, 2024

anmyachev commented Dec 21, 2024

whitneywhtsang commented Dec 30, 2024 •

edited

Loading

anmyachev commented Jan 2, 2025

whitneywhtsang commented Jan 2, 2025

anmyachev commented Jan 13, 2025

whitneywhtsang commented Jan 13, 2025

anmyachev commented Jan 15, 2025 •

edited

Loading

anmyachev Jan 16, 2025

anmyachev commented Jan 16, 2025

whitneywhtsang commented Jan 16, 2025

anmyachev commented Jan 16, 2025

anmyachev commented Jan 16, 2025

whitneywhtsang commented Jan 16, 2025

		n_warmup = max(1, int(warmup_time / estimate_ms))
		n_repeat = max(1, int(rep_time / estimate_ms))

Don't use implicitly elapsed_time in autotuner #3036

Don't use implicitly elapsed_time in autotuner #3036

Conversation

anmyachev commented Dec 17, 2024 • edited Loading

anmyachev commented Dec 18, 2024

whitneywhtsang commented Dec 18, 2024

anmyachev commented Dec 18, 2024 • edited Loading

whitneywhtsang commented Dec 20, 2024

anmyachev commented Dec 21, 2024

whitneywhtsang commented Dec 30, 2024 • edited Loading

anmyachev commented Jan 2, 2025

whitneywhtsang commented Jan 2, 2025

anmyachev commented Jan 13, 2025

whitneywhtsang commented Jan 13, 2025

anmyachev commented Jan 15, 2025 • edited Loading

anmyachev Jan 16, 2025

Choose a reason for hiding this comment

anmyachev commented Jan 16, 2025

whitneywhtsang commented Jan 16, 2025

anmyachev commented Jan 16, 2025

anmyachev commented Jan 16, 2025

whitneywhtsang commented Jan 16, 2025

Don't use implicitly `elapsed_time` in autotuner #3036

Don't use implicitly `elapsed_time` in autotuner #3036

anmyachev commented Dec 17, 2024 •

edited

Loading

anmyachev commented Dec 18, 2024 •

edited

Loading

whitneywhtsang commented Dec 30, 2024 •

edited

Loading

anmyachev commented Jan 15, 2025 •

edited

Loading