Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use implicitly elapsed_time in autotuner #3036

Closed
wants to merge 18 commits into from

Conversation

anmyachev
Copy link
Contributor

@anmyachev anmyachev commented Dec 17, 2024

The main idea of ​​this pull request is not to use elapsed_time that enable profiling mode for sycl queues, as this is not needed for profiling with PyTorch and PTI.

CI runs:

@anmyachev anmyachev marked this pull request as ready for review December 18, 2024 13:51
anmyachev added a commit that referenced this pull request Dec 18, 2024
Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev
Copy link
Contributor Author

@whitneywhtsang we can try the changes in #2484 on DLE runner, but we need to cherry-pick 2a4b818 into Pavel's branch

@whitneywhtsang
Copy link
Contributor

@whitneywhtsang we can try the changes in #2484 on DLE runner, but we need to cherry-pick 2a4b818 into Pavel's branch

Let's cherry-pick this PR to ptdb-dle-runner.

@anmyachev
Copy link
Contributor Author

anmyachev commented Dec 18, 2024

@whitneywhtsang we can try the changes in #2484 on DLE runner, but we need to cherry-pick 2a4b818 into Pavel's branch

Let's cherry-pick this PR to ptdb-dle-runner.

ok, but let's use 2a4b818 (last commit in #2484) which compatible with changes on Pavel's branch

whitneywhtsang pushed a commit that referenced this pull request Dec 18, 2024
Signed-off-by: Anatoly Myachev <[email protected]>
anmyachev added a commit that referenced this pull request Dec 18, 2024
@whitneywhtsang
Copy link
Contributor

Please rebase this PR.

@anmyachev
Copy link
Contributor Author

Please rebase this PR.

done

@whitneywhtsang
Copy link
Contributor

whitneywhtsang commented Dec 30, 2024

I like the idea of this PR, but it looks like performance is not better:
Screenshot 2024-12-30 091157
Let's rerun after agama update to see if there will be any performance difference.

@anmyachev
Copy link
Contributor Author

I like the idea of this PR, but it looks like performance is not better

This performance difference may be due to the different number of warm-up runs of the function. I use the interface of our functions that warm up a certain number of times (10), instead of running only 10 milliseconds, as is the default in Triton:
return do_bench(*args, n_warmup=10, n_repeat=10, **kwargs)

@whitneywhtsang
Copy link
Contributor

This performance difference may be due to the different number of warm-up runs of the function.

We could do a run with upstream do_bench changed to use exact number of runs (without this PR), and see if there are any performance differences, to isolate the reason. (We could do that after Agama update.)

@anmyachev
Copy link
Contributor Author

@whitneywhtsang it seems that the performance is improving from this change. Could you double check me, just like you looked at the dashboards before?
image

@whitneywhtsang
Copy link
Contributor

@whitneywhtsang it seems that the performance is improving from this change. Could you double check me, just like you looked at the dashboards before?

This change should not have performance impact on XeTLA. As all 3 dots are slightly higher, where one of them is XeTLA, I would think that the machine is in a good stage, and not due to the change. And potentially a performance drop for GEMM with advanced path.

Screenshot 2025-01-13 144657

@anmyachev
Copy link
Contributor Author

anmyachev commented Jan 15, 2025

@whitneywhtsang should be better now. Please take a look. I can run it a couple more times to be sure.

Comment on lines +191 to +192
n_warmup = max(1, int(warmup_time / estimate_ms))
n_repeat = max(1, int(rep_time / estimate_ms))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iteration determination procedure is as similar as possible to the one used before. I believe the changes in the results can be fully attributed to the consequences of the transition from implicit elapsed_time timing to simple wall timing.

@anmyachev
Copy link
Contributor Author

This performance difference may be due to the different number of warm-up runs of the function.

We could do a run with upstream do_bench changed to use exact number of runs (without this PR), and see if there are any performance differences, to isolate the reason. (We could do that after Agama update.)

Looks like I have time, will try.

@whitneywhtsang
Copy link
Contributor

@whitneywhtsang should be better now. Please take a look. I can run it a couple more times to be sure.

This change itself looks good to me, but from the current performance results, there is no evident that there is performance improvement. Do you have the same observation?

@anmyachev
Copy link
Contributor Author

@whitneywhtsang should be better now. Please take a look. I can run it a couple more times to be sure.

This change itself looks good to me, but from the current performance results, there is no evident that there is performance improvement. Do you have the same observation?

More or less yes, however, for softmax the improvement is noticeable.

@anmyachev
Copy link
Contributor Author

@whitneywhtsang ok, it seems the benefit of this change is less than the support costs. I suggest closing both the pull request and the issue. WDYT?

@whitneywhtsang
Copy link
Contributor

@whitneywhtsang ok, it seems the benefit of this change is less than the support costs. I suggest closing both the pull request and the issue. WDYT?

I agree. Thanks for performing all the experiments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't use implicitly elapsed_time in autotuner when profiling with PyTorch and PTI
2 participants