-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda.parallel: Minor perf improvements #3718
base: main
Are you sure you want to change the base?
Conversation
try: | ||
return np.dtype(arr.dtype) # type: ignore | ||
except Exception: | ||
typestr = arr.__cuda_array_interface__["typestr"] | ||
|
||
if typestr.startswith("|V"): | ||
# it's a structured dtype, use the descr field: | ||
return np.dtype(arr.__cuda_array_interface__["descr"]) | ||
else: | ||
# a simple dtype, use the typestr field: | ||
return np.dtype(typestr) | ||
if typestr.startswith("|V"): | ||
# it's a structured dtype, use the descr field: | ||
return np.dtype(arr.__cuda_array_interface__["descr"]) | ||
else: | ||
# a simple dtype, use the typestr field: | ||
return np.dtype(typestr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to arr.__cuda_array_interface__
was expensive. So it pays to save the interface into a temporary and reuse it in the function scope.
With this approach, we still make repeated calls to arr.__cuda_array_interface__
(in every one of these functions). We should instead create a calls that encapsulate arr
in it, and provides cached __cuda_array_interface__
property. This is likely what GpuMemoryView class is meant to be. @leofang for comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this PR simply adds a fast path for cupy arrays (and numba as well). If something else is passed (e.g., a torch tensor, we fall back to using __cuda_array_interface__
).
I found that StridedMemoryView
is expensive enough to construct that it doesn't give us sufficient speedup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that StridedMemoryView is expensive enough to construct that it doesn't give us sufficient speedup.
How was the timing done? I would like to know better on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think the current implementation of cuda/parallel/experimental/_utils/protocols.py
should be replaced by StridedMemoryView
, because what happens now is the call to __cuda_array_interface__
is not cached, so we repeatedly call it every time to just get one attribute, instead of caching and reusing it for all attribute queries. But once we cache it essentially it is a re-implementation (of some sort) of StridedMemoryView
. I don't understand what is going on and would like to get down to the bottom of it. StridedMemoryView
is designed so that it supports all single-CPU/GPU array libraries, beyond just CuPy, a capability that we eventually will need for all CUDA projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How was the timing done? I would like to know better on this.
Consider 9cc2902, where I am converting the user-provided arrays to strided memory views and grabbing the data ptr from that. I even deleted all the contiguity checks and other validation code. The benchmarks numbers are then:
Benchmark Results (input size, average time with first run, average time without first run):
Input size: 10 | Avg time with first run: 0.00102104 seconds | Avg time without first run: 0.00002196 seconds
Input size: 100 | Avg time with first run: 0.00002170 seconds | Avg time without first run: 0.00002169 seconds
Input size: 1000 | Avg time with first run: 0.00002289 seconds | Avg time without first run: 0.00002302 seconds
Input size: 10000 | Avg time with first run: 0.00002413 seconds | Avg time without first run: 0.00002405 seconds
Input size: 100000 | Avg time with first run: 0.00004833 seconds | Avg time without first run: 0.00004833 seconds
Input size: 1000000 | Avg time with first run: 0.00019740 seconds | Avg time without first run: 0.00019125 seconds
Input size: 10000000 | Avg time with first run: 0.00104487 seconds | Avg time without first run: 0.00104453 seconds
Input size: 100000000 | Avg time with first run: 0.01038157 seconds | Avg time without first run: 0.01038252 seconds
Which is still a significant improvement over the existing benchmark numbers, but not quite good enough. The overhead here is all in the construction of the StridedMemoryView
s from d_in
and d_out
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Ashwin, this helps. Would you mind passing stream_ptr=-1
to StridedMemoryView
to bypass the stream ordering, and then point me to where bench.py
is so that I can also try this out myself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scripts bench.py
and device_reduce.py
can be found here: #3213 (comment)
Would you mind passing stream_ptr=-1 to StridedMemoryView
This improved things very slightly:
Benchmark Results (input size, average time with first run, average time without first run):
Input size: 10 | Avg time with first run: 0.00103634 seconds | Avg time without first run: 0.00002310 seconds
Input size: 100 | Avg time with first run: 0.00002255 seconds | Avg time without first run: 0.00002250 seconds
Input size: 1000 | Avg time with first run: 0.00002412 seconds | Avg time without first run: 0.00002272 seconds
Input size: 10000 | Avg time with first run: 0.00002496 seconds | Avg time without first run: 0.00002491 seconds
Input size: 100000 | Avg time with first run: 0.00005463 seconds | Avg time without first run: 0.00005463 seconds
Input size: 1000000 | Avg time with first run: 0.00018084 seconds | Avg time without first run: 0.00016899 seconds
Input size: 10000000 | Avg time with first run: 0.00106087 seconds | Avg time without first run: 0.00104471 seconds
Input size: 100000000 | Avg time with first run: 0.01038340 seconds | Avg time without first run: 0.01038483 seconds
🟥 CI finished in 5m 58s: Pass: 0%/1 | Total: 5m 58s | Avg: 5m 58s | Max: 5m 58s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 1)
# | Runner |
---|---|
1 | linux-amd64-gpu-rtx2080-latest-1 |
bf5b043
to
4dcfc6f
Compare
How are we compared to cupy now? |
🟥 CI finished in 5m 55s: Pass: 0%/1 | Total: 5m 55s | Avg: 5m 55s | Max: 5m 55s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 1)
# | Runner |
---|---|
1 | linux-amd64-gpu-rtx2080-latest-1 |
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Closer, but not there quite yet. We have ~15us of constant overhead versus CuPy's ~10us. I'll iterate on this PR until we reach parity |
btw I think you meant us (microseconds) not ms (millisecond). I feel we are pushing to the limit where Python overhead could be something to worry about. |
With the latest changes which rip out all the validation checks we do between the call to
We are absolutely there already - this PR is trying to minimize the number of Python operations we're doing in the |
/ok to test |
🟥 CI finished in 6m 06s: Pass: 0%/1 | Total: 6m 06s | Avg: 6m 06s | Max: 6m 06s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 1)
# | Runner |
---|---|
1 | linux-amd64-gpu-rtx2080-latest-1 |
/ok to test |
🟥 CI finished in 6m 05s: Pass: 0%/1 | Total: 6m 05s | Avg: 6m 05s | Max: 6m 05s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 1)
# | Runner |
---|---|
1 | linux-amd64-gpu-rtx2080-latest-1 |
/ok to test |
🟩 CI finished in 33m 23s: Pass: 100%/1 | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 1)
# | Runner |
---|---|
1 | linux-amd64-gpu-rtx2080-latest-1 |
This reverts commit 9cc2902.
0f404e7
to
41f652d
Compare
/ok to test |
🟩 CI finished in 28m 40s: Pass: 100%/1 | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 1)
# | Runner |
---|---|
1 | linux-amd64-gpu-rtx2080-latest-1 |
Description
This PR addresses some of the performance issues found by @oleksandr-pavlyk' in #3213. Still WIP - there are broader questions regarding API/behaviour we need to answer before considering the changes in this PR.
Before this PR
After this PR:
CuPy
Checklist