Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

+50% slow down #33

Open
keryell opened this issue Feb 5, 2021 · 9 comments
Open

+50% slow down #33

keryell opened this issue Feb 5, 2021 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@keryell
Copy link
Member

keryell commented Feb 5, 2021

I have a slowdown on the recent version of the path_tracer on my CPU laptop.
If you notice the same, can you bisect the change?
Perhaps some objects we copy instead of using by ref (in lambda?).
Check also with the various sanitizers that there are no weird bugs.

@keryell keryell added the bug Something isn't working label Feb 5, 2021
@lforg37
Copy link
Contributor

lforg37 commented Feb 5, 2021

I had also an important slowdown related to copy inside lambda.
On my side #32 fixes the problem.
Can you check if it works for you ?

@keryell
Copy link
Member Author

keryell commented Feb 5, 2021

No. but I think I have found why...

@lforg37
Copy link
Contributor

lforg37 commented Feb 5, 2021

I have a slowdown on the recent version of the path_tracer on my CPU laptop.
If you notice the same, can you bisect the change?

On my experimentation machine (8 virtual cores) I go from 8'50" total execution time (~2' wallclock time) to 14'30" between commits df12334 and fca7dcf .

Interestingly, I tried to remove as many pointer as possible in lforg37:avoid_pointers and the total execution time increases even more (16'31").

@lforg37
Copy link
Contributor

lforg37 commented Feb 5, 2021

It seems that having one render_function using internal lambdas leads to better performance.
With the same settings on 3c3c152 I get 10'15" total execution time.
I have no clue yet where the extra 1'30" can be saved..

@lforg37
Copy link
Contributor

lforg37 commented Feb 5, 2021

It seems there is an overhead when using sycl::access instead of raw pointers : with 7640afd I'm reaching 9'10" total execution time.

I'm not convince the resulting code is way cleaner than using object functions...

@keryell
Copy link
Member Author

keryell commented Feb 9, 2021

So do you mean that just using a triSYCL accessor in a kernel is the slow-down?
For example using a[i] is slower than a.get_pointer()[i]?
Perhaps a bug in accessor implementation...
For now I am using Boost.Multi_Array. At some point we need to switch to md_span and md_array.

@keryell
Copy link
Member Author

keryell commented Feb 9, 2021

On my laptop it runs in 1 min normally, 1min30 with the performance bug. Curious these 10 min.

@lforg37
Copy link
Contributor

lforg37 commented Feb 9, 2021

On my laptop it runs in 1 min normally, 1min30 with the performance bug. Curious these 10 min.

Maybe a quiproquo on the reported time : I'm reporting the total execution time. In wallclock time I'm also having something which is around two minutes on a 8 core machine (Xeon CPU E3-1230 v6 @ 3.50GHz).

So do you mean that just using a triSYCL accessor in a kernel is the slow-down?
For example using a[i] is slower than a.get_pointer()[i]?

It's only part of the slow down. The important part of the slow down was having lambdas captured inside lambdas : hit_world() and get_color() were defined outside of render_pixel(). Moving there definition inside render_pixel() body was already an improvement (3c3c152).

However, replacing trisycl accessor with raw pointer leads also to an improvement (7640afd).

I don't see a conceptual difference between the render_pixel() lambda of 7640afd and the initial function object, but there remain a slight slow down (around 15" total execution time on my setting, so probably something like 2" wall clock time, persistent across multiple run) that I cannot explain.

@keryell
Copy link
Member Author

keryell commented Feb 9, 2021

Have you tried with triSYCL Clang++ and G++?
It would be interesting to compare with DPC++ and hipSYCL in CPU mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants