Alternative work distribution #36

lforg37 · 2021-02-06T06:05:42Z

Rewriting with nested lambda to solve #33 plus alternative scheduling to allow better cache locality plus avoiding too many threads on cpu.

keryell

It is unreadable with the TAB changes.
Hopefully there is an option in GitHub for not showing these differences...

CMakeLists.txt

include/build_parameters.hpp

include/material.hpp

include/render.hpp

src/main.cpp

include/render.hpp

keryell · 2021-02-09T05:58:51Z

include/render.hpp

+          const auto th_max_y = start_y + pg_height;
+          const auto max_y = (th_max_y > height) ? height : th_max_y;
+          for (auto y = start_y; y < max_y; ++y) {
+            for (auto x = start_x; x < max_x; ++x) {


Can you explain somewhere in the comments what is this optimal problem you are trying to solve?
Why these loop nests in the case of a parallel_for?

Why these loop nests in the case of a parallel_for ?

On my machine, using one thread per pixel lead to an under-utilization of the cpu (each core was having an activity between 70-80 %) because of scheduling issues.

On the gpu if we have more cores than pixel it should "degenerate" in using one core per work item (but I think the case is not handled well : the find immediate divider should return 1 if it reaches end of the function instead of zero).

I'm waiting for your comment on this strategy before commenting.

On FPGA I think the problem is more complicated because you will want to "tune" the replication of the loop heart to find a good trade-off between area (to be able to pack other kernel on the same part without requiring reconfiguration) and execution time.

There should not be one thread per pixel.
By looking at the code again, the problem is perhaps there is an explicit nd_range distribution of the parallelism. On CPU this is painful because you need to fight against possible barriers. This is why there is a macro to swear there is no barriers in use...
What about just using a simple range parallel_for and just trusting the runtime for distributing the work?
For triSYCL you could try the TBB runtime too.

By looking at the code again, the problem is perhaps there is an explicit nd_range distribution of the parallelism.

Indeed, there was an explicit nd range creating 64 work item per group organised in a 8x8 grid. I wasn't aware of the alternative parallel_for api.

What about just using a simple range parallel_for and just trusting the runtime for distributing the work?

That seems even better indeed. Done in 32baf38

However, given this alternative API I don't see why there is a need of a "one_single_task" version for FPGA : the semantics of parallel_for should be sufficient, the backend compiler would then be responsible of choosing wether or not replicate many time the data flow or use loops.

Sure. But the parallel_for might be less efficient on FPGA.

src/main.cpp

lforg37 added 10 commits February 5, 2021 11:12

Correct performance issue introduced by fca7dcf

86d0a8c

Remove as much pointer and direct copy as possible

832159f

Remove pointer usage in main

df12334

Move more things inside executor

d51486f

Grouping renderer code in one function

02c56dc

Remove useless render_executor

ec8cb83

Clang format on render.hpp

3c3c152

Try with raw pointers

7640afd

Alternative work distribution

7c2c51c

Removing useless lambda from triangle

cc87e9f

keryell requested changes Feb 8, 2021

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

include/build_parameters.hpp Outdated Show resolved Hide resolved

include/material.hpp Outdated Show resolved Hide resolved

include/render.hpp Outdated Show resolved Hide resolved

src/main.cpp Outdated Show resolved Hide resolved

Moving from hierarchical parallelism back to parallel_for, removing tabs

276e1b2

lforg37 requested a review from keryell February 8, 2021 09:58

keryell reviewed Feb 9, 2021

View reviewed changes

Let sycl runtime determine work item grouping

32baf38

keryell approved these changes Feb 11, 2021

View reviewed changes

keryell merged commit c6d2303 into triSYCL:main Feb 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative work distribution #36

Alternative work distribution #36

lforg37 commented Feb 6, 2021

keryell left a comment

keryell Feb 9, 2021

lforg37 Feb 9, 2021

lforg37 Feb 9, 2021

keryell Feb 10, 2021

lforg37 Feb 10, 2021 •

edited

Loading

lforg37 Feb 10, 2021

keryell Feb 11, 2021

Alternative work distribution #36

Alternative work distribution #36

Conversation

lforg37 commented Feb 6, 2021

keryell left a comment

Choose a reason for hiding this comment

keryell Feb 9, 2021

Choose a reason for hiding this comment

lforg37 Feb 9, 2021

Choose a reason for hiding this comment

lforg37 Feb 9, 2021

Choose a reason for hiding this comment

keryell Feb 10, 2021

Choose a reason for hiding this comment

lforg37 Feb 10, 2021 • edited Loading

Choose a reason for hiding this comment

lforg37 Feb 10, 2021

Choose a reason for hiding this comment

keryell Feb 11, 2021

Choose a reason for hiding this comment

lforg37 Feb 10, 2021 •

edited

Loading