Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative work distribution #36

Merged
merged 12 commits into from
Feb 11, 2021

Conversation

lforg37
Copy link
Contributor

@lforg37 lforg37 commented Feb 6, 2021

Rewriting with nested lambda to solve #33 plus alternative scheduling to allow better cache locality plus avoiding too many threads on cpu.

Copy link
Member

@keryell keryell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unreadable with the TAB changes.
Hopefully there is an option in GitHub for not showing these differences...

CMakeLists.txt Outdated Show resolved Hide resolved
include/build_parameters.hpp Outdated Show resolved Hide resolved
include/material.hpp Outdated Show resolved Hide resolved
include/render.hpp Outdated Show resolved Hide resolved
src/main.cpp Outdated Show resolved Hide resolved
@lforg37 lforg37 requested a review from keryell February 8, 2021 09:58
include/render.hpp Outdated Show resolved Hide resolved
const auto th_max_y = start_y + pg_height;
const auto max_y = (th_max_y > height) ? height : th_max_y;
for (auto y = start_y; y < max_y; ++y) {
for (auto x = start_x; x < max_x; ++x) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain somewhere in the comments what is this optimal problem you are trying to solve?
Why these loop nests in the case of a parallel_for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these loop nests in the case of a parallel_for ?

On my machine, using one thread per pixel lead to an under-utilization of the cpu (each core was having an activity between 70-80 %) because of scheduling issues.

On the gpu if we have more cores than pixel it should "degenerate" in using one core per work item (but I think the case is not handled well : the find immediate divider should return 1 if it reaches end of the function instead of zero).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm waiting for your comment on this strategy before commenting.

On FPGA I think the problem is more complicated because you will want to "tune" the replication of the loop heart to find a good trade-off between area (to be able to pack other kernel on the same part without requiring reconfiguration) and execution time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should not be one thread per pixel.
By looking at the code again, the problem is perhaps there is an explicit nd_range distribution of the parallelism. On CPU this is painful because you need to fight against possible barriers. This is why there is a macro to swear there is no barriers in use...
What about just using a simple range parallel_for and just trusting the runtime for distributing the work?
For triSYCL you could try the TBB runtime too.

Copy link
Contributor Author

@lforg37 lforg37 Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By looking at the code again, the problem is perhaps there is an explicit nd_range distribution of the parallelism.

Indeed, there was an explicit nd range creating 64 work item per group organised in a 8x8 grid. I wasn't aware of the alternative parallel_for api.

What about just using a simple range parallel_for and just trusting the runtime for distributing the work?

That seems even better indeed. Done in 32baf38

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, given this alternative API I don't see why there is a need of a "one_single_task" version for FPGA : the semantics of parallel_for should be sufficient, the backend compiler would then be responsible of choosing wether or not replicate many time the data flow or use loops.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. But the parallel_for might be less efficient on FPGA.

src/main.cpp Show resolved Hide resolved
@keryell keryell merged commit c6d2303 into triSYCL:main Feb 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants