-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative work distribution #36
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unreadable with the TAB changes.
Hopefully there is an option in GitHub for not showing these differences...
include/render.hpp
Outdated
const auto th_max_y = start_y + pg_height; | ||
const auto max_y = (th_max_y > height) ? height : th_max_y; | ||
for (auto y = start_y; y < max_y; ++y) { | ||
for (auto x = start_x; x < max_x; ++x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain somewhere in the comments what is this optimal problem you are trying to solve?
Why these loop nests in the case of a parallel_for
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why these loop nests in the case of a parallel_for ?
On my machine, using one thread per pixel lead to an under-utilization of the cpu (each core was having an activity between 70-80 %) because of scheduling issues.
On the gpu if we have more cores than pixel it should "degenerate" in using one core per work item (but I think the case is not handled well : the find immediate divider should return 1 if it reaches end of the function instead of zero).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm waiting for your comment on this strategy before commenting.
On FPGA I think the problem is more complicated because you will want to "tune" the replication of the loop heart to find a good trade-off between area (to be able to pack other kernel on the same part without requiring reconfiguration) and execution time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should not be one thread per pixel.
By looking at the code again, the problem is perhaps there is an explicit nd_range
distribution of the parallelism. On CPU this is painful because you need to fight against possible barriers. This is why there is a macro to swear there is no barriers in use...
What about just using a simple range
parallel_for
and just trusting the runtime for distributing the work?
For triSYCL you could try the TBB runtime too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By looking at the code again, the problem is perhaps there is an explicit nd_range distribution of the parallelism.
Indeed, there was an explicit nd range creating 64 work item per group organised in a 8x8 grid. I wasn't aware of the alternative parallel_for api.
What about just using a simple range parallel_for and just trusting the runtime for distributing the work?
That seems even better indeed. Done in 32baf38
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, given this alternative API I don't see why there is a need of a "one_single_task" version for FPGA : the semantics of parallel_for should be sufficient, the backend compiler would then be responsible of choosing wether or not replicate many time the data flow or use loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. But the parallel_for
might be less efficient on FPGA.
Rewriting with nested lambda to solve #33 plus alternative scheduling to allow better cache locality plus avoiding too many threads on cpu.