Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative work distribution #36
Alternative work distribution #36
Changes from 11 commits
86d0a8c
832159f
df12334
d51486f
02c56dc
ec8cb83
3c3c152
7640afd
7c2c51c
cc87e9f
276e1b2
32baf38
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain somewhere in the comments what is this optimal problem you are trying to solve?
Why these loop nests in the case of a
parallel_for
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my machine, using one thread per pixel lead to an under-utilization of the cpu (each core was having an activity between 70-80 %) because of scheduling issues.
On the gpu if we have more cores than pixel it should "degenerate" in using one core per work item (but I think the case is not handled well : the find immediate divider should return 1 if it reaches end of the function instead of zero).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm waiting for your comment on this strategy before commenting.
On FPGA I think the problem is more complicated because you will want to "tune" the replication of the loop heart to find a good trade-off between area (to be able to pack other kernel on the same part without requiring reconfiguration) and execution time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should not be one thread per pixel.
By looking at the code again, the problem is perhaps there is an explicit
nd_range
distribution of the parallelism. On CPU this is painful because you need to fight against possible barriers. This is why there is a macro to swear there is no barriers in use...What about just using a simple
range
parallel_for
and just trusting the runtime for distributing the work?For triSYCL you could try the TBB runtime too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, there was an explicit nd range creating 64 work item per group organised in a 8x8 grid. I wasn't aware of the alternative parallel_for api.
That seems even better indeed. Done in 32baf38
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, given this alternative API I don't see why there is a need of a "one_single_task" version for FPGA : the semantics of parallel_for should be sufficient, the backend compiler would then be responsible of choosing wether or not replicate many time the data flow or use loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. But the
parallel_for
might be less efficient on FPGA.