-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Parallel Loop with Generic Reduction #12195
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how I feel abt this TBH
You replace the reductor by an extra lambda function if I understand right?
You could easily do the example you mention with the existing interface
Do you have a usecase where the existing one doesn't work?
more or less yes. The key is that the lambda function is an instance the user creates at the location of the loop, while existing reductions are instantiated in
The current interface is general enough to allow doing mostly anything, but it's ill-suited for a couple of applications. One gripe I have with it is that you need to define a class for every kind of reduction. If your reduction is highly specific to your problem, you'd basically have to define a class that gets used only once. However, the major issue is what I mentioned earlier: reductions cannot store local state (reference local variables), nor do they support thread local storage. So the only way of getting local state into the reduction is via the return value of the parallel function. This is extremely hackish and I would immediately block any PR that tried doing this. For example, take the inverse of the problem in the example: you have an existing set of integers between 0 and 99, and you want to remove all round numbers that appear in the range you're looping over. With the current interface, you'd have to capture the existing set in the functor and forward the pointer in its return value ... // --- Core Includes ---
#include "utilities/parallel_utilities.h" // block_for_each
// --- STL Includes ---
#include <unordered_set> // unordered_set
#include <numeric> // iota
#include <vector> // vector
#include <iostream> // cout
#include <optional> // optional
using Container = std::vector<int>;
using IntSet = std::unordered_set<Container::value_type>;
namespace Kratos { // required by KRATOS_CRITICAL_SECTION
struct RemoveRoundsReduction
{
using value_type = std::pair<
Container::value_type, // <== value of the item in the container we're looping over
IntSet* // <== local state
>;
using return_type = void;
/// @details We're performing the reduction on an external object, so there's nothing to return.
/// @note Btw it's extremely confusing that "GetValue" returns "return_type" instead of "value_type".
return_type GetValue() const {}
/// @brief Accumulate items to remove from the global set.
void LocalReduce(value_type Value) {
const auto item = Value.first;
if (!(item % 10)) this->local_set.insert(item);
// Store the global state in the reduction
IntSet* p_global_set = Value.second;
this->p_maybe_global_set.emplace(p_global_set);
}
/// @brief Remove items in the local set from the global one.
/// @note The global reducer (this instance) is unused because we're performing
/// the reduction on an external object, and the data used for that reduction
/// is stored in the local reductions.
void ThreadSafeReduce(const RemoveRoundsReduction& rLocalReduction) {
if (rLocalReduction.p_maybe_global_set.has_value()) {
KRATOS_CRITICAL_SECTION
for (auto item : rLocalReduction.local_set) rLocalReduction.p_maybe_global_set.value()->erase(item);
}
}
/// @brief Accumulate round numbers in this set during the local loop.
IntSet local_set;
/// @details Pointer to the global set, protected by an optional in case
/// the current thread was assigned an empty chunk to work on.
std::optional<IntSet*> p_maybe_global_set;
}; // struct RemoveRoundsReduction
} // namespace Kratos
int main(){
Container container(1e2);
std::iota(container.begin(), container.end(), 0);
IntSet not_round_numbers(container.begin(), container.end());
Kratos::block_for_each<Kratos::RemoveRoundsReduction>(
container,
[¬_round_numbers](Container::value_type Value) -> std::pair<Container::value_type,IntSet*> {
return std::make_pair(Value, ¬_round_numbers);
}
);
for (auto item : not_round_numbers) std::cout << item << " ";
std::cout << "\n";
} Possible output:
In comparison, the reduction with a lambda would look like so: // --- Core Includes ---
#include "utilities/parallel_utilities.h" // block_for_each
// --- STL Includes ---
#include <unordered_set> // unordered_set
#include <numeric> // iota
#include <vector> // vector
#include <iostream> // cout
int main(){
using Container = std::vector<int>;
Container container(1e2);
std::iota(container.begin(), container.end(), 0);
using TLS = std::unordered_set<Container::value_type>;
TLS not_round_numbers(container.begin(), container.end());
Kratos::block_for_each(
container,
TLS(),
[](Container::value_type Value, TLS& rTls) -> void {
if (!(Value % 10)) rTls.insert(Value);
},
[¬_round_numbers](TLS& rTls) mutable -> void {
for (auto item : rTls) not_round_numbers.erase(item);
}
);
for (auto item : not_round_numbers) {
std::cout << item << "\n";
}
} |
Changes
Add a parallel for loop with thread local storage and an extra functor that performs reduction on each storage in a single thread after it finished its chunk of the loop.
This allows defining reductions on the fly, allowing for the implementation of more complex logic without overpopulating
reduction_utilities
.Example
An example for usage is building a set "in parallel". Each thread computes part of the set, which then gets unified into the final one during reduction. The following example collects every round number between 0 and 99:
Possible output: