[Core] Parallel Loop with Generic Reduction #12195

matekelemen · 2024-03-16T22:59:52Z

Changes

Add a parallel for loop with thread local storage and an extra functor that performs reduction on each storage in a single thread after it finished its chunk of the loop.

This allows defining reductions on the fly, allowing for the implementation of more complex logic without overpopulating reduction_utilities.

Example

An example for usage is building a set "in parallel". Each thread computes part of the set, which then gets unified into the final one during reduction. The following example collects every round number between 0 and 99:

// --- Core Includes ---
#include "utilities/parallel_utilities.h" // block_for_each

// --- STL Includes ---
#include <unordered_set> // unordered_set
#include <numeric> // iota
#include <vector> // vector
#include <iostream> // cout


int main(){
    using Container = std::vector<int>;

    // Construct a range of integers between 0 and 99
    Container container(1e2);
    std::iota(container.begin(), container.end(), 0);

    // Define the thread-local storage
    using TLS = std::unordered_set<Container::value_type>;
    TLS round_numbers;
    
    // Perform the loop and reduction
    Kratos::block_for_each(
        container,
        round_numbers,
        [](Container::value_type Value, TLS& rTls) -> void {
            if (!(Value % 10)) rTls.insert(Value);
        },
        [&round_numbers](TLS& rTls) mutable -> void {
            for (auto item : rTls) round_numbers.insert(item);
        }
    );

    // Print the unified results
    for (auto item : round_numbers) {
        std::cout << item << "\n";
    }
}

Possible output:

philbucher

Not sure how I feel abt this TBH

You replace the reductor by an extra lambda function if I understand right?

You could easily do the example you mention with the existing interface

Do you have a usecase where the existing one doesn't work?

matekelemen · 2024-03-17T13:55:49Z

You replace the reductor by an extra lambda function if I understand right?

more or less yes. The key is that the lambda function is an instance the user creates at the location of the loop, while existing reductions are instantiated in for_each, and thus cannot refer to local state.

You could easily do the example you mention with the existing interface

Do you have a usecase where the existing one doesn't work?

The current interface is general enough to allow doing mostly anything, but it's ill-suited for a couple of applications.

One gripe I have with it is that you need to define a class for every kind of reduction. If your reduction is highly specific to your problem, you'd basically have to define a class that gets used only once.

However, the major issue is what I mentioned earlier: reductions cannot store local state (reference local variables), nor do they support thread local storage. So the only way of getting local state into the reduction is via the return value of the parallel function. This is extremely hackish and I would immediately block any PR that tried doing this.

For example, take the inverse of the problem in the example: you have an existing set of integers between 0 and 99, and you want to remove all round numbers that appear in the range you're looping over. With the current interface, you'd have to capture the existing set in the functor and forward the pointer in its return value ...

// --- Core Includes ---
#include "utilities/parallel_utilities.h" // block_for_each

// --- STL Includes ---
#include <unordered_set> // unordered_set
#include <numeric> // iota
#include <vector> // vector
#include <iostream> // cout
#include <optional> // optional

using Container = std::vector<int>;

using IntSet = std::unordered_set<Container::value_type>;

namespace Kratos { // required by KRATOS_CRITICAL_SECTION

struct RemoveRoundsReduction
{
    using value_type = std::pair<
                            Container::value_type, // <== value of the item in the container we're looping over
                            IntSet*                // <== local state
                       >;

    using return_type = void;


    /// @details We're performing the reduction on an external object, so there's nothing to return.
    /// @note Btw it's extremely confusing that "GetValue" returns "return_type" instead of "value_type".
    return_type GetValue() const {}

    /// @brief Accumulate items to remove from the global set.
    void LocalReduce(value_type Value) {
        const auto item = Value.first;
        if (!(item % 10)) this->local_set.insert(item);

        // Store the global state in the reduction
        IntSet* p_global_set = Value.second;
        this->p_maybe_global_set.emplace(p_global_set);
    }

    /// @brief Remove items in the local set from the global one.
    /// @note The global reducer (this instance) is unused because we're performing
    ///       the reduction on an external object, and the data used for that reduction
    ///       is stored in the local reductions.
    void ThreadSafeReduce(const RemoveRoundsReduction& rLocalReduction) {
        if (rLocalReduction.p_maybe_global_set.has_value()) {
            KRATOS_CRITICAL_SECTION
            for (auto item : rLocalReduction.local_set) rLocalReduction.p_maybe_global_set.value()->erase(item);
        }
    }

    /// @brief Accumulate round numbers in this set during the local loop.
    IntSet local_set;

    /// @details Pointer to the global set, protected by an optional in case
    ///          the current thread was assigned an empty chunk to work on.
    std::optional<IntSet*> p_maybe_global_set;
}; // struct RemoveRoundsReduction

} // namespace Kratos


int main(){
    Container container(1e2);
    std::iota(container.begin(), container.end(), 0);

    IntSet not_round_numbers(container.begin(), container.end());

    Kratos::block_for_each<Kratos::RemoveRoundsReduction>(
        container,
        [&not_round_numbers](Container::value_type Value) -> std::pair<Container::value_type,IntSet*> {
            return std::make_pair(Value, &not_round_numbers);
        }
    );

    for (auto item : not_round_numbers) std::cout << item << " ";
    std::cout << "\n";
}

Possible output:

99 98 97 96 95 94 93 92 91 89 88 87 86 85 84 83 82 81 79 78 77 76 75 74 73 72 71 69 68 67 66 65 64 63 62 61 59 28 27 26 25 24 23 22 21 19 18 17 16 15 14 13 1 2 3 4 5 6 7 8 9 11 12 29 31 32 33 34 35 36 37 38 39 41 42 43 44 45 46 47 48 49 51 52 53 54 55 56 57 58

In comparison, the reduction with a lambda would look like so:

// --- Core Includes ---
#include "utilities/parallel_utilities.h" // block_for_each

// --- STL Includes ---
#include <unordered_set> // unordered_set
#include <numeric> // iota
#include <vector> // vector
#include <iostream> // cout


int main(){
    using Container = std::vector<int>;

    Container container(1e2);
    std::iota(container.begin(), container.end(), 0);

    using TLS = std::unordered_set<Container::value_type>;
    TLS not_round_numbers(container.begin(), container.end());

    Kratos::block_for_each(
        container,
        TLS(),
        [](Container::value_type Value, TLS& rTls) -> void {
            if (!(Value % 10)) rTls.insert(Value);
        },
        [&not_round_numbers](TLS& rTls) mutable -> void {
            for (auto item : rTls) not_round_numbers.erase(item);
        }
    );

    for (auto item : not_round_numbers) {
        std::cout << item << "\n";
    }
}

add parallel for loop with generic reduction

682369c

matekelemen added Kratos Core C++ Parallel-SMP Shared memory parallelism with OpenMP or C++ Threads labels Mar 16, 2024

matekelemen requested a review from philbucher March 16, 2024 22:59

matekelemen self-assigned this Mar 16, 2024

matekelemen requested a review from a team as a code owner March 16, 2024 22:59

matekelemen requested a review from sunethwarna March 16, 2024 23:28

philbucher reviewed Mar 17, 2024

View reviewed changes

jcotela mentioned this pull request May 7, 2024

[Core] Add Voxel Mesh Generation Modeler #12297

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Parallel Loop with Generic Reduction #12195

[Core] Parallel Loop with Generic Reduction #12195

matekelemen commented Mar 16, 2024 •

edited

Loading

philbucher left a comment

matekelemen commented Mar 17, 2024 •

edited

Loading

[Core] Parallel Loop with Generic Reduction #12195

Are you sure you want to change the base?

[Core] Parallel Loop with Generic Reduction #12195

Conversation

matekelemen commented Mar 16, 2024 • edited Loading

Changes

Example

philbucher left a comment

Choose a reason for hiding this comment

matekelemen commented Mar 17, 2024 • edited Loading

matekelemen commented Mar 16, 2024 •

edited

Loading

matekelemen commented Mar 17, 2024 •

edited

Loading