summing data across threads #294

toaster-robotics · 2024-08-20T06:13:21Z

toaster-robotics
Aug 20, 2024

Hi everyone,

I'm working with Warp and need some advice on correctly accumulating values in a kernel. I have a vector that I pass to my kernel, which is used to sum calculations performed by each thread. However, I'm struggling with ensuring this summation is done correctly, as multiple threads may be trying to update the same array element simultaneously. Below is a simplified example of what I'm trying to do:

import warp as wp
import numpy as np
import matplotlib.pylab as plt


@wp.kernel
def my_kernel(
    theta: wp.array(dtype=wp.float32),    # type: ignore
    phi: wp.array(dtype=wp.float32),  # type: ignore
    A: wp.array(dtype=wp.float32),  # type: ignore
):
    i, j = wp.tid()
    A[i] += wp.cos(theta[i]) + wp.sin(phi[j])


def sample_code():

    theta = wp.array(np.linspace(0, np.pi, 100).astype(np.float32))
    phi = wp.array(np.linspace(np.pi, 2 * np.pi, 100).astype(np.float32))
    _A = wp.zeros((100,))

    with wp.ScopedDevice('cuda'):
        wp.launch(
            kernel=my_kernel,
            dim=(100, 100),
            inputs=[
                theta,
                phi,
                _A
            ]
        )

    A = _A.numpy()
    return A


a1 = sample_code()
a2 = sample_code()

plt.plot(a1)
plt.plot(a2)
plt.show()

The issue is that when multiple threads try to update the same element in A, it leads to incorrect results. I suspect this is because one thread may be reading and writing to the array element while another thread is updating it, leading to a race condition. I initially thought that atomic_add might be the solution, but it doesn't accept A[i] as an argument.

I’m aware that I could add an additional dimension to A, run the kernel so each thread only addresses one element, and then sum up the results after the kernel finishes. However, I’d prefer to avoid this as it significantly increases memory usage, which is problematic in my actual code (though not in the above example).

Does anyone have any suggestions on how to handle this accumulation correctly without the memory overhead?

Answered by toaster-robotics

Aug 20, 2024

Well this is gonna be funny/sad...
I had chatgpt4o clean up the wording of my post before submitting it. Afterwords I thought what the hell, lets see if chatgpt can help with the actual problem. It told me to use atomic_add. It turns out I didn't read the atomic_add documentation correctly 🤦. So for my sample code you would just do

wp.atomic_add(A, i, wp.cos(theta[i]) + wp.sin(phi[j]))

I'm going to leave the post up just in case there's other dummies like me out there.

View full answer

toaster-robotics · 2024-08-20T06:41:04Z

toaster-robotics
Aug 20, 2024
Author

Well this is gonna be funny/sad...
I had chatgpt4o clean up the wording of my post before submitting it. Afterwords I thought what the hell, lets see if chatgpt can help with the actual problem. It told me to use atomic_add. It turns out I didn't read the atomic_add documentation correctly 🤦. So for my sample code you would just do

wp.atomic_add(A, i, wp.cos(theta[i]) + wp.sin(phi[j]))

I'm going to leave the post up just in case there's other dummies like me out there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

summing data across threads #294

{{title}}

Replies: 1 comment

{{title}}

Select a reply

summing data across threads #294

toaster-robotics Aug 20, 2024

Replies: 1 comment

toaster-robotics Aug 20, 2024 Author

toaster-robotics
Aug 20, 2024

toaster-robotics
Aug 20, 2024
Author