You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
cuet.TensorProduct flattens, squeezes, and splits the descriptor so that it can use the TensorProductUniform4x1d. In fact, GPUs have Tensor core. Will processing data into 1d vectors reduce computational complexity, or may I have misunderstood? Please advise.
The text was updated successfully, but these errors were encountered:
For now we have 3 kernels and the frontend, as you saw, tries to reduce the given STP to on of these kernels. We are currently working on two things related to that:
improve the reduction of STP to the available kernels
add more kernels.
Can you provide details about the specific STP you are executing?
We are mainly focusing on the TensorProductUniform4x1d interface. Can you provide more information?
In addition, the frontend of cuEqu shows four interfaces: TensorProductUniform3x1d, TensorProductUniform4x1d, FusedTensorProductOp3, and FusedTensorProductOp4, but you mentioned there are only three kernels. Is the backend kernels corresponding to these four interfaces?
Hi
cuet.TensorProduct flattens, squeezes, and splits the descriptor so that it can use the TensorProductUniform4x1d. In fact, GPUs have Tensor core. Will processing data into
1d
vectors reduce computational complexity, or may I have misunderstood? Please advise.The text was updated successfully, but these errors were encountered: