-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorisation #589
Vectorisation #589
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite out of sync with the changes in PyOP2 and Firedrake since about 18 months ago, so I'm probably not qualified to review this. However, I did take a look and added some comments.
I remember this (or a similar) branch did pass all tests at one point of time (I can't quite recall if it is locally on my Linux machine or on Jenkins), but it is very likely that changes and/or new examples might introduce new issues.
One major feature that I didn't touch at all back then is detecting hardware/compiler environment and apply vectorisation accordingly. This is important esp. if you want to switch on vectorisation by default. I guess we could have the users to op in and not turn this on by default as the first step to get things going, but we should at least have some Jenkins jobs that test vectorisation in that case.
The firedrake test test_real_space.py::test_real_interpolate is failing with the following generated code and the answer is 4 times the answer of what we expect. It seems like there is another bug in the splitting. I think the iname should not be split at all, it is only of extent 1.
|
Can you dump the wrapper kernel before this splitting transform so we can have a look at what went wrong? Do |
|
The loop splitting code doesn't seem to privatise the reduction variables properly for global reductions. We start with (I've assumed start = 0, end mod 4 = 0 and end > start to remove some conditionals): void loopy_kernel(int32_t const start, int32_t const end,
double *__restrict__ glob0,
double const *__restrict__ glob1) {
for (int32_t n = 0; n <= -1 + end; ++n) {
glob0[0] = 0.0;
{
int32_t const expr_p0 = 0;
glob0[0] = glob0[0] + glob1[0];
}
}
} So far, so good. If I void loopy_kernel(int32_t const start, int32_t const end,
double *__restrict__ glob0,
double const *__restrict__ glob1) {
for (int32_t n_outer = 0; n_outer <= (-4 + end) / 4; ++n_outer) {
for (int32_t n_inner = 0; n_inner <= 3; ++n_inner) {
glob0[0] = 0.0;
}
{
int32_t const expr_p0 = 0;
for (int32_t n_inner = 0; n_inner <= 3; ++n_inner)
glob0[0] = glob0[0] + glob1[0];
}
}
} Notice how this is wrong, because we zero I wonder what one should have done. |
I don't know. Which instruction belongs to which loop is usually defined with within_inames. Maybe we missed to specify that somewhere. But since the the instruction is in the loop body, I assume this is not it. Maybe the fact that it's wrapped in a |
The problem here is that the |
A workaround is to add yet another conditional to the case in which we switch vectorisation off again, see https://github.com/OP2/PyOP2/pull/589/files#diff-c87324f585685c592b11a1e78337d775R183 |
Do you expect this only to occur for writing into globals? I added the following conditional and was hoping the other failing test passes as well, but no.
|
No clue, but I'm not sure that it should only trigger then the arg is |
All of these workarounds are not very satisfactory solutions: it feels like the implementation of the cvec and omp_simd backends in loopy are not very complete. But I guess one should get as much code merged as possible and then fix that. |
Yes to both comments. The line smoother test is still failing. It's only the one for the periodic hierarchy. The solver diverges in the end. I have no idea how to debug this. |
|
Here's a proposed better fix. The loopy folks suggest privatizing (via a temporary variable) the reduction (see discussion here: https://gitlab.tiker.net/inducer/loopy/-/issues/217). We can do this by producing code with temporaries at codegen time. I pushed 0677a24 which does this, and hopefully fixes the problem (and doesn't introduce other ones). |
OK, that was a bad fix, will revert for now. |
What went wrong with your fix, @wence- ? |
Oh, it revealed firedrakeproject/firedrake#1768. |
3d47231
to
fdf3fad
Compare
The kernel, that causes the failure in tests/regression/test_line_smoother_periodic.py::test_line_smoother_periodic is attached in vectorised and unvectorised form. The bug is in the value of |
Can you show what values the Translating that loop structure to python I get:
It's not obvious that this gives wrong iteration counts. |
I think, the reason one cannot see the problem in the iteration count in your script is that the slab |
So I think in the C kernel we want |
I'm pretty sure python and C ints have the same rounding behaviour. I did a bit more digging, it turns out this is not a bug in the loopy codegen, but rather in values that we're sending in to the kernel. You can reproduce by comparing the output with vectorisation off and on of this code:
What's happening here is that we're interpolating a DG field into a CG space, and the values we get are dependent on the order in which we execute things. Now, PyOP2 has never guaranteed that it won't change the iteration order, so arguably this code has been relying on unspecified behaviour. Actually fixing this properly needs a big rethink of the way the multigrid grid transfer is done, however, I can get the test to pass at least, by applying this patch in loopy: diff --git a/loopy/codegen/loop.py b/loopy/codegen/loop.py
index c71554a4..43bff2c3 100644
--- a/loopy/codegen/loop.py
+++ b/loopy/codegen/loop.py
@@ -105,9 +105,9 @@ def get_slab_decomposition(kernel, iname):
if upper_bulk_bound is not None:
bulk_slab = bulk_slab.add_constraint(upper_bulk_bound)
- slabs.append(("bulk", bulk_slab))
if lower_slab:
slabs.append(lower_slab)
+ slabs.append(("bulk", bulk_slab))
if upper_slab:
slabs.append(upper_slab) |
If I execute that test I get a different output
|
9d6e625
to
6cda3c1
Compare
The last failing firedrake test is tests/extrusion/test_real_tensorproduct.py now. |
#590 in PyOP2 (+ https://github.com/firedrakeproject/firedrake/tree/wence/fix/interpolate-max in firedrake) should be merged first, so I can drop those commits in this branch. |
Done. |
6420e9e
to
856b6aa
Compare
I have py-cpuinfo as required dependency. I added this in firedrake https://github.com/firedrakeproject/firedrake/compare/trigger-tests-for-vec and also in PyOP2 in this commit 1cf7698. Do I need both or is it enough in PyOP2? |
Only works when kernel is a Loopy kernel.
…he tree vectorisation flag for our vectorisation anyways.
Don't vectorise, if complex arguments. Check if vect strategy specified, otw dont vectorise.
87bb49a
to
6bf9de3
Compare
Superseded by #654 |
I would like to merge TJs changes to enable vectorisation of TSFC kernels.
This branch depends on https://github.com/firedrakeproject/loopy/tree/cvec.
#590 in PyOP2 (+ https://github.com/firedrakeproject/firedrake/tree/wence/fix/interpolate-max in firedrake) should be merged first, so I can drop those commits in this branch.