Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorisation sprint #654

Closed
wants to merge 107 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
2a8d17c
codegen: Implement SIMD vectorisation
tj-sun Apr 11, 2019
fbc6e4a
add omp simd vectorization mode
tj-sun Aug 1, 2019
5ae780d
add openmp flag and by pass workaround flag
tj-sun Aug 4, 2019
ba693dc
DROP BEFORE MERGE: test with correct loopy branch
wence- Apr 11, 2019
4ec0769
Turn of tree vectorize for certain gcc compilers. We might not need t…
sv2518 Jul 1, 2020
f9e60fd
Add simd compiler flags.
sv2518 Jul 1, 2020
00e073d
Remove time configuration.
sv2518 Jul 1, 2020
1cf7698
Default SIMD width.
sv2518 Jul 1, 2020
3e66946
Generate CVec Target with batch size infomation and move typedef into…
sv2518 Jul 3, 2020
1238ce8
Move zero declaration to loopy code base to be more robust in naming …
sv2518 Jul 3, 2020
1d54777
Added conditionals when to vectorise:
sv2518 Jul 15, 2020
b369213
Drop omp vectorisation.
sv2518 Jul 15, 2020
1c6346e
Add -march=native everywhere.
sv2518 Jul 16, 2020
856b6aa
Silence warnings.
sv2518 Jul 22, 2020
5e52ce1
Change vector tag.
sv2518 Aug 24, 2020
537c14c
Give more control over vectorisation to PyOP2.
sv2518 Sep 1, 2020
9317654
Naming adaption.
sv2518 Sep 1, 2020
6723b6a
Realize ilp first.
sv2518 Sep 1, 2020
38ebc8a
Jenkins.
sv2518 Sep 1, 2020
32b2910
Merge branch 'master' into vectorisation-restructure-checks
sv2518 Feb 28, 2022
944c6cf
DBM: run against new loopy branch
sv2518 Mar 1, 2022
3a1eb24
Lint
sv2518 Mar 1, 2022
681e315
More adapations to new PyOP2
sv2518 Mar 1, 2022
48d6142
More adapations to new PyOP2
sv2518 Mar 1, 2022
792c8f0
DBM take the correct branch
sv2518 Mar 3, 2022
2469870
Adapt to new PyOP2 and vectorisation
sv2518 Mar 3, 2022
4bbcde5
Adapt to new PyOP2 and vectorisation
sv2518 Mar 3, 2022
a5c0455
Fix return wrapper with kernel not kernel
sv2518 Mar 3, 2022
c374031
We do need to inline bc Implementing transforms that apply cleanly ac…
sv2518 Mar 3, 2022
e7d31eb
First split then tag because loopy does not support retaggin of iname…
sv2518 Mar 3, 2022
56a8dde
tag_array_axes requires us to specify the tags for each dimension of …
sv2518 Mar 3, 2022
d1171b3
Fix
sv2518 Mar 3, 2022
0641c75
fix
sv2518 Mar 3, 2022
644842e
improve comments
sv2518 Mar 3, 2022
9e58b22
tag only non-constant arrays with vec axes
kaushikcfd Mar 3, 2022
3f133fd
Only vectorise when local kernel is a loopy thing.
sv2518 Mar 4, 2022
dcd0b69
shift iel-loop to have lbound of 0
kaushikcfd Mar 4, 2022
907fe58
Fix import
sv2518 Mar 6, 2022
ca2aaaf
Debug: try with newer python version
sv2518 Mar 6, 2022
0440f66
Debug: try with newer python version
sv2518 Mar 6, 2022
4bcb592
change target before inlining
kaushikcfd Mar 7, 2022
d42e7e8
ignore loopy vectorization fallback warnings
kaushikcfd Mar 7, 2022
7e37e02
Revert "Debug: try with newer python version"
sv2518 Mar 6, 2022
b541dbd
Make complex check tighter
sv2518 Mar 11, 2022
caa567a
extend the set of variables that cannot be vecotrized
kaushikcfd Mar 11, 2022
c3a96fa
Attempt to fix Slate by inlining of all subkernels
sv2518 Mar 14, 2022
dc996de
Add comment
sv2518 Mar 14, 2022
fa343e1
placate flake8
kaushikcfd Mar 15, 2022
aa7bc0c
blas callables: do not accept vectorized dtypes
kaushikcfd Apr 1, 2022
8302d52
allow inverse.c::inverse() to take in vector dtypes
kaushikcfd May 5, 2022
a767fe2
Merge remote-tracking branch 'origin/master' into vectorisation-sprint
kaushikcfd May 5, 2022
85de156
do not invoke the vectorization pass if one of the arguments is a Mix…
kaushikcfd May 5, 2022
30f8ecb
makes freeing logic accurate
kaushikcfd May 5, 2022
0d5023d
rewrite solve to accept strided inputs
kaushikcfd May 6, 2022
d25545b
blas-helpers: corrects the freeing logic
kaushikcfd May 6, 2022
0ade829
Don't vectorise the kernel which generates the coordinates for the ex…
sv2518 May 6, 2022
a4bab8e
PyOP2 compilation: add a pathway to compile with gcc on Mac.
sv2518 May 6, 2022
175eb14
do not vectorize the entire kernel if some instruction are surrounded…
kaushikcfd May 8, 2022
8256bd2
loop being split starts from '0' => do not peel at the head
kaushikcfd May 8, 2022
6585dbb
Merge branch 'vectorisation-sprint' of github.com:OP2/PyOP2 into vect…
sv2518 May 9, 2022
4c0ca6e
Add comment
sv2518 May 9, 2022
e744092
Fix complex check?
sv2518 May 10, 2022
5fc4264
Fix complex check?
sv2518 May 10, 2022
31f0c39
Fix complex check?
sv2518 May 10, 2022
7e8a86a
Fix complex check?
sv2518 May 10, 2022
63f1e52
clarifies vectorization strategy
kaushikcfd May 11, 2022
8b19370
Updates to transform startegy
kaushikcfd May 11, 2022
7a2cbd6
Time configuration is not used anywhere and add doc
sv2518 May 19, 2022
69d4921
Move conditional
sv2518 May 19, 2022
43960e6
sun2020study -> cross-element
sv2518 May 19, 2022
b4c9926
Make default_simd_width more readable
sv2518 May 19, 2022
c603f3f
cleanup
sv2518 May 19, 2022
1cee3d7
Lint
sv2518 May 19, 2022
a671b6c
corrects the condition to not vectorize temps passed to BLAS calls
kaushikcfd May 20, 2022
4aa86e1
Add vectorisation config to cache keys
sv2518 May 24, 2022
60b4b3e
Tests: add a vectorisation test
sv2518 May 24, 2022
1b3c29e
Cleanup
sv2518 May 24, 2022
0a54a34
Cleanup
sv2518 May 24, 2022
9b23200
Use reconfigure not init for changing the vectorisation strategy in t…
sv2518 May 24, 2022
acb9c89
Cleanup
sv2518 May 24, 2022
49e2779
Test: improve the vectorisation test.
sv2518 May 24, 2022
e5fe4d2
Put vectorisation strategy only in cache key of the global kernel.
sv2518 May 24, 2022
0eff9d6
lint
sv2518 May 25, 2022
22ce06e
Fix docs
sv2518 May 25, 2022
bdefbfa
Fix config error
sv2518 May 25, 2022
2a459e5
Fix config error
sv2518 May 25, 2022
56c65da
Don't add py-cpuinfo
May 27, 2022
ca5c51b
Add nbytes property
connorjward Jun 22, 2022
dc5f3bc
Drop unused args
sv2518 Jun 22, 2022
ac36708
Time->extra_info
sv2518 Jun 22, 2022
89c9dec
Merge branch 'vectorisation-sprint' into connorjward/add-nbytes
sv2518 Jun 22, 2022
e2af4c7
Merge pull request #666 from OP2/connorjward/add-nbytes
sv2518 Jun 22, 2022
4de6f06
Merge branch 'vectorisation-sprint' into JDBetteridge/vectorisation-s…
sv2518 Jun 22, 2022
2840f28
Merge pull request #665 from OP2/JDBetteridge/vectorisation-sprint
sv2518 Jun 22, 2022
89feb72
Fix bandwidth calculation
Jun 24, 2022
0857145
Add simd compiler flag also to LinuxGNU compiler
Jun 24, 2022
662241e
Add vectorisation flag to linux clang compiler too
Jun 27, 2022
203223c
account for changed in loopy's vectorization syntax
kaushikcfd Jul 6, 2022
fae323f
run CI with py3.8
kaushikcfd Jul 6, 2022
030cae5
Fallback for stopping criterium
sv2518 Jul 7, 2022
ece0e62
Fallback for stopping criterium
sv2518 Jul 7, 2022
934e147
Reduce inames to untag
sv2518 Jul 7, 2022
bd95ba3
Reduce inames to untag
sv2518 Jul 7, 2022
fd6650d
Fallback for stopping criterium
sv2518 Jul 7, 2022
f69755d
unroll (not vectorize) loops surrounding CInstructions
kaushikcfd Jul 11, 2022
e72f316
get rid of noop insns
kaushikcfd Jul 11, 2022
09bf629
Fix merge leftovers for vectorisation in chapter 3
sv2518 Oct 4, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ jobs:
- name: Set correct Python version
uses: actions/setup-python@v2
with:
python-version: '3.6'
python-version: '3.8'

- name: Clone PETSc
uses: actions/checkout@v2
Expand Down
5 changes: 5 additions & 0 deletions pyop2/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,8 @@
from pyop2._version import get_versions
__version__ = get_versions()['version']
del get_versions

from pyop2.configuration import configuration
from pyop2.compilation import max_simd_width
if configuration["vectorization_strategy"]:
configuration["simd_width"] = max_simd_width()
42 changes: 35 additions & 7 deletions pyop2/codegen/c/inverse.c
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,10 @@
#define BUF_SIZE 30
static PetscBLASInt ipiv_buffer[BUF_SIZE];
static PetscScalar work_buffer[BUF_SIZE*BUF_SIZE];
static PetscScalar Aout_proxy_buffer[BUF_SIZE*BUF_SIZE];
#endif


#ifndef PYOP2_INV_LOG_EVENTS
#define PYOP2_INV_LOG_EVENTS
PetscLogEvent ID_inv_memcpy = -1;
Expand All @@ -16,32 +18,58 @@ PetscLogEvent ID_inv_getri = -1;
static PetscBool log_active_inv = 0;
#endif

void inverse(PetscScalar* __restrict__ Aout, const PetscScalar* __restrict__ A, PetscBLASInt N)
static void inverse(PetscScalar* __restrict__ Aout, const PetscScalar* __restrict__ A, PetscBLASInt N,
PetscBLASInt incA, PetscBLASInt incAout)
{
PetscLogIsActive(&log_active_inv);
if (log_active_inv){PetscLogEventBegin(ID_inv_memcpy,0,0,0,0);}
PetscBLASInt info;
PetscBLASInt *ipiv = N <= BUF_SIZE ? ipiv_buffer : malloc(N*sizeof(*ipiv));
PetscScalar *Awork = N <= BUF_SIZE ? work_buffer : malloc(N*N*sizeof(*Awork));
memcpy(Aout, A, N*N*sizeof(PetscScalar));

PetscInt N_sq = N * N;
PetscInt one = 1;

// Aout_proxy: 'Aout', but stored contiguously
PetscScalar *Aout_proxy;
if (incAout == 1)
Aout_proxy = Aout;
else
{
// TODO: Must see if allocating has a significant performance impact
Aout_proxy = N_sq <= BUF_SIZE ? Aout_proxy_buffer : malloc(N*N*sizeof(*Aout));
}

if (log_active_inv){PetscLogEventBegin(ID_inv_memcpy,0,0,0,0);}
BLAScopy_(&N_sq, A, &incA, Aout_proxy, &one);
if (log_active_inv){PetscLogEventEnd(ID_inv_memcpy,0,0,0,0);}

if (log_active_inv){PetscLogEventBegin(ID_inv_getrf,0,0,0,0);}
LAPACKgetrf_(&N, &N, Aout, &N, ipiv, &info);
LAPACKgetrf_(&N, &N, Aout_proxy, &N, ipiv, &info);
if (log_active_inv){PetscLogEventEnd(ID_inv_getrf,0,0,0,0);}

if(info == 0){
if (log_active_inv){PetscLogEventBegin(ID_inv_getri,0,0,0,0);}
LAPACKgetri_(&N, Aout, &N, ipiv, Awork, &N, &info);
LAPACKgetri_(&N, Aout_proxy, &N, ipiv, Awork, &N, &info);
if (log_active_inv){PetscLogEventEnd(ID_inv_getri,0,0,0,0);}

// Copy Aout_proxy back to Aout
if (Aout != Aout_proxy)
{
if (log_active_inv){PetscLogEventBegin(ID_inv_memcpy,0,0,0,0);}
BLAScopy_(&N_sq, Aout_proxy, &one, Aout, &incAout);
if (log_active_inv){PetscLogEventEnd(ID_inv_memcpy,0,0,0,0);}
}
}

if(info != 0){
fprintf(stderr, "Getri throws nonzero info.");
abort();
}
if ( N > BUF_SIZE ) {

if (Awork != work_buffer)
free(Awork);
if (ipiv != ipiv_buffer)
free(ipiv);
}
if ((Aout_proxy != Aout) && (Aout_proxy != Aout_proxy_buffer))
free(Aout_proxy);
}
45 changes: 36 additions & 9 deletions pyop2/codegen/c/solve.c
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ static PetscBLASInt ipiv_buffer[BUF_SIZE];
static PetscScalar work_buffer[BUF_SIZE*BUF_SIZE];
#endif

static PetscScalar out_proxy_buffer[BUF_SIZE];

#ifndef PYOP2_SOLVE_LOG_EVENTS
#define PYOP2_SOLVE_LOG_EVENTS
PetscLogEvent ID_solve_memcpy = -1;
Expand All @@ -16,15 +18,32 @@ PetscLogEvent ID_solve_getrs = -1;
static PetscBool log_active_solve = 0;
#endif

void solve(PetscScalar* __restrict__ out, const PetscScalar* __restrict__ A, const PetscScalar* __restrict__ B, PetscBLASInt N)

/*
* @param[incA]: Stride value while accessing elements of 'A'.
* @param[incB]: Stride value while accessing elements of 'B'.
* @param[incOut]: Stride value while accessing elements of 'out'.
*/
void solve(PetscScalar* __restrict__ out, const PetscScalar* __restrict__ A, const PetscScalar* __restrict__ B, PetscBLASInt N,
PetscBLASInt incA, PetscBLASInt incB, PetscBLASInt incOut)
{
PetscScalar* out_proxy; /// output laid-out with unit stride, expected by LAPACK
PetscInt N_sq = N*N;
PetscInt one = 1;
PetscLogIsActive(&log_active_solve);
if (log_active_solve){PetscLogEventBegin(ID_solve_memcpy,0,0,0,0);}
PetscBLASInt info;
PetscBLASInt *ipiv = N <= BUF_SIZE ? ipiv_buffer : malloc(N*sizeof(*ipiv));
memcpy(out,B,N*sizeof(PetscScalar));
PetscScalar *Awork = N <= BUF_SIZE ? work_buffer : malloc(N*N*sizeof(*Awork));
memcpy(Awork,A,N*N*sizeof(PetscScalar));

if (incOut == 1)
out_proxy = out;
else
out_proxy = (N <= BUF_SIZE) ? out_proxy_buffer : malloc(N*sizeof(*out));

BLAScopy_(&N, B, &incB, out_proxy, &one);

PetscScalar *Awork = N <= BUF_SIZE ? work_buffer : malloc(N_sq*sizeof(*Awork));
BLAScopy_(&N_sq, A, &incA, Awork, &one);
if (log_active_solve){PetscLogEventEnd(ID_solve_memcpy,0,0,0,0);}

PetscBLASInt NRHS = 1;
Expand All @@ -35,7 +54,11 @@ void solve(PetscScalar* __restrict__ out, const PetscScalar* __restrict__ A, con

if(info == 0){
if (log_active_solve){PetscLogEventBegin(ID_solve_getrs,0,0,0,0);}
LAPACKgetrs_(&T, &N, &NRHS, Awork, &N, ipiv, out, &N, &info);
LAPACKgetrs_(&T, &N, &NRHS, Awork, &N, ipiv, out_proxy, &N, &info);

if (out != out_proxy)
BLAScopy_(&N, out_proxy, &one, out, &incOut);

if (log_active_solve){PetscLogEventEnd(ID_solve_getrs,0,0,0,0);}
}

Expand All @@ -44,8 +67,12 @@ void solve(PetscScalar* __restrict__ out, const PetscScalar* __restrict__ A, con
abort();
}

if ( N > BUF_SIZE ) {
free(ipiv);
free(Awork);
}
if (ipiv != ipiv_buffer)
free(ipiv);

if (Awork != work_buffer)
free(Awork);

if ((out_proxy != out) && (out_proxy != out_proxy_buffer))
free(out_proxy);
}
100 changes: 96 additions & 4 deletions pyop2/codegen/rep2loopy.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ def with_types(self, arg_id_to_dtype, callables_table):
callables_table)

def emit_call_insn(self, insn, target, expression_to_code_mapper):
from loopy.codegen import UnvectorizableError
assert self.is_ready_for_codegen()
assert isinstance(insn, loopy.CallInstruction)

Expand All @@ -151,6 +152,9 @@ def emit_call_insn(self, insn, target, expression_to_code_mapper):
parameters = list(parameters)
par_dtypes = [self.arg_id_to_dtype[i] for i, _ in enumerate(parameters)]

if expression_to_code_mapper.codegen_state.vectorization_info:
raise UnvectorizableError("LACallable: cannot take in vector arrays")

parameters.append(insn.assignees[-1])
par_dtypes.append(self.arg_id_to_dtype[0])

Expand All @@ -177,6 +181,46 @@ class INVCallable(LACallable):
"""
name = "inverse"

def with_descrs(self, arg_id_to_descr, callables_table):
a_descr = arg_id_to_descr.get(0)
a_inv_descr = arg_id_to_descr.get(-1)

if a_descr is None or a_inv_descr is None:
# shapes aren't specialized enough to be resolved
return self, callables_table

assert len(a_descr.shape) == 2
assert a_descr.shape == a_inv_descr.shape
assert a_descr.shape[1] == a_descr.shape[0]

return self.copy(arg_id_to_descr=arg_id_to_descr), callables_table

def emit_call_insn(self, insn, target, expression_to_code_mapper):
from loopy.codegen import UnvectorizableError

# Override codegen to emit stride info. to the blas calls.
in_descr = self.arg_id_to_descr[0]
out_descr = self.arg_id_to_descr[-1]
ecm = expression_to_code_mapper

# see pyop2/codegen/c/inverse.c for the func. signature
inc_a = in_descr.dim_tags[1].stride
inc_a_out = out_descr.dim_tags[1].stride
n = in_descr.shape[0]

a, = insn.expression.parameters
a_out, = insn.assignees

if ecm.codegen_state.vectorization_info is not None:
raise UnvectorizableError("cannot vectorize 'inverse'.")

c_parameters = [ecm(a_out).expr,
ecm(a).expr,
n,
inc_a,
inc_a_out]
return var(self.name_in_target)(*c_parameters), False

def generate_preambles(self, target):
assert isinstance(target, type(target))
yield ("inverse", inverse_preamble)
Expand All @@ -189,19 +233,65 @@ class SolveCallable(LACallable):
"""
name = "solve"

def with_descrs(self, arg_id_to_descr, callables_table):
a_descr = arg_id_to_descr.get(0)
b_descr = arg_id_to_descr.get(1)
x_descr = arg_id_to_descr.get(-1)

if a_descr is None or b_descr is None:
# shapes aren't specialized enough to be resolved
return self, callables_table

assert len(a_descr.shape) == 2
assert len(x_descr.shape) == 1
assert b_descr.shape == x_descr.shape

return self.copy(arg_id_to_descr=arg_id_to_descr), callables_table

def emit_call_insn(self, insn, target, expression_to_code_mapper):
from loopy.codegen import UnvectorizableError

# Override codegen to emit stride info. to the blas calls.
a_descr = self.arg_id_to_descr[0]
b_descr = self.arg_id_to_descr[1]
out_descr = self.arg_id_to_descr[-1]
ecm = expression_to_code_mapper

# see pyop2/codegen/c/solve.c for the func. signature
inc_a = a_descr.dim_tags[1].stride
inc_b = b_descr.dim_tags[0].stride
inc_out = out_descr.dim_tags[0].stride
n = a_descr.shape[0]

a, b = insn.expression.parameters
out, = insn.assignees

if ecm.codegen_state.vectorization_info is not None:
raise UnvectorizableError("cannot vectorize 'inverse'.")

c_parameters = [ecm(out).expr,
ecm(a).expr,
ecm(b).expr,
n,
inc_a,
inc_b,
inc_out]
return var(self.name_in_target)(*c_parameters), False

def generate_preambles(self, target):
assert isinstance(target, type(target))
yield ("solve", solve_preamble)


class _PreambleGen(ImmutableRecord):
fields = set(("preamble", ))
fields = {"preamble", "idx"}

def __init__(self, preamble):
def __init__(self, preamble, idx="0"):
self.preamble = preamble
self.idx = idx

def __call__(self, preamble_info):
yield ("0", self.preamble)
yield (self.idx, self.preamble)


class PyOP2KernelCallable(loopy.ScalarCallable):
Expand Down Expand Up @@ -537,7 +627,9 @@ def renamer(expr):
options=options,
assumptions=assumptions,
lang_version=(2018, 2),
name=wrapper_name)
name=wrapper_name,
# TODO, should these really be silenced?
silenced_warnings=["write_race*", "data_dep*"])

# prioritize loops
for indices in context.index_ordering:
Expand Down
Loading