-
Notifications
You must be signed in to change notification settings - Fork 98
VP4DPWSSD
VP4DPWSSD zmm1{k1}{z}, zmm2+3, m128
[AVX512_4VNNIW] Multiply signed words from source register block indicated by zmm2 by signed words from m128 and accumulate resulting signed dwords in zmm1.
This instruction computes 4 sequential register source-block dot-products of two signed word operands with doubleword accumulation. The memory operand is sequentially selected in each of the four steps. In the above box, the notation of “+3”' is used to denote that the instruction accesses 4 source registers based on that operand; sources are consecutive, start in a multiple of 4 boundary, and contain the encoded register operand. This instruction supports memory fault suppression. The entire memory operand is loaded if any bit of the lowest 16-bits of the mask is set to 1 or if a “no masking” encoding is used. The tuple type T1_4X implies that four 32-bit elements (16 bytes) are referenced by the memory operation portion of this instruction.
src_reg_id is the 5 bit index of the vector register specified in the instruction as the src1 register.
VP4DPWSSD dest, src1, src2
(KL,VL) = (16,512)
N ← 4
ORIGDEST ← DEST
src_base ← src_reg_id & ~ (N-1) // for src1 operand
FOR i ← 0 to KL-1:
IF k1[i] or *no writemask*:
FOR m ← 0 to N-1:
t ← SRC2.dword[m]
p1dword ← reg[src_base+m].word[2*i] * t.word[0]
p2dword ← reg[src_base+m].word[2*i+1] * t.word[1]
DEST.dword[i] ← DEST.dword[i] + p1dword + p2dword
ELSE IF *zeroing*:
DEST.dword[i] ← 0
ELSE
DEST.dword[i] ← ORIGDEST.dword[i]
DEST[MAX_VL-1:VL] ← 0
That boils down to 128 (16-bit) signed word multiplications and 128 (32-bit) signed Dword additions.
How to use this instruction to implement a neural network
reg[src_base+0] = [ a0, a1, ..., a32]
reg[src_base+1] = [ b0, b1, ..., b32]
reg[src_base+2] = [ c0, c1, ..., c32]
reg[src_base+3] = [ d0, d1, ..., d32]
SRC2 = [s0, s1, ..., s7]
DEST.i32[0] += (a0 *s0) + (a1 *s1) + (b0 *s2) + (b1 *s3) + (c0 *s4) + (c1 *s5) + (d0 *s6) + (d1 *s7)
DEST.i32[1] += (a2 *s0) + (a3 *s1) + (b2 *s2) + (b3 *s3) + (c2 *s4) + (c3 *s5) + (d2 *s6) + (d3 *s7)
DEST.i32[2] += (a4 *s0) + (a5 *s1) + (b4 *s2) + (b5 *s3) + (c4 *s4) + (c5 *s5) + (d4 *s6) + (d5 *s7)
DEST.i32[3] += (a6 *s0) + (a7 *s1) + (b6 *s2) + (b7 *s3) + (c6 *s4) + (c7 *s5) + (d6 *s6) + (d7 *s7)
DEST.i32[4] += (a8 *s0) + (a9 *s1) + (b8 *s2) + (b9 *s3) + (c8 *s4) + (c9 *s5) + (d8 *s6) + (d9 *s7)
DEST.i32[5] += (a10*s0) + (a11*s1) + (b10*s2) + (b11*s3) + (c10*s4) + (c11*s5) + (d10*s6) + (d11*s7)
DEST.i32[6] += (a12*s0) + (a13*s1) + (b12*s2) + (b13*s3) + (c12*s4) + (c13*s5) + (d12*s6) + (d13*s7)
DEST.i32[7] += (a14*s0) + (a15*s1) + (b14*s2) + (b15*s3) + (c14*s4) + (c15*s5) + (d14*s6) + (d15*s7)
DEST.i32[8] += (a16*s0) + (a17*s1) + (b16*s2) + (b17*s3) + (c16*s4) + (c17*s5) + (d16*s6) + (d17*s7)
DEST.i32[9] += (a18*s0) + (a19*s1) + (b18*s2) + (b19*s3) + (c18*s4) + (c19*s5) + (d18*s6) + (d19*s7)
DEST.i32[10] += (a20*s0) + (a21*s1) + (b20*s2) + (b21*s3) + (c20*s4) + (c21*s5) + (d20*s6) + (d21*s7)
DEST.i32[11] += (a22*s0) + (a23*s1) + (b22*s2) + (b23*s3) + (c22*s4) + (c23*s5) + (d22*s6) + (d23*s7)
DEST.i32[12] += (a24*s0) + (a25*s1) + (b24*s2) + (b25*s3) + (c24*s4) + (c25*s5) + (d24*s6) + (d25*s7)
DEST.i32[13] += (a26*s0) + (a27*s1) + (b26*s2) + (b27*s3) + (c26*s4) + (c27*s5) + (d26*s6) + (d27*s7)
DEST.i32[14] += (a28*s0) + (a29*s1) + (b28*s2) + (b29*s3) + (c28*s4) + (c29*s5) + (d28*s6) + (d29*s7)
DEST.i32[15] += (a30*s0) + (a31*s1) + (b30*s2) + (b31*s3) + (c30*s4) + (c31*s5) + (d30*s6) + (d31*s7)
If we take the total incoming signal of a neuron j to be the sum of the state of the afferent (incoming) neurons i times the efficacy (weight) of the pathway from i to j, we can use a VP4DPWSSD to calculate 8 incoming pathways for 15 neurons in one single step.
interpret s0 to s7 as the state of 8 incoming neurons, and a0, a1, b0, b1, c0, c1, d0, d1 as the weights to a neuron we have a0 is the weight of pathway from s0 to the neuron.