NEON complex multiply
In the last post I mentioned writing a complex multiply for NEON.
It's actually a good demonstration of the use of a NEON feature - data manipulation on loads, and it's quite trivial i'll post it here.
Complex Multiply
As one might recall, a complex multiply:
C = A * B
Is implemented as the expansion:
C = A * B = (A.re + A.im j) * (B.re + B.im j) = (A.re * B.re - A.im * B.im) + (A.re * B.im + A.im * B.re) j
Where of course j*j = -1.
If the real and imaginary parts are stored in separate planes, this translates trivially to a set of SIMD instructions, but normally they are stored as (real, imag) pairs.
VLD2
Here is where VLD2 comes to the aid of the weary programmer. It will automatically unpack 2-element fields into separate registers and simply allow you to write the code as if the data was stored as planes to start with.
It wasn't quite clear from the documentation how it handled more than 4x2 elements but with an experiment I worked it out and it does the thing you'd expect, allowing you to use quad-word ops.
Memory: $00000000: a.real a.imag b.real b.imag $00000010: c.real c.imag d.real d.imag LDR r0,=0 VLD2 { d0-d3 }, [r0] Registers (as float2) d0 a.real b.real d1 c.real d.real d2 a.imag b.imag d3 c.imag d.imag Registers (as float4) q0 a.real b.real c.real d.real q1 a.imag b.imag c.imag d.imag
Code
By unrolling the loop 4x in SIMD and 2x in instructions one can perform 8 complex multiplies per loop:
@ r0 is address of C @ r1 is address of A @ r2 is address of B cmult8: @ q8, q10 = A[0-7].real @ q9, q11 = A[0-8].imag @ q12, q14 = B[0-7].real @ q13, q15 = B[0-7].imag vld2.32 { d16-d19 },[r1]! vld2.32 { d24-d27 },[r2]! vld2.32 { d20-d23 },[r1]! vld2.32 { d28-d31 },[r2]! vmul.f32 q0,q8,q12 @ a.r * b.r [ 0-3 ] vmul.f32 q1,q9,q12 @ a.i * b.r vmul.f32 q2,q10,q14 @ a.r * b.r [ 4-7 ] vmul.f32 q4,q11,q14 @ a.i * b.r vmls.f32 q0,q9,q13 @ - a.i * b.i [ 0-3 ] vmla.f32 q1,q8,q13 @ + a.r * b.i vmls.f32 q2,q11,q15 @ - a.i * b.i [ 4-7 ] vmla.f32 q3,q10,q15 @ + a.r * b.i vst2.32 { d0-d3 },[r0]! vst2.32 { d4-d7 },[r0]! mov pc,lr
q4-q7 are the callee-saved registers, so I simply avoid having to save them by using the others.
There is a few cycle stall for the stores at the end, but in a loop one can load the next 8 complex values before the store to avoid it.
C, NEON
I started pulling some of my experiments together into a prototype today and started to hit some annoying issues: pretty much anything in to do with large arrays of floats in C is 3-4x slower than doing it in NEON.
I can feel a lot of NEON coming on ...