Optimising
I spent a good chunk of yesterday trying to optimise the cexpi implementation to get of as many register access penalties as possible and dual-issue as many instructions as possible. There still seems to be a bit of magic there but i'm getting there.
what clock ialu fpu dual e1 st ra st loc fe ex ld e_build_wtable 29637 21347 9548 8866 24 7563 21 36 e_build_wtable2 20149 17656 8184 6138 24 402 28 36
build_wtable_2 is based on the 5+4 term sincos from the previous post so requires fewer flops to start with but the main point of interest is the greatly reduced register stalls.
Primarily I just wasn't putting enough code between the flop instructions, but that can be easier said than done.
I used a technique I've talked about previously to be able to add more ialu instructions within the flops. Basically the loop is unrolled "in-place", whereby some results required for the next loop are calculated in the current one and otherwise the loop conditions remain the same.
A fairly detailed summation of the steps follows. This calculates two tables of sin/cos pairs together. The polynomial approximation is only valid over (0..pi/4) and so to calculate sin(pi/4+i) it calcualtes cos(pi/4-i) and so on.
build_wtable(complex float *wtable, int logN, int logStride, int logStep, int w0) load constants calculate loop count loop: all integer ops: // 1: calculate frequency of this coefficient f1 = w0 << shift f0 = f1 * 2 // 2: calculate octant j0 = f0 >> shift2 j1 = f1 >> shift2 // 3: calculate fractional part of frequency and mirror if required x0 = f0 & mask; x0 = j0 & 1 ? (scale - x0) ? x0; x1 = f1 & mask; x1 = j1 & 1 ? (scale - x1) ? x1; // 4: calculate sign of result negcos0 = ((j0+2) >> 2) << 31; negsin0 = ((j0) >> 2) << 31; negcos1 = ((j1+2) >> 2) << 31; negsin1 = ((j1) >> 2) << 31; // 5: determine if cos/sine needs swapping swap0 = (j0+1) & 2; swap1 = (j1+1) & 2; All float ops: // 6: scale input a0 = to_float(x0) * 1.0 / scale; a1 = to_float(x1) * 1.0 / scale; // 7: calculate sin/cos over pi/4 b0 = a0 * a0; b1 = a1 * a1; ... calculate sin'(a0/a1) and cos'(a0/a1) using fmadd+fmul // 8: swap if necessary cos0 = swap0 ? sin'0 : cos'0; sin0 = swap0 ? cos'0 : sin'0; cos1 = swap1 ? sin'0 : cos'0; sin1 = swap1 ? cos'0 : sin'0; // 9: fix sign All Integer ops: // twiddles bit-31 of ieee float as_int(cos0) ^= negcos0; as_int(sin0) ^= negsin0; as_int(cos1) ^= negcos1; as_int(sin1) ^= negsin1; // a: output *wtable++ = cos0 + I * sin0; *wtable++ = cos1 + I * sin1; // b: next step w0 += wstep; endloop
I wont show the detail of the calculation itself but I tried using Estrin's Algorithm for the polynomial expansion to try to reduce the latency involved. Actually it ended up a bit interesting because it shifted where the constants needed to be loaded for the 3-register fmadd (since it modifies it's first result, the constants need loading there) - and meant there were fewer ways to dual-issue instructions, more below. I might end up trying the other way too - but this stuff just sucks so much time it's crazy.
So basically there are a bunch of integer ops followed by float ops followed by a few int ops to finish off. Now comes the tricky bit where 'optimisation' comes into play. Firstly, float operations all have a latency which means that if you just use them in the order shown you wont get good performance.
Even just listing it the way I have is a simple optimisation.
For example:
a0 = to_float(x0) * 1.0 / scale; b0 = a0 * a0; a1 = to_float(x1) * 1.1 / scale; b1 = a1 * a1;
Would take approximately 2x longer than:
a0 = to_float(x0) * 1.0 / scale; a1 = to_float(x1) * 1.1 / scale; b0 = a0 * a0; b1 = a1 * a1;
If one looks at the basic instructions annotated with the delay slots it's pretty obvious which is better (i might be a off with the delay slots)
inline interleaved float r8,r0 float r8,r0 - float r10,r1 - - - - - - fmul r8,r8,r32 fmul r8,r8,r32 - fmul r10,r10,r32 - - - - - - fmul r9,r8,r8 fmul r9,r8,r8 float r10,r1 fmul r11,r10,r10 - - - - fmul r10,r10,r32 - - - - fmul r11,r10,r10
But one can only do this optimisation if you have enough registers to fully calculate each separately (this is an important point, and why the cpu has so many registers).
The other factor is that the epiphany cpu can dual-issue certain sets of instructions in certain circumstances. Fortunately the circumstances are fairly simple (for the most part, some details still elude me) and it's basically a float + ialu or load op. The precise behaviour seems to depend on instruction order/alignment but i'm not really sure on all the details yet.
So this means for example that any of the integer operations can be interleaved with the float operations and effectively become 'zero cost' - assuming there is some 'fixed cost' based on the number of float ops and their scheduling requirements. Cool but it complicates things and dual-issuing instructions means you don't fill any of those latency slots so need to find more instructions!
Examining the algorithm there are some hard dependencies which force a particular calculation order but a bunch of the integer calculations aren't needed till the end.
- Steps 1-3
- These are required before any float ops can be performed. These cannot be dual-issued obviously.
- Steps 4-5
- These can be performed anywhere and they are not needed until the end. These are prime candidates for dual issue.
- Steps 6-7
- These are hard orderings that add float latencies which cannot be avoided. They also require some register moves to setup the 3-register fmadd instructions which can make use of some of those slots and dual-issue.
- Steps 8-a
- These are pretty much hard-dependencies on the output. Fortunately there are no ra penalties for integer operations but it does mean there is an integer/fpu op mismatch; as odd as it sounds it may actually help to convert these operations into floating point ops.
- Step b
- Can be put almost anywhere.
Actually ... I lied. All of those steps can be interleaved with others ... by calculating the value you will need for the next loop (or next next) instead. This requires a ton of registers - I used every 'free' register up and had to save a few to the stack besides. I think the epiphany ABI could use some tuning for size/performance but I don't have enough data to back up any changes; and it isn't simple because any change has side effects.
The algorithm then becomes changed to:
setup first loop loop: calculate result interleaved with setup next loop
The tricky bit comes in tracking the state of all the registers across the algorithm and ensuring the right value is calculated and in the correct register when it is needed. The simplest way to do this is to basically use a new register for every calculation that might be needed later and then bits of code can basically be re-arranged 'at will' without becoming too much of a complicated mess for a head to manage. And hope you don't run out of registers.
Still it's a little hard on the head and eyes so I'm thinking of writing a tool which can help with some of the work. I started with a tool that parsed the assembly and dumped a table of register liveliness; but then i realised to go any further with that i'd basically need to write a whole assembler so I started work on an instruction decoder that can pass over the compiled binary which should be easier. Epiphany's instruction set is fairly tiny so at least that shouldn't be too much work. Well i'll see on that one, ... it's still a lot of work.
Estrin's Method and 3-register FMA
So whilst doing a pass over this post I noticed the anomaly with the greatly reduced ialu count of the new implementation. I did notice when I wrote the initial implementation it seemed rather small compared to the previous one - I put it down to the reduced term count but on reflection is is more than that.
So the basic polynomial expansion using Horner's Rule for sin+cos() in terms of 3-register fmadd instructions becomes:
; sin ; a (A + a^2 (B + a^2 (C + a^2 D) ; r0 = a ; r1 = a^2 mov r2,rC ; t0 = C 1 fmadd r2,r1,rD ; t0 += a^2 D mov r3,rB ; t1 = B 2 fmadd r3,r2,r1 ; t1 += a^2 (C + a^2 D) mov r2,rA ; t0 = A 3 fmadd r2,r3,r1 ; t0 += a^2 (B + a^2 (C + a^2 D)) 4 fmul r2,r2,r0 ; t0 *= a ; cos ; A + a^2 (B + a^2 (C + a^2 (D + a^2 E) ; r0 = a ; r1 = a^2 mov r2,rD ; t0 = D 1 fmadd r2,r1,rE ; t0 += a^2 D mov r3,rC ; t1 = C 2 fmadd r3,r2,r1 ; t1 += a^2 (D + a^2 E) mov r2,rB ; t0 = B 3 fmadd r2,r3,r1 ; t0 += a^2 (C + a^2 (D + a^2 E)) mov r3,rB ; t1 = A 4 fmadd r3,r2,r1 ; t1 += a^2 (B + (C + a^2 (D + a^2 E)))
Note that the cos 'A' is not the same as the sin 'A'. The digit infront of the fpu ops is the 'stage' of the calculation in terms of register dependencies.
The same using Estrin's Aglorithm:
; sin ; a ((A + a^2 B) + a^4 (C + a^2 D)) ; r0 = a ; r1 = a^2 1 fmul r2,r1,r1 ; a^4 = a^2 . a^2 mov r3,rA ; t0 = A mov r4,rC ; t1 = C 1 fmadd r3,r1,rB ; t0 = A + a^2 B 1 fmadd r4,r1,rD ; t1 = C + a^2 D 2 fmadd r3,r2,r4 ; sin'= (A + a^2 B) + a^4 (C + a^2 D) 3 fmul r3,r3,r0 ; sin = a sin' ; cos ; A + a^2 ((B + a^2 C) + a^4 (D + a^2 E)) ; r0 = a ; r1 = a^2 1 fmul r2,r1,r1 ; a^4 = a^2 . a^2 mov r3,rB ; t0 = B mov r4,rD ; t1 = D 1 fmadd r3,r1,rB ; t0 = B + a^2 C 1 fmadd r4,r1,rD ; t1 = D + a^2 E 2 fmadd r3,r2,r4 ; sin'= (A + a^2 B) + a^4 (C + a^2 D) mov r4,rA ; t1 = A 3 fmadd r4,r3,r1 ; sin = A + a^2 ((A + a^2 B) + a^4 (C + a^2 D))
So 15 instructions vs 15 instructions, but 7 constant loads vs 5. Well that's interesting. With dual issue with this snippet the cost is hidden but if a routine has more ialu ops than flops then the latter leave more opportunities for scheduling.
Even if the instruction set had a 4-register fma instruction the second might be better due to needing one less stage of calculation.
ABI?
I wish the abi was tad bit more asm-hacker friendly. A compiler doesn't care where the spare registers are but keep track of the sparse ranges is a pain. I think it would benefit from having all 8 'zero page' registers available as scratch anyway.
I wouldn't mind something closer to this:
r0-r3 argument / saved by caller r4-r7 gp / saved by caller r8-r23 saved by callee / hardware (lr only) / constants r24-r63 gp / saved by caller
Rather than this:
r0-r3 argument / saved by caller r8-r11 saved by callee r12 scratch / saved by caller r13-r15 saved by callee / hardware r16-r27 gp /saved by caller r28-r31 constants r32-r43 saved by callee r44-r63 gp / saved by caller
I don't really see the need for any of the arm-legacy register assignments such as r12 == ip. This would leave 48 registers free for leaf functions (rather than the current 37) to use without having to resort to saving registers and importantly leave 0-7 which can have a huge impact on code-size. And it would still leave enough registers for outer loops and so on. The abi is designed to work with a cut-down cpu which i think has 16 registers: but the proposed would work there as well.
But yeah, who knows, any change isn't cost-free and if your code is calling lots of small functions rather than calling big processing leaf functions then it might not work so well (but yeah, it wont anyway). I previously tried patching gcc to modify the abi but I didn't fully complete it.