asm v c
For a bit of 'fun' i thought i'd see how translating some of the fft code to assembly would fare.
I started with this:
void build_wtable(complex float *wtable, int logN, int logStride, int logStep, int w0) { int wcount = 1<<(logN - logStep - 2); int wshift = logStride; for (int i=0;i<wcount;i++) { int j0 = w0 + (i<<wshift); int f2 = j0 << (CEXPI_LOG2_PI - logN + logStep - logStride + 1); int f0 = f2 * 2; wtable[i*2+0] = ez_cexpii(f0); wtable[i*2+1] = ez_cexpii(f2); } }
It generates the twiddle factors required for a given fft size relative to the size of the data. ez_cexpii(x) calculates e^(I * x * PI / (2^20)), as previously discussed.
This is called for each radix-4 pass - 5 times per 1024 elements, and generates 1, 4, 16, 64, or 256 complex value pairs per call. I'm timing all of these together below so this is the total time required for twiddle factor calculation for a 1024-element FFT in terms of the timer counter register.
The results:
what clock ialu fpu dual e1 st ra st loc fe ex ld size noop 23 4 0 0 9 16 8 6 build_wtable 55109 36990 9548 6478 9 7858 1 6 372 build_wtable_il 62057 25332 9548 4092 9 27977 8 6 618 e_build_wtable 31252 21999 9548 8866 9 7518 13 6 464 e_build_wtable_hw 29627 21337 9548 8866 24 7563 16 36 472 1 * e_build_wtable 148 99 28 26 9 38 1 6 $ e-gcc --version e-gcc (Epiphany toolchain (built 20130910)) 4.8.2 20130729 (prerelease)
(30 000 cycles is 50uS for a clock of 600Mhz)
- noop
- This just calls ez_timer0_start(timerid, ~0); ez_ctimer0_stop(); and then tallies up the counters showing a baseline for a single time operation.
- build_wtable
- This is the code as in the fft sample.
- build_wtable_il
- This is the same as above but the ez_cexpii() call has been inlined by the compiler.
- e_build_wtable
- The assembly version. This is after a few hours of poking at the code trying to optimise it further.
- e_build_wtable_hw
- The same, but utilising the hardware loop function of the epiphany.
Discussion
Well what can i say, compiler loses again.
I'm mostly surprised that inlining the cexpii() function doesn't help at all - it just hinders. I confirmed the code is being inlined from the assembler listing but it just seems to be more or less copying the code in-lined twice rather than trying to interleave the functions to provide additional scheduling opportunities. As can be seen the ialu ops are significantly reduced because the constant loads are being taken outside of the loop but that is offset by a decrease in dual-issue and a big jump in register stalls.
Whilst the compiler generated a branch in the cexpii function I managed to go branchless in assembly by using movCC a couple of times. I am loading all the constants from a memory table using ldr which reduces the ialu op count a good bit, although that is outside the inner loop.
The hardware loop feature knocks a little bit off the time but the loop is fairly large so it isn't really much. I had to tweak the code a little to ensure the code lined up correctly in memory otherwise the loop got slower. i.e. such that .balignw 8,0x01a2 did nothing although it is still present to ensure correct operation.
Try as I might I was unable to get every flop to dual issue with an in ialu op. Although I got 26 out of 28 to dual-issue which seems ok(?). I don't fully understand the interaction of the dual-issue with things like instruction order and alignment so it was mostly just trial and error. Sometimes moving an instruction which had no other dependency could change the timing by a bigger jump than it should have, or e.g. add one ialu instruction and now the total clock time is 3 clock cycles longer?
I'm not sure I understand the register stall column completely either. For example calling the function such that a single result pair is calculated results in the last row of the table above. 38 stalls, where? The noop case is easier to analyse in detail: where do those 16 stalls come from? AFAICT the instruction sequence executed is literally:
32: 3feb 0ff2 mov r1,0xffff 36: 3feb 1ff2 movt r1,0xffff 3a: 10e2 mov r0,r4 3c: 1d5f 0402 jalr r15 -> calls 00000908 <_ez_ctimer0_start>: 908: 6112 movfs r3,config 90a: 41eb 0ff2 mov r2,0xff0f 90e: 5feb 1ff2 movt r2,0xffff 912: 4d5a and r2,r3,r2 914: 4102 movts config,r2 916: 390f 0402 movts ctimer0,r1 91a: 0096 lsl r0,r0,0x4 91c: 487a orr r2,r2,r0 91e: 4102 movts config,r2 -> clock starts counting NOW! 920: 194f 0402 rts -> returns to timing loop: 40: 0112 movfs r0,config 42: 02da and r0,r0,r5 44: 0102 movts config,r0 -> clock stops counting NOW! 46: 391f 0402 movfs r1,ctimer0 4a: 1113 add r0,r4,2 4c: 270a eor r1,r1,r6 4e: 0056 lsl r0,r0,0x2 etc.
Attempting to analyse this:
- 4 ialu - integer alu operations
- Yep, 4 instructions are executed.
- 6 ex ld - external load stalls
- This is from reading the config register.
- 8 loc fe - local fetch stalls
- From the rts switching control flow back to the caller?
- 16 ra st - register dependency stalls
- Considering no registers are being directly accessed in this section it's a little weird. rts will access lr (link register) but that shouldn't stall. The instruction fetch will access the PC, not sure why this would stall.
Still puzzled. 23 clock cycles for { rts, movfs, and, movts }, apparently.