That scheduling ...
Had some other stuff I had to poke at the last couple of nights, and I needed a bit of a rest anyway. Pretty stuffed tbh, but i want to drop this off to get it out of my head.
Tonight I finally got my re-scheduled inner loop to work. Because i'm crap at keeping it all in my head I basically made a good guess and then ran it on the hardware using the profiling counters and tweaked until it stopped improving (actually until i removed all RA stalls and had every FLOP a dual-issue). Although it looks like now it's running for realz one of the dual-issue's dropped out - depends on things like alignment and memory contention.
But the results so far ...
Previous "best" scheduling New "improved" scheduling CLK = 518683470 (1.3x) CLK = 403422245 IALU_INST = 319357570 IALU_INST = 312638579 FPU_INST = 118591312 FPU_INST = 118591312 DUAL_INST = 74766734 (63% rate) DUAL_INST = 108870170 (92% rate) E1_STALLS = 11835823 E1_STALLS = 12446143 RA_STALLS = 122796060 (2.6x) RA_STALLS = 47086269 EXT_FETCH_STALLS = 0 EXT_FETCH_STALLS = 0 EXT_LOAD_STALLS = 1692412 EXT_LOAD_STALLS = 1819284
The 2-region loop is 33 instructions including the branch, so even a single cycle improvement is measurable.
I haven't yet re-scheduled the '3-region' calculation yet so it can gain a bit more. But as can be seen from the instruction counts the gain is purely from just rescheduling. The IALU instruction count is different as i improved the loop mechanics too (all of one instruction?).
As a quick comparison this is what the C compiler comes up with (-O2). I'm getting some different results to this at the moment so the comparison here are only provisonal ...
CLK = 1189866322 (2.9x vs improved) IALU_INST = 693566877 FPU_INST = 131085992 DUAL_INST = 93602858 (71% rate) E1_STALLS = 31768387 (2.5x vs improved) RA_STALLS = 322216105 (6.8x vs improved) EXT_FETCH_STALLS = 0 EXT_LOAD_STALLS = 14099244
The number of flops are pretty close though so it can't be far off. I'm doing a couple of things the C compiler isn't so the number should be a bit lower. Still not sure where all those ext stalls are coming from.
Well the compiler can only improve ...
In total elapsed time terms these are something like 1.8s, 0.88s, and 0.60s from slowest to fastest on a single core. I only have a multi-core driver for the assembly versions. On 1 column of cores best is 201ms vs improved at 157ms. With all 16 cores ... identical at 87ms. But I should really get those bugs fixed and a realistic test case running before getting too carried away with the numbers.
Update: I later posted in more detail about the scheduling. I tracked down some bugs so the numbers changed around a bit but nothing that changes the overall relationships.