EPU elf loader, reloc, etc.
I've come to the point where I need to start looking at placing different code on different epu's and having them talk to each other via on-chip writes ...
But the current SDK is a bit clunky here. One basically has to write custom linker scripts, custom loader calls, and then manually link various bits together either with custom compile-time steps or manual linking (or even hardcoded absolute addresses).
So ... i've been looking into writing my own loader which will take care of some of the issues:
- Allow symbolic lookup from host code, a-la OpenGLES;
- Allow standard C resolution of symbols across cores;
- Allow multi-core code to be loaded into different topologies/different hardware setups automatically.
Symbolic lookup
This is relatively straightforward and just involves a bit of poking around the ELF file. It's pretty straightforward and since ELF is designed for this kind of thing it takes very little code in the simple case.
Cross-core symbols
Fortunately the linker can do most of this, I just need a linker script but one that can be shared across multiple implementations.
My idea is to have the linker work against "virtual" cores which are simply 1MB apart in the address space. Section attributes can place code or data blocks into individual cores or shared memory or tls blocks.
Relocating loader
Because the cores are "virtual" the loader can then re-arrange them to suit the target topology and/or work-group. I'm going to rely on the linker creating relocatable code so i'm able to do this - basically it retains the reloc hunks in the final binary.
I'm not relying on position independent code for this - and actually that would just make life harder.
Linker too?
The problem is that the linker is going to spew if i try to put the same code into local blocks on different cores ... you know simple things like maths routines that really need to be local to the core. The alternative is to build a different complete binary for each core ... but then you're stuck with no way to automatically resolve addresses across cores and you're back where you started.
So it's going to have to get a lot more involved than just a simple load and reloc.
I'm just hoping i can somehow leverage/trick the linker into creating a single executable that has most of the linking work done, and then I'm able to finish it off at runtime without having to do everthing. Perhaps just duplicate all the sections common to all cores and then relocate and link in the per-core blocks.
Hmm, i think i need to think about this a bit more first.
Hmm, Valve, AMD, Nvidia?
Hmm, so have Valve and the Gabester finally managed to do what common sense and economics couldn't?
That is, get AMD and perhaps even Nvidia to start working on proper GPU drivers for Linux?
Nvidia just announced that they're going to start helping the GPL driver effort all of a sudden. AMD are teasing about a GNU/Linux and game related announcement in under 12 hours. And Valve's "SteamStation" is being announced one way or another in under 12 hours too.
I guess we'll know soon enough ... it'll give me something to read in the morning unless I wake up at 4am again ...
I'm most interested in what AMD have to announce. The best we can hope for is a properly-free reference implementation of GPU + HSA for AMD APU machines - this is probably in the realm of dreaming but you never know because it makes a hell of a lot of sense economically. And fits some of their HSA related rumblings. Add in a range of "desktop" parts from low to high powered to match and it could be an interesting day. HSA has the potential to be the biggest leap in IBM compatible PC architecture in history - even if it is just all the way back to 1995 (Amiga).
SteamOS is interesting to me beyond the GameOS potential. Having an 'under tv' option which isn't Sony, or XBMC, or Google has to be a good thing. Android is a pretty sucky 'spin' of GNU/Linux. The optimistic part of me also looks forward to the announcement of some sort of OpenGL based display mechanism that would finally fuck The X Window System and it's other shitty replacements right off into to the dustbin of history where they belong. Actually I take back what I said about being most interested in what AMD have to say, a replacement for X that isn't just X-again wayland or ubuntu-i-can't-believe-it's-not-linux's mir would be very, very welcome.
One hopes that Nvidia's announcement is also genuine (and also involved in Valve's announcement) and not just a cynical response to something AMD/Valve are expected to say. Because of nvidia's shithouse opencl support and performance on their mainstream parts, they are still "dead to me" - but that isn't a universal opinion.
Post-press
Well ... that was unexpected. A proprietary game api? Oh-kay.
So I guess AMD want to play the market power card? After wrapping up all the consoles?
I thought the whole point of the HSA design was to improve the efficiency of existing apis ...
Still at this point there isn't enough details to really make much of a judgement call. No real info about Linux either, apart from an "importance" of cross-platform support (but that could mean anything).
I guess this was an announcment of game cards, and game cards are bought by game players and game players buy game cards based on game benchmarks ... and a smaller API could definitely make a big difference there.
So I guess we'll just have to continue to wait and see on the APU and HSA fronts, and the same goes for the steam-machine. Poo to that.
Balanced, but fair?
So i've been following the xbone trainwreck over the last few months. It's been pretty entertaining, I like to see a company like m$ get their comeuppance. It's been a complete PR disaster, from the meaningless "180" to the technical specifications.
The FUD is stinking up the discourse a little though so here's a bit of fudbusting of my own.
Balance
Balance is an interesting term - but only if you've fucked up somewhere. Every system aims to be balanced within it's externally determined contraints - that's pretty much the whole point of systems engineering. It relates to the efficiency of a given design but says NOTHING WHATSOEVER about it's performance.
One of the main constraints is always cost and clearly that was one of the major factors in the xbone cpu design. Within the constraints of making a cheap machine it may well be balanced but it's certainly not as fast as the current competition.
m$ are trying to use the chief ps4 engineer's words against him in that he stated that they have more CU's than is strictly necessary for graphics - but the design is clearly intented to use the CU's for compute from the start. And in that scenario the xbone's gpu becomes unbalanced as it has inadequate ALU.
For the sort of developer that works on games I imagine GPU coding is really pretty easy. And with the capabilities of the new HSA-capable devices it should be efficient too - as soon as one has any sort of parallel job just chuck that routine on a GPU core instead of the cpu. Not catering for this seems short-sighted at best.
"Move engines"
These are just plain old DMA engines. Every decent personal computer has them since the Amiga 1000. They have them because they are useful but there's nothing particularly special or unique about them today and the AMD SOC in both consoles will have these - infact they will have several.
Even the beagleboard has a few of them (i can't remember if it's 2 or 4), and they can do rectangle copies, colour fill and even chroma-key. The CELL BE in the PS3 has a 16-deep DMA queue on each SPU - allowing up to 16 in-flight DMA operations PER SPU (i.e. 112 per CELL BE, not including other DMA engines). The epiphany core has 2 2-D DMA channels per EPU - or 32 independent channels for a 16-core chip.
They don't take too much hardware to implement either, just a couple of address registers, adders and a memory interface/arbiter (the biggest bit).
Hardware Scaler & "Display Planes"
i.e. overlays. Video hardware has had this sort of functionality for a couple of decades. IIRC even the lowly beagleboard has 3 "display planes" one of which has an alpha channel, and two of which can be scaled independently using high quality multi-tap filters and two of which support YUV input. Basically they're used for a mouse pointer and a video window, but they could be used for more.
Overlays are really useful if you have a low-bandwidth/low-performance system because of the "free" scaling and yuv conversion, but aren't going to make much of a difference on a machine like the xbone. For example even at 68GB/s one can read or write over 8000x1080P 32-bit frames per second, so you're looking at only a few percent maximum render time on a 50fps display for blending and scaling several separate display planes.
Good to have - sure, but no game-changer and certainly no unique 'value add'.
DRM & the 180
Personally I don't think anything much changed with their "180" on the DRM thing. DRM is still there, and even without a nightly parole check there are plenty of ways to have effectively the same thing. e.g. make a game pretty shit without being constantly on-line, tie a given disk to an account the first time you use it, and so on. And whatever they were planning could always be turned on at the flick of a switch at any future point in time (it needn't have to work with any game previously published, just with ones published after that point).
BTW Sony are really no better here despite all the free PR they wallowed in. Sure they never tried the really dumb idea of banning second hand sales of physical discs (it was an absurd idea anyway as much of they money you might make back from it would be swallowed in adminstration costs and given it would kill the used game market it would probably just end up being revenue negative). But they're making download-only attractive enough that people are foregoing their rights for convenience and the end result is pretty much the same.
All consoles have always been heavily laden with DRM - it was always one of their selling points to developers to negate the wide-spread sharing that everyone does on personal computers.
I can't see the difference...
This is just straight-up PR speak for "we don't expect the average (i.e. uneducateD) `consumer' to notice the difference".
Would you like some condescention with that?
It's all FUD and Games
The great thing about FUD is you don't even have to do much. Say a couple of things in the right places and you get ill-informed but well-intentioned people doing all your work for you. They don't even realise they've been manipulated.
We'd all let the games speak for themselves if we could actually see them ... but developers have to sign NDAs that wont let them talk about the differences, and rumours suggest they're not even allowed to show the games side-by-side at trade shows. So telling people to "see the games" is being very dishonest at best. It's just a FUD teqnique to try to get people locked in to buying their product. Once they get it home and see they've been sold a lemon few will be motivated to do anything about it, and if they get enough early adopters the network effects take over (this is why they're scambling so much even though they were clearly aiming for a 2014 launch - it has to be now or never, at least in their grand plan).
From what we can see the xbone was basically created as the end-game for their trojan horse in the home idea - a $700 hand-wavey remote control that you have to pay a subscription to use, and which monitors demographics and viewer reactions and serves advertisements appropriately. Playing games is only a secondary function - as can clearly be seen by the technical specifications.
If playing games was the primary function of the design then they simply "done fucked up". A company this big doesn't waste this much money over the course of a decade to fuck up at the end of it.
Scheduling in detail
Just some notes on optimising the assembly language version of the viola-jones cascade walker I've been working on for the eiphany chip.
I'm still working toward a tech demo but I got bogged down with the details of the resampling code - i'm using the opportunity to finally grok how upfirdn works.
Excuse the typos, i did this in a bit of a rush and don't feel like fully proof-reading it.
The algorithm
First the data structure. This allows the cascade to be encoded using dword (64-bit) alignment. It's broken into 64-bit elements for C compatability.
union drecord {
unsigned long long v;
struct {
unsigned int flags;
float sthreshold;
} head0;
struct {
unsigned int count2;
unsigned int count3;
} head1;
struct {
unsigned short a,b,c,d;
} rect;
struct {
float weight1;
float fthreshold;
} f0;
struct {
float fsucc,ffail;
} f1;
};
And then a C implementation using this data structure. The summed-area-table (sat) sum calculates the average of all pixels within that rectangle. The sat table size is hard-coded to a specific width and encoded into the compiled cascade. Because it is only ever processed as part of a window of a known size this doesn't limit it's generality.
It performs a feature test on a 2-region feature which equates to either a "this half is brighter than the other half" test in all 4 directions, or a "the middle half is brighter than the two quarter sides" in both directions and senses.
// Copyright (c) 2013 Michael Zucchi
// Licensed under GNU GPLv3
int test_cascade(float *sat, float var, const union drecord *p, float *ssump) {
union drecord h0;
union drecord h1;
float ssum = *ssump;
do {
h0 = (*p++);
h1 = (*p++);
while (h1.head1.count2) {
union drecord r0, r1, f0, f1;
float rsum;
r0 = (*p++);
r1 = (*p++);
f0 = (*p++);
f1 = (*p++);
rsum = (sat[r0.rect.a] + sat[r0.rect.d]
- sat[r0.rect.b] - sat[r0.rect.c]) * -0.0025f;
rsum += (sat[r1.rect.a] + sat[r1.rect.d]
- sat[r1.rect.b] - sat[r1.rect.c]) * f0.f0.weight1;
ssum += rsum < f0.f0.fthreshold * var ? f1.f1.fsucc : f1.f1.ffail;
h1.head1.count2--;
}
... 3-feature test is much the same ...
if (h0.head0.flags & 1) {
if (ssum < h0.head0.sthreshold) {
return 0;
}
ssum = 0;
}
} while ((h0.head0.flags & 2) == 0);
*ssump = ssum;
// keep on going
return 1;
}
As one can see the actual algorithm is really very simple. The problem with making it run fast is dealing with the amount of data that it can chew through as i've mentioned and detailed in previous posts.
I don't have any timings but this should be a particularly fast implementation on an desktop cpu too - most of the heavy lifting fits in the L1 cache for example, and it's pre-compling as much as possible.
Hardware specific optimisations
This covers a couple of optimisations made to take advantage of the instruction set.
First issue is that there is no comparison operation - all one can do is subtract and compare flags. Furthermore there are only limited comparison operators available - equal, less-than and less-than-or-equal. So in general a compare is at least 2 instructions (and more if you want to be ieee compliant but that isn't needed here).
On the other hand there are fmad and fmsub instructions - AND these set the flags. So it is possible to perform all three operations in one instruction given that we don't need to know the precise value.
Another feature of the epu is that the floating point and integer flags are separate so this can be utilised to fill instruction slots and also perform control flow without affecting the flags.
The epu is most efficient when performing dword loads. It's the same speed as a word load, and faster than a short or byte load. So the format is designed to support all dword loads.
Another general optimisation is in pre-compiling the cascade for the problem. So far i'm only using it to pre-calculate the array offsets but it could also be used to alter the sign of calculations to suit the available fpu flags.
Update: Because the eiphipany LDS is so small another optimisation was to make the cascade streamable. Although the single biggest stage with the test cascade fits in 8k it is pretty tight and limits the code flexbility and tuning options (e.g. trade-off space and time). It also limits generality - other cascades may not have the same topology. So the cascade format is designed so it can be broken at completely arbitrary boundary points with very little overhead - this is probably the single most important bit of engineering in the whole exercise and determines everything else. The difficulty isn't so much in designing the format as in recognising the need for it and it's requirements in the first place. Having a streamable cascade adds a great deal of flexibility for dealing with large structures - they can be cached easily and implementing read-ahead is trivial.
There were some other basic optimisation techniques which became available after studying the actual data:
- 2-region features use only two variations of weights, therefore it can be encoded in 1 bit or in a single float (the first one is always the same).
- 3-region features all use the same weights, therefore all 3 floats can be thrown away.
- The original cascade format had 2 or 3 region features scattered amongst the cascade randomly which means any inner loop has to deal with the different number of elements (and branch!). Once on realises the only result is the sum then they can be processed in any order (summation algebra ftw ... again), meaning i could group them and optimise each loop separately.
Some of these seem to lose the generality of the routine - but actually the weights are always the same relationship they are just scaled to the size of the native cascade window. So making the algorithm general would not take much effort.
These are things I missed when I worked on my OpenCL version so I think I could improve that further too. But trying to utilise the concurrency and dealing with the cascade size is what kills the GPU performance so it might not help much as it isn't ALU constrained at all. If I ever get a GCN APU I will definitely revisit it though.
Unscheduled ASM
After a (good) few days worth hacking blind and lots of swearing I finally came up with the basic code below. I was dreaming in register loads ...
Actually this was de-scheduled in order to try to follow it and re-schedule it more efficiently. This is the top part of the C code and the entire 2-region loop.
// Copyright (c) 2013 Michael Zucchi
// Licensed under GNU GPLv3
0: ldrd r18,[r7,#1] ; count2, count3
ldr r16,[r7],#4 ; flags
and r0,r18,r5 ; check zer0 count
beq 1f
2: ldrd r0,[r7],#4 ; 0: load a,b,c,d
ldrd r2,[r7,#-3] ; 1: load a,b,c,d
lsr r4,r0,#16 ; 0:b index
ldr r21,[r6,r4] ; 0:load b
and r0,r0,r5 ; 0:a index
ldr r20,[r6,r0] ; 0:load a
lsr r4,r1,#16 ; 0: d index
ldr r23,[r6,r4] ; 0: load d
and r1,r1,r5 ; 0: c indec
ldr r22,[r6,r1] ; 0: load c
lsr r4,r2,#16 ; 1: b index
ldr r25,[r6,r4] ; 1: load b
and r2,r2,r5 ; 1: a index
ldr r24,[r6,r2] ; 1: load a
lsr r4,r3,#16 ; 1: d iindex
ldr r27,[r6,r4] ; 1: load d
and r3,r3,r5 ; 1: c index
ldr r26,[r6,r3] ; 1: load c
ldrd r50,[r7,#-2] ; load w1, rthreshold
fsub r44,r20,r21 ; 0: a-b
fsub r45,r23,r22 ; 0: d-c
fsub r46,r24,r25 ; 1: a-b
fsub r47,r27,r26 ; 1: d-c
fmul r48,r51,r60 ; rthreshold *= var
fadd r44,r44,r45 ; 0[-1]: a+d-b-c
fadd r45,r46,r47 ; 1[-1]: a+d-b-c
fmsub r48,r44,r63 ; [-1]: var * thr -= (a+d-b-c) * w0
ldrd r52,[r7,#-1] ; [-1] load fsucc, ffail
fmsub r48,r45,r50 ; [-1] var * thr -= (a+d-b-c) * w1
movblte r52,r53
fsub r17,r17,r52 ; [-2]: ssum -= var * thr > (rsum) ? fsucc: ffail
sub r18,r18,#1
bne 2b
1:
Apart from the trick with the implicit 'free' comparison operations for all that it pretty much ended up in a direct translation of the C code (much of the effort was in the format design and getting the code to run). But even in this state it will execute much faster than what the compiler generates for the very simple loop above. Things the C compiler is missing:
- It doesn't use dword loads - more instructions are needed
- It does use hword loads - causes fixed stalls
- It is using an ieee comparison function (compiler flags may change this)
- It doesn't use fmsub as much, certainly not for comparison
- It needs to multiply the array references by 4
Because there are no datatypes in asm, this can take advantage of the fact that the array lookups are by the byte and pre-calculate the shift (multiply by sizeof(float)) in the cascade. In the C version I do not as it adds a shift for a float array reference - I do have a way to remove that in C but it's a big ugly.
Otherwise - it's all very straightforward in the inner loop.
First it loads all the rect definitions and then looks them up in the sat table (r6).
Then it starts the calculations, first calculating the average and then using fmsub to perform the multiply by the weight and comparison operation in one.
At the very end of the loop the last flop is to perform a subtraction on the ssum - this sets the status flags to the final comparison (if (ssum < h0.head0.sthreshold) in c). This actually requires some negation in code that uses it which could be improved - the threshold could be negated in the cascade for example.
If one looks closely one will see that the registers keep going up even though many are out of scope and can be re-used. This is done on purpose and allows for the next trick ...
I don't have the full profiling info for this version, but I have a note that it includes 15 RA stalls, and I think from memory only dual-issues 2 of the 10 flops.
Scheduling
A typical optimisation technique is to unroll a loop, either manually or by letting the compiler do it. Apart from reducing the relative overhead of any loop support constructs it provides modern processors with more flexibility to schedule instructions.
The code already has some loop unrolling anyway - the two regions are tested using in-line code rather than in a loop.
But unrolling gets messy when you don't know the the loop bounds or don't have some other hard detail such as that there is always an even number of loops. I didn't really want to try to look at pages of code and try to schedule by hand either ...
So instead I interleaved the same loop - as one progresses through the loop calculating the addresses needed for "this" result, the fpu is performing the calculations for the "last" result. You still need a prologue which sets up the first loop for whatever the result+1 code is expecting, and also an epilogue for the final result - and if only 1 value is processed the guts is completely bypassed. I'll only show the guts here ...
// Copyright (c) 2013 Michael Zucchi
// Licensed under GNU GPLv3
.balign 8
2:
[ 0] fsub r46,r24,r25 ; [-1] 1: a-b
[ 0] ldrd r0,[r7],#4 ; [ 0] 0: load a,b,c,d
[ 1] fsub r47,r27,r26 ; [-1] 1: d-c
[ 1] ldrd r2,[r7,#-3] ; [ 0] 1: load a,b,c,d
[ 2] fmul r48,r51,r60 ; [-1] rthreshold *= var
[ 2] lsr r4,r0,#16 ; [ 0] 0:b index
[ 3] fadd r44,r44,r45 ; [-1] 0: a+d-b-c
[ 3] ldr r21,[r6,r4] ; [ 0] 0:load b
[ 4] and r0,r0,r5 ; [ 0] 0:a index
[ 5] ldr r20,[r6,r0] ; [ 0] 0:load a
[ 6] lsr r4,r1,#16 ; [ 0] 0: d index
[ 6] fadd r45,r46,r47 ; [-1] 1: a+d-b-c
[ 7] ldr r23,[r6,r4] ; [ 0] 0: load d
[ 8] and r1,r1,r5 ; [ 0] 0: c indec
[ 8] fmsub r48,r44,r63 ; [-1] var * thr -= (a+d-b-c) * w0
[ 9] ldr r22,[r6,r1] ; [ 0] 0: load c
[ 10] lsr r4,r2,#16 ; [ 0] 1: b index
[ 11] ldr r25,[r6,r4] ; [ 0] 1: load b
[ 12] and r2,r2,r5 ; [ 0] 1: a index
[ 13] ldr r24,[r6,r2] ; [ 0] 1: load a
[ 13] fmsub r48,r45,r50 ; [-1] var * thr -= (a+d-b-c) * w1
[ 14] ldrd r52,[r7,#-5] ; [-1] load fsucc, ffail
[ 15] lsr r4,r3,#16 ; [ 0] 1: d iindex
[ 16] and r3,r3,r5 ; [ 0] 1: c index
[ 17] ldr r27,[r6,r4] ; [ 0] 1: load d
[ 18] movblte r52,r53 ; [-1] val = var * thr < rsum ? fsucc : ffail
[ 19] fsub r44,r20,r21 ; [ 0] 0: a-b
[ 19] ldr r26,[r6,r3] ; [ 0] 1: load c
[ 20] fsub r45,r23,r22 ; [ 0] 0: d-c
[ 20] sub r18,r18,#1
[ 21] ldrd r50,[r7,#-2] ; [-1] load w1, rthreshold
[ 21] fsub r17,r17,r52 ; [-1] ssum -= var * thr > (rsum) ? fsucc: ffail
[ 22] bne 2b
[ 26] ; if back to the start of the loop
Update: I tried to improve and fix the annotations in the comments. The [xx] value is the index of the result this instruction is working on, the next x: value is the index of the region being worked on (where it is needed).
I've attempted to show the clock cycles the instructions start on (+ 4 for the branch), but it's only rough. I know from the hardware profiling that every flop dual-issues and there are no register stalls. The loop start alignment is also critical to the lack of stalls. And it took a lot of guess-work to remove the final stall which lingered in the last 5 instructions (someone'll probably tell me now that the sdk has a cycle timer, but that would be no matter if they did).
It almost fell out almost completely symmetrically - that is having all ialu ops in loop 0 and having all flops in loop 1, but by rotating the flops around a bit I managed to get the final flop being the ssum "subtraction + comparison" operation and with no stalls ...
The movblte instruction which performs the ternary is the one that uses the implicit comparison result from the fmsub earlier. Not only does this save one instruction, it also saves the 5 clock cycle latency it would add - and this loop has no cycles to spare that I could find.
There is some more timing info for this one on the previous post. The version that this is 30% faster is not the unscheduled one above but an earlier scheduling attempt.
Oh I should probably mention that i found the bugs and the timings in the previous post did change a bit for the worse, but not significantly.
That scheduling ...
Had some other stuff I had to poke at the last couple of nights, and I needed a bit of a rest anyway. Pretty stuffed tbh, but i want to drop this off to get it out of my head.
Tonight I finally got my re-scheduled inner loop to work. Because i'm crap at keeping it all in my head I basically made a good guess and then ran it on the hardware using the profiling counters and tweaked until it stopped improving (actually until i removed all RA stalls and had every FLOP a dual-issue). Although it looks like now it's running for realz one of the dual-issue's dropped out - depends on things like alignment and memory contention.
But the results so far ...
Previous "best" scheduling New "improved" scheduling
CLK = 518683470 (1.3x) CLK = 403422245
IALU_INST = 319357570 IALU_INST = 312638579
FPU_INST = 118591312 FPU_INST = 118591312
DUAL_INST = 74766734 (63% rate) DUAL_INST = 108870170 (92% rate)
E1_STALLS = 11835823 E1_STALLS = 12446143
RA_STALLS = 122796060 (2.6x) RA_STALLS = 47086269
EXT_FETCH_STALLS = 0 EXT_FETCH_STALLS = 0
EXT_LOAD_STALLS = 1692412 EXT_LOAD_STALLS = 1819284
The 2-region loop is 33 instructions including the branch, so even a single cycle improvement is measurable.
I haven't yet re-scheduled the '3-region' calculation yet so it can gain a bit more. But as can be seen from the instruction counts the gain is purely from just rescheduling. The IALU instruction count is different as i improved the loop mechanics too (all of one instruction?).
As a quick comparison this is what the C compiler comes up with (-O2). I'm getting some different results to this at the moment so the comparison here are only provisonal ...
CLK = 1189866322 (2.9x vs improved)
IALU_INST = 693566877
FPU_INST = 131085992
DUAL_INST = 93602858 (71% rate)
E1_STALLS = 31768387 (2.5x vs improved)
RA_STALLS = 322216105 (6.8x vs improved)
EXT_FETCH_STALLS = 0
EXT_LOAD_STALLS = 14099244
The number of flops are pretty close though so it can't be far off. I'm doing a couple of things the C compiler isn't so the number should be a bit lower. Still not sure where all those ext stalls are coming from.
Well the compiler can only improve ...
In total elapsed time terms these are something like 1.8s, 0.88s, and 0.60s from slowest to fastest on a single core. I only have a multi-core driver for the assembly versions. On 1 column of cores best is 201ms vs improved at 157ms. With all 16 cores ... identical at 87ms. But I should really get those bugs fixed and a realistic test case running before getting too carried away with the numbers.
Update: I later posted in more detail about the scheduling. I tracked down some bugs so the numbers changed around a bit but nothing that changes the overall relationships.