Hmm, Valve, AMD, Nvidia?

Hmm, so have Valve and the Gabester finally managed to do what common sense and economics couldn't?

That is, get AMD and perhaps even Nvidia to start working on proper GPU drivers for Linux?

Nvidia just announced that they're going to start helping the GPL driver effort all of a sudden. AMD are teasing about a GNU/Linux and game related announcement in under 12 hours. And Valve's "SteamStation" is being announced one way or another in under 12 hours too.

I guess we'll know soon enough ... it'll give me something to read in the morning unless I wake up at 4am again ...

I'm most interested in what AMD have to announce. The best we can hope for is a properly-free reference implementation of GPU + HSA for AMD APU machines - this is probably in the realm of dreaming but you never know because it makes a hell of a lot of sense economically. And fits some of their HSA related rumblings. Add in a range of "desktop" parts from low to high powered to match and it could be an interesting day. HSA has the potential to be the biggest leap in IBM compatible PC architecture in history - even if it is just all the way back to 1995 (Amiga).

SteamOS is interesting to me beyond the GameOS potential. Having an 'under tv' option which isn't Sony, or XBMC, or Google has to be a good thing. Android is a pretty sucky 'spin' of GNU/Linux. The optimistic part of me also looks forward to the announcement of some sort of OpenGL based display mechanism that would finally fuck The X Window System and it's other shitty replacements right off into to the dustbin of history where they belong. Actually I take back what I said about being most interested in what AMD have to say, a replacement for X that isn't just X-again wayland or ubuntu-i-can't-believe-it's-not-linux's mir would be very, very welcome.

One hopes that Nvidia's announcement is also genuine (and also involved in Valve's announcement) and not just a cynical response to something AMD/Valve are expected to say. Because of nvidia's shithouse opencl support and performance on their mainstream parts, they are still "dead to me" - but that isn't a universal opinion.

Post-press

Well ... that was unexpected. A proprietary game api? Oh-kay.

So I guess AMD want to play the market power card? After wrapping up all the consoles?

I thought the whole point of the HSA design was to improve the efficiency of existing apis ...

Still at this point there isn't enough details to really make much of a judgement call. No real info about Linux either, apart from an "importance" of cross-platform support (but that could mean anything).

I guess this was an announcment of game cards, and game cards are bought by game players and game players buy game cards based on game benchmarks ... and a smaller API could definitely make a big difference there.

So I guess we'll just have to continue to wait and see on the APU and HSA fronts, and the same goes for the steam-machine. Poo to that.

Balanced, but fair?

So i've been following the xbone trainwreck over the last few months. It's been pretty entertaining, I like to see a company like m$ get their comeuppance. It's been a complete PR disaster, from the meaningless "180" to the technical specifications.

The FUD is stinking up the discourse a little though so here's a bit of fudbusting of my own.

Balance

Balance is an interesting term - but only if you've fucked up somewhere. Every system aims to be balanced within it's externally determined contraints - that's pretty much the whole point of systems engineering. It relates to the efficiency of a given design but says NOTHING WHATSOEVER about it's performance.

One of the main constraints is always cost and clearly that was one of the major factors in the xbone cpu design. Within the constraints of making a cheap machine it may well be balanced but it's certainly not as fast as the current competition.

m$ are trying to use the chief ps4 engineer's words against him in that he stated that they have more CU's than is strictly necessary for graphics - but the design is clearly intented to use the CU's for compute from the start. And in that scenario the xbone's gpu becomes unbalanced as it has inadequate ALU.

For the sort of developer that works on games I imagine GPU coding is really pretty easy. And with the capabilities of the new HSA-capable devices it should be efficient too - as soon as one has any sort of parallel job just chuck that routine on a GPU core instead of the cpu. Not catering for this seems short-sighted at best.

"Move engines"

These are just plain old DMA engines. Every decent personal computer has them since the Amiga 1000. They have them because they are useful but there's nothing particularly special or unique about them today and the AMD SOC in both consoles will have these - infact they will have several.

Even the beagleboard has a few of them (i can't remember if it's 2 or 4), and they can do rectangle copies, colour fill and even chroma-key. The CELL BE in the PS3 has a 16-deep DMA queue on each SPU - allowing up to 16 in-flight DMA operations PER SPU (i.e. 112 per CELL BE, not including other DMA engines). The epiphany core has 2 2-D DMA channels per EPU - or 32 independent channels for a 16-core chip.

They don't take too much hardware to implement either, just a couple of address registers, adders and a memory interface/arbiter (the biggest bit).

Hardware Scaler & "Display Planes"

i.e. overlays. Video hardware has had this sort of functionality for a couple of decades. IIRC even the lowly beagleboard has 3 "display planes" one of which has an alpha channel, and two of which can be scaled independently using high quality multi-tap filters and two of which support YUV input. Basically they're used for a mouse pointer and a video window, but they could be used for more.

Overlays are really useful if you have a low-bandwidth/low-performance system because of the "free" scaling and yuv conversion, but aren't going to make much of a difference on a machine like the xbone. For example even at 68GB/s one can read or write over 8000x1080P 32-bit frames per second, so you're looking at only a few percent maximum render time on a 50fps display for blending and scaling several separate display planes.

Good to have - sure, but no game-changer and certainly no unique 'value add'.

DRM & the 180

Personally I don't think anything much changed with their "180" on the DRM thing. DRM is still there, and even without a nightly parole check there are plenty of ways to have effectively the same thing. e.g. make a game pretty shit without being constantly on-line, tie a given disk to an account the first time you use it, and so on. And whatever they were planning could always be turned on at the flick of a switch at any future point in time (it needn't have to work with any game previously published, just with ones published after that point).

BTW Sony are really no better here despite all the free PR they wallowed in. Sure they never tried the really dumb idea of banning second hand sales of physical discs (it was an absurd idea anyway as much of they money you might make back from it would be swallowed in adminstration costs and given it would kill the used game market it would probably just end up being revenue negative). But they're making download-only attractive enough that people are foregoing their rights for convenience and the end result is pretty much the same.

All consoles have always been heavily laden with DRM - it was always one of their selling points to developers to negate the wide-spread sharing that everyone does on personal computers.

I can't see the difference...

This is just straight-up PR speak for "we don't expect the average (i.e. uneducateD) `consumer' to notice the difference".

Would you like some condescention with that?

It's all FUD and Games

The great thing about FUD is you don't even have to do much. Say a couple of things in the right places and you get ill-informed but well-intentioned people doing all your work for you. They don't even realise they've been manipulated.

We'd all let the games speak for themselves if we could actually see them ... but developers have to sign NDAs that wont let them talk about the differences, and rumours suggest they're not even allowed to show the games side-by-side at trade shows. So telling people to "see the games" is being very dishonest at best. It's just a FUD teqnique to try to get people locked in to buying their product. Once they get it home and see they've been sold a lemon few will be motivated to do anything about it, and if they get enough early adopters the network effects take over (this is why they're scambling so much even though they were clearly aiming for a 2014 launch - it has to be now or never, at least in their grand plan).

From what we can see the xbone was basically created as the end-game for their trojan horse in the home idea - a $700 hand-wavey remote control that you have to pay a subscription to use, and which monitors demographics and viewer reactions and serves advertisements appropriately. Playing games is only a secondary function - as can clearly be seen by the technical specifications.

If playing games was the primary function of the design then they simply "done fucked up". A company this big doesn't waste this much money over the course of a decade to fuck up at the end of it.

Anyone for dessert?

Made this yesterday and thought it looked nice enough to post ...

An unbaked cheesecake is not the sort of thing I normally make but it doesn't hurt to know how. I don't have a very sweet tooth but I don't mind sweet things that are supposed to be sweet (in limited amounts). Sister-in-law had bought the strawberries so I used them too.

I'd made a nicely tart lime cheesecake last week and at first I was just going to have this one vanilla but to the same basic recipe. But on the spoon it tasted too much like sweetened condensed milk (yuck) so i added some lime and so it ended up a sort of vanilla + tart yoghurt sort of flavour.

So that DuskZ thing ...

After doing nothing with it for months I finally checked in all the DuskZ code I had sitting on my HDD.

Unfortunately its very much work in progress and I didn't do any cleaning up (other than check the licensing), so it's all just "as it was" right now. I think the last thing I added was animated tiles, and before that multiple-map support.

At least some of the code there is a decent quality, although not much use on it's own.

Is it dead or just pining?

I'm not sure when I will get time to work on it again - i'm either too busy or hungover lately and it's hard enough to get the time just to fit in social interaction or the garden with all other hacking i'm doing lately. Actually it's not so much the physical time as being able to fit it in mentally as one needs to devote quite a bit of mind-share to do a good job. Like most I'm usually more active during summer so maybe i'll have time to fit it in ...

Just going through the code checking the licenses did pique my interest a little bit but also made me realise I would need a good few days of switched-on thinking to be able to do the next bit of work, that is after I even work out where I was at.

An embedded database backend would definitely be high on the list for example.

up/down sampling

One part of a window-based object detector is scaling the input to different resolutions before running the window across it (one can also scale the window, but this is not efficient on modern cpus).

So i've been looking at up/down resampling in little bits here and there over the last few days.

To cut a long story short, after coming up with a fairly complex but pretty good (i think) implementation of a one-pass 1/N and 2/N 2-D 'up-sample/down-sample' scaling filter ... I found that using a simple 1/N box filter is more than good enough for the application - and about 2x faster. Should NEONise pretty easily too.

The up/down filter may well be useful for other purposes though and I did learn more about up/down filters in general which is something i've been meaning to do. Maybe at some point i'll write about both.

I was only looking at implementing this for ARM but the algorithm I came up with should fit epiphany quite well - so at some point I will look deeper into it. Epiphany does offer other more exotic options mind you.

Scheduling in detail

Just some notes on optimising the assembly language version of the viola-jones cascade walker I've been working on for the eiphany chip.

I'm still working toward a tech demo but I got bogged down with the details of the resampling code - i'm using the opportunity to finally grok how upfirdn works.

Excuse the typos, i did this in a bit of a rush and don't feel like fully proof-reading it.

The algorithm

First the data structure. This allows the cascade to be encoded using dword (64-bit) alignment. It's broken into 64-bit elements for C compatability.

union drecord {
        unsigned long long v;
        struct {
                unsigned int flags;
                float sthreshold;
        } head0;
        struct {
                unsigned int count2;
                unsigned int count3;
        } head1;
        struct {
                unsigned short a,b,c,d;
        } rect;
        struct {
                float weight1;
                float fthreshold;
        } f0;
        struct {
                float fsucc,ffail;
        } f1;
};

And then a C implementation using this data structure. The summed-area-table (sat) sum calculates the average of all pixels within that rectangle. The sat table size is hard-coded to a specific width and encoded into the compiled cascade. Because it is only ever processed as part of a window of a known size this doesn't limit it's generality.

It performs a feature test on a 2-region feature which equates to either a "this half is brighter than the other half" test in all 4 directions, or a "the middle half is brighter than the two quarter sides" in both directions and senses.

// Copyright (c) 2013 Michael Zucchi
// Licensed under GNU GPLv3
int test_cascade(float *sat, float var, const union drecord *p, float *ssump) {
        union drecord h0;
        union drecord h1;
        float ssum = *ssump;


        do {
                h0 = (*p++);
                h1 = (*p++);

                while (h1.head1.count2) {
                        union drecord r0, r1, f0, f1;
                        float rsum;

                        r0 = (*p++);
                        r1 = (*p++);
                        f0 = (*p++);
                        f1 = (*p++);

                        rsum = (sat[r0.rect.a] + sat[r0.rect.d]
                                - sat[r0.rect.b] - sat[r0.rect.c]) * -0.0025f;
                        rsum += (sat[r1.rect.a] + sat[r1.rect.d]
                                 - sat[r1.rect.b] - sat[r1.rect.c]) * f0.f0.weight1;

                        ssum += rsum < f0.f0.fthreshold * var ? f1.f1.fsucc : f1.f1.ffail;
                        h1.head1.count2--;
                }

                ... 3-feature test is much the same ...

                if (h0.head0.flags & 1) {
                        if (ssum < h0.head0.sthreshold) {
                                return 0;
                        }
                        ssum = 0;
                }
        } while ((h0.head0.flags & 2) == 0);

        *ssump = ssum;

        // keep on going
        return 1;
}

As one can see the actual algorithm is really very simple. The problem with making it run fast is dealing with the amount of data that it can chew through as i've mentioned and detailed in previous posts.

I don't have any timings but this should be a particularly fast implementation on an desktop cpu too - most of the heavy lifting fits in the L1 cache for example, and it's pre-compling as much as possible.

Hardware specific optimisations

This covers a couple of optimisations made to take advantage of the instruction set.

First issue is that there is no comparison operation - all one can do is subtract and compare flags. Furthermore there are only limited comparison operators available - equal, less-than and less-than-or-equal. So in general a compare is at least 2 instructions (and more if you want to be ieee compliant but that isn't needed here).

On the other hand there are fmad and fmsub instructions - AND these set the flags. So it is possible to perform all three operations in one instruction given that we don't need to know the precise value.

Another feature of the epu is that the floating point and integer flags are separate so this can be utilised to fill instruction slots and also perform control flow without affecting the flags.

The epu is most efficient when performing dword loads. It's the same speed as a word load, and faster than a short or byte load. So the format is designed to support all dword loads.

Another general optimisation is in pre-compiling the cascade for the problem. So far i'm only using it to pre-calculate the array offsets but it could also be used to alter the sign of calculations to suit the available fpu flags.

Update: Because the eiphipany LDS is so small another optimisation was to make the cascade streamable. Although the single biggest stage with the test cascade fits in 8k it is pretty tight and limits the code flexbility and tuning options (e.g. trade-off space and time). It also limits generality - other cascades may not have the same topology. So the cascade format is designed so it can be broken at completely arbitrary boundary points with very little overhead - this is probably the single most important bit of engineering in the whole exercise and determines everything else. The difficulty isn't so much in designing the format as in recognising the need for it and it's requirements in the first place. Having a streamable cascade adds a great deal of flexibility for dealing with large structures - they can be cached easily and implementing read-ahead is trivial.

There were some other basic optimisation techniques which became available after studying the actual data:

2-region features use only two variations of weights, therefore it can be encoded in 1 bit or in a single float (the first one is always the same).
3-region features all use the same weights, therefore all 3 floats can be thrown away.
The original cascade format had 2 or 3 region features scattered amongst the cascade randomly which means any inner loop has to deal with the different number of elements (and branch!). Once on realises the only result is the sum then they can be processed in any order (summation algebra ftw ... again), meaning i could group them and optimise each loop separately.

Some of these seem to lose the generality of the routine - but actually the weights are always the same relationship they are just scaled to the size of the native cascade window. So making the algorithm general would not take much effort.

These are things I missed when I worked on my OpenCL version so I think I could improve that further too. But trying to utilise the concurrency and dealing with the cascade size is what kills the GPU performance so it might not help much as it isn't ALU constrained at all. If I ever get a GCN APU I will definitely revisit it though.

Unscheduled ASM

After a (good) few days worth hacking blind and lots of swearing I finally came up with the basic code below. I was dreaming in register loads ...

Actually this was de-scheduled in order to try to follow it and re-schedule it more efficiently. This is the top part of the C code and the entire 2-region loop.

// Copyright (c) 2013 Michael Zucchi
// Licensed under GNU GPLv3

0:      ldrd    r18,[r7,#1]     ; count2, count3
        ldr     r16,[r7],#4     ; flags

        and     r0,r18,r5       ; check zer0 count
        beq     1f

2:      ldrd    r0,[r7],#4      ; 0: load a,b,c,d
        ldrd    r2,[r7,#-3]     ; 1: load a,b,c,d

        lsr     r4,r0,#16       ; 0:b index
        ldr     r21,[r6,r4]     ; 0:load b
        and     r0,r0,r5        ; 0:a index
        ldr     r20,[r6,r0]     ; 0:load a

        lsr     r4,r1,#16       ; 0: d index    
        ldr     r23,[r6,r4]     ; 0: load d
        and     r1,r1,r5        ; 0: c indec
        ldr     r22,[r6,r1]     ; 0: load c

        lsr     r4,r2,#16       ; 1: b index
        ldr     r25,[r6,r4]     ; 1: load b
        and     r2,r2,r5        ; 1: a index
        ldr     r24,[r6,r2]     ; 1: load a
        
        lsr     r4,r3,#16       ; 1: d iindex
        ldr     r27,[r6,r4]     ; 1: load d
        and     r3,r3,r5        ; 1: c index
        ldr     r26,[r6,r3]     ; 1: load c

        ldrd    r50,[r7,#-2]    ; load w1, rthreshold
        
        fsub    r44,r20,r21     ; 0: a-b
        fsub    r45,r23,r22     ; 0: d-c
        fsub    r46,r24,r25     ; 1: a-b 
        fsub    r47,r27,r26     ; 1: d-c
        
        fmul    r48,r51,r60     ; rthreshold *= var
        
        fadd    r44,r44,r45     ; 0[-1]: a+d-b-c
        fadd    r45,r46,r47     ; 1[-1]: a+d-b-c
        
        fmsub   r48,r44,r63     ; [-1]: var * thr -= (a+d-b-c) * w0
        ldrd    r52,[r7,#-1]    ; [-1] load fsucc, ffail
        fmsub   r48,r45,r50     ; [-1] var * thr -= (a+d-b-c) * w1
        movblte r52,r53
        fsub    r17,r17,r52     ; [-2]: ssum -= var * thr > (rsum) ? fsucc: ffail

        sub     r18,r18,#1
        bne     2b
1:

Apart from the trick with the implicit 'free' comparison operations for all that it pretty much ended up in a direct translation of the C code (much of the effort was in the format design and getting the code to run). But even in this state it will execute much faster than what the compiler generates for the very simple loop above. Things the C compiler is missing:

It doesn't use dword loads - more instructions are needed
It does use hword loads - causes fixed stalls
It is using an ieee comparison function (compiler flags may change this)
It doesn't use fmsub as much, certainly not for comparison
It needs to multiply the array references by 4

Because there are no datatypes in asm, this can take advantage of the fact that the array lookups are by the byte and pre-calculate the shift (multiply by sizeof(float)) in the cascade. In the C version I do not as it adds a shift for a float array reference - I do have a way to remove that in C but it's a big ugly.

Otherwise - it's all very straightforward in the inner loop.

First it loads all the rect definitions and then looks them up in the sat table (r6).

Then it starts the calculations, first calculating the average and then using fmsub to perform the multiply by the weight and comparison operation in one.

At the very end of the loop the last flop is to perform a subtraction on the ssum - this sets the status flags to the final comparison (if (ssum < h0.head0.sthreshold) in c). This actually requires some negation in code that uses it which could be improved - the threshold could be negated in the cascade for example.

If one looks closely one will see that the registers keep going up even though many are out of scope and can be re-used. This is done on purpose and allows for the next trick ...

I don't have the full profiling info for this version, but I have a note that it includes 15 RA stalls, and I think from memory only dual-issues 2 of the 10 flops.

Scheduling

A typical optimisation technique is to unroll a loop, either manually or by letting the compiler do it. Apart from reducing the relative overhead of any loop support constructs it provides modern processors with more flexibility to schedule instructions.

The code already has some loop unrolling anyway - the two regions are tested using in-line code rather than in a loop.

But unrolling gets messy when you don't know the the loop bounds or don't have some other hard detail such as that there is always an even number of loops. I didn't really want to try to look at pages of code and try to schedule by hand either ...

So instead I interleaved the same loop - as one progresses through the loop calculating the addresses needed for "this" result, the fpu is performing the calculations for the "last" result. You still need a prologue which sets up the first loop for whatever the result+1 code is expecting, and also an epilogue for the final result - and if only 1 value is processed the guts is completely bypassed. I'll only show the guts here ...

// Copyright (c) 2013 Michael Zucchi
// Licensed under GNU GPLv3

        .balign 8
2:
[  0]   fsub    r46,r24,r25     ; [-1] 1: a-b 
[  0]   ldrd    r0,[r7],#4      ; [ 0] 0: load a,b,c,d
[  1]   fsub    r47,r27,r26     ; [-1] 1: d-c
[  1]   ldrd    r2,[r7,#-3]     ; [ 0] 1: load a,b,c,d
[  2]   fmul    r48,r51,r60     ; [-1] rthreshold *= var
        
[  2]   lsr     r4,r0,#16       ; [ 0] 0:b index
[  3]   fadd    r44,r44,r45     ; [-1] 0: a+d-b-c
[  3]   ldr     r21,[r6,r4]     ; [ 0] 0:load b
[  4]   and     r0,r0,r5        ; [ 0] 0:a index
[  5]   ldr     r20,[r6,r0]     ; [ 0] 0:load a

[  6]   lsr     r4,r1,#16       ; [ 0] 0: d index    
[  6]   fadd    r45,r46,r47     ; [-1] 1: a+d-b-c
[  7]   ldr     r23,[r6,r4]     ; [ 0] 0: load d
[  8]   and     r1,r1,r5        ; [ 0] 0: c indec
[  8]   fmsub   r48,r44,r63     ; [-1] var * thr -= (a+d-b-c) * w0
[  9]   ldr     r22,[r6,r1]     ; [ 0] 0: load c
        
[ 10]   lsr     r4,r2,#16       ; [ 0] 1: b index
[ 11]   ldr     r25,[r6,r4]     ; [ 0] 1: load b
[ 12]   and     r2,r2,r5        ; [ 0] 1: a index
[ 13]   ldr     r24,[r6,r2]     ; [ 0] 1: load a

[ 13]   fmsub   r48,r45,r50     ; [-1] var * thr -= (a+d-b-c) * w1

[ 14]   ldrd    r52,[r7,#-5]    ; [-1] load fsucc, ffail
[ 15]   lsr     r4,r3,#16       ; [ 0] 1: d iindex
[ 16]   and     r3,r3,r5        ; [ 0] 1: c index
[ 17]   ldr     r27,[r6,r4]     ; [ 0] 1: load d
[ 18]   movblte r52,r53         ; [-1] val = var * thr < rsum ? fsucc : ffail
[ 19]   fsub    r44,r20,r21     ; [ 0] 0: a-b
[ 19]   ldr     r26,[r6,r3]     ; [ 0] 1: load c
[ 20]   fsub    r45,r23,r22     ; [ 0] 0: d-c

[ 20]   sub     r18,r18,#1
[ 21]   ldrd    r50,[r7,#-2]    ; [-1] load w1, rthreshold

[ 21]   fsub    r17,r17,r52     ; [-1] ssum -= var * thr > (rsum) ? fsucc: ffail

[ 22]   bne     2b
[ 26] ; if back to the start of the loop

Update: I tried to improve and fix the annotations in the comments. The [xx] value is the index of the result this instruction is working on, the next x: value is the index of the region being worked on (where it is needed).

I've attempted to show the clock cycles the instructions start on (+ 4 for the branch), but it's only rough. I know from the hardware profiling that every flop dual-issues and there are no register stalls. The loop start alignment is also critical to the lack of stalls. And it took a lot of guess-work to remove the final stall which lingered in the last 5 instructions (someone'll probably tell me now that the sdk has a cycle timer, but that would be no matter if they did).

It almost fell out almost completely symmetrically - that is having all ialu ops in loop 0 and having all flops in loop 1, but by rotating the flops around a bit I managed to get the final flop being the ssum "subtraction + comparison" operation and with no stalls ...

The movblte instruction which performs the ternary is the one that uses the implicit comparison result from the fmsub earlier. Not only does this save one instruction, it also saves the 5 clock cycle latency it would add - and this loop has no cycles to spare that I could find.

There is some more timing info for this one on the previous post. The version that this is 30% faster is not the unscheduled one above but an earlier scheduling attempt.

Oh I should probably mention that i found the bugs and the timings in the previous post did change a bit for the worse, but not significantly.

That scheduling ...

Had some other stuff I had to poke at the last couple of nights, and I needed a bit of a rest anyway. Pretty stuffed tbh, but i want to drop this off to get it out of my head.

Tonight I finally got my re-scheduled inner loop to work. Because i'm crap at keeping it all in my head I basically made a good guess and then ran it on the hardware using the profiling counters and tweaked until it stopped improving (actually until i removed all RA stalls and had every FLOP a dual-issue). Although it looks like now it's running for realz one of the dual-issue's dropped out - depends on things like alignment and memory contention.

But the results so far ...

       Previous "best" scheduling       New "improved" scheduling

                   CLK = 518683470 (1.3x)           CLK = 403422245
             IALU_INST = 319357570            IALU_INST = 312638579
              FPU_INST = 118591312             FPU_INST = 118591312
             DUAL_INST = 74766734  (63% rate) DUAL_INST = 108870170    (92% rate)
             E1_STALLS = 11835823             E1_STALLS = 12446143
             RA_STALLS = 122796060 (2.6x)     RA_STALLS = 47086269
      EXT_FETCH_STALLS = 0             EXT_FETCH_STALLS = 0
       EXT_LOAD_STALLS = 1692412        EXT_LOAD_STALLS = 1819284

The 2-region loop is 33 instructions including the branch, so even a single cycle improvement is measurable.

I haven't yet re-scheduled the '3-region' calculation yet so it can gain a bit more. But as can be seen from the instruction counts the gain is purely from just rescheduling. The IALU instruction count is different as i improved the loop mechanics too (all of one instruction?).

As a quick comparison this is what the C compiler comes up with (-O2). I'm getting some different results to this at the moment so the comparison here are only provisonal ...

                   CLK = 1189866322 (2.9x vs improved)
             IALU_INST = 693566877
              FPU_INST = 131085992
             DUAL_INST = 93602858   (71% rate)
             E1_STALLS = 31768387   (2.5x vs improved)
             RA_STALLS = 322216105  (6.8x vs improved)
      EXT_FETCH_STALLS = 0
       EXT_LOAD_STALLS = 14099244

The number of flops are pretty close though so it can't be far off. I'm doing a couple of things the C compiler isn't so the number should be a bit lower. Still not sure where all those ext stalls are coming from.

Well the compiler can only improve ...

In total elapsed time terms these are something like 1.8s, 0.88s, and 0.60s from slowest to fastest on a single core. I only have a multi-core driver for the assembly versions. On 1 column of cores best is 201ms vs improved at 157ms. With all 16 cores ... identical at 87ms. But I should really get those bugs fixed and a realistic test case running before getting too carried away with the numbers.

Update: I later posted in more detail about the scheduling. I tracked down some bugs so the numbers changed around a bit but nothing that changes the overall relationships.

Well, this is going to be challenging ...

After finally licking the algorithm and memory requirements I tried going multi-core on the epiphany ...

Results are ... well affected by the grid network.

I need to do some memory profiling to work out how much i'm hitting memory but even the internal mesh network is being swamped.

If I use a row of the chip (4 cores) as you move closer to core 0 (further away from external ram?) the dma wait overhead grows progressively. Using 4 cores is about 3x faster.

If instead I use a column of the chip, they all seem about equal in dma wait and using 4 cores is closer to 4x faster.

Using all 16 is about the same as only using 4 in a column. Actually using 8 in a 2x4 arrangement is best, but it's only a bit faster than 4. I haven't tried columns other than 0.

But it's not all epiphany's fault ...

VJ is just such a harsh algorithm for cpus in general, and concurrency in particular- no wonder intel choose it for opencv. I haven't tried these new versions on multi-core ARM yet, but I think from memory it generally doesn't scale very well either. Even on a GPU it was incredibly difficult to get any efficiency out of it - whereas other algorithms typically hit 100+X in performance, I could only manage around 10x after considerable effort.

The cascade description is just so bulky and needs to be accessed so often it just keeps blowing out any caches or LDS. Or if you do have a pile of LDS like on CELL, it doesn't work well with SIMD either.

Where to now?

Anyway i'm not sure what to do at this point. First thought was to move the cascade distributed amongst multiple cores on chip, but one will still have high contention on the earlier cascade stages. Which means you need to distribute copies as well, and then it can't all fit. Then again any change could have a large effect on external bandwidth requirements so might be useful (I would need to run the numbers to see if this would help). Perhaps a better idea is to look at trying to increase the work done at each stage by increasing the size of the window - 32x32=144 sub-windows, 64x32=528 which might give the DMA enough time to finish without thrashing the network. But then I wouldn't have sufficient LDS to double buffer. Although as I scan horizontaly in each core I can also greatly reduce the amount of data loaded at each horizonal step and maybe double-buffering isn't so important.

32K ... boy it's just so small for both code and data.

PS as a plus though - just getting multi-core code running was pretty easy.

Update: I ran some simulated numbers with my test case and came up with some pretty good results. Just straight off the bat one can halve bandwidth requirements of both the cascade and window loads and further tuning is possible. A summary is below:

Window Size     Block size      Cascade             Window DMA
                Local   Load    DMA Count   Bytes   DMA Count  Bytes
32x32              4K     4K         5770    23M6        1600    6M5
64x32              4K     4K         2827    11M6         440    3M6 
64x32              8K     4K         2253     9M2         440    3M6
64x32              8K     2K         4251     8M2         440    3M6
64x32              8K     1K         7971     8M2         440    3M6
64x32              2K    0K5        21356    10M9         440    3M6
64x32             10K     1K         6999     7M2         440    3M6
64x32             14K     1K         5353     5M5         440    3M6
64x32              1K     1K        11169     11M         440    3M6

Performance varied a bit but it wasn't by a great amount - which allows one to trade performance for memory requirements. I still haven't tried taking advantage of the common case of window overlap during the window dma.

Interestingly using 3K of memory vs 12K memory as a cache actually improves the bandwidth needs ...? My guess is that it is partly because buffer 0/1 thrash 80%/52% of the time for almost any local block size, and partly because 1K allows less waste on average during early termination (on average 1K5 vs 6K).

When I came to putting this onto the epiphany I didn't have enough room to double buffer the window loads ... but then as i parameterised it and tried with and without i discovered that double-buffering reduced the overall performance anyway. I think it just adds extra congestion to the mesh network and also to the LDS. Because memory is so tight I'm letting the compiler assign all buffers to the LDS right now - so it could potentially be improved.

Whilst parameterising the routine I made a mistake and started reading cascade data from main memory - and spent a couple of hung-over hours hunting down why it was suddenly running 15x slower. Just used the 'src' pointer rather than the 'dst' pointer ... but while chasing it down I added some hardware profiling stuff.

                   CLK = 511522228
             IALU_INST = 320509200
              FPU_INST = 118591312
             DUAL_INST = 73790192
             E1_STALLS = 9737422
             RA_STALLS = 111237528
      EXT_FETCH_STALLS = 0
       EXT_LOAD_STALLS = 1704676

The numbers show the code stil could be improved a bit (a good bit?) - this includes the C routine that calls the ASM and loads in the cascade buffers. Because (64-20) doesn't go very well into (512-64) this is for a routine that misses the edges and does a bit less work (i'm also getting funny answers so there is some other bug too).

I actually don't know where those external load stalls are coming from, there shouldn't be any external memory accesses within the code block in question as far as I know (although could be from e-lib). RA seems a bit high, and dual issue count seems a bit low. Need to look closer at the scheduling I guess.

BTW with these changes the multi-core performance is looking much much better but I need to track that bug down first to make sure it's still doing the same amount of work.

Update 2: Using the hardware profiling I tweaked and re-arranged and I think I have the inner loop down to 27 cycles. The previous version was about 42 by a similar measurement. Assuming no mistakes (its just an untested inner fragmment so far) that's a pretty big boost and just shows how much ignorance can mislead.

Although I tried to leave enough time between flops I forgot to account for the dual-issue capability. This makes a very big difference (i.e. up to 2x = 8) to how many instructions you need to fit in to avoid stalls.

The optimisation technique was to overlay/interleave/pipeline two loops which gives enough scheduling delay to avoid all extra stalls and enough instruction mix to dual-issue every possible dual-issue instruction (flops). Actually the very final stage of the calculation is currently part of the 3rd loop (although i hope to change that). So while it is calculating addresses and loads for [t=0], it is also calculating arithmetic at [t=-1] and final result for [t=-2]. Fortunately there are the plenty of registers required to track the interleaved state. This just then needs to be bracketed by the start of the first and the end of the last, and any stalls there are pretty much insignificant. This technique gives you the scheduling oportunities of unrolling loops without the code-size overhead.

(PS but doing it by hand is a royal pita as you have to keep track of multiple concurrent streams in your head. Quite a task even for such a simple loop as this one).

I also looked into a more compact cascade format. If I use bytes and perform the 2d address calculation in-code I can drop the cascade size by over 33%. The address calculation goes up from 4+1 (+ 1 for the ldrd) instructions to 10+0.5 but the smaller data size may make it worth it. Because the window data is fixed in size the address calculation is a simple shift and mask.

This byte format would require only 24 bytes for both 2-region features and 3-region features. The current short-based format requires 32 bytes for 2-region features and 40 bytes for 3-region features (more data is implicit in the 3-region features so needn't be included). Either format is dword aligned so every data load uses dwords.

For assembly left/top/bottom/right is pre-multiplied by 4 for the float data size. The region bounds are then stored thusly:

     uint   (left << 26) | (top << 18) | (bottom << 10) | (right << 2)

left and right are placed in the first/last 8 bits on purpose as both can be extracted with only 1 instruction: either a shift or an add. The middle 2 bytes always need both a shift and a mask and so can implicitly include the required stride multiply for free.

Since the window is fixed in size, the stride multiply can be implicitly calculated as part of the byte-extraction shift - which makes it basically cost-free. So the only extra overhead is summing the partial products.

    left = (val >> 24);                 // nb: no mask or multiply needed
    top = (val >> 16-6) & (0xff << 6);
    bottom = (val >> 8-6) & (0xff << 6);
    right = val & 0xff;               // nb: no multiply needed
    topleft = image[top+left];
    topright = image[top+right];
    bottomleft = image[bottom+left];
    bottomright = image[bottom+right];

This equates to 14 instructions plus a word load.

Because shorts have enough bits the stride multiply can be pre-calculated at compile time and then loaded into the cascade stream. This is what i'm currently using.

     uint (left + top * 64) << 16 | (right + top * 64)
     uint (left + bottom * 64) << 16 | (right + bottom * 64)

And the code to extract becomes

    topleftoff = val0 >> 16;
    toprightoff = val0 & 0xffff;
    bottomleftoff = val1 >> 16;
    bottomrightoff = val1 & 0xffff;
    topleft = image[topleftoff];
    topright = image[toprightoff];
    bottomleft = image[bottomleftoff];
    bottomright = image[bottomrightoff];

This equates to 8 cpu instructions + a dword load.

Unfortunately if one is to write this as C code, the compiler will force a multiply-by-4 to any index to the float array and unless one tries to trick it (using casts) the code will necessarily be 4 instructions longer. Actually it might just perform short loads instead and make a bit of a pigs breakfast of it.

About Me

Tags