Well, this is going to be challenging ...
After finally licking the algorithm and memory requirements I tried going multi-core on the epiphany ...
Results are ... well affected by the grid network.
I need to do some memory profiling to work out how much i'm hitting memory but even the internal mesh network is being swamped.
If I use a row of the chip (4 cores) as you move closer to core 0 (further away from external ram?) the dma wait overhead grows progressively. Using 4 cores is about 3x faster.
If instead I use a column of the chip, they all seem about equal in dma wait and using 4 cores is closer to 4x faster.
Using all 16 is about the same as only using 4 in a column. Actually using 8 in a 2x4 arrangement is best, but it's only a bit faster than 4. I haven't tried columns other than 0.
But it's not all epiphany's fault ...
VJ is just such a harsh algorithm for cpus in general, and concurrency in particular- no wonder intel choose it for opencv. I haven't tried these new versions on multi-core ARM yet, but I think from memory it generally doesn't scale very well either. Even on a GPU it was incredibly difficult to get any efficiency out of it - whereas other algorithms typically hit 100+X in performance, I could only manage around 10x after considerable effort.
The cascade description is just so bulky and needs to be accessed so often it just keeps blowing out any caches or LDS. Or if you do have a pile of LDS like on CELL, it doesn't work well with SIMD either.
Where to now?
Anyway i'm not sure what to do at this point. First thought was to move the cascade distributed amongst multiple cores on chip, but one will still have high contention on the earlier cascade stages. Which means you need to distribute copies as well, and then it can't all fit. Then again any change could have a large effect on external bandwidth requirements so might be useful (I would need to run the numbers to see if this would help). Perhaps a better idea is to look at trying to increase the work done at each stage by increasing the size of the window - 32x32=144 sub-windows, 64x32=528 which might give the DMA enough time to finish without thrashing the network. But then I wouldn't have sufficient LDS to double buffer. Although as I scan horizontaly in each core I can also greatly reduce the amount of data loaded at each horizonal step and maybe double-buffering isn't so important.
32K ... boy it's just so small for both code and data.
PS as a plus though - just getting multi-core code running was pretty easy.
Update: I ran some simulated numbers with my test case and came up with some pretty good results. Just straight off the bat one can halve bandwidth requirements of both the cascade and window loads and further tuning is possible. A summary is below:
Window Size Block size Cascade Window DMA Local Load DMA Count Bytes DMA Count Bytes 32x32 4K 4K 5770 23M6 1600 6M5 64x32 4K 4K 2827 11M6 440 3M6 64x32 8K 4K 2253 9M2 440 3M6 64x32 8K 2K 4251 8M2 440 3M6 64x32 8K 1K 7971 8M2 440 3M6 64x32 2K 0K5 21356 10M9 440 3M6 64x32 10K 1K 6999 7M2 440 3M6 64x32 14K 1K 5353 5M5 440 3M6 64x32 1K 1K 11169 11M 440 3M6
Performance varied a bit but it wasn't by a great amount - which allows one to trade performance for memory requirements. I still haven't tried taking advantage of the common case of window overlap during the window dma.
Interestingly using 3K of memory vs 12K memory as a cache actually improves the bandwidth needs ...? My guess is that it is partly because buffer 0/1 thrash 80%/52% of the time for almost any local block size, and partly because 1K allows less waste on average during early termination (on average 1K5 vs 6K).
When I came to putting this onto the epiphany I didn't have enough room to double buffer the window loads ... but then as i parameterised it and tried with and without i discovered that double-buffering reduced the overall performance anyway. I think it just adds extra congestion to the mesh network and also to the LDS. Because memory is so tight I'm letting the compiler assign all buffers to the LDS right now - so it could potentially be improved.
Whilst parameterising the routine I made a mistake and started reading cascade data from main memory - and spent a couple of hung-over hours hunting down why it was suddenly running 15x slower. Just used the 'src' pointer rather than the 'dst' pointer ... but while chasing it down I added some hardware profiling stuff.
CLK = 511522228 IALU_INST = 320509200 FPU_INST = 118591312 DUAL_INST = 73790192 E1_STALLS = 9737422 RA_STALLS = 111237528 EXT_FETCH_STALLS = 0 EXT_LOAD_STALLS = 1704676
The numbers show the code stil could be improved a bit (a good bit?) - this includes the C routine that calls the ASM and loads in the cascade buffers. Because (64-20) doesn't go very well into (512-64) this is for a routine that misses the edges and does a bit less work (i'm also getting funny answers so there is some other bug too).
I actually don't know where those external load stalls are coming from, there shouldn't be any external memory accesses within the code block in question as far as I know (although could be from e-lib). RA seems a bit high, and dual issue count seems a bit low. Need to look closer at the scheduling I guess.
BTW with these changes the multi-core performance is looking much much better but I need to track that bug down first to make sure it's still doing the same amount of work.
Update 2: Using the hardware profiling I tweaked and re-arranged and I think I have the inner loop down to 27 cycles. The previous version was about 42 by a similar measurement. Assuming no mistakes (its just an untested inner fragmment so far) that's a pretty big boost and just shows how much ignorance can mislead.
Although I tried to leave enough time between flops I forgot to account for the dual-issue capability. This makes a very big difference (i.e. up to 2x = 8) to how many instructions you need to fit in to avoid stalls.
The optimisation technique was to overlay/interleave/pipeline two loops which gives enough scheduling delay to avoid all extra stalls and enough instruction mix to dual-issue every possible dual-issue instruction (flops). Actually the very final stage of the calculation is currently part of the 3rd loop (although i hope to change that). So while it is calculating addresses and loads for [t=0], it is also calculating arithmetic at [t=-1] and final result for [t=-2]. Fortunately there are the plenty of registers required to track the interleaved state. This just then needs to be bracketed by the start of the first and the end of the last, and any stalls there are pretty much insignificant. This technique gives you the scheduling oportunities of unrolling loops without the code-size overhead.
(PS but doing it by hand is a royal pita as you have to keep track of multiple concurrent streams in your head. Quite a task even for such a simple loop as this one).
I also looked into a more compact cascade format. If I use bytes and perform the 2d address calculation in-code I can drop the cascade size by over 33%. The address calculation goes up from 4+1 (+ 1 for the ldrd) instructions to 10+0.5 but the smaller data size may make it worth it. Because the window data is fixed in size the address calculation is a simple shift and mask.
This byte format would require only 24 bytes for both 2-region features and 3-region features. The current short-based format requires 32 bytes for 2-region features and 40 bytes for 3-region features (more data is implicit in the 3-region features so needn't be included). Either format is dword aligned so every data load uses dwords.
For assembly left/top/bottom/right is pre-multiplied by 4 for the float data size. The region bounds are then stored thusly:
uint (left << 26) | (top << 18) | (bottom << 10) | (right << 2)
left and right are placed in the first/last 8 bits on purpose as both can be extracted with only 1 instruction: either a shift or an add. The middle 2 bytes always need both a shift and a mask and so can implicitly include the required stride multiply for free.
Since the window is fixed in size, the stride multiply can be implicitly calculated as part of the byte-extraction shift - which makes it basically cost-free. So the only extra overhead is summing the partial products.
left = (val >> 24); // nb: no mask or multiply needed top = (val >> 16-6) & (0xff << 6); bottom = (val >> 8-6) & (0xff << 6); right = val & 0xff; // nb: no multiply needed topleft = image[top+left]; topright = image[top+right]; bottomleft = image[bottom+left]; bottomright = image[bottom+right];
This equates to 14 instructions plus a word load.
Because shorts have enough bits the stride multiply can be pre-calculated at compile time and then loaded into the cascade stream. This is what i'm currently using.
uint (left + top * 64) << 16 | (right + top * 64) uint (left + bottom * 64) << 16 | (right + bottom * 64)
And the code to extract becomes
topleftoff = val0 >> 16; toprightoff = val0 & 0xffff; bottomleftoff = val1 >> 16; bottomrightoff = val1 & 0xffff; topleft = image[topleftoff]; topright = image[toprightoff]; bottomleft = image[bottomleftoff]; bottomright = image[bottomrightoff];
This equates to 8 cpu instructions + a dword load.
Unfortunately if one is to write this as C code, the compiler will force a multiply-by-4 to any index to the float array and unless one tries to trick it (using casts) the code will necessarily be 4 instructions longer. Actually it might just perform short loads instead and make a bit of a pigs breakfast of it.