About Me
Michael Zucchi
B.E. (Comp. Sys. Eng.)
also known as Zed
to his mates & enemies!
< notzed at gmail >
< fosstodon.org/@notzed >
That scheduling ...
Had some other stuff I had to poke at the last couple of nights, and I needed a bit of a rest anyway. Pretty stuffed tbh, but i want to drop this off to get it out of my head.
Tonight I finally got my re-scheduled inner loop to work. Because i'm crap at keeping it all in my head I basically made a good guess and then ran it on the hardware using the profiling counters and tweaked until it stopped improving (actually until i removed all RA stalls and had every FLOP a dual-issue). Although it looks like now it's running for realz one of the dual-issue's dropped out - depends on things like alignment and memory contention.
But the results so far ...
Previous "best" scheduling New "improved" scheduling
CLK = 518683470 (1.3x) CLK = 403422245
IALU_INST = 319357570 IALU_INST = 312638579
FPU_INST = 118591312 FPU_INST = 118591312
DUAL_INST = 74766734 (63% rate) DUAL_INST = 108870170 (92% rate)
E1_STALLS = 11835823 E1_STALLS = 12446143
RA_STALLS = 122796060 (2.6x) RA_STALLS = 47086269
EXT_FETCH_STALLS = 0 EXT_FETCH_STALLS = 0
EXT_LOAD_STALLS = 1692412 EXT_LOAD_STALLS = 1819284
The 2-region loop is 33 instructions including the branch, so even a single cycle improvement is measurable.
I haven't yet re-scheduled the '3-region' calculation yet so it can gain a bit more. But as can be seen from the instruction counts the gain is purely from just rescheduling. The IALU instruction count is different as i improved the loop mechanics too (all of one instruction?).
As a quick comparison this is what the C compiler comes up with (-O2). I'm getting some different results to this at the moment so the comparison here are only provisonal ...
CLK = 1189866322 (2.9x vs improved)
IALU_INST = 693566877
FPU_INST = 131085992
DUAL_INST = 93602858 (71% rate)
E1_STALLS = 31768387 (2.5x vs improved)
RA_STALLS = 322216105 (6.8x vs improved)
EXT_FETCH_STALLS = 0
EXT_LOAD_STALLS = 14099244
The number of flops are pretty close though so it can't be far off. I'm doing a couple of things the C compiler isn't so the number should be a bit lower. Still not sure where all those ext stalls are coming from.
Well the compiler can only improve ...
In total elapsed time terms these are something like 1.8s, 0.88s, and 0.60s from slowest to fastest on a single core. I only have a multi-core driver for the assembly versions. On 1 column of cores best is 201ms vs improved at 157ms. With all 16 cores ... identical at 87ms. But I should really get those bugs fixed and a realistic test case running before getting too carried away with the numbers.
Update: I later posted in more detail about the scheduling. I tracked down some bugs so the numbers changed around a bit but nothing that changes the overall relationships.
Well, this is going to be challenging ...
After finally licking the algorithm and memory requirements I tried going multi-core on the epiphany ...
Results are ... well affected by the grid network.
I need to do some memory profiling to work out how much i'm hitting memory but even the internal mesh network is being swamped.
If I use a row of the chip (4 cores) as you move closer to core 0 (further away from external ram?) the dma wait overhead grows progressively. Using 4 cores is about 3x faster.
If instead I use a column of the chip, they all seem about equal in dma wait and using 4 cores is closer to 4x faster.
Using all 16 is about the same as only using 4 in a column. Actually using 8 in a 2x4 arrangement is best, but it's only a bit faster than 4. I haven't tried columns other than 0.
But it's not all epiphany's fault ...
VJ is just such a harsh algorithm for cpus in general, and concurrency in particular- no wonder intel choose it for opencv. I haven't tried these new versions on multi-core ARM yet, but I think from memory it generally doesn't scale very well either. Even on a GPU it was incredibly difficult to get any efficiency out of it - whereas other algorithms typically hit 100+X in performance, I could only manage around 10x after considerable effort.
The cascade description is just so bulky and needs to be accessed so often it just keeps blowing out any caches or LDS. Or if you do have a pile of LDS like on CELL, it doesn't work well with SIMD either.
Where to now?
Anyway i'm not sure what to do at this point. First thought was to move the cascade distributed amongst multiple cores on chip, but one will still have high contention on the earlier cascade stages. Which means you need to distribute copies as well, and then it can't all fit. Then again any change could have a large effect on external bandwidth requirements so might be useful (I would need to run the numbers to see if this would help). Perhaps a better idea is to look at trying to increase the work done at each stage by increasing the size of the window - 32x32=144 sub-windows, 64x32=528 which might give the DMA enough time to finish without thrashing the network. But then I wouldn't have sufficient LDS to double buffer. Although as I scan horizontaly in each core I can also greatly reduce the amount of data loaded at each horizonal step and maybe double-buffering isn't so important.
32K ... boy it's just so small for both code and data.
PS as a plus though - just getting multi-core code running was pretty easy.
Update: I ran some simulated numbers with my test case and came up with some pretty good results. Just straight off the bat one can halve bandwidth requirements of both the cascade and window loads and further tuning is possible. A summary is below:
Window Size Block size Cascade Window DMA
Local Load DMA Count Bytes DMA Count Bytes
32x32 4K 4K 5770 23M6 1600 6M5
64x32 4K 4K 2827 11M6 440 3M6
64x32 8K 4K 2253 9M2 440 3M6
64x32 8K 2K 4251 8M2 440 3M6
64x32 8K 1K 7971 8M2 440 3M6
64x32 2K 0K5 21356 10M9 440 3M6
64x32 10K 1K 6999 7M2 440 3M6
64x32 14K 1K 5353 5M5 440 3M6
64x32 1K 1K 11169 11M 440 3M6
Performance varied a bit but it wasn't by a great amount - which allows one to trade performance for memory requirements. I still haven't tried taking advantage of the common case of window overlap during the window dma.
Interestingly using 3K of memory vs 12K memory as a cache actually improves the bandwidth needs ...? My guess is that it is partly because buffer 0/1 thrash 80%/52% of the time for almost any local block size, and partly because 1K allows less waste on average during early termination (on average 1K5 vs 6K).
When I came to putting this onto the epiphany I didn't have enough room to double buffer the window loads ... but then as i parameterised it and tried with and without i discovered that double-buffering reduced the overall performance anyway. I think it just adds extra congestion to the mesh network and also to the LDS. Because memory is so tight I'm letting the compiler assign all buffers to the LDS right now - so it could potentially be improved.
Whilst parameterising the routine I made a mistake and started reading cascade data from main memory - and spent a couple of hung-over hours hunting down why it was suddenly running 15x slower. Just used the 'src' pointer rather than the 'dst' pointer ... but while chasing it down I added some hardware profiling stuff.
CLK = 511522228
IALU_INST = 320509200
FPU_INST = 118591312
DUAL_INST = 73790192
E1_STALLS = 9737422
RA_STALLS = 111237528
EXT_FETCH_STALLS = 0
EXT_LOAD_STALLS = 1704676
The numbers show the code stil could be improved a bit (a good bit?) - this includes the C routine that calls the ASM and loads in the cascade buffers. Because (64-20) doesn't go very well into (512-64) this is for a routine that misses the edges and does a bit less work (i'm also getting funny answers so there is some other bug too).
I actually don't know where those external load stalls are coming from, there shouldn't be any external memory accesses within the code block in question as far as I know (although could be from e-lib). RA seems a bit high, and dual issue count seems a bit low. Need to look closer at the scheduling I guess.
BTW with these changes the multi-core performance is looking much much better but I need to track that bug down first to make sure it's still doing the same amount of work.
Update 2: Using the hardware profiling I tweaked and re-arranged and I think I have the inner loop down to 27 cycles. The previous version was about 42 by a similar measurement. Assuming no mistakes (its just an untested inner fragmment so far) that's a pretty big boost and just shows how much ignorance can mislead.
Although I tried to leave enough time between flops I forgot to account for the dual-issue capability. This makes a very big difference (i.e. up to 2x = 8) to how many instructions you need to fit in to avoid stalls.
The optimisation technique was to overlay/interleave/pipeline two loops which gives enough scheduling delay to avoid all extra stalls and enough instruction mix to dual-issue every possible dual-issue instruction (flops). Actually the very final stage of the calculation is currently part of the 3rd loop (although i hope to change that). So while it is calculating addresses and loads for [t=0], it is also calculating arithmetic at [t=-1] and final result for [t=-2]. Fortunately there are the plenty of registers required to track the interleaved state. This just then needs to be bracketed by the start of the first and the end of the last, and any stalls there are pretty much insignificant. This technique gives you the scheduling oportunities of unrolling loops without the code-size overhead.
(PS but doing it by hand is a royal pita as you have to keep track of multiple concurrent streams in your head. Quite a task even for such a simple loop as this one).
I also looked into a more compact cascade format. If I use bytes and perform the 2d address calculation in-code I can drop the cascade size by over 33%. The address calculation goes up from 4+1 (+ 1 for the ldrd) instructions to 10+0.5 but the smaller data size may make it worth it. Because the window data is fixed in size the address calculation is a simple shift and mask.
This byte format would require only 24 bytes for both 2-region features and 3-region features. The current short-based format requires 32 bytes for 2-region features and 40 bytes for 3-region features (more data is implicit in the 3-region features so needn't be included). Either format is dword aligned so every data load uses dwords.
For assembly left/top/bottom/right is pre-multiplied by 4 for the float data size. The region bounds are then stored thusly:
uint (left << 26) | (top << 18) | (bottom << 10) | (right << 2)
left and right are placed in the first/last 8 bits on purpose as both can be extracted with only 1 instruction: either a shift or an add. The middle 2 bytes always need both a shift and a mask and so can implicitly include the required stride multiply for free.
Since the window is fixed in size, the stride multiply can be implicitly calculated as part of the byte-extraction shift - which makes it basically cost-free. So the only extra overhead is summing the partial products.
left = (val >> 24); // nb: no mask or multiply needed
top = (val >> 16-6) & (0xff << 6);
bottom = (val >> 8-6) & (0xff << 6);
right = val & 0xff; // nb: no multiply needed
topleft = image[top+left];
topright = image[top+right];
bottomleft = image[bottom+left];
bottomright = image[bottom+right];
This equates to 14 instructions plus a word load.
Because shorts have enough bits the stride multiply can be pre-calculated at compile time and then loaded into the cascade stream. This is what i'm currently using.
uint (left + top * 64) << 16 | (right + top * 64)
uint (left + bottom * 64) << 16 | (right + bottom * 64)
And the code to extract becomes
topleftoff = val0 >> 16;
toprightoff = val0 & 0xffff;
bottomleftoff = val1 >> 16;
bottomrightoff = val1 & 0xffff;
topleft = image[topleftoff];
topright = image[toprightoff];
bottomleft = image[bottomleftoff];
bottomright = image[bottomrightoff];
This equates to 8 cpu instructions + a dword load.
Unfortunately if one is to write this as C code, the compiler will force a multiply-by-4 to any index to the float array and unless one tries to trick it (using casts) the code will necessarily be 4 instructions longer. Actually it might just perform short loads instead and make a bit of a pigs breakfast of it.
Epiphany bit test
Whilst working on the assembly version of the object detector code I came across the need for a bit test. Unfortunately paralella doesn't have one built in and also unfortunately there is no literal form of the AND instruction which is normally used for such things.
Fortunately however there is still a way to encode a single-bit test it into a single instruction.
CPUs always track the sign bit in the status register and its easy toput any bit there. The only question was how to extract that since there is no Bcc that maps directly to it.
Starting with the LSL instruction and it's flag updates:
LSL <RD>, <RN>, #IMM5
RD = RN << <OP2>
AN = RD[31]
AV = 0
AC = 0 (pity AC doesn't follow the last out-shifted bit)
if (RD[31:0]==0) { AZ=1 } else { AZ=0}
And then looking at the Bcc operation that utilise the negative flag:
0110 Greater Than (Signed) BGT ~AZ & (AV ==AN)
0111 Greater Than or Equal (Signed) BGTE AV == AN
1000 Less Than (Signed) BLT AV !=AN
1001 Less Than or Equal (Signed) BLTE AZ | (AV != AN)
Since AV is always 0 after a shift, this leads to a fairly straightforward pair of tests:
;; Test bit X is set
lsl r0,r1,#(31-X)
blt .bit_is_set
bgte .bit_is_not_set
And the same for the MOVcc instruction.
Having this just as efficient as if one had a bit-test instruction is rather more handy than if it wasn't. Bits are such a compact way to represent information it's a way to save memory and anything that saves memory is a big plus on epiphany.
The C compiler just follows the C one uses to implement bit tests:
int main(int argc, char **argv) {
return (argc & (1<<5)) ? 6 : 9;
}
00000000 <_main>:
0: 2403 mov r1,0x20
2: 00da and r0,r0,r1
4: 00c3 mov r0,0x6
6: 2123 mov r1,0x9
8: 0402 moveq r0,r1
a: 194f 0402 rts
But it can be done literally "one better":
_main: lsl r0,r0,#(31-5)
mov r0,#6
mov r1,#9
movgte r0,r1
rts
Update: Actually a better example would be a higher bit. C falls further behind ...
int main(int argc, char **argv) {
return (argc & (1<<21)) ? 6 : 9;
}
00000000 <_main>:
0: 200b 0002 mov r1,0x0
4: 240b 1002 movt r1,0x20
8: 00da and r0,r0,r1
a: 00c3 mov r0,0x6
c: 2123 mov r1,0x9
e: 0402 moveq r0,r1
10: 194f 0402 rts
Also note the code-size expansion from 2 bytes to 10 for the low-register source operand. Although i don't know why the compiler is using the 32-bit form of mov r1,0x0 since the 8 bit form would suffice and it would only need 8 bytes.
For high-registers this would be 4-bytes vs 12.
Now this looks better ...
So after "update *" the other night (on a previous post) and saying i was "done" with trying to improve the code; I ended up waking up thinking about it, and spent a good few hours last night trying to implement a new approach.
I didn't think it would improve performance but I was trying to address the problem of generality and memory overhead.
The previous version grouped stages in blocks, but each stage with a given block needed to be complete. So the double-buffer buffers need to accomodate at least the single largest stage - which is over 7k even with the relatively simple cascade and "highly optimised" enncoding format I'm testing with. This together with keeping a few stages in LDS permanently meant it was using upwared of 20k just for the cascade data and wouldn't be able to handle bigger cascade stages without exhausting local memory. It was already extremely tight.
So I thought about how to break the cascade into fixed-size streamable blocks. The only state needed to be kept between blocks is the "stage sum", which needs to accumulate across the entire stage - so it is cheap to implement. Most of the time I spent writing the cascade compiler trying to fit it into a usable format, and then trying to debug the implementation (when it was the compiler which had the bug).
The new version uses tuneable block sizes, where the first block is stored permanently in LDS (to allow time to dma in the next block) and the rest DMAd from shared memory as with the previous implementation. Using 4k blocks (=12K LDS) even attains marginally better performance in the C implementation. Using 2k blocks (only 6k lds) is about equal.
Once I got that working I converted the assembly routine I had over - and it even worked the first time I ran it! Performance is identical to the previous version in this case although the increased generality and reduced memory requirements are a big plus.
I also found a way to avoid the need for dma abort (which as stated, doesn't work reliably and can crash the hardware) - by delaying it's request till part-way through processing the local cascade. Although the asm version is still too fast for it and some cycles are wasted waiting for the DMA channel to be clear.
And as a bonus I incidentally verified the results against another independent implementation which i know works well enough so i'm fairly confident the numbers are now valid too. Because it performs so many calculations even a small error should show up with wildly different results.
Best times on single epu:
c - 1.79s, dmawait=15MC
asm - 0.87s, dmawait=31MC
ARM c - 1.00s
I could potentially reduce the image data dma external bandwidth needs by nearly 2/3 ... but i haven't looked at that yet - however it may be important for multi-core.
I'm fairly happy with these numbers, it will be interesting to see how it scales to multiple cores.
PS the assembly language leaf routine 149 instructions in 454 bytes. I used some interesting optimisations of the algorithm and took advantage of the separate float and integer status flags. However there may be room for micro-optimisation by improving the scheduling as todate it was only "by eye and intuition" (i.e educated guesses).
A bit more vj progress, thoughts on elf loader
I had a chance to fit in a bit more hacking on the parallella over the weekend, although not as much progress as i'd hoped.
I first created an ARM version of the viola-jones (VJ) code to compare; it's still getting different results to a previous implementation but at least it's consistent with the EPU (epiphany core) version. I tried to work out why it isn't the same but I still haven't fathomed it - it can wait.
But at least with a consistent implementation I can do some comparisions. It's only running on a single core in each case, and it's timing just the processing of every 20x20 window on a single-scale 512x512 image (about 230K probes).
CPU Time
Intel core Duo 2Ghz 0.3s
Parallella ARM 1.5s
Parallella EPU 2.8s
Parallella ARM format 2 1.1s (approx)
Parallella EPU format 2 2.3s
Parallella EPU ASM 1.4s
Parallella ARM C* 1.0s
Parallella EPU C* 1.8s
Parallella EPU ASM* 0.9s
Update 3: I finally got the assembly version working and added 3 more timings. These include C versions that have hand-unrolled inner loops aided by the new format. The ARM version doesn't report very reliable timings (paging on first-run i guess), but it's definitely improved. I've tried to optimise the scheduling on the assembly version although there are still enough stalls to add up - but removing those will require more than just reordering of instructions. These new times also includes some tweaks to the arrangement of the cascade chunks, although i can't recall if it made much difference. Still yet to double-buffer the image DMA.
Update *: Tried a bunch more things like fully double buffering, a more C-friendly format for the C versions, and some hacks which probably aren't safe/aren't really working. But as they still "appear" to work i'm including the timings here listed with the *s. But the C implementation is so tight it only just fits into the on-core memory to such an extent it might be difficult to add the scheduling code for multi-core.
It's also pretty much exhausted my interest at this point so i will probably move onto multi-core as well as a more complete and realistic example (although, tomorrow is another day so i might change my mind if inspiration hits).
(BTW this is not representative of the time it takes to detect a face on a 512x512 image on this hardware, as one doesn't normally need to look for 20x20 pixel faces at that resolution and it will typically be scaled down somewhat first, 1x1 pixel centres are usually not needed either).
To be honest I was hoping the EPU would be a bit closer to the ARM because of the similar clock speed, but I think in part it's suffering from a compiler which doesn't optimise quite as well. And the format I'm using benefits from the more capable instruction set on the ARM (bitfield stuff). However, there's still more that can be done so it isn't over yet - i'm not double buffering the image window loads yet for one.
First I started looking at writing an assembly version of the inner-most routine. But I ran into a problem on that - debugging is a royal pain in the arse. Obviously I had bugs, but the only way to "debug" it is to stare at the code looking for the error. And i'm not yet proficient enough at the dialect to be very good at this yet. The only indication of failure is that the code never gets to the mailbox return message. I don't know how to run gdb on the epu cores, and the computer i'm using as a console doesn't have an amd64 GNU/Linux installed (required for the pre-compiled sdk). A few documentation errors were a bit misleading too (post-increment is really post-increment despite the copy-paste job from the indexed LDR page). After paring things down enough and working out the ABI I at least had something which walked the cascade. As I started filling out the guts I thought I may as well tweak the format for better run-time efficiency too.
The EPU is most efficient when performing 64-bit data loads (aligned to 64-bit); but the format I had was all based on 32-bits for space-efficiency. With a little bit of re-arranging I managed to create a 64-bit format which requires less manipulation whilst only adding 4 bytes to each feature - this adds up but I think it will be worth it. It also lets me unroll the inner loop a bit more easily. One thing I noticed is that the 'weights' of the features in the cascade i'm using only fall into 3 categories; 2 combinations for 2-region features: -0.25 + 0.50, -0.25 + 0.75, and 3-region features are always a fixed -0.25 + 0.50 + 0.50. So I don't have to encode the weights individually for each region and can just either hard-code the weights (for the 3-region features), or store a single float value for each feature instead.
Anyway then my sister dropped around, which means a slight hangover this morning ... so seeing if all this tweaking makes any difference will have to wait.
Update: So i did some evening hacking, and finally got the code working on the epu - kept overwriting the stack and other fun stuff which is a pain to track down. And oh boy, make sure you avoid double and that any shorts in a structure are unsigned - othterwise it really hits performance.
So yeah the new cascade format does increase the performance ... but it just widened the gap! Although I gained a pretty decent 30-50% improvement on ARM execution time (actually its hard to tell, timing is very inconsistent on the arm code), it was pretty much insignificant on epiphany, under 2%. Blah.
The total DMA wait time of loading the cascades went up by 30% which suggests it's being limited by external memory bandwidth and not cpu processing. Some of this is good - indicating the cpu code is better and so there's more waiting around - but as the format is a bit fatter that can't help either and it's not like there's any way to fit any other processing in the dead time.
Will have to re-check the dma code to make sure it's double buffering properly, but i'm probably going to have to try another approach. a) Go for size of the data instead, i can get each feature down to 20/24 bytes (2 region features) + 24 bytes (3 region features) vs 32/40 - if i do the address arithmetic in the inner loop (lucky i can do the multplies with shifts), and/or b) see if fiddling with the sizes of the groups helps, and/or c) Distribute the extra bits of the cascade in chunks across multiple epiphany cores so the cascade remains completely on-chip, and/or d) Try a completely different approach such as a pipeline architecture where different cores process different stages.
Update 2: Actually on second thoughts, i don't think it's the dma, it's just not enough cycles to make a big difference (assuming my timing code is right). And even if i limit it to the local stages and don't dma in any cascade chunks it's still taking about 75% of the total time. So although the C compiler seems to be doing an okish job I'm probably better off spending my time in assembler now.
EPU Linking
Based on thoughts on a post on the forums I also had a quick look at how one might write an elf loader for the epu - in a way which makes it more useful than just memcpy. I think it should be possible to create an executable for the epu which allows each core to have custom code, and for each core to be able to resolve locations in other cores. And if the linking is done at run-time it can also re-arrange the code in arbitrary ways - such as to customise a specific work-group topology or even handle differences between 16 or 64-core chips. And I think I can do this without changing any of the tools.
I poked around ld and noticed the -q and -r options which retain the reloc hunks in the linked output. For a test I created a linker script which defines a number of sections each with an origin a megabyte apart - i.e. the spacing of epu cores. This represents the absolute location of each 'virtual' core. Code can be placed on any core using section directives (attributes), and using the -q option the linker outputs both the code and the reloc hunks and so should allow a loader to re-link each core's code to suit any topology ... well in theory.
Before I get anywhere with that I need some doco on the epiphany-specific reloc hunks from the assembler - looks quite straightforward - and finding enough time to look into it.
This might seem a bit weird in terms of GNU/Linux, but i'm familair (although no doubt rather rusty!) with non-pic but relocatable code from my Amiga hacking days. My code back then was assembled to a fixed address, and the OS loader would re-link it to potentially non-contiguous chunks of memory in a global system address space. If it worked then, it should work ok with the parallella too. My only concern is that the GNU linker outputs enough information - using the -r switch (for relocatable code) causes the final link to fail (probably libc/startup related), but the -q switch (which simply retains the reloc hunks) may not be sufficient. It also relies on the fact that any inter-core reference will have to be absolute and thus require a load and a reloc entry of a 32-bit address, but I think that will be the case.
Having a runtime elf loader/relocator allows for some other nifty shit too like thread local storage and so on. Now, if only the epiphany ABI were a bit more sane too! (wrt register conventions)
Well that wasn't very fun
I've been trying to work with the android open accessory protocol over the last few days and i finally got android to android communications working via usb. Most of the examples to be found are either on an embedded platform or via GNU/Linux, and i couldn't find anything about doing the host-side from android.
Bit of a mess though. They somehow managed to make a fairly simple (but still a bit weird) library - libusb - into something less than the sum of it's parts (discordant vs syngergistic?). And then you add to that the complex life-cycle of an android app and things get pretty nasty.
Some pretty weird shit to deal with:
- If you don't set the app as the default handler for the device, then you get permission popups which mean you have to put the connect logic in the resume method otherwise nothing works.
- But if it is defaulted, ... then you don't get permission popups, and therefore resume is not called at the right time.
- If you tell a connected device to go into accessory mode, you get a device disconnected message, which sort of makes sense ...
- ... but you don't get a connected message for the accessory device.
- And if you subsequently list the devices connected to the system, a device is still listed with the same vendor id + device id (this is wrong, the device should have changed at this point) ...
- ... but that "device" isn't the same as the "device" you started with.
- And there seems to be no "reliable" way to determine if you have the accessory interface, or the MTP interface (on the machine i'm testing with). (well maybe testing using a no side-effect valid request which fails could be used here).
- If the device rotates it destroys the application ... leaving everything in an unknown state.
- Sometimes nothing works anyway, and the apps on both end need killing.
- The documentation is a bit shit, and amounts to "the [outdated] source is the documentation". And that relies on a USB library which doesn't fudge the device id.
But with all that I did manage to send strings both ways so now the challenge is to sort through the cruft i've ended up with and turn it into something robust and hopefully simple.
In other android news i finally restarted on the ffts-related work and will scale down the initial goals to get something out the door.
DMA n stuff
So it looks like it's going to take a bit longer to get the object recognition code going on parallella.
First i was a bit dismayed that there is no way to signal the ARM with an interrupt - but subsequently satisfied that it is just work in progress and can be added to the FPGA glue logic later. Without this there is no mechanism for efficient epiphany to arm communications as the only option is polling a shared memory location (ugh, how x86).
Then I had to investigate just how to communicate using memory buffers. I had some trouble with the linker script (more on that later) so I hardcoded some addresses and managed to get it to work. I created a simple synchronous mailbox system that isn't too inefficient to poll from the ARM.
Once I had something going I tried a few variations to judge performance: simple memory accesses, and different types of DMA.
The test loop just squares the elements of one array into another, and uses a single e-core. The arrays are 512x512 floats.
seq How Total DMA verified
1: Direct array access 216985626 0 ok
3: Synchronous DMA 11076425 0 ok
5: Simultanous DMA byte 90242983 88205438 ok
7: Simultanous DMA short 46104601 44066927 ok
9: Simultanous DMA word 24047251 22009451 ok
11: Simultanous DMA long 11179747 9131659 ok
13: Async Double DMA byte 88744259 88021739 ok
15: Async Double DMA short 44542240 43819736 ok
17: Async Double DMA word 22448029 21725573 ok
19: Async Double DMA long 9479016 8756500 ok
All numbers are in clock cycles. The DMA routines used one or two 8K buffers (but not aligned to banks). The DMA column is the "wasted cycles" waiting for asynchronous DMA to complete (where that number is available).
As expected the simple memory access is pretty slow - an order of magnitude slower than a simple synchronous DMA. The synchronous DMA is good for it's simplicity - just use e_dma_copy() to copy in/out each block.
Simultanous DMA uses two buffers and enqueue two separate DMA operations concurrently - one reading and the other writing. They both still need to wait for completion before moving on, and it appears they are bandwwidth limited.
Async double-buffered DMA uses the two buffers, but uses a chained DMA operation to write out the previous result and read in the next result - and the DMA operation runs asynchronous to the processing loop.
Some notes:
- Avoid reading buffers using direct access!
- Slow slow slow. Writing shouldn't be too bad as write transactions are fire and forget. I presume as i used core 0,0 this is actually the best-case scenario at that ...
- Avoid anything smaller than 64-bit transfers.
- Every DMA element transfer takes up a transaction slot, so it becomes very wasteful for any smaller size than the maximum.
- Concurrent external DMA doesn't help
- Presumably it's bandwidth limited.
- Try to use async DMA.
- Well, obviously.
- The e-core performance far outstrips the external memory bandwidth
- So yeah, next time i should use something a bit more complex than a square op. This is both good - yeah heaps of grunt - and not so good - memory scheduling is critical for maximising performance.
- Multicore?
- I have yet to experiment with bigger work-groups to see how it scales up (or doesn't).
Build Environment
So one reason I couldn't get the linker script to work properly (assigning blocks to sections) was due to my build setup. I was initially going to have a separate directory for epiphany code so that a makefile could just change CC, etc. But that just seemed too clumsy so I decided to use some implicit make rules which use new extensions to automate some of the work - .ec, .eo, .elf, .srec, etc. The only problem is the linker script takes the name of the extension into account, so all my section attributes were being ignored ...
I copied it and added the extensions to a couple of places and that fixed it - but i haven't gone back to adjust the code trying to take advantage of it.
Object recognition
So anyway i tried to fit this knowledge into the OR code, but i haven't yet got it working. The hack of hard-coding the address offsets doesn't work now since i'm getting the linker to drop some data into the same shared address space, and i'm not sure yet how i can resolve the linker-assigned addresses from the ARM side so i can properly map the memory blocks. So until I work this out there's no way to pass the job data to the e-cores.
I could just move all the data to the ARM side and have that initialise the tables, but then I have to manually 'link' the addresses in. So that is throwing away the facilities of the linker which are designed for this kind of thing. I could create a custom linker script which hard-codes the addresses in another way but that seems hacky and non-portable.
I might have to check the forums and see what others have come up with, and read that memory map a bit more closely.
Progress on object detection
I spent more hours than I really wanted to trying to crack how to fit a haarcascade onto the epiphany in some sort of efficient way, and I've just managed to get my first statistics. I think they look ok but I guess i will only know once I load it onto the device.
First the mechanism i'm using.
Data structures
Anyone who has ever used the haarcascades from opencv will notice something immediately - their size. This is mostly due to the XML format used, but even the relatively compact representation used in socles is bulky in terms of epiphany LDS.
struct stage {
int stageid;
int firstfeature;
int featurecount;
float threshold;
int nextSuccess;
int nextFail;
};
struct feature {
int featureid;
int firstregion;
int regioncount;
float threshold;
float valueSuccess;
float valueFail;
};
struct region {
int regionid;
int left;
int top;
int width;
int height;
int weight;
};
For the simple face detector i'm experimenting with there are 22 stages, 2135 features and 4630 regions, totalling ~150K in text form.
So the first problem was compressing the data - whilst still retaining ease/efficiency of use. After a few iterations I ended up with a format which needs only 8 bytes per region whilst still including a pre-calculated offset. I took advantage of the fact that each epiphany core will be processing fixed-and-limited-sized local window, so the offsets can be compile-time calculated as well as fit within less than 16 bits. I also did away with the addressing/indexing and took advantage of some characterstics of the data to do away with the region counts. And I assumed things like a linear cascade with no branching/etc.
I ended up with something like the following - also note that it is accessed as a stream with a single pointer, so i've shown it in assembly (although I used an unsigned int array in C).
cascade:
.int 22 ; stages
;; stage 0
.short 2,3 ; 3 features each with 2 regions
;; one feature with 2 regions
.short weightid << 12 | a, b, c, d ; region 0, pre-calculated region offsets with weight
.short weightid << 12 | a, b, c, d ; region 1, pre-calculated region offsets with weight
.float threshold, succ, fail ; feature threshold / accumulants
;; ... plus 2 more for all 3 features
.short 0,0 ; no more features
.float threshold ; if it fails, finish
;; stage 1
.short 2,15 ; 15 features, each with 2 regions
This allows the whole cascade to fit into under 64K in a readonly memory in a mostly directly usable form - only some minor bit manipulation is required.
Given enough time I will probably try a bigger version that uses 32-bit values and 64-bit loads throughout, to see if the smaller code required for the inner loop outweights the higher memory requirements.
Overlays ...
Well I guess that's what they are really. This data structure also lets me break the 22-stage cascade into complete blocks of some given size, or arbitrary move some to permanent local memory.
Although 8k is an obvious size, as i want to double-buffer and also keep some in permanent local memory and still have room to move, I decided on the size of the largest single stage - about 6.5K, and moved 1.2K to permanent LDS. But this is just a first cut and a tunable.
The LDS requirements are thus modest:
+-------------+
| |
|local stages |
| 1k2 |
+-------------+
| |
| buffer 0 |
| 6k5 |
+-------------+
| |
| buffer 1 |
| 6k5 |
+-------------+
With these constraints I came up with 13 'groups' of stages, 3 stages in the local group, and the other 19 spread across the remaining 12 groups.
This allows for a very simple double-buffering logic, and hopefully leaves enough processing at each buffer to load the next one. All the other stages are grouped into under 6k5 blocks which are referenced by DMA descriptors built by the compiler.
This also provides a tunable for experimentation.
Windowing
So the other really big bandwith drain is that these 6000 features are tested at each origin location at each scale of image ...
If you have a 20x20 probe window and a 512x512 image, this equates to 242064 window locations, which for each you will need to process at least stage 0 - which is 6 regions at 4 memory accesses per region, which is 5 million 32-bit accesses or 23 megabytes. If you just access this directly into image data it doesn't cache well on a more sophisticated chip like an ARM (512 floats is only 4 lines at 16K cache), and obviously direct access is out of the question for epiphany.
So one is pretty much forced to copy the memory to the LDS, and fortunately we have enough room to copy a few window positions, thus reducing global memory bandwidth about 100 fold.
For simplicity my first cut will access 32x32 windows of data, and with a cascade window size of 20x20 this allows 12x12 = 144 sub-window probes per loaded block to be executed with no external memory accesses. 2D DMA can be used to load these blocks, and what's more at only 4K there's still just enough room to double buffer these loads to hide any memory access latency.
LDS for image data buffers:
+-------------+
| SAT 0 |
| 32x32xfloat |
| 4k |
+-------------+
| SAT 1 |
| 32x32xfloat |
| 4k |
+-------------+
A variance value is also required for each window, but that is only 144 floats and is calculated on the ARM.
Algorithm
So the basic algorithm is:
Load in 32x32 window of summed-area-table data (the image data)
prepare load group 0 into buffer 0 if not resident
for each location x=0..11, y=0..11
process all local stages
if passed add (x,y) to hits
end
gid = 0
while gid < group count AND hits not empty
wait dma idle
gid++;
if gid < group count
prepare load group gid into dma buffer gid & 1
if not resident
end
for each location (x,y) in hits
process all stages in current buffer
if passed add (x,y) to newhits
end
hits = newhits
end
return hits
The window will be loaded double-buffered on the other dma channel.
Efficiency
So i have this working on my laptop stand-alone to the point that I could run some efficiency checks on bandwidth usage. I'm using the basic lenna image - which means it only finds false positives - so it gives a general lowerish bound one might expect for a typical image.
Anyway these are the results (I may have some boundary conditions out):
Total window load : 1600
Total window bytes : 6553600
Total cascade DMA required: 4551
Total cascade DMA bytes : 24694984
Total cascade access bytes: 371975940
Total SAT access bytes : 424684912
So first, the actual global memory requirements. Remember that the image is 512x512 pixels.
- window load, window bytes
-
How many windows were loaded, and how many bytes of global memory this equates to. Each window is 32x32xfloats.
This can be possibly be reduced as there is nearly 1/3 overlap between locations at the expense of more complex addressing logic. Another alternative is to eschew the double-buffering and trade lower bandwidth for a small amount of idle time while it re-arranges the window data (even a 64x32 window is nearly 5 times more global-bandwidth efficient, but i can't fit 2 in for double buffering).
- cacade DMA count, cascade DMA bytes
-
How many individual DMA requests were required, and their total bytes. One will immediately notice that the cascade is indeed the majority of the bandwidth requirement ...
If one is lucky, tuning the size of each of the cascade chunks loaded into the double buffers may make a measureable difference - one can only assume the bigger the better as DMA can be avoided entirely if the cascade aborts within 11 stages with this design.
Next we look at the "effective" memory requirements - how much data the algorithm is actually accessing. Even for this relatively small image without any true positives, it's quite significant.
It requires something approaching a gigabyte of memory bandwidth just for this single scale image!
- Total SAT bytes
-
The total image (SAT) data accessed is about 420MB, but since this code "only" needs to load ~6.5MB it represents a 65x reduction in bandwidth requirements. I am curious now as to whether a simple technique like this would help with the cache hit ratio on a Cortex-A8 now ...
- Total cascade bytes
-
The cascade data that must be 'executed' for this algorithm is nearly as bad - a good 370MB or so. Here the software cache isn't quite as effective - it still needs to load 24MB or so, but a bandwidth reduction of 15x isn't to be sneezed at either. Some of that will be the in-core stages. Because of the double buffering there's a possibility the latency of accessing this memory it can be completely hidden too - assuming the external bus and mesh fabric isn't saturated, and/or the cascade processing takes longer than the transfer.
LDS vs cache can be a bit of a pain to use but it can lead to very fast algorithms and no-latency memory access - with no cache thrashing or other hard-to-predict behaviour.
Of course, the above is assuming i didn't fuck-up somewhere, but i'll find that out once I try to get it working on-device. If all goes well I will then have a baseline from which to experiment with all the tunables and see what the real-world performance is like.
Copyright (C) 2019 Michael Zucchi, All Rights Reserved.
Powered by gcc & me!