Progress on object detection
I spent more hours than I really wanted to trying to crack how to fit a haarcascade onto the epiphany in some sort of efficient way, and I've just managed to get my first statistics. I think they look ok but I guess i will only know once I load it onto the device.
First the mechanism i'm using.
Data structures
Anyone who has ever used the haarcascades from opencv will notice something immediately - their size. This is mostly due to the XML format used, but even the relatively compact representation used in socles is bulky in terms of epiphany LDS.
struct stage { int stageid; int firstfeature; int featurecount; float threshold; int nextSuccess; int nextFail; }; struct feature { int featureid; int firstregion; int regioncount; float threshold; float valueSuccess; float valueFail; }; struct region { int regionid; int left; int top; int width; int height; int weight; };
For the simple face detector i'm experimenting with there are 22 stages, 2135 features and 4630 regions, totalling ~150K in text form.
So the first problem was compressing the data - whilst still retaining ease/efficiency of use. After a few iterations I ended up with a format which needs only 8 bytes per region whilst still including a pre-calculated offset. I took advantage of the fact that each epiphany core will be processing fixed-and-limited-sized local window, so the offsets can be compile-time calculated as well as fit within less than 16 bits. I also did away with the addressing/indexing and took advantage of some characterstics of the data to do away with the region counts. And I assumed things like a linear cascade with no branching/etc.
I ended up with something like the following - also note that it is accessed as a stream with a single pointer, so i've shown it in assembly (although I used an unsigned int array in C).
cascade: .int 22 ; stages ;; stage 0 .short 2,3 ; 3 features each with 2 regions ;; one feature with 2 regions .short weightid << 12 | a, b, c, d ; region 0, pre-calculated region offsets with weight .short weightid << 12 | a, b, c, d ; region 1, pre-calculated region offsets with weight .float threshold, succ, fail ; feature threshold / accumulants ;; ... plus 2 more for all 3 features .short 0,0 ; no more features .float threshold ; if it fails, finish ;; stage 1 .short 2,15 ; 15 features, each with 2 regions
This allows the whole cascade to fit into under 64K in a readonly memory in a mostly directly usable form - only some minor bit manipulation is required.
Given enough time I will probably try a bigger version that uses 32-bit values and 64-bit loads throughout, to see if the smaller code required for the inner loop outweights the higher memory requirements.
Overlays ...
Well I guess that's what they are really. This data structure also lets me break the 22-stage cascade into complete blocks of some given size, or arbitrary move some to permanent local memory.
Although 8k is an obvious size, as i want to double-buffer and also keep some in permanent local memory and still have room to move, I decided on the size of the largest single stage - about 6.5K, and moved 1.2K to permanent LDS. But this is just a first cut and a tunable.
The LDS requirements are thus modest:
+-------------+ | | |local stages | | 1k2 | +-------------+ | | | buffer 0 | | 6k5 | +-------------+ | | | buffer 1 | | 6k5 | +-------------+
With these constraints I came up with 13 'groups' of stages, 3 stages in the local group, and the other 19 spread across the remaining 12 groups.
This allows for a very simple double-buffering logic, and hopefully leaves enough processing at each buffer to load the next one. All the other stages are grouped into under 6k5 blocks which are referenced by DMA descriptors built by the compiler.
This also provides a tunable for experimentation.
Windowing
So the other really big bandwith drain is that these 6000 features are tested at each origin location at each scale of image ...
If you have a 20x20 probe window and a 512x512 image, this equates to 242064 window locations, which for each you will need to process at least stage 0 - which is 6 regions at 4 memory accesses per region, which is 5 million 32-bit accesses or 23 megabytes. If you just access this directly into image data it doesn't cache well on a more sophisticated chip like an ARM (512 floats is only 4 lines at 16K cache), and obviously direct access is out of the question for epiphany.
So one is pretty much forced to copy the memory to the LDS, and fortunately we have enough room to copy a few window positions, thus reducing global memory bandwidth about 100 fold.
For simplicity my first cut will access 32x32 windows of data, and with a cascade window size of 20x20 this allows 12x12 = 144 sub-window probes per loaded block to be executed with no external memory accesses. 2D DMA can be used to load these blocks, and what's more at only 4K there's still just enough room to double buffer these loads to hide any memory access latency.
LDS for image data buffers:
+-------------+ | SAT 0 | | 32x32xfloat | | 4k | +-------------+ | SAT 1 | | 32x32xfloat | | 4k | +-------------+
A variance value is also required for each window, but that is only 144 floats and is calculated on the ARM.
Algorithm
So the basic algorithm is:
Load in 32x32 window of summed-area-table data (the image data) prepare load group 0 into buffer 0 if not resident for each location x=0..11, y=0..11 process all local stages if passed add (x,y) to hits end gid = 0 while gid < group count AND hits not empty wait dma idle gid++; if gid < group count prepare load group gid into dma buffer gid & 1 if not resident end for each location (x,y) in hits process all stages in current buffer if passed add (x,y) to newhits end hits = newhits end return hits
The window will be loaded double-buffered on the other dma channel.
Efficiency
So i have this working on my laptop stand-alone to the point that I could run some efficiency checks on bandwidth usage. I'm using the basic lenna image - which means it only finds false positives - so it gives a general lowerish bound one might expect for a typical image.
Anyway these are the results (I may have some boundary conditions out):
Total window load : 1600 Total window bytes : 6553600 Total cascade DMA required: 4551 Total cascade DMA bytes : 24694984 Total cascade access bytes: 371975940 Total SAT access bytes : 424684912
So first, the actual global memory requirements. Remember that the image is 512x512 pixels.
- window load, window bytes
-
How many windows were loaded, and how many bytes of global memory this equates to. Each window is 32x32xfloats.
This can be possibly be reduced as there is nearly 1/3 overlap between locations at the expense of more complex addressing logic. Another alternative is to eschew the double-buffering and trade lower bandwidth for a small amount of idle time while it re-arranges the window data (even a 64x32 window is nearly 5 times more global-bandwidth efficient, but i can't fit 2 in for double buffering).
- cacade DMA count, cascade DMA bytes
-
How many individual DMA requests were required, and their total bytes. One will immediately notice that the cascade is indeed the majority of the bandwidth requirement ...
If one is lucky, tuning the size of each of the cascade chunks loaded into the double buffers may make a measureable difference - one can only assume the bigger the better as DMA can be avoided entirely if the cascade aborts within 11 stages with this design.
Next we look at the "effective" memory requirements - how much data the algorithm is actually accessing. Even for this relatively small image without any true positives, it's quite significant.
It requires something approaching a gigabyte of memory bandwidth just for this single scale image!
- Total SAT bytes
-
The total image (SAT) data accessed is about 420MB, but since this code "only" needs to load ~6.5MB it represents a 65x reduction in bandwidth requirements. I am curious now as to whether a simple technique like this would help with the cache hit ratio on a Cortex-A8 now ...
- Total cascade bytes
-
The cascade data that must be 'executed' for this algorithm is nearly as bad - a good 370MB or so. Here the software cache isn't quite as effective - it still needs to load 24MB or so, but a bandwidth reduction of 15x isn't to be sneezed at either. Some of that will be the in-core stages. Because of the double buffering there's a possibility the latency of accessing this memory it can be completely hidden too - assuming the external bus and mesh fabric isn't saturated, and/or the cascade processing takes longer than the transfer.
LDS vs cache can be a bit of a pain to use but it can lead to very fast algorithms and no-latency memory access - with no cache thrashing or other hard-to-predict behaviour.
Of course, the above is assuming i didn't fuck-up somewhere, but i'll find that out once I try to get it working on-device. If all goes well I will then have a baseline from which to experiment with all the tunables and see what the real-world performance is like.