SIMD linear re-sampling

It's usually more efficient to do bi-linear re-sampling in two passes as it then only requires 2 multiplies per output pixel rather than 4. So today I had a look at the X sampling problem in SIMD - as re-sampling in Y is a trivial SIMD exercise.

I've done this before in SPU, but I couldn't find my code again, and in any event it wasn't so much the algorithm as the NEON translation that was the tricky bit.

A simple and efficient way to implement this re-sampling in C is to use fixed point 16.16 bit values for the calculation (this is more than enough accuracy for screen-resolution images), so a straightforward implementation is something like the following:

float scale = (float)srcwidth / dstwidth;
uint xinc = (int)(65536 * scale);
uint xoff = 0;
int x;

for (x=0;x<dstwidth;x++) {
   // pixel address
   uint sx = xoff >> 16;
   ubyte a = src[sx];
   ubyte b = src[sx+1];
   // pixel fraction
   uint rx = xoff & 0xffff;

   dst[x] = (ubyte)(a + ((b-a) * rx) >> 16);

   xoff += xinc;
}

This assumes the scale factor is >= 0.5, and bi-linear re-sampling breaks down outside of such scales anyway.

For RGBA data, converting this to SIMD is straightforward, all the calculations are simply extended to 4 wide vectors (if that is the machine size). But for greyscale this doesn't work because each position needs to be loaded independently.

The approach I came up with is to use a table lookup instruction to extract N consecutive pixel values into A and B vectors and then use per-lane scale factors. I also saw this guy did it very differently - using a transpose so that the problem is always the easily-SIMD y case.

Pseudo-code is something like this (using some opencl constructs):

float scale = (float)srcwidth / dstwidth;
uint xinc = (int)(65536 * scale);
uint4 xoff =  { xinc * 0, xinc * 1, xinc * 2, xinc * 3 };
int x;

for (x=0;x<dstwidth;x+=4) {
   // pixel address
   uint4 sx = xoff >> 16;
   ubyte8 d = vload8(src, sx.s0);
   // pixel fraction
   uint4 rx = xoff & 0xffff;
   // form lookup table
   uint4 ox = sx - sx.s0;

   ubyte4 a = lookup(d, ox);
   ubyte4 b = lookup(d, ox+1);

   vstore4(dst, x, (ubyte4)(a + ((b-a) * rx) >> 16));

   xoff += xinc * 4;
}

Where lookup takes the first argument as an array of elements, and the second a list of indices, and it returns a lookup of each index from the array. The limited lookup-table suffices, since the limited scale range prevents out-of-range accesses.

The astute reader will notice that this requires unaligned memory accesses for loading d, but with some small changes it could cope with forced-aligned memory accesses which is probably worth it. Actually the only change required is to use a aligned (masked) sx.s0 for the load and ox calculation.

In the NEON code I mixed it up a bit - the sx.s0 calculation I performed in ARM code (as it's needed for indexing lookups) as well as sx being calculated in NEON. But the scaling itself (and rx) uses 8.8 fixed-point so that 16-bit multiplies suffice. And 8 bits is enough precision for the pixel interpolation.

I tested scaling half a 512x512 image (i.e. 256x512) up by 2x in X to 512x512. Machine is a Beagleboard XM with default CPU clocking. I'm just using a C routine to call the assembly which processes one row at a time.

Routine         Time (ms)
C                9
NEON             2.9
NEON 2x          1.7
NEON 4x          1.5

The NEON code is processing 8 pixels at a time (minimum transfer size of NEON load).

One problem with the straight conversion of the above code to NEON is that there are many dependent instructions/stalls/no dual issue opportunities. So the 2x and 4x versions process 2 and 4 rows of data concurrently and interleaved. Apart from the better scheduling it also allows the addressing and rx calculations to be shared.

I also tried the masked version and aligning the reads and writes, I thought that gave better results - but it was only an error (i'd changed the scale factor). Although adding an appropriate alignment specifier to writes did make a stable measurable (if small) difference.

Y Scaling

I haven't looked at Y scaling but that is fairly straightforward. Actually rather than perform a complete X scale pass and a complete Y scale pass it's betterto scale the (Y, Y+1) rows into a double-row buffer, and then form the output one row at a time. Actually this can be extended to a pipeline of processing if you need to implement further passes and don't need all the data for a given pass - I used this in ImageZ to perform floating point compositing of integer data very fast in Java. This requires much less temporary memory, and should be faster because of the cpu cache (and see below about how slow memory is and how much the cache is needed). Actually since you're normally scaling up by some amount, Y(n+1) will probably be equal to Y(n)+1, so you can save some the X scaling work as well.

For this reason the 2x version is probably more useful even if it isn't the fastest. In fact it can just perform the Y scaling directly in-register and avoid the temporary buffer entirely. The overheads of calculating the same row more than once might be offset by not needing to pass intermediate results through memory.

I noticed that everything seems to take about 1ms + a bit. I wonder what the memcpy time is ... ok, just using memcpy on each row (512x512) is over 1.2ms. Wow, that's slow memory. I will have to try this on the Mele (which I still haven't had time to play with much) and the Tegra 3 tablet when I get it back. I gotta say the machine is running like a pig - emacs (terminal mode) crawls on it over the network with constant pauses and delays - it feels worse than running it via a 56k modem on an Amiga 1200. Maybe something is up with the network (looks like it, console seems a lot better) ... probably the old ADSL router which is causing other issues besides (can't turn off the dhcp server, which is affecting a couple of machines).

mip-maps

So I mentioned earlier that the above scaling routines only handle scaling by >= 0.5. So how does one go smaller? Either "do it properly" by implementing an "upfirdn" function (up-sample, filter, down-sample) - which is really just a 'sparse' convolution, or you approximate it with a mip-map.

i.e. you just pick the best-closest mipmap, and then scale that down/up to fit the output. Although building a mipmap is pretty fast it is still an overhead, so it helps if you are going to use it more than once - which is the case with sliding window object detectors which must run at multiple scales.

Although there are better ways to build the mipmaps (i.e. using the upfirdn stuff), using successive summation/a box filter works well enough for what I need. It can also be implemented rather neatly in NEON - I can perform a 1/2 and 1/4 scale scaling in the same routine (and there are enough registers to do 1/8th - but that requires 64-pixel-horizontal blocks).

I didn't bother trying to implement it in C, but the NEON code I wrote can re-sample a 512x512 image into two scales - 256x256 and 128x128 in under 1.3ms by processing 32x32 tiles. One then calls this successive times on the smallest result to build the rest, at least down to 8x8 (which is more than I need).

I've ignored the edge cases for now ... and usually it's better just to arrange ones data so that there aren't any. i.e. by aligning the stride and allocating extra invalid rows. The only problem with this is with getting data from/to external sources/destinations that may have their own alignment restrictions/policies. Obviously one wants to avoid a dumb memcpy but usually there's some work that can be done, and then the edge cases only need to be handled at these interfaces.

Update: I didn't have much to do today whilst waiting for a meeting, so I mucked about with the hardware today - tried to use a narcissus image - but that wouldn't boot, so just changed to a faster SD card I bought on the way back from buying milk & beer sugars (gee, sd cards are cheep now), ethernet cable, and different switch - and after that it's still slow in emacs ... But I also mucked about with the mele and got my test harness running there. Most of the stuff is 1.5x to 2x faster than the Beagleboard XM - which is pretty nice. The mipmap code is about 4x faster!

Update 2: Late last night I poked again - I thought I had enough to put it all together but then I realised I hadn't written the XY scaling. I tried adding Y interpolation to the Xx2 row scaler, but that just took twice as long to run as it ends up doing twice as many rows (stating the bleeding obvious). It still might be useful as it requires no temporary memory but I will work on another approach using temporary row buffers which I know works well.

Update 3: Well there's no particular reason I didn't include the NEON code ... so here it is. This is just the 1x version which isn't particularly quick due to data stalls and conflicts - but it can be extended obviously by unrolling the loop and interleaving the operations.

        @
        @ Scale row in X.
        @

        @ r0 = src address
        @ r1 = dst address
        @ r2 = width of output
        @ r3 = scale factor, 16.16 fixed point

        .global scalex_neon
scalex_neon:
        stmdb sp!, {r4-r7, lr}
        
        @ copy over scale
        vdup.32         q2,r3

        @ xinc = { scale * 0, scale * 1, scale * 2, ... scale * 7 }
        vldr    d0,index
        vmovl.u8        q3,d0
        vmovl.u16       q0,d6
        vmovl.u16       q1,d7

        @ xoff [0 .. 3]
        vmul.u32        q0,q2,q0
        @ xoff [4 .. 7]
        vmul.u32        q1,q2,q1
        @ xinc = 8 * inc
        vshl.u32        q2,#3

        vmov.u8         d28,#1
        
        @ xoff(s) = 0
        mov     r4,#0

        @ Load this, next
1:      add     r5,r0,r4,lsr #16
        vld1.u8 { d16,d17 }, [r5]

        # sx = xoff >> 16
        vshrn.u32       d18,q0,#16
        vshrn.u32       d19,q1,#16
        vdup.u16        d20,d18[0]

        @ mx = sx - sx[0]
        vsub.u16        d18,d20
        vsub.u16        d19,d20

        @ convert to 8 bits - this is lookup table for all 8 entries
        vmovn.u16       d18,q9

        @ a = src[sx]
        vtbl.8          d20,{d16,d17},d18
        @ b = src[sx+1]
        vadd.u8         d18,d28
        vtbl.8          d22,{d16,d17},d18

        @ rx = (xoff & 0xffff) >> 8
        @ (only need 8 bits for lerp)
        vmovn.u32       d24,q0
        vmovn.u32       d25,q1
        vshr.u16        q12,#8
        
        @ convert to short for arithmetic
        vmovl.u8        q10,d20
        vmovl.u8        q11,d22

        @ (b - a) * rx
        vsub.u16        q13,q11,q10
        vmul.u16        q13,q12

        @ (((b-x)*rx) >> 8) + a
        vshr.s16        q13,#8
        vadd.u16        q13,q10
        
        @ convert back to bytes
        vmovn.u16       d26,q13

        @ done
        vst1.u8 d26,[r1]!

        @ xoff += xinc*8
        add     r4,r3,lsl #3
        @ xoff += xinc
        vadd.u32        q0,q2
        vadd.u32        q1,q2

        @ for whole row
        subs    r2,#8
        bhi     1b
        
        ldmfd sp!, {r4-r7, pc}

        @ per-byte pixel offsets/multiplicands
index:  .byte   0,1,2,3,4,5,6,7

ARM, NEON, etc

After a good few hours of experimentation I had some NEON code running. With all the OpenCL i've been doing for a while I kind of forgot some of the ways one writes SIMD code, and for a "RISC" cpu, a modern ARM has a large set of instructions to learn.

For example, whilst trying to build the 8 bits of the LBP code my first approach was to try to perform the 8 comparisons to form the bits in the code using a single instruction. The problem with this is that it takes a lot of fuffing about just to get the bytes in the right spot (vtbl helps a lot here), and even more fuffing about trying to turn those 8 results into 1x8 bit number. And even when you've done that you need to get 8 results before you write something out and it all just gets messy. Thinking about that jogged my memory on how one can stream data through a SIMD processor and instead produce 'n' independent results concurrently.

Which makes the whole thing a lot simpler - and allows one to re-use memory accesses as well without having to write custom extraction code for every position in the vector. For 6x64-bit memory accesses I can produce 8 output codes, and the code itself is fairly simple ...

Unfortunately, this only produces the 8 bit LBP codes, and I really want the u2 variant ... which requires a 256-byte lookup table. (you can't really do the lookup using NEON, and have to resort to ARM, and it's too slow moving the registers back, so it has to go through memory).

The obvious approach is to just run the whole lot twice, but it turns out that's pretty slow because of the slow memory on the beagleboard. e.g. my best LBP algorithm (which does 8 columns and 4 rows per inner loop) takes about 2.6ms to process a 512x512 image (simple C is about 12ms). And my best LBP to LBPu2 conversion routine takes about 2.5ms ... and together they end up 5.1ms.

So I took a slightly simpler LBP routine (does 8x1 results per loop, about 2.9ms) and tacked on the ARM code which processes the previous row's result after en-queuing all the NEON instructions (i.e. it does 8 lookups for the row above, whilst the current row calculates 8 results). This showed some promise, so I then took the ARM instructions and sprinkled them throughout the NEON ones so it spreads out the calculation a bit more/improves parallelism. Not bad - best effort is down to 3.9ms total. I only have a rough idea on the scheduling rules, I wish I had the static analysis tool for the SPU which showed dual-issue and delay slots.

I guess it's worth it - the C code that does the U2 lookup is about 13ms - but it didn't take a good few days to write and it's only 10 lines of code.

I also looked at a 'SIMD' like ARM version - it turns out you can do 4x 8-bit unsigned comparisons with a single instruction (uhsub8) - I got stuck working on that till late into the night one night (instead of watching tv), but then I figured it wasn't going to beat the NEON, so I couldn't be bothered debugging it. OTOH it would allow one to hook in the U2 lookup more easily, so it might actually be a win and worth looking into. The ARM code is very tight on registers though so i'm not sure it will fit.

I've done some testing with the classifier code as well - at first I thought I'd just wasted 10 hours on the NEON code because it was taking the same time as some simple C (the classifier is literally a 5 line double-loop). But then I realised I was running it 8x as many times as i should have. At least an 8x speedup is definitely not something to sniff at. And I still have the outer loop in C and with all the redundant function invocation overheads so there might be a bit more left to get. I haven't debugged it yet so i'm not sure it's working, but a 17x17 template on a 512x512 image is about 0.6s - this seems slow but this is much smaller than one would normally look for objects in such a source image, and at a more-realistic 4x this size it takes about 30ms.

There's still a lot of scaffolding before I can find out how it will actually run in practice, but it's looking ok so far. I kind of forgot how long it takes to write assembly code (at least, when one is learning the instruction set), but now I remember why I wasted so much of my youth on it - it is rather fun.

Mele A2000

So I finally got a Mele A2000 I ordered from deal-extreme. Took about 3 weeks from ordering. The cardboard boxes were a bit crushed but nothing was broken.

Actually seems like quite a nice little unit - much better than that crap digital tv tuner/recorder I got from Kogan (which even if it didn't have issues like rebooting and not responding to the remote if the weather isn't just so, runs totally shithouse software anyway and is just a real fucking pain to use). In short, at least it isn't total junk ...

Android seems to run fine on it, and the video player seems ok although it doesn't really do 'pal' properly. The HDMI output didn't seem too compatible with my old DVI monitor (output was purplish) but I didn't try the different video modes on it either. Had a US plug on the plug-pack, but it was robust enough that a pair of pliers fixed that without too much work (i've had other plug-packs break apart when trying to bend the pins). There's something going funky with the network though - videos off my lan play fine, but although the google shop worked once doesn't seem to want to connect any more, neither does the browser. I guess it's lost the default route somehow (maybe playing with both) Hah, funny, it was a DHCP server running on the ADSL router i used as a switch .... Looks like i can't turn the damn thing off either so I guess i'll have to find another box to do that.

Oh yeah, when i first plugged it in I used my pc mouse. And here's a fucking idea - the mouse wasn't over-accelerated to the point of total unusability like every X11 based desktop always is these days ...

OTOH, a dark blue and black mouse pointer on a black background is hardly a fucking sterling idea though.

Performance seems ok - not quite as snappy as the transformer prime (not surprising), but visually fine (better than the benchmarks would suggest). The browser is a truck load faster than the ps3 that's for sure, even running firefox.

I also got one of the wand remote things - it isn't fantastic but works ok (better than the original wii imo) without requiring any adjustment (one re-centres by just pushing off the edge and when you move back it starts from there). What is odd about it is design of the case. It has some strange sharp edges and a clip-together type look. It also has holes for a microphone and ear-piece, but it doesn't appear to be hooked into anything (could be neat if it was). It's actually kind of cool and works as a keyboard mouse on any machine you plug the small dongle into. Rather missing is a tab key however ...

Not sure what I'll do with it yet, I might just use it on a tv - one reason I went with the mele rather than the other smaller things is the CVBS output, and the SATA (and to be honest - it has a friggan case). But work is using the tablet for a demo so it's also the only android device i have handy at the moment too. I'm not sure whether I want to muck about with linux on it although that's originally got me interested in it.

I was originally going to get 2, one to hack, one to use as an appliance/android thing, and ordered them from Tom's shop on aliexpress. Unfortunately they have some fucked up credit card verification thing which wouldn't work with mine (even rang the bank), and although Tom mentioned pay pal never gave me his details (if it was supposed to be on the shop front, it definitely wasn't). I did get 2 this time but one was for a friend to look at, although if it doesn't suit his purposes I guess I may end up with it too.

There goes the weekend anyway, as if the NEON hacking wasn't enough ...

Update: Apart from 1/2 of Saturday i didn't get much time to play, but since then ...

I downloaded the new firmware which just came out and installed that. Needs a microsoft windows peecee to create the SD card :-( I did mine via a USB stick first, and then dd'd that over the sd-card, as my windows peecee has no card reader. It still worked even though the USB stick was 8GB and the SD 4GB.
The update made some strange changes, such as changing the device id in the shop, removing their custom ugly launcher, and replacing the default five-pane one with a single-pane version. But on the positive side the mouse pointer is a lot easier to see (although it just wasn't well drawn, and additionally goes beyond the hardware sprite bounds when playing video), and videos play full-screen properly. I didn't use it enough to know if its fixed any other issues.
I found Skifta - it seems a simple no-nonsense upnp client for browsing mediatomb & mythtv shares, without adverts and other crap. I was using avia (or something like that) on the tablet but then I noticed it wants a DNA sample and urine sample to install ...
I tried the alpha XBMC port from the nightly builds (on xda forums): in short, just don't bother yet on this machine. It uses software decoding, and even without that it runs like a total pig and is a huge install (~50MB). To be honest I find it a pretty confusing and difficult to use bit of software anyway, and it would have to have something else to make me put up with it's weird interface.
On the dev side, I found ADB Konnect as a simple tool to enable adb over the network. It's about 2x faster than a beagleboard XM in raw cpu power + memory speed. And JJPlayer is totally shit on it ...!

Stop #*@^$&@ "innovating"!

So in my morning 'paper read' of BN I came across this opinion piece about desktop 'domination'.

What I don't get is, why is it so important to "innovate" in such a basic and to be honest - downright fucking boring - area of a `computer desktop'?

Last I looked, my computer workbench is pretty much the same it was on AmigaDOS 1.2 (apart from a proper top-menu, but it's half there) ... I have a first-class CLI, overlapping graphical application windows, removable media is mounted automatically, and a file browser from which I can launch applications or delete files. The big things are the same, only the details have changed. Even Apple's Macintosh (apparently 'the' example of innovation) is basically the same design as it's first iteration; top-menu, full-screen "windowed" applications, and so on.

Some things just don't need innovating after the basic functionality is in place.

Take doors for instance. I'm no door expert but the last innovation in doors I see was the sliding glass door from the 80s or so and even then they only work in very limited situations (but there they do work well). Other than that, the basic design of a large flat rectilinear with a turning latch on one side hasn't changed - in hundreds of years. I really don't want anyone to 'innovate' in such a basic user interface since it simply works.

One can take any number of other household or day-to-day items, from car steering wheels to cups and saucers, frying pans to mirrors. One can innovate in the details as much as one likes - using new materials, finishes, colours, or designs - but a frying pan is most definitely identifiable as a frying pan whether it's made of steel, aluminium, glass, ceramic, coloured red, black, or blue.

And like doors, in some cases there is scope for radical differences in very different situations. An appliance (like a tv, phone, etc) is a very different computer from one used as a general re-usable `workbench', trying to make a universal GUI solution here is about as stupid (and it is mind-numbingly stupid) as making a universal door.

All a computer desktop has to do is get out of the fucking way whilst I get real work done - and that problem was long solved ago. Once it becomes the end in itself we end up with disasters like GNOME 3 or "Metro", and really they deserve all they get. I'm sure wood-workers and other craftsmen are very proud of their workbench (especially if they made it themselves), but I doubt they consider the workbench itself the pinnacle of their craft or anything other than a dependable rock-solid tool that shouldn't be trying to get in their way at every opportunity.

On ARM, NEON, et al.

So for a bit of a diversion I finally found a decent use for my beagleboard-xm ... i can use it to write and debug ARM assembly and NEON ...

To that end I 'upgraded' the sd-card to the latest available image of angstrom - for no particular reason I went with the GNOME build. Which is course was a big mistake since it uses that systemd shit (which sent me out on a google-search-of-rage looking for like-minded individuals of which there are plenty), notworkmanager and all sorts of crap. And X is still un-accelerated so it's pretty much pointless anyway.

So I went back to the other angstrom build, which at least starts the network up from boot and has gcc and gdb. But no opkg for emacs? Ugh ... so I just had to compile that myself. A bit over an hour, but I just watching tv by then and by now I was just happy I had a system which worked, and I already knew it wasn't a super-computer.

I dragged out the 2nd hand monitor I had for playing with the beagle, remembered where I had put my old ADSL router which I set up as a switch (seriously, i'd 'lost' it for about a year), and Bob's James' dad, I have a development environment up. I should really go out and get a faster SD card though, since the one it came with is slow ... (i presume that's the issue).

It's been a while since I looked much at ARM. I remember VFP being a bit weird, but NEON is nice and more like SPU. It's really missing pack-bits instruction though (take MSB of each element and pack to a scalar bitmap), at least for what I was looking at.

To start with I was looking at using it to build LBP codes - if i had a pack bits it would only be 4 instructions instead of 7. Pity I still need to use a lookup table to get the LBP u2 code though (there is a definite pattern to the table, so there might be some bit-manipulation tricks to do it too). It's been really crappy and cold, and I've been too exhausted from work, so I haven' felt like doing much yet.

Before I set the beagle up I also poked around just cross compiling stuff and seeing how it went. I came up with a builder in ARM code which was at least all in registers if not particularly compact - and what should have been an equivalent implementation in C didn't compile very well. But I guess we know that.

I spent another hour or so last night just fiddling with another version of the ARM code - it was actually quite enjoyable to play with something just for fun. It doesn't matter if i finish it, or if i ever get it to work but trying to hand-optimise one screen-full worth of code to solve an interesting problem is quite a puzzling challenge. It's nice to work on something fairly simple for a change ...

Efficient local binary pattern multi-class classifier

So given that the feature tests with the classifier i've been working with are independent ... it offers a fairly interesting possibility: that of using sub-regions of a single trained classifier to detect corresponding sub-regions of the image with a higher resolving power.

This isn't a very good picture, but it demonstrates what's going on (it's from the same source image as the previous posts).

Here I am using a single trained classifier for faces, but from that classifier detecting 4 classes of subregions - the full face, each eye separately (9x9 corners), and the eye pair (17x9 top region). In the picture I am showing green for the face, red for left eye, and blue for right eye (as the RGB intensities as well as the detection boxes). This is the raw detections at a single scale with no grouping, and I haven't determined if the threshold can be tweaked to improve the false positives. The RGB image shows the RGB components overlayed with respect to the top-left of the detection box, not the centre of each feature (so a white dot means everything aligns). I'm drawing a 21x21 green box just to make it easier to see.

It's not perfect, but it is detecting localised eye positions within the faces.

So even using the same trained classifier it allows a finer resolution of the eye locations, and a further test to remove false positives. The fact there are a lot of false positives for the eyes is not very important as only those in the green boxes (and in the correct quadrant) are of interest. The main downside is such a small feature test means low accuracy for larger images (as they are downscaled more to start with).

The kicker: in opencl this executes at the same speed as the single classifier as most of the work is shared. A fixed set of additional separately trained classifiers can also be added very cheaply as the classifier runs in LDS to reduce the bandwidth needs.

Update: So i've been trying to utilise this, and although it kind of works it doesn't kind of work well enough to be that useful, at least with the classifier I have.

However, on the base LBP detector, with a bit of tweaking to the training data I have been able to get fairly decent performance. For some sequences it is clearly outperforming the haar cascades from OpenCV, whilst retaining the braind-dead fast training algorithm. The fact that it gives a score allows for more post-processing options as well.

Actually anything I try to do to 'improve it', e.g. by weighting some regions, just breaks it's resolving power. I seem to have much more luck by adjusting the training images. For example I revisited the eye detector based on the data used for the haar data. I noticed there was a bit too much variation in scale and variety of the eye images being used. So I tried instead creating a subset of images which were closer to a hand-picked candidate - this made a much better eye detector, even for eye shapes which weren't included in the training set. Although left eyes (right from the front) are always a problem - they have a less consistent and distinctive 'eye shape' than right eyes which gives classifiers trouble.

I find this machine learning stuff pretty frustrating as it just seems to involve a bit too much magic, and my mathematics skills just aren't up to the task of delving further.

More on the LBP detector

I've been a bit busy with work doing another prototype but I had a small play with my object detector last night. I wanted to see how it would work with online training for parts of a scene.

In a word: excellent.

Training on as little as a single instance of an image even works. It yields a detector with an obviously very specific discriminative selection power (scale/orientation), but it still works for that case.

With a modest number (16, 256, or 1024) of randomly synthesized training images (scale/rotation), the discriminative power still remains to pick out the object of interest, but it works over a larger range of scales and orientations. When I tried translation the classifier started generating much less distinct results but I think that might have been a bug in my synthesising code.

The only problem is that the scale of the detector output changes depending on the range of orientations (not so much the number of training images). Although even with a poorly specified classifier (very many high values in the output), the global peak value is still a reliable indicator of a single instance of the object (which is the most important thing, so long as it is present).

I have only been trying it with still images so far, the next thing to try will be with video. And with other objects than a face (faces are particularly 'interesting' in the LBP space, so it might not work on general objects).

Simple LBP Object Detector

After mucking about getting nowhere with a simple local binary pattern (LBP) object detector algorithm I finally had a bit of a breakthrough. Rather than getting dozens of false positives, i'm finally getting a concise enough answer to be useful for further processing.

Here is an example based on a photo I found on the net with a few faces in it (it's from here - i only found this using an image search, that page is just where i found it but I haven't otherwise read it). The yellow box is their result, the white boxes are mine (i'm not quite centring/scaling it properly yet).

The only tuning is a threshold and the grouping limit.

I'm actually quite surprised some of those faces are even found because this is one of the first tests i've done on more than a couple of images - and it has trouble with Lenna for example (but i suspect that is some aliasing issues with downsampling the larger countenance). I originally started with the eye set (from the OpenCV eye cascade), but was having trouble with false positives - e.g. eyebrows, mouth, dark spots. It seems to work much better with the face data. But most eye detectors on their own seem a bit noisy anyway so perhaps I was expecting too much.

I gave up on the cascade idea and this simply tests every position using the LBP u2 8,1 code (only 59 values) against a binary lookup table, and then each position votes on the outcome - once I get enough positive votes, it's considered a hit. This is similar to the LBP feature test in the OpenCV LBP detector, except that one uses the full 8 bits of LBP code, and of course the code is calculated on regions, not pixels.

I am only using the face and non-face images from the CBCL face dataset available here, which isn't a particularly good quality set of images. The only pre-processing i'm doing is mirroring the faces to double the training set. Training is very fast - after the images are loaded and converted to LBP, it's only taking 0.038s on my machine to `train' the 4858 positive and 4548 negative images (very plain single-threaded Java).

On the CPU the lookup isn't particularly fast (0.4s for the test image above) but I will look at porting it to OpenCL - it should be a very good fit for a GPU. If weighting isn't required, the feature description itself can be made very compact as it only requires 2 integers (64 bits) for each x,y location in the pattern - i.e. under 2K5b for a 17x17 test pattern (e.g. 19x19 training set, as the the LBP requires a 1 pixel border) which can easily fit in the constant cache.

There are still some tuning issues such as that a given threshold doesn't work equally well on all images, but it is still a promising result and there are still plenty of ideas to try.

Update (ok not really an update i hadn't published this yet ...) ... I coded something up in OpenCL and the performance is really very good - kernel time for the scaling (using a mip-map like thing), lbp building, and running the detector is around 2ms (same scales at the cpu example above). But this time doesn't include peak detection, thresholding and grouping. Still, this is pretty favourable compared to the VJ cascade as that does far fewer probes in it's 10ms runtime (and takes a week to train - if you can get that to work). Here i'm doing over 200 000 17x17 probes through 5 scales ...

I also played with a more statistically valid accumulation mechanism (as each test is independent): multiplication rather than addition (statistics isn't my strength by any stretch ... sigh). This leads to much more specific peaks as can be seen by the following picture, although i'm not sure if it leads to a more consistent threshold value (I think it does, and if that's true, I probably don't even need to do peak detection ...).

Both images are normalised.

Update 2: Had a bit more of a play this morning, tried a couple of different kernel topologies and using LDS to reduce memory bandwidth requirements. Got the kernel time of my test case down to 1.3ms, vs the 2.1ms yesterday (on a Radeon HD 7970). I also found the new combination metric is working well - I can use a specific value as the threshold and remove the peak detection stage entirely. It doesn't work too well with the eye data (way too many false positives), but it's pretty good with faces, so far.

About Me

Tags