Its beaten me. For now.
I should've stayed outside in the sun today gardening - but curiosity got the better of me. I hope the (absolutely stunning) weather continues tomorrow, otherwise i've blown it on nothing ...
I tried working on the AMD performance of the Viola & Jones detector in socles: I tried a whole bunch of stuff, from copying the image tiles pre-scaled (as summed area table) to local memory, to completely re-arranging the data structures so they are workgroup aligned, to even trying the cpu single-thread-per-location version.
I got some minor improvement, the most being the copying the tile to local store and removing some of the calculations (since it doesn't need to scale the rects): but that only took a simple test case from about 25ms to 20ms. Barely really noticeable in my webcam test harness.
I think the problem is with the fact it has to read so much data for each single test. It requires 3-4 uint4's just to describe the test, and 8-12 uint texture lookups for the summed area table lookups. The cascade I have has ~6 400 regions to test grouped in ~3 000 features, and although most aren't tested it's just a lot of data. It's too much for constant memory for example.
With a fix to use the atomic counters AMD hardware provides at least it's now in the same order of magnitude as the nvidia hardware, but still 2-4x slower.
Maybe ... if the stages were broken up into smaller parts it could work more efficiently, but it does seem a pretty long shot to me as the problem remains with the sheer amount of stuff that needs to be loaded for each test.
Time probably better spent on something else.