Cache only works when there's enough
Update: Seems I was (mostly) wrong - see the follow-up post.
I've been playing with object classifiers in OpenCL - I have one that works, not terribly well but relatively quickly. It's a random tree classifier and just uses pixel intensity comparisons for feature tests. Although I can get some results out of it they just weren't reliable enough.
So I decided to use 2-bit-binary patterns instead, a haar-like feature measure which uses an integral image to accelerate the intensity comparisons (maybe more on the integral image creation another time, that was a bit of a journey in OpenCL as well). Unfortunately the simple modifications required to change the feature detector suddenly blew out the computation time - from under 1ms to over 20ms, making it far too slow. This even though it only requires twice the memory accesses for the same number of tests.
After much experimentation I found the cause - the texture cache was being exhausted, dramatically reducing the apparent memory throughput. I discovered that a single-tree, 3-feature tester is about the limit of the texture cache. That will execute in 150uS. 4-features take 250uS, and if I change the 4-feature system to 2 trees, or 8 features - just twice the amount of work - it blows out to 1500uS.
Well, at least that gives me options for splitting the work into multiple passes and maybe i'll end up with something fast enough over-all. Otherwise I might have to find something else. As it is i'm losing confidence that it is going to be good enough anyway (maybe after running it so many times i've forgotten how poor the previous one was).