Special-Case Code and Multi-Pass Algorithms
Ok, so without going into too much detail I have a function which needs to resample 3 float2 planes of data to another resolution, and then perform very simple arithmetic on it (a few mult, add). The scale factors are powers of two up and down. One complication is that the numbers have to be pre-sampled first at pixel corners before being interpolated.
I implemented it initially using bilinear interpolation for simplicity, and yesterday looked at implementing bicubic filtering.
It wasn't really that bad - the given routine was about 1.5x the original speed which is ok, and overall this was only a 3% impact.
But I thought I would try a few ideas to speed it up ...
A) I separated the routine into separate implementations, one for each scale. I still used the same sampling routine, but just passed it a fixed-value for the scale. In previous micro-benchmarks on the bilinear code I noticed this lead to a pretty decent improvement.
But in this case it didn't. It slowed down some scales by a factor of 1-2, and moreover, made other routines in the same source file execute slower(!). I can only assume the growth in code-size was a significant factor here. I also noticed the register usage hit 63 again - which probably means all i've done is hit a bug in the compiler again (I should really upgrade the driver: we're moving to AMD hardware RSN anyway).
B) Using two passes. A separate scale pass followed by a calculation pass. Intuitively this should be somewhat slower: the calculation after the scaling is simple and can be done in registers.
But of course it turned out faster. Not a huge amount, about 20% for the routine in question.
I did have to do some work to make it happen though: using local memory and 2d workgroup sizes, and separate code for the scaling down functions (e.g. it just sums 2x2 block to go down by 2). In this case using separate functions for each size worked quite well (more evidence of compiler bugs). I was also able to batch the 3 planes separately to get added parallelism - the problem size is quite small so this should hep.
... and after writing (C) below I re-arranged the upscaler to use hard-coded sizes as well, and re-did the bicubic interpolator to accept integer and offset values separately: the compiler can remove some of the calculations here since i'm always using the same pixel offsets.
... and i also experimented with changing the output type to float8 rather than float2 and writing 4 pixels at once for the 4x upscale. This was 2x faster again for this routine (and uses fewer registers?), although I can't trust this number as the results are now broken (and i really have had about enough of it and don't want to debug it).
C) Doing more at once. e.g. doing 1/2, 1, and 2x at the same time. Actually because the 2x scale uses hard-coded interpolation numbers the bicubic interpolation can be simplified greatly (that just gave me an idea to improve B) above).
I didn't get this incorporated because it required a bit of re-arrangement of the host code, but this could shave off a bit more. I usually need a few scales of the same data in each pass so this would be useful.
Conclusions
Although all these could also be applied to the bilinear code, I now (with the changes in B above) have bicubic interpolation for this routine running much the same speed as the original bilinear did.
But it shows that you sometimes don't want to do too much in a given routine - compiler bugs, register spillage, or just more registers end up being used, which adversely affect parallelism and performance. Although a trip to memory is quite costly, these other factors can greatly outweigh it.
After all this, and a few more changes in this particular routine i'm working on, I only managed about a 9% improvement. TBH i'm not sure it's really worth it ... and I probably only went so far as I had a bit of time between getting this to a working state and heading back to reading papers.