Simple padding trick
For a few algorithms I have, I end up with data which is 32 elements wide operated on by 16x16 threads. This ends up with 100% local memory contention if mapped directly to the local memory since every 16th thread aliases to the same bank.
Although this can be addressed with a 16 word padding this is wasteful of precious local memory which might mean the code can't run in parallel to the extent it might otherwise, or simply cannot fit.
A simple trick which still keeps the addressing quite simple is to shift every odd line to the second half of the data which is then offset by 16 words. In effect bit 0 of the y address is shifted to the top of the addressing index, with an offset.
For example, if a kernel is working on a 16x16 region of memory but requires some data either side of the target tile, it might do something like:
local float ldata[32*16]; int lx = get_local_id(0); int ly = get_local_id(1); int i = lx + ly * 32; // load data ldata[i] = ..read data block 8 to the left ...; ldata[i+16] = ..read data 8 to the right...; barrier(CLK_LOCAL_MEM_FENCE); // work using i+8 as the centre pixel for this thread
By only changing the calculation of i
and padding the storage with only 16 words, the contention is easily removed without changing any other code:
local float ldata[32*16+16]; ... int i = lx + ( ly >> 1 ) * 32 + (32*8+16)*(y & 1); ...
Assuming one is only working in the X direction, for Y the addressing is slightly more complex of course. But this could come at no extra run-time cost once the loops are unwound.