Shared Bank Conflicts
So yesterday I was working on some code for a wavelet transform - essentially a bunch of convolutions performed in a specific way.
One routine I have does 4 interleaved convolutions simultaneously, and had a large number of shared bank conflicts - 200% or so. So I spent quite a while trying to remove them. I got rid of most of them - but for some weird reason I still get about 14% and I ran out of day (and patience) to work out what was going on there.
I find this kind of problem quite tricky - trying to juggle dozens of sets of numbers in your head and still come up with something that works. I've developed a few rules of thumb but I still haven't done it often enough to make it easy.
But for all that effort I got a very minor performance improvement: barely 2%. From 70uS, to 67uS kernel time. Hardly seemed worth it.