Transpose Is Your Friend
With graphics programming a lot of algorithms can be split into separate X and Y passes. This generally works particularly well in the X case but in Y you can hit issues with memory (or processor) locality which can have a big impact on the algorithm.
But the GPU texture cache is block oriented rather than line oriented so both X and Y oriented algorithms can be implemented equally (in)efficiently if you store data in images. However, once in local memory you're effectively back to line-oriented access ... (if you want to preserve your sanity whilst working out the memory addressing to efficiently access the banked memory).
The trick is just to transpose the data on read and write, and always work in the X direction locally. It also means the X and Y working code is often identical. This can be done just within the local work-group, but for 2D workgroups one has the added complication that work units are allocated in row-major order, i.e. in X first.
The simple solution is just to transpose the global X and Y work-size as well, and simply swap the result of get_global_id(0) and get_global_id(1) when reading or writing the images.