Images vs Arrays 4

Update 7/10/11: I uploaded the array convolution generator to socles

And so it goes ...

I've got a fairly convoluted convolution algorithm for performing a complex wavelet transform and I was looking to re-do it. Part of that re-doing is to move to using arrays rather than image types.

I got a bit side-tracked whilst revisiting convolutions again ... I started with the generator from socles for separable convolution and modified it to work with arrays too. Then I tried a couple of ideas and timed a whole bunch of runs.

One idea I wanted to try was using a rolling buffer to reduce the memory load for the Y convolution. I also wanted to see if using more work-items in a local workgroup to simplify the local memory load would help or hinder. Otherwise it was pretty much just getting an array implementation working. As is often the case I haven't fully tested these actually work, but i'm reasonably confident they should as i fixed a few bugs along the way.

The candidates

convolvex_a: This is a simple implementation which uses local memory and a work-group size of 64x4. 128x4 words of data are loaded into the local memory, and then 64x4 results are generated in parallel purely from the local memory.
convolvey_a: This uses no local memory, and just steps through the addresses vertically, producing 64x4 results concurrently. As all memory loads are coalesced it runs quite well.
convolvex_b: This version tries to use extra work-items just to load the memory, after wards only using 64x4 threads. In some testing I had for small jobs this seemed to be a win, but for larger jobs it is a big hit to concurrency.
convolvey_b: This version uses a 64x4 `rolling buffer' to cache image values for all items in the work-group. For each row of the convolution, the data is loaded once rather than 4x.
imagex, imagey: Is from the socles implementation in ConvolveXYGenerator which uses local memory to cache input data.
simplex, simpley: Is from the socles implementation in ConvolveXYGenerator which relies on the texture cache only.
convolvex_a(limit): Is a version of convolvex_a which attempts to only load the amount of memory it needs, rather than doing a full work-group width each time.
convolvex_a(vec): Is a version of convolvex_a which uses simple vector types for the local cache, rather than flattening all access to 32-bits to avoid bank conflicts. It is particularly poor with 4-channel input.

The array code implements CLAMP_TO_EDGE for source reads. The image code uses a 16x16 worksize, the array code 64x4. The image data is FLOAT format, and 1, 2, or 4 channels wide. The array data is float, float2, or float4. Images and arrays represent a 512x512 image. GPU is Nvidia GTX 480.

Results

The timing results - all timings are in micro-seconds as taken from computeprof. Most were invoked for 1, 2, or 4 channels and a batch size of 1 or 4. Image batches are implemented by multiple invocations.

                        batch=1                 batch= 4
channels                1       2       4       1       2       4

convolvex_a             42      58      103     151     219     398
convolvey_a             59      70      110     227     270     429

convolvex_b             48      70      121     182     271     475
convolvey_b             85      118     188     327     460     738

imagex                  61      77      110     239     303     433
imagey                  60      75      102     240     301     407

simplex                 87      88      169
simpley                 87      87      169

convolvex_a (limit)     44      60      95      160     220     366
convolvex_a (vec)               58      141

Thoughts

The rolling cache for the y convolution is a big loss. The address arithmetic and need for synchronisation seems to kill performance. So much for that idea. I guess there just isn't enough work to do each loop to make it work it (it only requires a single mad per thread).
Using more threads for loading, then dropping back when doing arithmetic is also a loss for larger problems since it limits how many groups of workgroups can execute on an SM.
Trying to reduce the memory accesses to only those required slows things down until you hit 4 element vectors. I guess for float and float2 the cached reads are effectively free, whereas the divergent branch is not.
Even with the texture cache, images benefit significantly from using a local cache.
Even with the local cache, images trail the array implementation - until one processes 4-element vectors, in which case they are even stevens for single images.
Arrays can also be batched - processing 'n' separate images concurrently. This adds a slight extra benefit as it can more fully utilise the SM cores, and reduces the need for extra host interaction. For smaller problems this could be important although this problem size is already giving the GPU a good sized workout so the differences are minimal.
Using single-channel data is under-utilising the GPU by quite a bit.

When I get time and work out how i want to do it i'll drop the array code into socles.

About Me

Tags

Images vs Arrays 4

The candidates

Results

Thoughts