QWT
Since I haven't done any work in OpenCL for a while I thought on my hacking-day today I'd poke around socles for something different.I managed to mostly get the quaternion wavelet transform code I started pre-xmas working, at least for the forward transform. It is based on the code I have in ImageZ, which i've mentioned before. This can then be used to form a dual-tree complex-wavelet-transform 'on the fly' as I do in ImageZ, or by copying to another structure (these both have some fairly interesting properties, so far i've only done work with denoise/sharpen, but there are also other applications such as registration).It was actually a bit of a ramp-up to remember where the code was at. And then to decipher what it was supposed to do. For what consists a fairly simple algorithm (dual-convolution - the difficulty in wavelets is the filter design) there is a lot of very fiddly addressing and mucking about. Implementing upfirdn
which is essentially what it uses is a pain even with such simple ratios as 1:2 and 2:1.
Although I seem to be getting an ok result, sometimes when I execute it on my GPU I end up having to hit the reset button ... so it's stuck on the CPU for now. Probably just some bounds checking errors. I'm not really on the ball today, pretty tired and grumpy, and I spent the better part of the day working on it so far; bit shit considering I already had a working Java implementation and most of the OpenCL scaffolding done. I guess not every day can be a super-productive one.
As above, i've only worked on the forward transform so far. Most of the tricky stuff is in the DWTGenerator which started as a copy of the convolution generator class, although QWT.forward() is a lot hairier than it looks too. I also managed to improve on the ImageZ code a tad: the redundant copies and vertical transforms aren't needed as I in-lined the sub-routines and have simpler data management.Update: Well I just kept poking at it, and eventually ended up with getting the inverse transform working: sometimes it just takes perseverance. Unfortunately the forward transform still crashes my GPU quite regularly and I haven't traced that down yet.
Between the resets I did manage to get a couple of runs through sprofile: it's about 350uS per transform for a 512x512 depth=1 transform, on a Radeon HD 6950. There are some LDS bank conflicts in the X convolutions, although the main bottleneck is the Y convolution since it cannot benefit from LDS and relies on coalesced reads and the global cache. I'm reasonably happy with that, and i'm not sure I can get much more out of it.