FFT convolution
So i looked into FFT convolution a bit more and nutted out a couple of useful things.
Two for the price of one
The real and imaginary parts of a complex DFT are basically independent if one performs linear operations on them in the complex space.
i.e. if you take two separate (real) images A, and B, and interleave Ai and Bi into a complex image C with Ci = Ai + Bi j, then you can do operations like a convolution in the fourier domain, and after the inverse, reversing the combination trick gives you the two separate images processed with the same operation. Nice.
I'd read this before but the explanations always got hairy - good news is it just works if you don't need to know anything about the signal in the fourier domain, and are just interested in processing each element independently using linear functions.
Cache friendlish 2D operations
Typically when using an FFT operator for 2D signals one does a couple of operations:
- forward transform of one or more signals.
- process each element by element.
- inverse transform.
But internally a 2D FFT is implemented as a two separate passes, on the rows, then on the columns (or visa-versa), and typically might be implemented with two passes:
- forward/inverse transform rows
- transpose result
- forward/inverse transform rows
- transpose result
ffts only has a single dimensional complex FFT available, so I had to implement the 2D myself. But this provides further opportunities - since for this application I don't particularly care where the various coefficients are, I can just treat each as a separate calculation.
It lets me avoid 2x transposes and also improve the cache coherence for the filter step.
- forward transform rows
- transpose result
- for each row
- forward transform row
- apply convolution/filter on row
- inverse transform row
- transpose result
- inverse transform rows
On my test example this version ran in 73% of the time compared to a fully separate 2D convolution.
NEON cmult
I also wrote a NEON complex array multiply. With LD2 this turns out to be quite simple code although I also interleaved a loop to avoid some stalls. 35 cycles to do 8 complex multiplies.
This ran at 4x the gcc performance of a simple C implementation.
Example
Filtered with a simple low-pass pedestal filter. This takes under 200ms on a beagleboard-xm, on which the break-even point for a 2D time domain convolution is around 15x15 or so (using custom NEON code). Obviously I still have some transposition issues - this is one of the things that always gives me the willies with using FFT for signal processing. (Actually I think it's a bug in ffts, it doesn't seem to like multiple plans being created at the same time, this is the same result as if the inverse fft plan was the same as the forward one).
Update: Just a bug - ffts doesn't implement inverse properly on NEON yet, so i'm just getting 2x forwards which mirrors both axes.
Update 2: ffts is now simple to build and inverse and a few other things have been fixed as well. By doing the two at once trick above, and using some NEON for type conversion and clamping, I have the beagleboard-xm (@600Mhz) doing a full byte image to byte image round trip for a Wiener deconvolution using a non-separable point spread function in about 80ms per 512x512 image.