radix-4 stuff
Well I did some more mucking about with the fft code. Here's some quick results for the radix-4 code.
First the runtime per-transform, in microseconds. I ran approximately 1s worth of a single transfer in a tight loop and took the last of 3 runs. All algorithms executed sequentially in the same run.
16 64 256 1024 4096 16384 65536 262144 jtransforms 0.156 1.146 6.058 27.832 135.844 511.098 3 037.140 14 802.328 dif4 0.160 0.980 5.632 27.503 138.077 681.006 3 759.005 20 044.719 dif4b 0.136 0.797 4.713 22.994 120.915 615.623 3 175.115 17 875.563 dif4b_col 0.143 0.797 4.454 21.835 117.659 593.314 3 020.144 22 341.453 dif4c 0.087 0.675 4.255 21.720 115.550 576.798 2 775.360 15 248.578 dif4bc 0.083 0.616 3.760 19.596 108.028 547.334 2 810.118 16 308.047 dif4bc_col 0.137 0.622 3.699 19.629 107.954 550.483 2 820.234 16 323.797
And the same information presented as a percentage of jtransforms' execution time with my best implementation highlighted.
16 64 256 1024 4096 16384 65536 262144 jtransforms 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 dif4 102.4 85.5 93.0 98.8 101.6 133.2 123.8 135.4 dif4b 86.7 69.6 77.8 82.6 89.0 120.5 104.5 120.8 dif4b_col 91.3 69.6 73.5 78.5 86.6 116.1 99.4 150.9 dif4c 55.9 58.9 70.2 78.0 85.1 112.9 91.4 103.0 dif4bc 53.3 53.8 62.1 70.4 79.5 107.1 92.5 110.2 dif4bc_col 87.7 54.3 61.1 70.5 79.5 107.7 92.9 110.3
Executed with the default java options.
$ java -version java version "1.8.0_92" Java(TM) SE Runtime Environment (build 1.8.0_92-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)
CPU is kaveri clocked at minimum (1.7Ghz) with only a single DIMM, it is quite slow as you can see.
A summary of the algorithms follow.
- jtransforms
- jtransforms 2.2, complexForward() routine.
- dif4
- A radix-4 implemented as two nested radix-2 operations. 42 flops.
- dif4b
- A radix-4 implemented directly with common subexpressions explicitly encoded. 36 flops.
- dif4b_col
- The first approximately half passes are the same as dif4b, but the second half swap the order of the loop in order to exploit locality of reference, a longer inner loop, and re-using of twiddle factors.
- difc
- The first N-2 passes attempt to re-use twiddle factors across pairs of results situated at (i,N-i). The N-1 pass is hand-coded for the 4 cases which only require 3 constants for the twiddle factors.
- dif4bc
- The first N-2 passes are the same as dif4b. The N-1 pass is hand coded (==dif4c).
- dif4bc_col
- A combination of dif4bc and dif4b_col.
The final (trivial) pass is hand-coded in all cases. dif4 requireds N/2 complex twiddle factors and the rest require N/2+N/4 due to the factorisation used.
In all cases only a forward complex transform which leaves the results unordered (bit-reversed indexed) is implemented.
Thoughts
-
All the implementations do full scans at each pass and so start to slow down above 2^12 elements due to cache thrashing (together with the twiddle table). Depth-first ordering should help in most cases.
-
Despite requiring fewer twiddle lookups in dif4c the extra complexity of the code leads to register spills and slower execution. That it takes a lead at very large numbers is also consistent with this and the point above.
-
Twiddle factor lookups are costly in general but become relatively costlier at larger sizes. This needs to be addressed separately to the general cache problem.
-
Whilst the re-ordering for the "col" variants was a cheap and very simple trick to gain some performance, it can't keep up with the hand-coded variants. Further tuning may help.
-
For addressing calculations sometimes it's better to fully spell out the calculation each time and let the compiler optimise it, but sometimes it isn't.
-
It is almost always a good idea to keep loops as simple as possible. I think this is one reason such a trivial / direct implementation of the FFT beats out a more sophisticated one.
-
There's a pretty clear winner here!
In most cases the calculations are expanded in full inside inner loops, apart from the W^0_N case. In all other cases the compiler generated slower execution if it was modularised, sometimes significantly slow (10%+). I suspect that even the radix-4 kernel is starting to exhaust available registers and scheduling slots together with the java memory model. The code also includes numerous possibly questionable and sometimes a little baffling micro-optimisations. I haven't validated the results yet apart from a test case which "looks wrong" when things go awry but i have reasonable confidence it is functioning correctly.
After I found a decent reference I did start on a radix-8 kernel but i realised why there is so little about writing software to do it; it just doesn't fit on the cpu in a high level language. Even the radix-4 kernel seems to be a very tight fit for java on this cpu.
Still playing ...