About Me

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as Zed
  to his mates & enemies!

notzed at gmail >
fosstodon.org/@notzed >

Tags

android (44)
beagle (63)
biographical (104)
blogz (9)
business (1)
code (77)
compilerz (1)
cooking (31)
dez (7)
dusk (31)
esp32 (4)
extensionz (1)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (459)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (231)
java ee (3)
javafx (49)
jjmpeg (81)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (10)
opencl (120)
os (17)
panamaz (5)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
players (1)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
vulkan (3)
wanki (3)
workshop (3)
zcl (4)
zedzone (26)
Friday, 13 December 2013, 09:32

FFT jiggery pokery

I've been meaning to poke at FFT calculation for a while and I finally got around to having a proper look.

First I just needed to calculate a single value so I did it by hand - I got some strange profiling results from Java vs jtransforms but it turned out that due to it's complexity it takes many many cycles before the JIT fully optimises it.

Then i started looking from a trivial Cooley-Tukey implementation that allocated all it's output arrays ... and i was surprised it was only about 1/2 the speed of jtransforms for it's simplicity. I tried a few variations and delved a bit into understanding the memory access patterns - although after an initial bit-reversed premute the memory patterns are trivial. It was still stuck at about 1/2 the speed of jtransforms.

I had a bit more of a think about the bit-reversing memory permute stage and figured it wouldn't work terribly well with cpu caches. First I just worked on blocking the output by 8 elements (=1 cache line) toward re-arranging the reads so they represent 8 sequential streams. Strangely enough this second case is a bit slower on small workloads that fit in the cache although is a bit faster on larger ones. It's not really much of a difference either way.

 D:d                               |S:s                               |   0
 D: d                              |S:                s               |  16
 D:  d                             |S:        s                       |   8
 D:   d                            |S:                        s       |  24
 D:    d                           |S:    s                           |   4
 D:     d                          |S:                    s           |  20
 D:      d                         |S:            s                   |  12
 D:       d                        |S:                            s   |  28
 D:        d                       |S:  s                             |   2
 D:         d                      |S:                  s             |  18
 D:          d                     |S:          s                     |  10
 D:           d                    |S:                          s     |  26
 D:            d                   |S:      s                         |   6
 D:             d                  |S:                      s         |  22
 D:              d                 |S:              s                 |  14
 D:               d                |S:                              s |  30
 D:                d               |S: s                              |   1
 D:                 d              |S:                 s              |  17
 D:                  d             |S:         s                      |   9
 D:                   d            |S:                         s      |  25
 D:                    d           |S:     s                          |   5
 D:                     d          |S:                     s          |  21
 D:                      d         |S:             s                  |  13
 D:                       d        |S:                             s  |  29
 D:                        d       |S:   s                            |   3
 D:                         d      |S:                   s            |  19
 D:                          d     |S:           s                    |  11
 D:                           d    |S:                           s    |  27
 D:                            d   |S:       s                        |   7
 D:                             d  |S:                       s        |  23
 D:                              d |S:               s                |  15
 D:                               d|S:                               s|  31

This shows the source-destination read/write pattern for a bit-reversal permute. Obviously with all those sparse reads the cache coherency isn't going to be great. FWIW moving from s to d (sequential writes) proved to have better performance than from d to s (sequential reads).

 D:dddddddd                        |S:0   4   2   6   1   5   3   7   |
 D:        dddddddd                |S:  0   4   2   6   1   5   3   7 |
 D:                dddddddd        |S: 0   4   2   6   1   5   3   7  |
 D:                        dddddddd|S:   0   4   2   6   1   5   3   7|

This is the same, but blocked to 8 elements in the output. Unfortunately there is still a large spread of reads.

 D:dddddddd                        |S:0   4   2   6   1   5   3   7   |
 D:                dddddddd        |S: 0   4   2   6   1   5   3   7  |
 D:        dddddddd                |S:  0   4   2   6   1   5   3   7 |
 D:                        dddddddd|S:   0   4   2   6   1   5   3   7|

This is an attempt at improving the previous one - although the reads are still in a broad spread, they are always confined to 8 sequential streams. It does improve the performance once you get a large enough problem, but on an x86 machine N needs to be over about 8M before it shows a minor improvement. I've yet to try it on ARM.

Then I realised that Integer.reverse() wasn't mapping to a single instruction on x86 - and subsequently over half of the processing time was spent just calculating the source index ... so I put in a bit of bit-fiddling to the 8-blocked implementations which precalculated the 8 fixed element offsets relative to base-offset. I also tried a lookup table for the non-blocked implementation which made a big difference. Strangely using a lookup-table to get the base offset actually slowed it down - suggesting that the jvm is pre-calculating reverse(i) in the outer-loop if it can.

Once this permute is done the memory access pattern is just in adjacent and sequential blocks - either depth or breadth first. The 8-blocked permute is about 130%-150% of the speed of a straight System.arraycopy() for 4-1K elements.

Actually another reason I worked toward an 8-blocked implementation was in order to perform the first 8-way fft stage at the same time as the fiddling - with that in place I got to within shouting distance (under 200% execution time) of jtransforms for some small fft sizes. I was curious what I had to do to get better than that so I tried hard-coding the last stage of a size-16 fft and managed to beat it - not that this has any real practical purpose mind you.

This gives me a starting point to tackle the problem on the parallella which has it's own challenges.

PS yes I found a book or two and many papers about both fft implementation and the permute step but I had time to experiment on my own.

Tagged hacking, java, parallella.
FFT bit-reverse, NEON | Java, C, SSE, poor mans lambdas
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!