About Me

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as Zed
  to his mates & enemies!

notzed at gmail >
fosstodon.org/@notzed >

Tags

android (44)
beagle (63)
biographical (104)
blogz (9)
business (1)
code (77)
compilerz (1)
cooking (31)
dez (7)
dusk (31)
esp32 (4)
extensionz (1)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (459)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (231)
java ee (3)
javafx (49)
jjmpeg (81)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (10)
opencl (120)
os (17)
panamaz (5)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
players (1)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
vulkan (3)
wanki (3)
workshop (3)
zcl (4)
zedzone (26)
Sunday, 24 March 2013, 08:11

clAmdFft

Just a quick note, I created a small JNI wrapper using JOCL to access the clAmdFft library from AMD, to see how it compared to the Apple FFT I ported to Java.

Bummer, the Apple FFT is still faster for my problem size - image processing. Kernel time on a 1024x1024 sp fft is ~0.5ms vs ~0.38ms for the Apple one (HD7970). And the apple one only requires 3 passes. Smaller and/or batches seems to scale linearly too.

All that (minor) frustration with the clMiXeDAbbrCamlCase API for nought.

Update: So I actually tried writing a 64-element parallel FFT to see how I could do ... well, not good. Apart from wasting a lot of time wondering why "no matter what I did it took 15ms" because I hadn't allocated a properly sized buffer (out of range access I guess), when I finally got it going it was about 10x slower than the Apple one. Blah. Actually the apple one was about the same speed (if not faster) than a simple memcpy (using float2).

I wasn't being very smart about it, I just broke a 64-element FFT into one-calculation steps and used shared memory to communicate the results. A big overhead was the index twiddling for which I just used lookup-tables, but even the shared memory communication was expensive.

I may have gotten something wrong anyway, the profiler said the apple fft code had many many fewer global read/write operations, and I couldn't work out why. Although I was trying to do it all in LDS in a single kernel.

Since I didn't verify the results - I was just seeing how the memory access patterns would work - I have no real idea whether it worked or not anyway. But I may have another play another day, I wrote some code which outputs the expressions required between two arbitrary "layers" of the calculation so I can use larger kernels than the radix-2 ones I was testing with.

Tagged hacking, opencl.
#$#@$ Android | Nvidia and OpenCL
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!