quick poke @ aparapi / hsa + fft = bummer
I have a very basic radix-2 fft algorithm that implements each pass as a single loop rather than multiple loops - this allows the algorithm to be parallelised[sic] trivially because each item is calculated in isolation. I converted it to java so I could experiment and compare - although given Java has no complex type it's easier experimenting in C! The single-loop version is slower than the two-loop one which is a bit of a shame but given that the radix-2 algorithm does so little work inside the loop it isn't really surprising.
(The algorithmic efficiency / performance isn't really important right now, it's just something to experiment with)
I tried running it using lambda expressions but the overhead of the thread communications swamped it - it's about 3x slower that way. This was no surprise.
So I thought i'd try it using HSA instead; and about the only bit of that I have handy right now is aparapi-lambda. I was hoping that using HSA would demonstrate where HSA could come into it's own so I hooked it up using aparapi-lambda given that's the only compiler I could think of that I have right now. Unfortunately there is a bit of an impedance mismatch between the way Aparapi and the JVM work and the way HSA does. Aparapi just translates the Java bytecode from javac directly into hsail assembly language; no problem there. But javac intentionally does no optimisation whatsoever - and leaves all that to the JVM instead which has more knowledge which allows it to do a better job. However HSA moves the optimisation to the compiler so that the HSA finaliser can be simpler - which makes it easier to port, smaller, and more robust and reliable.
To cut that long explanation short: it runs like shit because it's generating shit code and I can't really use it as any indication of performance. Bummer. It's about 3x slower than using lambdas.
So much for trying a short-cut - looks like I have to get my hands a bit dirtier on this one.
Oh then I remembered I had the graal stuff, but I forgot how to run it. So I tried updating and after a bit of frobbing about got it to run, ... I think. This generates better code but still has overly complex array indexing arithmetic ... and it's running much much slower too (coincidentally, around another 3x slower again).
So I updated gcc from the hsa branch and got that built but trying to do something will require a bit more work. I don't want to use libOkra for this so I started poking at the ioctls required to talk to the kfd device (not sure what kfd stands for but it's the kernel module which handles the HSA interface). I managed to get some info out of it so at least it's on the right track. It's a tiny interface and most of the work is done in userland and should be straightforward but there are some details which are important to do with cache coherency that I need to find out about.
I tried getting the HSA documents which would aid this work ... but they're all over the place, one is on some shitty website called sl1deshare which has an abysmal eye-hurting in-browser viewer and wont let me download the pdfs without a third-party account which I don't have.
Oh I see, if you send the HSA foundation a message the site sends you an email with a download link anyway. How ... annoying. I wonder what spam service I just inadvertently signed up for.
Hmm, I think that might be enough for today. And that reminds me that I haven't had breakfast yet, and together with another night of poor sleep i'm just not in the mood.
Update: Ok so I had a break and got back to it. But it seems like i misunderstood the abstraction a little bit and the finalising is done at the api level before it hits the device. Well that makes complete sense of course. Duh.
Anyway I tried getting libOkra to load a BRIG generated by gcc but it just aborts, probably due to some elf issues alluded to in the hsa branch of gcc.
So I guess it's just not ready for that kind of poking yet.