Aparapi
So, I kept watching some more of the AMD fusion summit videos yesterday evening - and now I remember one reason I haven't watched them before: they take a long time to get through!
The GCN one was interesting (from 'Mike and Mike'), although the first Mike was a bit nervous and together with his Texan accent and rushing a bit, made him a bit hard to follow.
But the Aparapi one gave me some new respect for the project.
Although one thing I didn't think was correct was he kept going on about how Java programmers are completely braindead and don't know how to code for performance or deal with other issues (it's not like they're Python coders FFS):
- Pointers?
- Java coders have never seen a pointer eh? Object references behave the same because they are the same, and although you can't increment/index them, most coders stick to direct pointer dereferencing and array indexing anyway: i.e. exactly the same way they're used in Java. And they both have null pointer dereferences, it's just that Java's are non-fatal. So one could say Java programmers have never seen a fatal pointer access, but just because they're called object references or arrays, doesn't mean they're not pointers ...
- Other languages?
- Java programmers seem to love XML for some reason: they can cope with other languages. Look at ant: it's an XML form of bourne-shell! Not to mention Scala and so on.
- Performance
- Actually any of them already using Java for engineering/science know how to get performance: use objects of (single-dimensional) arrays, not arrays of objects. Write really ugly maths to cope with it. There's just no other practical choice without taking unacceptable performance hits.
And this goes for more fundamental understanding of the underlying architecture too. I know everyone likes to make out that abstraction means you don't have to know about registers and cache and how language constructs are executed by the hardware (and I know some schools of computer science try to hide such details), but that's bunkum. It affects every language because at the end of the day they're all executed as machine instructions.
- Threading, barriers, etc.
- Again for performance you can't avoid these. Java barriers are also the same semantically as OpenCL barriers and are a very simple concept if you're able to cope with the concept of concurrency at all.
- Dynamic memory
- Absolutely agree here, this is one thing Aparapi should hide.
So if a coder can't cope with these concepts already they are not going to be a potential user of Aparapi, so it seems strange to only target them as a potential adopter of the technology.
As with these concepts in Java, these 'braindead' coders will just use third party libraries and (ugh) frameworks to do this fiddly stuff for them, because even the simplest concurrency model that Aparapi presents (which is more like an OpenGL shader than a CL kernel) will be beyond them.
It's the actual concurrency which is the hard part: and for the most part OpenCL's concepts make the concurrency easier to deal with (or even, possible to deal with). So hiding the tools to deal with the concurrency whilst exposing the concurrency does seem a little counter-productive.
So I don't think hiding all the details of OpenCL is necessarily a good idea: sometimes one does need to know about data flow, local memory, 3-d workgroups, and barriers. 'system' and 'framework' programmers already need to know about threads and concurrency and so these related concepts will not be foreign to them. Although I noticed that Aparapi now has support for local memory blocks and so there's hope yet.
I'm not sure why Aparapi uses such an explicit memory transfer model either, when it could have managed the buffer memory in a simpler way. e.g. Java has accessors, why not use them to determine when buffer memory needs to be returned to the host? The data flow should be quite explicit from the kernel invocation order: no need to analyse the host code for this (and this is problematic: I don't see how the host code can invoke multiple kernels in a sequence from the example, since each kernel has its own single 'host' method - imho 'host' should be an attribute on any method rather than a single one, but maybe the api has moved on since then).
However, taking into account the future plans of the AMD/ARM platforms ... Aparapi has the potential to be much much more useful as the programming model could map quite well to the future target design. i.e. once one has zero-copy unified memory and a low-overhead job queue mechanism the cost of an Aparapi kernel call will become very low.
Although reviewing the plans, I suspect they're being a bit optimistic: i only recently discovered for example that async memory copies between CPU and GPU is only an 'preview' feature in the current sdk ... which was quite a 'WTF' moment - how seriously IBM-PC-XT is that ... OTOH moving the ringbuffer processing to the hardware removes most of the OS-specific code required to talk to the hardware and so will reduce the resource requirements for driver development (i.e. we wont have to wait for the driver devs, it'll already be done by the hardware guys).
I think the biggest problem for Java however is in the language itself, and in particular it's lack of mathematical expressivity[sic]. It doesn't support vector types for starters - which are handy in this case more for their ability to concisely realise mathematical expressions than the ability to map efficiently to SIMD processors. And even if one does wrap these in objects, using them is clumsy, error prone, and even messier if you're aiming for efficiency. e.g. it's simply cleaner writing C or OpenCL C code to perform most maths than it is doing the same in Java (in it's most purest object form, let alone optimised for practicality).
So although a lot of problems will probably map fine enough, occasionally one will be better off with lower-level access. If one could override the kernel code with your own string, but let Aparai handle the buffer management and let them all interoperate it would be very cool. Even if for example, it compiled the OpenCL code in to Java bytecode as the fallback (ok, this is a rather much bigger problem to solve ...).