Sorry about the formatting.
This code is present in zcl on the foreign-abi branch.
git clone https://code.zedzone.au/git/zcl
git checkout foreign-abi
- ---------------------------------------------------------------------- - JNI to Java with project panama Michael Zucchi notzed at gmail - ---------------------------------------------------------------------- - This is a summary of the work of converting the zcl binding for OpenCL from JNI to jdk.incubator.foriegn aka 'project panama'. Background. =========== zcl is a library for calling OpenCL from Java that maps the OpenCL 2.1 api to a set of Java classes. The OpenCL api is highly object oriented so it maps consistently and logically this way. In addition it supports garbage collection for reclaiming all allocated resources, or manual reclamation should it be desired. It was partially an exploratory work testing ideas on API design and comparing auto-generated with hand-rolled bindings that leveraged the C compilation environment for code reuse. The garbage collection allows for some interesting use of lambda functions, for example functions which retain references to working buffers to avoid the overhead of dynamic allocation. Issues. ======= Problems encountered. -- memory segments -- Java allocated MemorySegments have some restrictions on use (but only from java). The most severe of these is that such segments must be closed on the same thread as they were created. This means they can not be garbage collected using a reference queue via a cleaner thread. OpenCL has a function which allows an application-allocated memory block to be the backing for a buffer or image - using CL_MEM_USE_HOST_PTR. This must remain allocated beyond the lifetime of creation function - for the lifetime of the buffer or image. zcl/jni simply retains a reference to the ByteBuffer used at buffer creation time and this is automatically reclaimed by the system. One possible workaround for this is to use malloc() and free() to create and manage the memory explicitly which would allow it to be freed anywhere. But there's no public way to tell who allocated the segment so in the context of a function which accepts a MemorySegment this is either a potential crash or requires additional types to manage. The latter is not ideal since it is just recreating MemorySegment functionality. zcl/panama punts this whole problem to the application with a new API requirement (and not just a recommendation) that CLMemory types created in this way be explicitly closed. This is doable as although it supports GC for all objects they can also be explicitly released at any time. Forgetting to call release() (or calling it on the wrong thread) will at worse result in a memory leak. OpenCL accepts some other application-created data whose lifetime matches an object. clSetEventCallback(), clSetMemObjectDestructorCallback(), and the pfn_notify parameter to clCreateContext(). A native up-call stub must be created to be passed to these functions. Fortunately these up-call stubs have special rules which allow them to be freed from any thread; thus they can be easily cleaned up by GC or safely changed. This important for clsetEventCallback() as a typical use-case is a hidden event - possibly created in a lambda function - that has no direct way of being undone. Static constants are another area where the MemorySegment API is currently fiddly. To be able to use use them from other threads one must use acquire() every time in every accessor and every time they are passed to a MethodHandle. This is special case code that is only required due to the current design of MemorySegment and only affects access from Java. This occurs in zcl/panama with CLImageformat. The solution is the same one used in zcl/jni - instead of being structures they are simple POJOs and they are marshalled every time they are used or retrieved. This is acceptable for OpenCL as it only has two small structures in the API. -- direct byte buffers -- zcl/jni uses direct byte buffers for some apis. One particular function of import is enqueueMapBuffer. jni can trivially wrap a ByteBuffer in a native memory block and trivially get the same address back later when it needs it. To retain the same interface requires auxiliary data structures to track the relationship between a ByteBuffer and a MemoryAddress. Hash tables cannot be used as ByteBuffer hashes on state information. Also of note is that to even implement the function requires the use of "scary" ForeignUsafe. Really the zcl API should probably be changed, enqueueMapImage is quite shit. -- direct array access -- zcl/jni uses two approaches for accessing Java arrays of primitives. For short arrays it just copies the data using the ArrayRegion functions. For potentially large ones it uses GetPrimitiveArrayCritical to allow direct access to the memory. These functions are only called in non-wait contexts so seem to work OK in practice. As a big bonus the function works for any primitive type so the C code can share almost all of the implementation. zcl/panama approaches this also in two ways. For short arrays they are just marshalled to/from java types and stack-allocated memory. For large arrays there is no equivalent function to GetPrimitiveArrayCritical so all transfers must also be marshalled to holding buffers. A further complication is that all array access in Java is type specific and to handle that one either has to cut and paste or try to use MethodHandles (which are completely unchecked until run time) and/or lambda functions to reduce code duplication. This last issue already causes a lot of bulk in zcl for the sake of convenience but with the added performance hit it's hard to say if it's even worth providing such interfaces. In zcl/jni it was worth it because it was faster and easier to use than going via a ByteBuffer for when synchronous transfers were adequate. -- bulk -- C is just more concise and flexible language than Java and the macro processor allows for all sorts of things that can't be done without one. It's easier to make polymorphic functions that work with primitive types which saves a lot of repetitive typing. For example the JNI interface presents some interfaces that work regardless of the java type - GetPrimitiveArrayCritical for example - that requires bulky type-specific repetition to recreate in Java. The java code is thus bigger and less elegant. It's just not as fun to write. Fighting with generics templates suxors. In a partial state of completion the binding is about 50% larger in terms of total class size compared to the jni classes and compiled libzcl.so. Notes. ====== Interesting details. -- gc -- Garbage collection is supported using the same mechanism as in zcl. Namely the java objects (CLObject etc) are registered to a reference queue which is tracked based on the pointer (MemoryAddress). When an object is no longer reachable then a class-specific static method release(MemoryAddress) is invoked to free the resource. In addition, all objects are 'uniqified', so that the original object is always retrieved from it's pointer for any of the query functions which allows shared state to operate. Any of the functions which set callbacks retain a reference to the callback so it doesn't go away, but it is also automatically reclaimed. In addition to automatic gc, applications can directly release any resource - but must take responsibility for ensuring they are not further accessed. -- down-calls -- The code generator creates code to instantiate the method handles for the relevant native calls, and creates stub functions which will invoke the method handle with the correct argument types. Currently the static methods created take the raw foreign API types such as MemoryAddress. This means each high level entry point must resolve these addresses and ensure the types are correct. Perhaps this should change so that at least the main library types are resolved directly as it's akin to every non-primitive argument being void *. I did have a Pointer<> abstraction to allow some compile-time type checking, but this just made a lot of the code messier and somewhat slower, so I removed it all and when I did that I moved all functions to use the raw types. I was hoping the function calls could just be exposed as is but there is too much call-specific scaffolding required for it to be automated. As method handle.invoke throws Throwable ... this is a pita to deal with at every invocation because you have to bounce any internal exceptions in ways that make sense. But I left it exposed from the static bindings because an exception frame is almost always needed anyway. -- up-calls -- Native callbacks (up-calls) are resolved by raw-type arguments to a common interface, for example: void (*func)(cl_mem *mem, void *data); void (*func)(cl_program *prog, void *data); are both mapped to: public interface Call_pLpv_v { public void fn(MemoryAddress arg_0, MemoryAddress arg_1) throws Throwable; } The convention isn't great but it's based on JNI types: pL = pointer to object pv = pointer to void _v returning void Each callback has a stub() method which will convert an instance of the matching type into an up-call stub. In addition to the low-level interface there is a high level on which is application visible. For example for the one above it is: public interface CLNotify{ public void notify(T source); } To simplify application use, this java-friendly version of the interface provides a static call() method which will create a Callback<> object which will bounce from the raw native interface to the java friendly one. The Callback is registered with the reference queue and can be automatically (or manually) reclaimed. As a convenience they will also take a null argument and return a Calllback which will resolve to a MemoryAddress.NULL but otherwise do nothing, avoiding the need for special case code where they are used as they are always optional. Thus an application can simply call: CLMemory.setMemObjectDestructorCallback((CLMemory m)->do_stuff(n)); setMemObjectDesctorCallback() will simply call: Callback cb = CLNotify.call(notify, CLMemory::create); And then pass it to C using: clSetMemObjectDestructorCallback( .., cb.addr(), MemoryAddress.NULL); And it is linked from native pointer to application lambda using a single mapping interface and retaining the reference keeps it alive as long as necessary. An alternative I explored was to create a higher level interface that contained the types. It was just too ugly to expose to the application, and required more complication to make use of if mapping to application interfaces. For example CLNotify would need a special case for each possible type. -- 'stack' allocator -- Most calls involve creating temporary work memory in which to marshal arguments or store results - in C this would just be a stack variable or a call to alloca(). Allocating many MemorySegment values and tracking them for accurate disposal is somewhat inefficient and quickly becomes intractable when you start nesting multiple dependent allocations. To simplify the common case of marshalling call arguments there is a stack allocator in API.Memory which can cheaply allocate blocks and free them all at once. There many functions in api.Native and CLObject which take an allocator, and this stack allocator is just one implementation. -- code generator -- The code generator uses a gcc plugin to access all the C type and function information by hooking into the precompiled header mechanism. This is dumped to a perl-syntax file ("perl, son") which can be easily processed. Unfortunately this doesn't have access to #defines which is how cl.h defines all constants so another simple script extracts those. The code generator is an attempt to crate a reusable component across multiple projects, but it really isn't going to be able to do that. It's probably better writing an API-specific bit of perl to dump the gcc-generated metadata out to something usable. It also got messed up converting from jextract-style annotation interfaces to foreign-abi low-level interfaces and experimenting with different approaches. So for now it's a huge disgusting mess, but zcl already consisted of hand-written classes and OpenCL is almost entirely a handle driven API so it is needed primarily to generate the library methods and callback templates. api.Native defines all the complex code which these use. Library method handles are resolved using a signature that is compatible with the jextract version I used previously. All the metadata required such as memory layouts are only created on the fly and discarded immediately - including for up-calls. OpenCL only requires two structures: cl_image_desc and cl_image_format. So these were hard-coded rather than using the code generator. The code generator rules for StudlyCaps generates shitty type names, so they too were all overridden. API changes. ============ Some things relevant to the panama version, perhaps due to it's limitations. Some are bug fixes. 1. createBuffer() can take a MemorySegment as well as ByteBuffer. 2. CL_MEM_USE_HOST_PTR now uses a MemorySegment to track the application data. As a result: - release() must be called explicitly to avoid leaks. - release() must be called on the original creator thread. - getHostPtr() now creates a ByteBuffer using segment.asByteBuffer().order(nativeOrder()) 3. All functions that use arrays require internal copying. As the OpenCL functions always copy as well this is a double copy. 4. CLImageDesc uses longs rather than ints for sizes (bugfix). 5. The exceptions are still a bloody mess. 7. Things that take ByteBuffer will behave differently. They will cover [position .. limit) rather than [0 .. capacity). 8. CL_MEM_USE_HOST_PTR state is now handled in CLMemory. Thus CLImage now handles it. 9. createImage and the various image copy functions don't perform data range validation. A. None of the SVM stuff is implemented. I've never had a working implementation to ever test it anyway. B. None of the extension stuff is implemented. C. enqueueMap* retains the same api but it has to track all mapped memory blocks so the unmap operation can determine the MemorySegment that underlies the ByteBuffer that exposed by the api. The api can probably be changed, particularly enqueueMapImage is quite shit.
notzed on various mail servers, primarily gmail.com.