About Me
Michael Zucchi
B.E. (Comp. Sys. Eng.)
also known as Zed
to his mates & enemies!
< notzed at gmail >
< fosstodon.org/@notzed >
OpenVX
Here's something I missed being out of the loop for the last good while: OpenVX.
Although it appears so did the rest of the industry? Another standard sent to rot?
I'll have to have a closer look though; if it is well thought out it will be worth borrowing a few ideas at the least as it's a problem i've come up with multiple solutions for myself and it's always useful to get a fresh perspective.
Or maybe it'll give me something to play with along with ZCL either as a binding and/or a pure Java (re)implementation for prototyping.
I wonder how far away vulkan is for the rest of us.
Update: Some movement has occurred on the HSA front recently. I suppose just some more announcements or partnerships or something. There's a lot of technology and software that needs to come together to get anywhere and it's taking it's time. I wonder if it'll ever get traction though. I guess it solves one of the problems with OpenCL - you actually have to program parallel software which is a skill few have the time to learn and fewer companies have the budget to fund. At least it has the potential to stay in the compiler layer, hidden from your typical code-monkey.
OpenCL notes
As tangentially related observation I've been working on some OpenCL stuff of late and I came across a problem I just couldn't get to run very fast on the dev GPU. I ended up resorting to using an OpenCL/CPU kernel and another queue but along the way i tried a couple of other interesting things.
One was using a native kernel written in Java - but this still requires a CPU driver / queue since the GPU driver i'm using doesn't support native kernels. ZCL has fairly simple interface to this:
public interface CLNativeKernel {
/**
* Native kernel entry point.
*
* @param args arguments, with any CLMemory objects replaced with
* ByteBuffer.
*/
public void invoke(Object[] args);
}
public class CLCommandQueue {
...
public native void enqueueNativeKernel(
CLNativeKernel kernel,
CLEventList waiters,
CLEventList events,
Object... args) throws CLException;
}
The JNI hides the details but it behaves the same way as the C code whereby any memory objects in the argument list are replaced by pointers (ByteBuffer here). Not sure if i'll keep the varargs prototype because it is inconsistent with every other enqueue function and only saves a little bit of typing. I'll review it when i look at the functional stuff i mentioned in the last post.
Which can be used efficiently and conveniently in Java 8:
CLBuffer points = cl.createBuffer(0, 1024 * 4);
q.enqueueNativeKernel((Object[] args)-> {
// object[0] = ByteBuffer = points contents
// object[1] = Integer = 12
// object[1] = Integer = 24
}, null, null,
points, 12, 24);
Since my prototype harness didn't have a CPU queue until I subsequently added it my first approach was to emulate a native kernel using event callbacks and user events. It actually worked pretty well and was about the same runnning time, although it took a bit more (fairly straightforward) code to set up.
One approach I took was to have two queues - the primary 'gpu' queue where the work is done, and another one used for the memory transfer and rendezvous point.
// setup CLEventLists and the user event
// gpu part of work
gpu.enqueueXXKernel(..., gpudone);
// prepare for cpu component
work.enqueueReadBuffer(..., gpudone, readdone);
// prepare for gpu again
work.enqueueWriteBuffer(..., cpudone, writedone);
// make sure q is ready
gpu.enqueueMarkerWithWaitList(writedone, null);
memdone.setEventCallback(CL_COMPLETE, (CLEvent e, int status) -> {
// do the work
cpudoneevent.setUserEventStatus(CL_COMPLETE);
});
In this case all the enqueue operations are performed at once and events are used to synchronise. This simplifies the callback code a little bit. Now i'm looking it it there's probably no need for the separate queue if the gpu queue is synchronised with it anyway. (like with most of these examples it is a summary of what i came up with, but not the full sequence of how i got there which explains some of the decisions).
This is a trivial approach to ensuring the 'gpu' queue behaves as the caller expects: that is, as if the work was performed in sequence on the queue and without having to pass explicit events. I'm using the read/write interfaces rather than map/unmap or otherwise mostly out of habit, but the data in question is quite small so it shouldn't make much difference either way.
And FWIW for this problem ... this approach or the java NativeKernel one actually runs a tiny bit quicker than using the OpenCL/CPU device let alone the GPU (all wall-clock time on the opencl q).
I had to make some small tweaks to the CLEventList code to make this all work and to tie properly into the garbage collection system. Mostly this was adding a CLEvent array rather than just using the pointer array and fixing the implementation. I kept the pointer array to simplify the jni lookup. I also had to have construction go through the same mechanism as the other CLObjects so they retain global reference uniqueness. This greatly simplifies (i.e. completely removes) reference tracking which is always nice with async callbacks. I think it should all "just work"; it does from the Java side - but i need to check from the OpenCL side of things whether actions like setUserEvent() adds an explicit reference bump.
This is a prime example of what HSA should solve well, but for now this is what i've got to work with.
I've been so busy with details i haven't had a chance to even look at any OpenCL 2.0, let alone OpenVX, HSA, or much else. And frankly the spirit is just not that willing of late. Spring is just the latest of a litany of excuses for that.
OpenCL lambda enqueue
Just had a thought on an alternative api for CLCommandQueue in zcl. No this has nothing to do with lambda calculus in OpenCL.
An inconvenience in the current api is that all the enqueue functions take a lot of arguments, many of which are typically default values. This can be addressed using function overloading but this just adds additional inconvenience as there are also simply a lot of functions to overload. A related issue is things like extensions can add additional entry points which are object-orientedly resident on the queue object but placing them there doesn't necessarily fit.
And finally new compound operations need to be placed elsewhere but also fit a similar semantic model of enqueing a task to a specific queue.
So the thought is to instead to use java's lambda expressions to create queueable objects which know how to run themselves, and then at least the waiters/events parameter overload can be handled in one place.
So rather than:
// some compound task
public void runop(CLCommandQueue q, CLImage src, CLImage dst,
CLEventList waiters, CLEventList events) {
... enqueue one or more jobs ...
}
public void runop(CLCommandQueue q, CLImage src, CLImage dst) {
runop(q, src, dst, null, null);
}
public void runop(CLCommandQueue q, CLImage src, CLImage dst,
CLEventList event) {
runop(q, src, dst, null, event);
}
// usages
runop(q, src, dst, waiters, events);
runop(q, src, dst, events);
runop(q, src, dst);
I can do:
// the interface
interface CLTask {
public void enqueue(CLCommandQueue q, CLEventList w, CLEventList e);
}
// the creation (only one required)
public CLTask of(CLImage src, CLImage dst) {
return (q, w, e) -> {
... enqueue one or more jobs ...
};
}
// usages
q.run(op.of(src, dst));
q.run(op.of(src, dst), events);
q.run(op.of(src, dst), waiters, events);
This could extend throughout the rest of the api so that for example a CLBuffer would provide it's own read task factories:
public CLBuffer {
public CLTask ofRead(byte[] target) {
return (q, w, e) -> {
q.enqueueReadBuffer(this, true, 0, target.length, target, 0, w, e);
};
}
}
// usages
q.run(buffer.ofRead(target));
q.run(buffer.ofRead(target), events);
q.run(buffer.ofRead(target), waiters, events);
vs
// typical usage (without overloading)
q.enqueueReadBuffer(this, true, 0, target.length, target, 0, null, null);
q.enqueueReadBuffer(this, true, 0, target.length, target, 0, null, events);
q.enqueueReadBuffer(this, true, 0, target.length, target, 0, waiters, events);
I think this would provide a way to add the convenience of overloading without a method count explosion. But the real question is whether it would actually improve the api in any meaningful way or merely make it different. Probably at this point it's a tentative yes on that one for many of the same reasons lambdas are convenient such as encapsulation and reuse.
There are some issues of resolving state at point-of-execution and threads but these are already an issue with OpenCL code to some extent and definitely with lambdas in general.
One could keep going:
// the interface
interface CLTask {
public void enqueue(CLCommandQueue q, CLEventList w, CLEventList e);
public default void on(CLCommandQueue q) {
enqueue(q, null, null);
}
public default void on(CLCommandQueue q, CLEventList w, CLEventList e) {
enqueue(q, w, e);
}
}
// usage
buffer.ofRead(target).on(q);
Despite this having the benefit of layering in isolation above the base api I think it starts to get a little absurd and turns into "much of a muchness" deckchair shuffling.
Although this addition is probably useful:
// the interface
interface CLTask {
public void enqueue(CLCommandQueue q, CLEventList w, CLEventList e);
public default CLTask andThen(CLTask after) {
return (q, w, e) -> {
enqueue(q, w, e);
after.enqueue(q, w, e);
};
}
}
// usage
q.run( buffer1.ofRead(target).andThen(buffer2.ofWrite(target)) );
Actually I didn't really intend it as an outcome but this also becomes a lot more usable if the resources in questions are automatically reclaimable via gc as per my last post. Whole state and work spaces can be retained and reused through nothing more than a CLTask reference.
I think i've convinced myself of the utility now but either way it takes very little code to try it.
OpenCL garbage
I was working on some higher level containers for managing OpenCL stuff and came to the conclusion that I wanted to add automatic resource reclaimation to zcl - it was either that or fill a whole hierarchy of objects with reference counting. But reference counting is slow, error-prone, and a big mess to write in so it isn't at all attractive when there is an alternative.
I'd already done it in jjmpeg but i wasn't really keen on the way i implemented it there and wanted to see if i could come with a more streamlined solution. Like when I did it for jjmpeg I started with this article about JavaSE finalisation and using weak reference queues.
I think the solution I came up with will work ... and it turned out to be rather simple in the end.
Previously all CLObjects were a simple lightweight pointer handle with all the details passed to the C functions. They all have an init(pointer) constructor which was called directly from the JNI layer. Duplicate objects referencing the same resource were not an issue so I just let it happen. Well it's easy to break but if you treat objects like the C pointers they are and know that dangling references are possible then it's not unsolvable.
But for GC to work the references need to be unique. This is fairly easy to guarantee as the resources are just memory pointers - which are guaranteed to be unique and unchanging. So rather than the JNI layer invoke the constructors directly I just call a factory method with a type index which lets me move some of the code into Java - it isn't significantly simpler but it is more flexible.
For the reference queue to work properly I need to store them in a container anyway so this conveniently meshes with using a hashtable to uniqify the objects.
static CLObject toObject(int ctype, long p) {
CLObjectHandle h = referenceMap.get(p);
if (h != null)
return h.get();
return classTable[ctype].newInstance(ctype, p);
}
My first attempt passed the Class through (this is how i did it in JNI) but I changed it to an integer. It makes the JNI a bit easier and having the type as an integer simplifies the release call (OpenCL api isn't OO and has per-type release functions). Being able to identify the object fully using primitive types also lets me freely use them without polluting the reference tree; which is critically important when dealing with gc.
Now comes the bit which i fucked up in jjmpeg (well the biggest bit). Each object is represented by 4(!) classes. An autogenerated native abstract class which includes the static native method prototypes and a hand-written native concrete class which implements any type-specific dispose or construction semantics. Then there is an autogenerated abstract public class which includes all the autogenerated methods again - this time invoking all the methods on the native class after looking up the object pointer. And finally a hand-written public concrete class which includes constructors, helpers, and any other special cases where the details are better hidden.
This is just a lot of code - every public method on the "java" class ends up calling a native method on the "native" class so every method needs at least two implementations; . This was the main driver for ZCL simply using a single JNI implementation and foregoing this redundant juggling of the call stack just to insert the resource pointer into the call. In most cases in ZCL the public api is just the native method and it needs no redundant wrapper.
This time I just added a single general-purpose CLObjectHandle weak reference type which is used by all instances to track the native resource. It just holds the pointer (and the ctype) and implements the release. I just add one of these to each CLObject in one place.
public abstract class CLNative {
final long p;
protected CLNative(long p) {
this.p = p;
}
...
}
public abstract class CLObject extends CLNative {
final CLObjectHandle h;
protected CLObject(int ctype, long p) {
super(p);
h = new CLObjectHandle(this, ctype, p);
}
...
static class CLObjectHandle extends WeakReference<CLObject> {
long p;
int ctype;
CLObjectHandle(CLObject referent, int ctype, long p) {
super(referent, referenceQueue);
this.p = p;
this.ctype = ctype;
referenceMap.put(p, this);
}
void release() {
if (p != 0) {
map.remove(p);
CObject.release(ctype, p);
p = 0;
}
}
}
...
}
This and a bit of house-keeping is all that is required.
Having release be idempotent allows explicit release mechanisms to remain - for those cases where you can't afford to let the native resource management be at the whim of the garbage collector. For this reason i may also have to move the native pointer resolution in the JNI from a CLNative.p field lookup to resolving it via the handle. I need to investigate the cost of doing this first, and also whether explicit release like this will actually work in practice (e.g. if you release an object with more than one reference, does it fuck up?). Doing this would also let me use the correct integral type if I felt the need by just creating two different CLObjectHandle classes (32/64) and resolving sizes in the JNI code.
There is some potential problems where you resolve an object for the first time via a non-referencing api (for example clGetProgramInfo(CL_PROGRAM_CONTEXT) and the like) and then let the reference expire. But this shouldn't normally be a problem since you would have to get the context before creating the program and are going to be keeping it around for the lifetime of the program and thus only one xxRelease is every invoked. And this should normally hold for everything else too. If it turns out to be an issue I have mechanisms I can use to address it from adding an explicit object reference to the given objects (e.g. a CLContext to each CLProgram created), or adding phantom reference bumps on specific apis.
It's actually a devilishly difficult thing to test and verify: even once you know the exact reference counting semantics of every OpenCL api the interaction with the JVM will hide faults.
I haven't explored further but having unique objects and gc lets me freely cache local copies of resource handles for convenience or efficiency and so on. It really simplifies using the library enough as it is.
The next zcl release will include this as well as a couple of bug fixes and some other things which make it easier to use. Dunno when that might be though.
BOOPSOGL time waster
The long story: I finally replaced my AVR (hi-fi amplifier) a couple of weeks ago after blowing it up 1-2 years ago and the new one has some network features. There's a web page to control it and a phone app - but both are pretty shitful. Actually the app isn't all bad but it's a pain for the things I use it for most: volume and mute because the volume knob is clumsy and the way the app handles screen blanking means mute isn't as easily accessed as it should be. I played a bit with the web app and worked out some of it's terrible 'xml-ish' remote control protocol and wrote a little application to perform both - but javafx is way too fat for this. I was recently looking into some opengl stuff and came across a trivial example which uses GLX to setup the screen - it also had some simple X11 Display code so I thought I could just write a super-lightweight Xlib tool for this. But then you need at least a little bit of 'toolkit' to make this doable ...
I'd had a blast of Res0gun and DRIVECLUB earlier but TV was dull so I started poking around some trivial C struct-based object system but then realised how much i'd forgotten since GObject and CamelObject. And then realised all the boilerplate that would be needed to even use such a one, so I went back to my RKRM: Libraries and looked into cloning BOOPSI instead. The only boilerplate that needs is setting a dispatch method, although the dispatch method itself ends up being fat as it fulfills the role a vtable would.
BOOPSI (basic object oriented programming system for intuition) was the AmigaOS 2 solution to general 'objects in C' which was apparently based on SmallTalk (Amiga libraries and devices are also object oriented but are not as general). Everything is implemented using a programmed dispatch call stack rather than vtables. It's not particularly fast but it is very small and flexible and it does have one rather interesting benefit not found in C or C++ - the ability to change any object in the hierarchy without a full recompile whilst still retaining single-instance memory blocks.
The short story: I got a couple of hundred lines into the code which is enough to instantiate objects and define classes together with some core support utilities.
Will I keep poking? I'm slightly curious perhaps but not quite curious enough for that as it gets involved very quickly. Maybe if I use GLX instead of the raw X I was thinking of (BOOPSOGL?). OpenVG? Text rendering is the biggest hassle either way. And layouts, although i've looked at that before.
I guess at least one observation is that back then this stuff looked so fat and cumbersome (albeit a large improvement over base intuition or gadtools), but then yeah, i've seen what else has come since and it really really wasn't.
Nice curves!
Bezeir Curves.
Wow what a page.
empirically corrected-approximate integer division
Although i haven't been posting about it i've been continuing to poke around in bits and pieces of code. Well, a little bit.
I did a bit of OpenCL last week and that was pretty fun. I had enough time to really dig into optimising a particular routine and was down to inspecting the ISA output from the driver. Good stuff. The GCN isa is pretty foreign to me so I had to use small snippets to isolate operations of interest.
For example one construct that comes up repeatedly when parallelising code is using a divide and modulus operator when splitting up a non-work-sized job into work-group sized blocks.
int block_size = info.block_size;
for (int id=get_local_id(0); id < limit; id+=64) {
int block_no = id / block_size;
int block_index = id % block_size;
// do work
}
Where possible one just chooses a power of 2 so this is a simple shift and mask, or integer divide by a constant isn't too bad as it can usually be optimised by the compiler. But this problem required a dynamic block size that wasn't a power of 2.
The solution? Use floating point multiply the reciprocal which can be calculated efficiently or here off-line. The problem is that this introduces enough rounding error to be worthless without some more work.
I must admit I just found the solution empirically here: i had a limited range of possible values so I just exhaustively tested them all against a couple of guesses. Hey it works, i'm no scientist.
float block_size_1 = info.block_size_1;
for (int id=get_local_id(0); id < limit; id+=64) {
int block_no = (int)(id * block_size_1 + 1.0f / 16384);
int block_index = id - (block_no * block_size);
// do work
}
This replaces the many instruction integer division decomposition with a convert+mad+convert.
On some work-loads this was a 25% improvement to the entire routine and these 2 lines are in an inner loop of about 50 lines of code.
Well it's been fun to play at this level again - its ... mostly ... pointless going to this level but just adds to the toolkit and I enjoy poking. Maybe one day i'll have a job where it's useful.
I gave zcl a go on this as originally I was thinking of trying some OpenCL 2 stuff but I may not bother now. Given the lack of use/testing it was pretty much bug free but I started filling out the API with some more convenient entry points. I also decided to add some more java-array interfaces here and there: they're just too convenient and it hides the mess in the C even if they might not be the most efficient in all cases.
This is the sort of thing i'm talking about:
float[] data = new float[] { 1, 2, 3, 4, 5 };
CLBuffer buffer = cl.createBuffer(CL_MEM_COPY_HOST_PTR, data);
vs
float[] data = new float[] { 1, 2, 3, 4, 5 };
ByteBuffer bb = ByteBuffer.allocateDirect(data.length * 4).order(ByteOrder.nativeOrder());
bb.asFloatbuffer().put(data);
CLBuffer buffer = cl.createBuffer(CL_MEM_COPY_HOST_PTR, data.length * 4, bb);
It's only two fewer lines of code ... but yeah that shit gets old fast. The first is more efficient too because this is a native method and it avoids the copy. In the first case CL_MEM_USE_HOST_PTR throws an exception though, and in the second it works (library call permitting).
The main downside is adding these convenience calls blows out the method count very quickly if you support all the primitive types - which detracts from the ease of use they're supposed to increase.
Another release? Who knows when.
And this week i've been poking at some OpenGL. My it's grown. I'm experimenting using JOGL for this although i'm not a fan of some of it's binding choices. It's crossed my mind but i'm pretty sure i don't want to create yet another binding as in a 'ZGL'. Hmm, I wonder if vulkan will clean up the cross platform junk from opengl.
Unfortunately my home workstation seems to have developed a faulty DIMM or something (unrelated note of note).
Losing a chunk.
So i've mentioned it a few times on here - i've been steadily losing weight since February. It's settled now at about 74kg - in part because it just seemed to stop on it's own and its enough off so i'm no longer trying to actively lose weight. It is surprising to me how little food it takes to maintain this so far but i haven't been exercising a lot either.
I'm not sure if the initial trigger was all mental or all physical but in part the shock of getting gout and a full freezer and pantry bereft of 'gout friendly' foods kicked off a period of simply being able to lose weight on whim. I guess I somehow shrank my stomach enough to change how my body detects hunger and once it started I just ran with it - i can't say I tried too hard but I wanted to lose a bit for years. I still get hungry, it just doesn't really bother me like it used to and turn into an overwhelming need to eat. I did a few experiments in the midst to see how little I could get away with and it wasn't hunger that drove me to eat but a sore stomach.
At 92kg I was sadly bang on average for a bloke my age; but that is a lot more than my body can take. Although the belt was the first indication of making progress it wasn't till i broke under about 84kg that I looked (to myself) like I was making any progress; that's probably about when my waist to hip ratio broke under 1.0. After that each kg seemed noticeable one way or another.
I don't feel great or anything, but I would be lying if i didn't say I felt `less shit'.
Looking more like a Clark every year (matriarchal ancestry).
A couple of friends (but thankfully not all!) are already saying I "need to eat" like i'm underweight or something. I'm quite some way from that even if i'm below the average these days. I was skinny until I got a job in the city and started having rich lunches and regular Friday drinks together with a shorter commute cycle and it's slowly accumulated since.
The All Fat Diet
SBS has had a series of diet/health related shows on recently and out of curiosity I looked up some of the stuff i've been typically eating.
Apart from a very low total number of joules around 2/3 of the energy was from fats and oils. Protein was low. I think "diet", "light", "lite" and "low fat" foods are complete nonsense so I certainly wasn't having any of that.
Just bread and butter ended up being a big part of what I was eating (and i don't hold back on the butter). A few spuds and some rice at times. Lots of nuts, mostly almonds. Matured cheddar cheese. Lemons and limes as they are in season. Almost no meat, some but not much grog. Lots of coffee (usually black+none) and tea (green, or white+1/4). But I also had what herbs i could find in the garden mostly in tom-yum-like soup, and plenty of chillies and other random stuff along the way (the chillies were important in one critical way beyond making the bland more palatable). Lots and lots of water.
This isn't what I had all of the time but it was in the majority. It's also not something i'll be continuing but it was certainly effective at losing some weight this time.
One thing I did notice is that big dinner means big morning hunger. I'd already noticed this before but i'm now more convinced of it. If you ignore it it just goes away but it's also easy to eat less at night.
A wretched hive of scum and villainy
This episode of gout also enforced just how useless the internet has become as a source of general information. Almost all the gout 'advice' is useless, being generous. Even the stuff from established medical sources wasn't terribly applicable to me due to being neither 70 nor obese.
For example low-fat is always recommended: but it's never specified whether that has anything (at all) to do with gout or merely just being overweight.
Anyway i'm glad I lost the weight, I wonder if it will stay off this time?
Update: Well 6 weeks later and it's still dropping - 72-73Kg now. I'm eating properly now too - probably better than i have for years and certainly not going hungry.
Update: 14 weeks from the first post and it kept dropping slowly - bang on 70kg has been the base for a couple of weeks now. Damn I haven't been this skinny in a long long time and i like returning to it. Except the fucking arthritis keeps coming back. So despite the good, actually the present is very dark and the future looks even more grim.
all gout, all the time
So my sore foot just kept getting worse despite all efforts of rest so I returned to a doctor. One quick look and he just said 'gout' and prescribed some drugs.
I was a little sceptical and the list of side-effects of the colchicine was a little off-putting but I gave it a go. Well within a few hours all the pain was gone and within a day so was the redness and swelling of both feet.
I guess what I had the last couple of winters was also gout - even though it didn't really appear that way.
Drugs were the last thing I wanted but lifestyle wasn't doing it so that's that I guess. It's probably still going to take a while to get the dosages and medications correct but at least this rules out everything else and has me mobile enough to get back to living.
Despite the weather last weekend I hit the road for a ride intending to just visit friends but nobody was home so ended up doing the 65km round-trip city to coast triangle. It was cold and windy and I took it pretty easy (i'd only just taken the drugs a couple of days before) so it took me over 3 hours and fortunately I missed the rain. Despite freezing my knees and toes (the rest was rugged up adequately) it was more enjoyable than I expected.
Now, if only winter would end ... it's been bitterly cold this year.
Update: Through the last 3 weeks i had some symptoms return a couple of times. Taking some colchicine cleared it up and it seems to be reducing in frequency and intensity ... but yeah it's still around and that colchicine is not good stuff. I'm not really sure the allopurinol is helping or hurting just yet, or if diet is still an issue or not, or really resolved anything; something for the next dr visit. But apart from one day a week ago i've been mobile enough to live normally; although it's been cold, wet, and pretty bloody dull for the most part so it hasn't made much difference. At least the wet has cut the edge from the bitter cold so it feels like winter is finally on it's receding edge. Update 2: I went back to a doc and he took me off the allopurinol. That seems to have been keeping the gout going. So after a week or so its cleared up and i've not had an attack since. It's still a bit sore and not fully vanished but it's the best it's been for months and now i'm doing enough just to get sore from doing too much. I'm pretty much eating normally but i haven't tried grog yet.
Copyright (C) 2019 Michael Zucchi, All Rights Reserved.
Powered by gcc & me!