About Me
Michael Zucchi
B.E. (Comp. Sys. Eng.)
also known as Zed
to his mates & enemies!
< notzed at gmail >
< fosstodon.org/@notzed >
Aparapi on HSA on Slackware on Kavaeri on ASROCK
Although i've been waiting with bated breath for HSA to arrive ... the last I heard about a month ago via the aparapi mailing list was that the drivers weren't quite ready yet. So I was content to wait patiently a bit longer. Then somehow the first I heard that the alpha became available from one of the few comments on this blog and apparently it's been out for a few weeks. I couldn't find any announcement about it?
So yesterday before I went out and this morning I followed Linux-HSA-Drivers-And-Images-AMD and SettingUpLinuxHSAMachineForAparapi trying to get something working. As i'm using a different motherboard and OS it was a little more involved although I made it more involved than it should've been by making a complete pigs breakfast out of every step along the way due to being a bit out of practice.
But after getting a working kernel built and X sorted I just ran the test example a few seconds ago:
$ ./runSquares.sh
using source from Squares.hsail
0->0, 1->1, 2->4, 3->9, 4->16, 5->25, 6->36,
;7->49, 8->64, 9->81, 10->100, 11->121,
;12->144, 13->169, 14->196, 15->225, 16->256,
;17->289, 18->324, 19->361, 20->400, 21->441,
;22->484, 23->529, 24->576, 25->625, 26->676,
;27->729, 28->784, 29->841, 30->900, 31->961,
;32->1024, 33->1089, 34->1156, 35->1225, 36->1296,
;37->1369, 38->1444, 39->1521,
PASSED
$
I'm presuming 'PASSED' means it worked.
I'm not sure how much i'll do today but i'll next look at the hsa branch of aparapi, sumatra?, and then I want to look a bit closer. I haven't been able find much detailed technical documentation yet but there is the kernel driver at least now and hopefully it's coming soon.
On Slackware
I'm using the ASROCK FM2A88X-ITX+ motherboard with Slackware64 14.1 and using the DVI and HDMI outputs in a dual-head configuration. Just getting Slackware 14.1 working on it reliably required a BIOS upgrade but i'm not sure what version it is right now.
To compile a fresh checkout of the correct kernel I tried the supplied kernel config file 3.13.0-config
at first but that didn't work it just hung on the loading kernel line from elilo. After a couple of aborted attempts I managed to get a working kernel by starting with /boot/config-generic-3.10.17
as the .config
file, running make oldconfig
and holding down return until it finished to accept all the defaults, then using make xconfig
to make sure my filesystem driver wasn't a module (which i of course forgot the first time).
Getting dual-screen X was a bit confusing - searches for xorg.conf configuration is pretty much a waste of time I think mostly because every config file is filled with non-important junk. But I finally managed to get it going even if for whatever reason it comes up in cloned mode but I can fix it manually running xrandr after i login. Because I'm not ready to make this permanent is good enough for me. As I was previously using the fglrx driver I had initially forgotten to de-blacklist the radeon kernel module but that was an easy fix.
This is how I set up the screen config.
$ xrandr --output HDMI-0 --right-of DVI-0
I'm not ready to make this my system yet because afaik OpenCL isn't available for this driver interface yet. Although the Okra stuff includes libamdhsacl64.so
so presumably it isn't too far away.
Aparapi
I got aparapi going quite easily.
But beware, don't run '. ./env.sh' directly to start with - any error and it just closes your shell window! So test with 'sh ./env.sh' until it passes it's checks.
I used the ant that comes with netbeans and I already had AMD APP SDK 2.9 and Java 8 installed.
Not sure if it's needed but I noticed a couple of variables were blank so I set them in env.sh.
export APARAPI_JNI_HOME=${APARAPI_HOME}/com.amd.aparapi.jni
export APARAPI_JAR_HOME=${APARAPI_HOME}/com.amd.aparapi
Once env.sh was sorted it built in a few seconds and the mandelbrot demo ran in suitably impressive fashion.
Well this should all keep me busy for a while ...
JNI, memory, etc.
So a never-ending hobby has been to investigate micro-optimisations for dealing with JNI memory transfer. I think this is at least the 4th post dedicated soley to the topic.
I spent most of the day just experimenting and kinda realised it wasn't much point but I do have some nice plots to look at.
This is testing 10M calls to a JNI function which takes an array - either byte[] or a ByteBuffer. In the first case these are pre-allocated outside of the loop.
The following tests are performed:
- Elements
-
Uses Get/SetArrayElements, which on hotspot always copies the memory to a newly allocated block.
- Range alloc
-
Uses Get/SetArrayRegion, and inside the JNI code always allocates a new block to store the transferred data and frees it on exit.
- Critical
-
Uses Get/ReleasePrimitiveArrayCritical to access the JVM memory directly.
- ByteBuffer
-
Uses the JNIEnv entry points to retrieve the memory base location and size.
- Range
-
Uses Get/SetArrayRegion but uses a pre-allocated (bss) buffer.
- ByteBuffer field
-
Uses GetLongField and GetIntField to retrieve the package/private address and size values directly from the Buffer object. This makes it non portable.
I'm running it on a Kaveri APU with JDK 1.8.0-b129 with default options. All plots are generated using gnuplot.
Update: I came across this more descriptive summary of the problem at the time, and think it's worth a read if you're ended up here somehow.
Small arrays
The first plot shows a 'no operation' JNI call - the pointer to the memory and the size is retrieved but it is not accessed. For the Range cases only the length is retrieved.
What can be seen is that the "ByteBuffer field" implementation has the least overhead - by quite a bit compared to using the JNIEnv entry points. From the hotspot source it looks like they perform type checks which are adding to the cost.
Also of interest is the "Range alloc" plot which only differs from the "Range" operation by a malloc()/free() pair. i.e. the JNI call invocation overhead is pretty much insignificant compared to how willy-nilly C programmers throw these around. This is also timing the Java loop as well of course. The "Range" call only retrieves the array size in this case although interestingly that is slower than retrieving the two fields.
The next series of plots are for implementing a dummy 'load'. The read load is to add up every byte in the array, and the write load is to write the array index to the array. It's not particularly important just that it accesses the memory.
Well, they're all pretty close and follow the overhead plot as you would expect them to. The only real difference is between the implementations that need to allocate the memory first - but small arrays can be stored on the stack 'for free'.
The only real conclusion is: don't use GetArrayElements() or malloc space for short arrays!
Larger arrays
This is the upper area of the same plots above.
Here we see that by 8K the overhead of the malloc() is so insignificant to the small amount of work being performed that it vanishes from the time - although GetArrayElements() is still a bit slower. The Critical and field-peeking ByteBuffer edge out the rest.
And now some strange things start to happen which don't seem to have an obvious reason. Writing the data to bss and then copying it using SetArrayRegion() has become the slowest ... yet if the memory is allocated first it is nearly the fastest?
And even though the only difference between the ByteBuffer variants is how it resolves Buffer.address and Buffer.capacity ... there is a wildly different performance profile.
And now even more weirdness. Performing a read and then a write ... results in by far the worst performance from accessing a ByteBuffer using direct field access, yet just about the best when going through the JNIEnv methods. BTW the implementation rules out most cache effects - this is exactly the same memory block at exactly the same location in each case, and the linearity of the plot shows it isn't size related either.
And now GetArrayElements() beats GetArrayRetion() ...
I have no idea on this one. I re-ran it a couple of times and checked the code but perhaps I missed something.
Dynamic memory
Perhaps it's just not a very good benchmark. I also tried an extreme case of allocating the Java memory inside the loop - which is another extreme case. At least these should give some bracket.
Here we see Critical running away with it, except for the very small sizes which will be due to cache effects. The ByteBuffer results show "common knowledge" these things are expensive to allocate (much more so than malloc) so are only suitable for long-lived buffers.
Again with the SetArrayRegion + malloc stealing the show. Who knows.
It only gets worse for the ByteBuffer the more work that gets done.
The zoomed plots look a bit noisy so i'm not sure they're particularly valid. They are similar to the pre-allocated version except the ByteBuffer versions are well off the scale at that size.
After all this i'm not sure what conclusions to draw. Well for one OpenCL has so many other overheads I don't think any of these won't even be a rounding error ...
Invocation
I also did some playing around with native method invocation. The goal is just to get a 'pointer' to a native resource in the JNI and just to compare the relative overheads. The calls just return it so it isn't optimised out. Each case is executed for 100M times and this is the result of a fourth run.
- call
-
This is what I used in zcl. An object method is invoked and the instance method retrieves the pointer from 'this.p'.
- calle
-
The same but the call is wrapped in a try { } catch { } with in the loop and the method declares it throws an exception.
- callp
-
An instance method where an anonymous pointer is passed to the JNI.
- calls
-
A static method which takes the object as a parameter. The JNI retrieves 'this.p'.
- callsp
-
This is the commonly used approach whereby an anonymous pointer is passed as a parameter to a static method.
The three types are the type of pointer. I was going to test this on a 32-bit platform but ran out of steam so the integers don't make much difference here. int and long are just a simple type and buffer stores a 'struct' as a ByteBuffer. This latter is how I originally implemented jjmpeg but clearly that was a mistake.
Results
type call calle callp calls callsp
int 1.062 1.124 0.883 1.100 0.935
long 1.105 1.124 0.883 1.101 0.936
buffer 5.410 5.401 2.639 5.365 2.631
The results seemed pretty sensitive to compilation - each function is so small so there may be some margin of error.
Anyway the upshot is that there's no practical performance difference across all implementations and so the decision on which to use can be based on other factors. e.g. just pass objects to the JNI rather than the mess that passing opaque pointers create.
And ... I think that it might be time for me to leave this stuff behind for good.
javafx on parallella via remote X, with added opencl
After the signoff on the last post I thought maybe I had spoken too soon about being able to write front-end code for the parallella in Java. I couldn't get JavaFX doesn't to work via remote X on ARM but writes directly to the framebuffer. While that's a nice think to have it doesn't help me. Always one step forward two steps back eh (I mean I could always just use swing but where's the fun in that).
The gl screensavers and other X11 stuff worked ok. I couldn't find any useful mention of it on the internets until I had a look in jira (openjdk bug system). Came across Monocle which I thought was worth a look. It didn't look all that promising until I saw it was already built-in to the ARM builds.
And ... it worked. Well I guess that's something. I had to add the following to the command line to get javafx working over remote X from an arm box:
-Djavafx.platform=monocle -Djavafx.order=sw
I 'fixed' some coprthr peculiarities in the opencl code and got it running (but only on the arm cpu so far):
(Not sure what blogger is doing, it doesn't look that bad even with my shitty palette).
At first the UI was very unresponsive then I discovered that clEnqueueNDRangeKernel or clEnqueueReadBuffer runs synchronously - which is obviously not something to be doing from an animation handler. I chucked that through an Executor and whilst it wont win any land-speed records it is now basically usable.
zcl 0.3 released
I just dumped another snapshot to the zcl home page.
Notes are there but a little more on the array stuff.
Basically it just lets you pass a java array instead of a Buffer to the read/write buffer/image commands. It supports non-blocking operations.
I was looking through the aparapi source and noticed it was using GetPrimitiveArrayCritical and ReleasePrimitiveArrayCritical in places where I didn't think you should, so I tried using that for pinning the array rather than GetArrayElements which I tried last time (and decided not to include). This should be more efficient because the *Elements calls all seem to allocate memory and create a copy.
A really dumb test case bears this out and i'm getting a 3-4x performance improvement. Yay! I think that's enough to include into the api now so I did - even though it bloats out the api considerably with 7 overloaded entry points for each such method (Buffer, byte, short, int, long, float, double).
But there's something of a problem in that because of the asynchronous nature of the non-blocking commands the pinning required must be for an indeterminate time. Which is explained as undesirable in the JNI docs but without a deeper knowledge is hard to know if it matters on real jvms.
I'm using an event listener to find out when a non-blocking job is complete and then releasing the array as soon as possible but this adds measurable overhead and other potential complications. I may try to implement a more explicit management mechanism and see if that makes a lot of difference but it would have to be significant to be worth the extra hassles involved in using such an api.
But until I have a use for zcl, ... it might all just be on hold for the moment, because ...
elf-loader
Something I mentioned earlier was revisiting the elf-loader code with a different set of goals. And so i'm thinking of pausing this OpenCL stuff and moving in that direction. Right now apart from OpenCL there's no way to access the epiphany chips from Java and I for one have no interest in doing any frontend stuff in C (there is some work on the sumatra thing but that could be a while and serves a different purpose anyway).
The more I think about it the more it seems like the elf-loader code should provide a pretty good basis for a decent epiphany runtime that still lets you write the low-level code on the device but creates an easier way to manage their code compared to a fugly linker script and an object format that was designed for a system with processes running in virtual memory.
Quite a bit of work though so i've gotta be peachy keen to get it started.
Update: I found out that I was completely wrong on the way aparapi uses Get/Release PrimitiveArrayCritical() and as I originally thought holding a critical array across jni calls and/or threads is probably not a very good idea at all: even if it works. It basically just locks GC completely (and not much else). So I will probably have to resort to using a temporary malloc() buffer to honour the api for non-blocking transfers. That's if i don't just revert it all out of existence. I may not even be able to do use critical access even for the blocking transfers due to supporting java native kernels and various notification callbacks.
Return of the king
Spent the afternoon and evening drinking in town.
One interesting part was when Jay Weatherall (the current Premier of the state) and a small entourage sat next to me at the bar @ The Exeter. As he was on the clock he and his wife stuck to a single soda water each which was a bit weak but understandable (state election is next Saturday). He was talking to some Scottish bloke ... that turned out to be Billy Bragg according to overheard gossip from the barman after they left. It's been probably 20 years since I saw him in a video clip so I didn't recognise him at all and just thought he was another politician (the mention of appearing on Q and A and Rockwiz in the next week or so had me confused mind you). I didn't really listen in too much but what I heard sounded interesting enough if a little dire (underclass oppression and that kinda stuff). Jay is Labor which is not pinko enough for me these days but he seems like a nice enough bloke from what little i've seen (being in a minor state with mostly centralised networked news services across the country we barely get any local coverage these days - even if i did watch it. Which I don't.). I kept to myself but an after-the-fact good luck to him anyway.
Bloody nice weather for autumn mind you. 35 and overcast which takes the sting out of the full sun - although one has to be careful not to get burnt severely, barely a breeze, a bit humid which isn't an every-day occurrence in the driest state in the world. Mowed the lawn after I woke up from my birds-are-singing all nighter hacking the parallella the `night' before and then had some lunch before heading out on my bike. By 10pm it just started to cool down a little but was still very pleasant and hasn't really changed since - if i were 10 years younger and my headlight was fully charged I might have headed out for an evening ride instead of just slinking home to some junk food and writing a pointless blog post. I was surprised how feral The Austral got by the time I left though, I dropped by to have a traveller before I rode home after spending most of the evening at The Exeter and it felt like it was 2 in the morning or something. A little bit too 'rough' for my sensibilities, but maybe The Adelaide Cup had something to do with that because it is pretty much a carnival of bogan and some of the revellers end up in town afterwards (horse race == public-holiday around here, how fucked is that). Tis ok during the day though.
womadelaide was on this weekend so there were plenty of people about (like politicians and singers). Given I was a hermit most of the summer for various personal reasons and the fact that i'm starting work soon, I thought I should make the effort and head out whilst the weather's holding. Twas good overall and hopefully i'm not hungover for a fifth day straight tomorrow.
Update: So a few days later ... something a bit strange. I had to go to the city to the dentist before 9am and as I passed one of the busiest intersections in the whole city I saw some middle aged smurfs wildly waving placards from the median strip. Between them was the local Liberal candidate waving madly at passing cars. Bit stupid and quite dangerous. But then again all I know of her is from the ABC election coverage a few years ago: apparently she used to run a hair salon (but the way the said it made it sound like she ran a brothel!). Obviously that qualifies you around here.
But today it's polling day and a thunderstorm woke me up a bit earlier than i'd hoped so I might get voting out the way straight up.
parallella opencl some success
So far only with the ARM driver but it's something.
After the previous previous post I spotted a new rootfs from Adapteva so I dropped that onto my sdcard. Unfortunately ... I dropped it onto the wrong sdcard and overwrote my original one. It's been a few days of stupid mistakes. Somehow I ended up with a running-enough system.
I found a pile of pointer size related bugs in zcl (obvious in hindsight) and made a bit of a mess trying to track down a bug that i'm pretty sure is just in the OpenCL driver (although it was a function that should've just returned not-supported so it isn't expected to work). There are some other missing bits and unexplained problems and it looks like only OpenCL 1.0 is partially implemented to start with (despite what cl.h claims) so I had to add some more version filtering. I think 'needs work' might be an apt description.
I just tried the epiphany driver but i just got unending reams of stuff about registers so i can't tell if anything worked (actually it's pretty clear it didn't run beyond compilation).
While I was doing the rootfs thing I also managed to compile the sdk on my main machine (more specifically just the GNU toolchain). I tried building it in a separate build directory the GNU-way but that doesn't seem to be supported and I just had to use the silent build script instead.
Unfortunately none of this trouble-shooting and repeated stupid mistakes really got me anywhere or was particularly fun. I think it was just the morbid curiosity that kept me poking until beyond 6am (15 minutes ago).
Another go at a smoothly loading list of icons
Due to the lack of streaming support in MediaPlayer I've given up on the idea of a javafx internet radio player for the moment but I thought perhaps I could make a remote interface for the android one I have. This is something I could actually use since I don't have my speakers plugged into my workstation.
First thing as ever is that scrolly list of stations with pictures. I think I finally have a tidy solution using JavaFX implemented using basic Java classes. It only requires a little bit more code than the custom ListCell required anyway.
First, the data item. I load this using an executor asynchronously.
class Station {
String title;
String thumb;
}
Then a cache based on a LinkedHashMap - which simply requires implementing removeEldestEntry. I think an important detail I missed last time I tried this was cancelling any still-loading image here directly. Without that it can end up with a queue of images to load that will never be seen and this adds an unnecessary delay in loading those that are currently being shown.
class ImageCache extends LinkedHashMap<String, Image> {
final int limit;
ImageCache(int limit) {
this.limit = limit;
}
protected boolean removeEldestEntry(Map.Entry<String, Image> eldest) {
if (size() > limit) {
eldest.getValue().cancel();
return true;
}
return false;
}
}
And finally the custom cell itself. It fades the image in once it's loaded although to be honest i'm not sure if that feels right on a desktop machine. After playing with it a bit it feels like it could cause eye fatigue by drawing the eye to multiple parts of the screen for much longer than necessary. But it does appear 'smoother' this way.
// Copyright 2014 Michael Zucchi
// This code is covered by the GNU General Public License
// version 3, or later.
class StationListCell extends ListCell<Station> {
ImageCache cache = new ImageCache(32);
ImageView iv;
FadeTransition fadein;
StationListCell() {
iv = new ImageView();
iv.setFitWidth(128);
iv.setFitHeight(64);
iv.setPreserveRatio(true);
setGraphic(iv);
}
protected void updateItem(Station t, boolean empty) {
super.updateItem(t, empty);
if (t != null) {
setText(t.title);
if (t.thumb != null) {
Image icon = cache.get(t.thumb);
if (icon == null) {
if (fadein != null) {
fadein.stop();
fadein = null;
}
icon = new Image(t.thumb, 128, 64, true, true, true);
icon.progressProperty().addListener((ObservableValue<? extends Number> ov,
Number t1, Number t2) -> {
if (t2.doubleValue() == 1.0) {
fadein = new FadeTransition(Duration.millis(250), iv);
fadein.setFromValue(0);
fadein.setToValue(1);
fadein.play();
}
});
}
cache.put(t.thumb, icon); // update lru
iv.setImage(icon);
}
} else {
iv.setImage(null);
}
}
}
}
I think last time I tried it I forgot to write to the cache every time an entry is accessed - i.e. to update the LRU order. It's still a bit more code than i'd really like but given that it's in one language in one place it's probably about as concise as can be expected.
Actually there is still some weird JavaFX bug in that the label text jumps to the left once the image is loaded - which makes no sense given the properties on the ImageView and the constructor arguments to the background loaded Image. But this can be fixed by placing the ImageView inside an appropriately configured Region like an AnchorView.
I also played with a busy animation while it was loading but that just looked naff.
This appears very nice and smooth as you scroll through the list. The only obvious time it drops some frames is when the list is initially populated with setItems(ObservableList) (which is still a bit unfortunate).
The Java Heap vs native heap
I was curious about whether the size-limited cache was worth putting in at all rather than simply using the mechanism to lazily load every image (or ... just use the Image background loading feature directly). So I tried some profiling The images are fixed at 128x64 pixels so they're not particularly big, and I dunno there's a few dozen stations.
Using lazy loading the JVM maxes out at 10MB of active heap. A 32-element cache required about 7MB. So that seems a sizable benefit.
However ...
The process itself requires about 150MB total (top - RES) so it's pretty insignificant in the grand scheme of things. Using just the interpreter drops this down to 100MB but that's not much use. Classes, compilers, compiled code, X, GL and other native resources really chew it up.
A typical C++ application probably needs comparable total memory to exist, sans the compiler, but a lot of that is just in shared libraries so it can ameliorated by sharing across applications. Although in reality with so many libraries needed to do anything in C or C++ and such an ill-defined "platform" as GNU/Linux there is much much less sharing going on than there could be (it's certainly no AmigaOS). One of the trade-offs of using a jit compiler and dynamic runtime is that compiled code can't really be shared unless multiple applications run in the same jvm (i've seen mention of such a feature but it isn't here yet - it's certainly technologically possible).
But yeah, memory is cheap and the main memory factor in most applications is the data itself which is much the same regardless of language (unless it doesn't support primitive-type arrays).
Strange surprise for the day.
linaro-ubuntu-desktop:~/src/zcl> PATH=/usr<tab>
PATH=/ not found
Hmmm?
linaro-ubuntu-desktop:~/src/zcl> echo $SHELL
/bin/tcsh
linaro-ubuntu-desktop:~/src/zcl>
tcsh? Oh wow. Weird. People still use that? Actually I know a sysadmin who still uses it but he still thinks SunOS 4 was the epitome of operating systems and he wont touch GNU/Linux with a barge pole (SomeBSD all the way for him). And yeah i'm also a little glad he doesn't admin anything i use ;-) Just teasing, although i'm a little baffled how he can do his job effectively without Bourne Shell scripts.
It's been so long it slipped my mind that '> ' implies a user-login csh when I saw the different prompt on the 131224 build of the parallella os. I think the last time I used tcsh was when I was at uni (20 years ago) until GNU/Linux pretty much forced me to use bash. Which i'm glad of because not only is it easier to learn and use, it's also more consistent and powerful.
Each to their own I guess/stranger things have happened/easily fixed, but still a surprise.
I should be out enjoying another unseasonally hot autumn day but following insufficient sleep after crashing rather late following a visit to friends yesterday arvo it's all a bit too hard. So i'm just having a quick poke at OpenCL on parallella again although some more sleep isn't very far off.
...
So I just now discovered something that means I messed up a bit. The 131224 OS looks to be the same linaro version with the same sdk as the previous image I was using. It's just using a different uboot, linux, and configured to start more desktop stuff. Or in other words all the stuff I changed or removed yesterday ... But more importantly it means the whole point of the exercise was missed: the binary dist of coprthr still isn't going to work.
Oops, well now don't i feel like a bit of a drop-kick. That was a good waste of a couple of afternoons so it might be a good idea to leave that topic for a bit and try a different approach next time.
Maybe revisit the elf-loader code I was experimenting with but work on creating a way of accessing it from Java. I have no trouble with C or even ASM hacking but writing a GUI in C or even touching C++ isn't something I'm going to do for fun and processing pam files from a command line isn't very interesting. Having an esdk analogue for Java could be quite useful actually now I think about it.
Otherwise I'm probably going to have to bite the bullet and work on setting up a dev env on my workstation.
Copyright (C) 2019 Michael Zucchi, All Rights Reserved.
Powered by gcc & me!