Other JNI bits/ NativeZ, jjmpeg.

Yesterday I spent a good deal of time continuing to experiment and tune NativeZ. I also ported the latest version of jjmpeg to a modularised build and to use NativeZ objects.

Hashing C Pointers

C pointers obtained by malloc are aligned to 16-byte boundaries on 64-bit GNU systems. Thus the lower 4 bits are always zero. Standard malloc also allocates a contiguous virtual address range which is extended using sbrk(2) which means the upper bits rarely change. Thus it is sufficient to generate a hashcode which only takes into account the lower bits (excluding the first 4).

I did some experimenting with hashing the C pointer values using various algorithms, from Knuth's Magic Number to various integer hashing algorithms (e.g. hash-prospector), to Long.hashCode(), to a simple shift (both 64-bit and 32-bit). The performance analysis was based on Chi-squared distance between the hash chain lengths and the ideal, using pointers generated from malloc(N) for different fixed values of N for multiple runs.

Although it wasn't the best statistically, the best performing algorithm was a simple 32-bit, 4 bit shift due to it's significantly lower cost. And typically it compared quite well statically regardless.

static int hashCode(long p) {
    return (int)p >>> 4;
}

In the nonsensical event that 28 bits are not sufficient the hash bucket index it can be extended to 32-bits:

static int hashCode(long p) {
    return (int)(p >>>> 4);
}

And despite all the JNI and reflection overheads, using the two-round function from the hash-prospector project increased raw execution time by approximately 30% over the trivial hashCode() above.

Whilst it might not be ideal for 8-bit aligned allocations it's probably not that bad either in practice. One thing I can say for certain though is NEVER use Long.hashCode() to hash C pointers!

Concurrency

I also tuned the use of synchronisation blocks very slightly to make critical sections as short as possible whilst maintaining correct behaviour. This made enough of a difference to be worth it.

I also tried more complex synchronisation mechanisms - read-write locks, hash bucket row-locks and so on, but it was at best a bit slower than using synchronize{}.

The benchmark I was using wasn't particularly fantastic - just one thread creating 10^7 `garbage' objects in a tight loop whilst the cleaner thread freed them. No resolution of exisitng objects, no multiple threads, and so on. But apart from the allocation rate it isn't an entirely unrealistic scenario either and i was just trying to identify raw overheads.

Reflection

I've only started looking at the reflection used for allocating and releaseing objects on the Java side, and in isolation these are the highest costs of the implementation.

There are ways to reduce these costs but at the expense of extra boilerplate (for instantiation) or memory requirements (for release).

Still ongoing. And whilst the relative cost over C is very high, the absolute cost is still only a few hundred nanoseconds per object.

From a few small tests it looks like that maximum i could achieve is a 30% reduction in object instantiation/finalisation costs, but I don't think it's worth the effort or overheads.

Makefile foo

I'm still experiemnting with this, I used some macros and implicit rules to get most things building ok, but i'm not sure if it couldn't be better. The basic makefile is working ok for multi-module stuff so I think i'm getting there. Most of the work is just done by the jdk tools as they handle modules and so on quite well and mostly dicatate the disk layout.

I've broken jjmpeg into 3 modules - the core, the javafx related classes and the awt related classes.

GC JNI, HashTables, Memory

I had a very busy week with work working on porting libraries and applications to Java modules - that wasn't really the busy part, I also looked into making various implementation's pluggable using services and then creating various pluggable implementations, often utilising native code. Just having some (much faster) implementation of parts also opened other opportunities and it sort of cascaded from there.

Anyway along the way I revisited my implementation of Garbage Collection with JNI and started working on a modular version that can be shared between libraries without having to copy core object, and then along the way found bugs and things to improve.

Here are some of the more interesting pieces I found along the way.

JNI call overheads

The way i'm writing jni these days is typically just write the method signature as if it were a Java method and just mark it native. Let the jni handle Java to C mappings direclty. This is different to how I first started doing it and flies in the convention i've typically seen amongst JNI implementations where the Java just passes the pointers as a long and has a wrapper function which resolves these longs as appropriate.

The primary reason is to reduce boilerplate and signficiantly simplify the Java class writing without having a major impact on performance. I have done some performance testing before but I re-ran some tests and they confirm the design decisions used in zcl for example.

Array Access

First, I tested some mechanisms for accessing arrays. I passed two arrays to a native function and had it perform various tests:

No op;
GetPrimitiveArrayCritical on both arrays;
GetArrayElements for read-only arrays (call Release(ABORT))
GetArrayElements for read-only on one array and read-write on the other (call Release(Abort, Commit));
GetArrayRegion for read-only, to memory allocated using alloca
GetArrayRegion and SetArrayRegion for one array, to memory using alloca
GetArrayRegion for read-only, to memory allocated using malloc
GetArrayRegion and SetArrayRegion for one array, to memory using malloc

I then ran these tests for different sized float[] arrays, for 1 000 000 iterations, and the results in seconds are below. It's some intel laptop.

       NOOP         Critical     Elements                  Region/alloca             Region/malloc
            0            1            2            3            4            5            6            7
    1  0.014585537  0.116005779  0.199563981  0.207630731  0.104293268  0.127865782  0.185149189  0.217530639
    2  0.013524620  0.118654092  0.201340322  0.209417471  0.104695330  0.129843794  0.193392346  0.216096210
    4  0.012828157  0.113974453  0.206195102  0.214937432  0.107255090  0.127068808  0.190165219  0.215024016
    8  0.013321001  0.116550424  0.209304277  0.205794572  0.102955338  0.130785133  0.192472825  0.217064583
   16  0.013228272  0.116148320  0.207285227  0.211022409  0.106344162  0.139751496  0.196179709  0.222189471
   32  0.012778452  0.119130446  0.229446026  0.239275912  0.111609011  0.140076428  0.213169077  0.252453033
   64  0.012838540  0.115225274  0.250278658  0.259230054  0.124799171  0.161163577  0.230502836  0.260111468
  128  0.014115022  0.120103332  0.264680542  0.282062633  0.139830967  0.182051151  0.250609001  0.297405818
  256  0.013412645  0.114502078  0.315914219  0.344503396  0.180337154  0.241485525  0.297850562  0.366212494
  512  0.012669807  0.117750316  0.383725378  0.468324904  0.261062826  0.358558946  0.366857041  0.466997977
 1024  0.013393850  0.120466096  0.550091063  0.707360155  0.413604094  0.576254053  0.518436072  0.711689270
 2048  0.013493996  0.118718871  0.990865614  1.292385065  0.830819392  1.147347700  0.973258653  1.284913436
 4096  0.012639675  0.116153318  1.808592969  2.558903773  1.628814486  2.400586604  1.778098089  2.514406096

Some points of note:

Raw method invocation is around 14 nanoseconds, pretty much irrelevant once you do any work.
Get/SetArrayElements is pretty much the same as using GetSet/ArrayRegion with malloc but with less flexibility.
For small arrays 2 calls to malloc/free is nearly 50% of the processing time. Given the gay abandon with which most C programmers throw these around like they cost nothing, the extra JNI overhead is modest.
For larger arrays memcpy time dominates.
For one way transfers shorter than 64 float using Get/SetRegion to the stack or pre-allocated memory is the fastest.
For all other cases including any-sized two-way transfers, GetPrimitiveArrayCritical is the fastest. But it has other overheads and isn't always applicable.

I didn't look at ByteBuffer because it doesn't really fit what i'm doing with these functions.

Anyway - the overheads are unavoidable with JNI but are quite modest. The function in question does nothing with the data and so any meaningful operation will quickly dominate the processing time.

Object Pointer resolution

The next test I did was to compare various mechanisms for transferring the native C pointer from Java to C.

I created a Native object with two long fields, native final long p, and native long q.

No operation;
C invokes getP() method which returns p;
C invokes getQ() method which returns q;
C access to .p field;
C access to .q field;
The native signature takes a pointer directly, call it resolving the .p field in the caller;
The native signature takes a pointer directly, call it resolving the .p field via a wrapper function.

Again invoking it 1 000 000 times.

NOOP         getP()       getQ()       (C).p        (C).q        (J).p        J wrapper
     0            1            2            3            4            5            6
0.016606942  0.293797182  0.294253973  0.020146810  0.020154508  0.015827028  0.016979563

final makes no difference.
method invocation is 15x slower than a field lookup!
Field lookups are much slower in C than Java, but the absolute cost is insignificant at ~2.5nS per lookup.

In short, just passing Java objects directly and having the C resolve the pointer via a field lookup is slightly slower but requires much less boilerplate and so is the preferred solution.

Logger

After I sorted out the basic JNI mechanisms I started looking at the reference tracking implementation (i'll call this NativeZ from here on).

For debugging and trying to be a more re-usable library I had added logging to various places in the C code using Logger.getLogger(tag).fine(String.printf());

It turns out this was really not a wise idea and the logging calls alone were taking approximately 50% of the total execution time - versus java to C to java, hashtable lookups and synchronisation blocks.

Simply changing to use the Supplier versions of the logging functions approximately doubled the performance.

  Logger.getLogger(tag).fine(String.printf());
->
  Logger.getLogger(tag).fine(() -> String.printf());

But I also decided to just make including any of the code optional by bracketing each call to a test against a final static boolean compile-time constant.

This checking indirectly confirmed that the reflection invocations aren't particualrly onerous assuming the're doing any work.

HashMap<Long,WeakReference>

Now the other major component of the NativeZ object tracking is using a hash-table to map C pointers to Java objects. This serves two important purposes:

Allows the Java to resolve separate C pointers to the same object;
Maintains a hard reference to the WeakReference, without which they just don't work.

For simplicity I just used a HashMap for this purpose. I knew it wasn't ideal but I did the work to quantify it.

Using jol and perusing the source I got some numbers for a jvm using compressed oops and an 8-byte object alignment.

Object	Size
HashMap.Node	32	Used for short hash chains.
HashMap.TreeNode	56	Used for long hash chains.
Long	24	The node key
CReference	48	The node value. Subclass of WeakReference

Thus the incremental overhead for a single C object is either 104 bytes when a linear hashchain is used, and 128 bytes when a tree is used.

Actually its a bit more than that because the hashtable (by default) uses a 75% load factor so also allocates 1.5 pointers for each object but that's neither here nor there and also a feature of the algorithm regardless of implementation.

But there are other bigger problems, the Long.hashCode() method just mixes the low and high words together using xor. If all C pointers are 8 (or worse, 16) byte aligned you essentially only get every 8 (or 16) buckets ever in use. So apart from the wasted buckets the HashMap is very likely to end up using Trees to store each chain.

So I wrote another hashtable implementation which addresses this by using the primitive long stored in the CReference directly as the key, and using the CReference itself as the bucket nodes. I also used a much better hash function. This reduced the memory overhead to just the 48 bytes for the CReference plus a (tunable) overhead for the root table - anywhere from 1/4 to 1 entry per node works quite well with the improved hash function.

This uses less memory and runs a bit faster - mostly because the gc is run less often.

notzed.nativez

So i'm still working on wrapping this all up in a module notzed.nativez which will include the Java base class and a shared library for other JNI libraries to link to which includes the (trivial) interface to the NativeZ object and some helpers to help write small and robust JNI libraries.

And then of course eventually port jjmpeg and zcl to use it.

Bye Bye Jaxby

So one of the biggst changest affecting my projects with Java 11 is the removal of java.xml.bind from the openjdk. This is a bit of a pain because the main reason I used it was the convenience, which is a double pain because not only do i have to undo all that inconvience, all that time using and learning it in the first place has just been confirmed as wasted.

I tried using the last release as modules but they are incompatible with the module system because one or two of the packages are split. I tried just making a module out of them but couldn't get it to work either. And either i'm really shit at google-foo or it's just shit but I couldn't for the life of me find any other reasonable approach so after wasting too much time on it I bit the bullet and just wrote some SAXParser and XMLStreamWriter code mandraulically.

Fortunately the xml trees I had made parsing quite simple. First, none of the element names overlapped so even parsing embedded structures works without having to keep track of the element state. Secondly almost all the simple fields were encoded as attributes rather than elements. So this means almost all objects can be parsed from the startElement callback, and a single stack is used to track encapsulated fields. Becuase I use arrays in a few places a coule of ancilliary lists are used to build them (or I could just change them to Lists).

It's still tedious and error-prone and a pretty shit indightment on the state of Java SE in 2018 vs other languages but once it's done it's done and not having a dependency on half a dozen badly over-engineered packages means it's only done once and i'm not wasting my time learning another fucking "framework".

I didn't investigate where javaee is headed - it'll no doubt eventually solve this problem but removing the dependency from desktop and command-line tools isn't such a bad thing - there have to be good reasons it was dropped from JavaSE in the first place.

One might point to json but that's just as bad to use as a DOM based mechanism which is also just as tedious and error prone. json only really works with fully dynamic languages where you don't have to write any of the field bindings, although there are still plenty of issues with no canonicalised encoding of things like empty arrays or null strings. In any event I need file format compatability so the fact that I also think it's an unacceptably shit solution is entirely moot.

Modules

By the end of the week i'd modularised my main library and ported one of the applications that uses it to the new structure. The application itself also needs quite a bit of modularisation but that's a job for next week, as is testing and debugging - it runs but there's a bunch of broken shit.

So using the modules it's actually quite nice - IF you're using modules all the way down. I didn't have time to look further to find out if it's just a problem with netbeans but adding jars to the classpath generally fucks up and it starts adding strange dependencies to the build. So in a couple of cases I took existing jars and added a module-info myself. When it works it's actually really nice - it just works. When it doesn't, well i'm getting resource path issues in one case.

I also like the fact the tools are the ones dictating the source and class file structures - not left to 3rd party tools to mess up.

Unfortunately I suspect modularisation will be a pretty slow-burn and it will be a while before it benefits the average developer.

Netbeans / CVS

As an update on netbeans I joined the user mailing list and asked about CVS - apparently it's in the netbeans plugin portal. Except it isn't, and after providing screenshots of why I would think that it doesn't exist I simply got ignored.

Yeah ok.

Command line will have to do for me until it decides to show up in my copy.

Java After Next

So with Oracle loosening the reigns a bit (?) on parts of the java platform like JavaFX i'm a little concerned about where things will end up.

Outside of the relatively tight core of SE the java platform there are some pretty shitty "industry standard" pieces. ant - it's just a horrible to use tool. So horrible it looks like they've added javascript to address some of it's issues (oh yay). maven has a lot of issues beyond just being slow as fuck. The ease with which it allows one to bloat out dependencies is not a positive feature.

So yeah, if the "industry" starts dictating things a bit more, hopefully they wont have a negative impact.

Java Modules

So I might not be giving a shit and doing it for fun but I'm still looking into it at work.

After a couple of days of experiments and quite a bit of hacking i've taken most of the libraries I have and re-combined them into a set of modules. Ostensibly the modules are grouped by functionality but I also moved a few bits and pieces around for dependency reasons.

One handy thing is the module-info (along with netbeans) lets you quickly determine dependencies between modules, so for example when I wanted to remove java.desktop and javafx from a library I could easily find the usages. It has made the library slightly more difficult to use because i've moved some methods to static functions (and these functions are used a lot in my prototype code so there's a lot of no-benefit fixing to be done to port it) but it seems like a reasonable compromise for the first cut. There may be other approaches using interfaces or subclasses too, although I tend to think that falls into over-engineering.

Spi

One of the biggest benefits is the service provider mechanism that enables pluggability by just including modules the path. It's something I should've looked into earlier rather than the messy ad-hoc stuff i've been doing but I guess things get done eventually.

I've probably not done a good job with it yet either but it's a start and easy to modify. There should be a couple of other places I can take advantage of it as well.

Redesign

I'm also mid-way through cleaning out a lot of stuff - cut and paste, newer-better implementations, or just experiments that take too much code and are rarely used.

I had a lot of stream processing experiements which just ended up being over-engineered. For example I tried experimenting with using streams and a Collector to calculate more than just min/sum/max, instead calculating multi-dimensional statistics (i.e. all at once) on multi-dimensional data (e.g. image channels). So I came up with a set of classes (1 to 4 dimensions), collector factories, and so on - it's hundreds of lines of code (and a lot of bytecode) and I think I only use it in one or two places in non-performance critical code. So it's going in the bin and if i do decide to replace it I think I can get by with at most a single class and a few factory methods.

The NotAnywhereZone

Whilst looking for some info on netbeans+cvs I tried finding my own posts, and it seems this whole site has effectively vanished from the internet. Well with a specific search you can still find blog posts on google, but not using the date-ranges (maybe the date headers are wrong here). All you can find on duckduckgo is the site root page.

So if you're reading this, congratulations on not being a spider bot!

Not Netbeans 9.0, Java 11

Well that effort was short-lived, no CVS plugin anymore.

It's not that hard to live without, just use the command line and/or emacs, but today i've already wasted enough time trying to find out if it will ever return (on which question I found no answer, clear or otherwise).

It was also going to be a bit of a pain translating an existing project into a 'modular' one, even though from the perspective of a makefile it's only a couple of small changes.

Netbeans 9, Java 11

So after months of rewriting license headers netbeans 9 is finally out. So I finally had a bit more of a serious look at migrating to openjdk + openjfx 11 for work.

Eh, it's going to be a bit of a pain but I think overall it should be an improvement worth the effort. When i'm less hungover and better-slept i'll start looking into jmodularising my projects too.

One unfortunate bit is that netbeans doesn't seem to support native libraries in modules so i'll need to use makefiles for those. This is one of the more interesting features of jmods so i'm also looking into utilising that a bit more.

At the moment as i'm looking into some deep learning stuff so i've got a lot of time between drinks - pretty much every stage of it is an obnoxiously slow process.

Lots of other little things to look into as well, and now the next yearly contract has finally been done I don't have an easy excuse for putting in fuck-all hours!

Dead Cats Dont Bounce

My cat died sometime in the last few weeks.

He was acting a bit strange and sore for a while. I later found out my nephew had stepped on him - looking at his fucking phone no doubt. But following that he did seem to recover - he wasn't quite himself but he didn't seem to be bothered by anything as such. Then one day he stopped turning up for food and I haven't seen him since. However I started to smell dead-animal from a spot he used to sleep in. Unfortunately (or perhaps fortunately) I can't get to where it is - deep under a low deck - so I just have to wait for the flies and bugs to do their work. It's also right next to the loungeroom so it's a bit unpleasant depending on the weather.

Rest In Peace gentle killer.

Life goes on ...

In other news.

Work has been slow for a few months - partly my mood, partly the work, and significantly the countract hasn't been renewed. There's still some money remaining but it's running out. The org is just being funny about contractors, they'd rather deal with multinationals who fleece them and the country than with local businesses.

Not that I really care, barely been doing 2 days/week of work and i'm still earning enough to blow way way too much at the pub. At least those 2 days have been solid lately, customers are super-happy with everything. The weather is slowly improving but still has a ways to go to be beer drinking weather, not that it's stopped me so far. The the weird bloke in a kilt that hangs around a certain pub. Hmmm.

Nephew is pissing me off by just being here (i'm still not sure how annoyed i should be with him over the whole stepping on the cat thing). He was only supposed to be living with me for a couple of months while he prepared the paperwork to go into the army after emigrating from The Philippines but he decided to do an apprentiship instead and he's been here over a year now. Mostly nothing major but it's the little things like having to clean the stove up to use it every time he's used it before, stack the dishwasher properly, remind him to do things, and well just having someone else banging around the house. His old man is visiting this week so hopefully I can find out when he's moving out.

I've been lazy lazy for months but I finally did a few things on the garden and around the house - pruning, mulching, weeding, clearing up piles of wood. Build a short staircase from the deck to the ground (one day to be paving). A few days here and there of warmer weather helped.

My mood has continued to be pretty flat for the most part, and worse than that every now and then. Often but not always sleep related - i'm always tired but sometimes more tired than others. I have to stop reading the news and forums - the world is just so fucked up and so are too many of the people in it. Every time I read about big corps or politicians I just want to go drink (or just sleep).

Odds n Sods

Not playing games much but when I do it's usually No Man's Sky. I think the game design is getting a little unfocused but it's still an ok way to blow a few hours now and then. I really dislike the new sentinal timeout mechanics though.

Looking at building a mini-ITX Ryzen machine but just can't decide on bits and pieces. APU or 2700x+small GPU? I started building a small case by hacking away at a bigger one (approx 200x400x400mm -> 200x200x200mm) but that's still work in progress.

As an ongoing thing i've been poking around a start-up that's looking at doing some machine vision/learning/artistic stuff, but i'm just not certain I can commit to it and I just haven't been coding much outside of work. It has the potential to be a whole heap of fun but it just hasn't grabbed me so far.

Another shitty `technology' company

Oh FFS.

I have my previous workstation for work sitting idle so I thought i'd drop in an xubuntu install and try building openjdk & openjfx on it. It's got a 6x core I7-980 and plenty of RAM so it should be ok right?

Well all went well until I tried to build webkit, just for completeness. Result - consistent ICE inside g++. Blast. Well I thought it was consistent until I tried it with a fresh build of gcc 7.3, this also crashed but in a different place and when I went back to the system gcc I noticed the crash whilst repeatable wasn't in a consistent place. Actually it started crashing everywhere, even inside various jvm based tasks.

This is typically a symptom of system problems, specifically RAM. I looked in the BIOS incase it's been overclocked but it is so ancient there's no settings for RAM, I ran a few memory testers, I tried various numbers of threads for the build.

Then I remembered Intel and their notorious bugs this year causing system stability problems in some cases. I tried to find the options to turn off the bug mitigations but (in part due to isp maintenance at just that moment) I gave up and just booted with the 4.10.x kernel.

Oh look, works fine now (well, it compiles cleanly, webkit tests still fail!)

Perhaps this is a failure of Canonical, or the Linux developers? No, ultimately it's because Intel cut too many corners and have shit hardware. Then again any company that could design something as poor as HPET in this day and age is obviously fucking incompetent.

On a related note i've been eyeing off a Ryzen system every few months. I price one up and think about it but ultimately leave it for the time being. I'm just not doing enough computing beyond 'read internet' to justify it. Another thing I can't decide on is between some 'low-end' APU system or a beastly 2700X machine. The RAM is still so $$$ here and you need good ram for either. At least the last time I specced one up I noticed from some benchmarks than a 2700X would pretty much cream that old I7-980 at 1/4 of the price (or less, not that I paid for it).

About Me

Tags