About Me
Michael Zucchi
B.E. (Comp. Sys. Eng.)
also known as Zed
to his mates & enemies!
< notzed at gmail >
< fosstodon.org/@notzed >
Other JNI bits/ NativeZ, jjmpeg.
Yesterday I spent a good deal of time continuing to experiment and
tune NativeZ. I also ported the latest version of jjmpeg to a
modularised build and to use NativeZ objects.
Hashing C Pointers
C pointers obtained by malloc are aligned to 16-byte boundaries on
64-bit GNU systems. Thus the lower 4 bits are always zero. Standard
malloc also allocates a contiguous virtual address range which is
extended using sbrk(2) which means the upper bits rarely change. Thus
it is sufficient to generate a hashcode which only takes into account
the lower bits (excluding the first 4).
I did some experimenting with hashing the C pointer values using
various algorithms,
from Knuth's
Magic Number to various integer hashing algorithms
(e.g. hash-prospector),
to Long.hashCode(), to a simple shift (both 64-bit and 32-bit).
The performance analysis was based on Chi-squared distance between
the hash chain lengths and the ideal, using pointers generated
from malloc(N) for different fixed values of N for multiple runs.
Although it wasn't the best statistically, the best performing
algorithm was a simple 32-bit, 4 bit shift due to it's significantly
lower cost. And typically it compared quite well statically
regardless.
static int hashCode(long p) {
return (int)p >>> 4;
}
In the nonsensical event that 28 bits are not sufficient the hash bucket index
it can be extended to 32-bits:
static int hashCode(long p) {
return (int)(p >>>> 4);
}
And despite all the JNI and reflection overheads, using the two-round
function from the hash-prospector project increased raw execution time
by approximately 30% over the trivial hashCode() above.
Whilst it might not be ideal for 8-bit aligned allocations it's
probably not that bad either in practice. One thing I can say for
certain though is NEVER use Long.hashCode() to hash C pointers!
Concurrency
I also tuned the use of synchronisation blocks very slightly to
make critical sections as short as possible whilst maintaining
correct behaviour. This made enough of a difference to be worth
it.
I also tried more complex synchronisation mechanisms
- read-write
locks, hash bucket row-locks and so on, but it was at best a
bit slower than using synchronize{}.
The benchmark I was using wasn't particularly fantastic - just one
thread creating 10^7 `garbage' objects in a tight loop whilst the
cleaner thread freed them. No resolution of exisitng objects, no
multiple threads, and so on. But apart from the allocation rate
it isn't an entirely unrealistic scenario either and i was just
trying to identify raw overheads.
Reflection
I've only started looking at the reflection used for allocating
and releaseing objects on the Java side, and in isolation these
are the highest costs of the implementation.
There are ways to reduce these costs but at the expense of extra
boilerplate (for instantiation) or memory requirements (for
release).
Still ongoing. And whilst the relative cost over C is very high,
the absolute cost is still only a few hundred nanoseconds per
object.
From a few small tests it looks like that maximum i could achieve
is a 30% reduction in object instantiation/finalisation costs, but
I don't think it's worth the effort or overheads.
Makefile foo
I'm still experiemnting with this, I used some macros and implicit
rules to get most things building ok, but i'm not sure if it
couldn't be better. The basic makefile is working ok for
multi-module stuff so I think i'm getting there. Most of the work
is just done by the jdk tools as they handle modules and so on
quite well and mostly dicatate the disk layout.
I've broken jjmpeg into 3 modules - the core, the javafx related
classes and the awt related classes.
GC JNI, HashTables, Memory
I had a very busy week with work working on porting libraries and
applications to Java modules - that wasn't really the busy part, I
also looked into making various implementation's pluggable using
services and then creating various pluggable implementations,
often utilising native code. Just having some (much faster)
implementation of parts also opened other opportunities and it
sort of cascaded from there.
Anyway along the way I revisited my implementation of
Garbage Collection with
JNI and started working on a modular version that can be
shared between libraries without having to copy core object, and
then along the way found bugs and things to improve.
Here are some of the more interesting pieces I found along the way.
JNI call overheads
The way i'm writing jni these days is typically just write the
method signature as if it were a Java method and just mark it
native. Let the jni handle Java to C mappings direclty. This is
different to how I first started doing it and flies in the
convention i've typically seen amongst JNI implementations where
the Java just passes the pointers as a long and has a wrapper
function which resolves these longs as appropriate.
The primary reason is to reduce boilerplate and signficiantly
simplify the Java class writing without having a major impact on
performance. I have done some performance testing before but I
re-ran some tests and they confirm the design decisions used in
zcl for example.
Array Access
First, I tested some mechanisms for accessing arrays. I passed
two arrays to a native function and had it perform various tests:
- No op;
- GetPrimitiveArrayCritical on both arrays;
- GetArrayElements for read-only arrays (call Release(ABORT))
- GetArrayElements for read-only on one array and read-write
on the other (call Release(Abort, Commit));
- GetArrayRegion for read-only, to memory allocated using alloca
- GetArrayRegion and SetArrayRegion for one array, to memory using alloca
- GetArrayRegion for read-only, to memory allocated using malloc
- GetArrayRegion and SetArrayRegion for one array, to memory using malloc
I then ran these tests for different sized float[] arrays, for
1 000 000 iterations, and the results in seconds are below. It's some intel laptop.
NOOP Critical Elements Region/alloca Region/malloc
0 1 2 3 4 5 6 7
1 0.014585537 0.116005779 0.199563981 0.207630731 0.104293268 0.127865782 0.185149189 0.217530639
2 0.013524620 0.118654092 0.201340322 0.209417471 0.104695330 0.129843794 0.193392346 0.216096210
4 0.012828157 0.113974453 0.206195102 0.214937432 0.107255090 0.127068808 0.190165219 0.215024016
8 0.013321001 0.116550424 0.209304277 0.205794572 0.102955338 0.130785133 0.192472825 0.217064583
16 0.013228272 0.116148320 0.207285227 0.211022409 0.106344162 0.139751496 0.196179709 0.222189471
32 0.012778452 0.119130446 0.229446026 0.239275912 0.111609011 0.140076428 0.213169077 0.252453033
64 0.012838540 0.115225274 0.250278658 0.259230054 0.124799171 0.161163577 0.230502836 0.260111468
128 0.014115022 0.120103332 0.264680542 0.282062633 0.139830967 0.182051151 0.250609001 0.297405818
256 0.013412645 0.114502078 0.315914219 0.344503396 0.180337154 0.241485525 0.297850562 0.366212494
512 0.012669807 0.117750316 0.383725378 0.468324904 0.261062826 0.358558946 0.366857041 0.466997977
1024 0.013393850 0.120466096 0.550091063 0.707360155 0.413604094 0.576254053 0.518436072 0.711689270
2048 0.013493996 0.118718871 0.990865614 1.292385065 0.830819392 1.147347700 0.973258653 1.284913436
4096 0.012639675 0.116153318 1.808592969 2.558903773 1.628814486 2.400586604 1.778098089 2.514406096
Some points of note:
- Raw method invocation is around 14 nanoseconds, pretty much
irrelevant once you do any work.
- Get/SetArrayElements is pretty much the same as using
GetSet/ArrayRegion with malloc but with less flexibility.
- For small arrays 2 calls to malloc/free is nearly 50% of the
processing time. Given the gay abandon with which most C
programmers throw these around like they cost nothing, the extra
JNI overhead is modest.
- For larger arrays memcpy time dominates.
- For one way transfers shorter than 64 float using
Get/SetRegion to the stack or pre-allocated memory is the fastest.
- For all other cases including any-sized two-way transfers,
GetPrimitiveArrayCritical is the fastest. But it has other
overheads and isn't always applicable.
I didn't look at ByteBuffer because it doesn't really fit what i'm
doing with these functions.
Anyway - the overheads are unavoidable with JNI but are quite
modest. The function in question does nothing with the data and
so any meaningful operation will quickly dominate the processing
time.
Object Pointer resolution
The next test I did was to compare various mechanisms for
transferring the native C pointer from Java to C.
I created a Native object with two long fields, native final long
p, and native long q.
- No operation;
- C invokes getP() method which returns p;
- C invokes getQ() method which returns q;
- C access to .p field;
- C access to .q field;
- The native signature takes a pointer directly, call it resolving the .p field in the caller;
- The native signature takes a pointer directly, call it resolving the .p field via a wrapper function.
Again invoking it 1 000 000 times.
NOOP getP() getQ() (C).p (C).q (J).p J wrapper
0 1 2 3 4 5 6
0.016606942 0.293797182 0.294253973 0.020146810 0.020154508 0.015827028 0.016979563
- final makes no difference.
- method invocation is 15x slower than a field lookup!
- Field lookups are much slower in C than Java, but the absolute
cost is insignificant at ~2.5nS per lookup.
In short, just passing Java objects directly and having the C
resolve the pointer via a field lookup is slightly slower but
requires much less boilerplate and so is the preferred solution.
Logger
After I sorted out the basic JNI mechanisms I started looking at
the reference tracking implementation (i'll call this NativeZ from
here on).
For debugging and trying to be a more re-usable library I had added
logging to various places in the C code using
Logger.getLogger(tag).fine(String.printf());
It turns out this was really not a wise idea and the logging calls
alone were taking approximately 50% of the total execution time -
versus java to C to java, hashtable lookups and synchronisation
blocks.
Simply changing to use the Supplier versions of the logging
functions approximately doubled the performance.
Logger.getLogger(tag).fine(String.printf());
->
Logger.getLogger(tag).fine(() -> String.printf());
But I also decided to just make including any of the code optional
by bracketing each call to a test against a final static boolean
compile-time constant.
This checking indirectly confirmed that the reflection invocations
aren't particualrly onerous assuming the're doing any work.
HashMap<Long,WeakReference>
Now the other major component of the NativeZ object tracking is
using a hash-table to map C pointers to Java objects. This serves
two important purposes:
- Allows the Java to resolve separate C pointers to the same object;
- Maintains a hard reference to the WeakReference, without
which they just don't work.
For simplicity I just used a HashMap for this purpose. I knew it
wasn't ideal but I did the work to quantify it.
Using jol
and perusing the source I got some numbers for a jvm using
compressed oops and an 8-byte object alignment.
Object | Size |
HashMap.Node | 32 | Used for short hash chains. |
HashMap.TreeNode | 56 | Used for long hash chains. |
Long | 24 | The node key |
CReference | 48 | The node value. Subclass of WeakReference |
Thus the incremental overhead for a single C object is either 104
bytes when a linear hashchain is used, and 128 bytes when a tree
is used.
Actually its a bit more than that because the hashtable (by
default) uses a 75% load factor so also allocates 1.5 pointers for
each object but that's neither here nor there and also a feature
of the algorithm regardless of implementation.
But there are other bigger problems, the Long.hashCode() method just
mixes the low and high words together using xor. If all C
pointers are 8 (or worse, 16) byte aligned you essentially only
get every 8 (or 16) buckets ever in use. So apart from the
wasted buckets the HashMap is very likely to end up using Trees
to store each chain.
So I wrote another hashtable implementation which addresses this
by using the primitive long stored in the CReference directly as
the key, and using the CReference itself as the bucket nodes. I
also used a much better hash function. This reduced the memory
overhead to just the 48 bytes for the CReference plus a (tunable)
overhead for the root table - anywhere from 1/4 to 1 entry per
node works quite well with the improved hash function.
This uses less memory and runs a bit faster - mostly because the
gc is run less often.
notzed.nativez
So i'm still working on wrapping this all up in a module
notzed.nativez which will include the Java base class and a shared
library for other JNI libraries to link to which includes the
(trivial) interface to the NativeZ object and some helpers to help
write small and robust JNI libraries.
And then of course eventually port jjmpeg and zcl to use it.
Bye Bye Jaxby
So one of the biggst changest affecting my projects with Java 11
is the removal of java.xml.bind from the openjdk. This is a bit
of a pain because the main reason I used it was the convenience,
which is a double pain because not only do i have to undo all that
inconvience, all that time using and learning it in the first
place has just been confirmed as wasted.
I tried using the last release as modules but they are
incompatible with the module system because one or two of the
packages are split. I tried just making a module out of them but
couldn't get it to work either. And either i'm really shit at
google-foo or it's just shit but I couldn't for the life of me
find any other reasonable approach so after wasting too much time
on it I bit the bullet and just wrote some SAXParser and
XMLStreamWriter code mandraulically.
Fortunately the xml trees I had made parsing quite simple. First,
none of the element names overlapped so even parsing embedded
structures works without having to keep track of the element
state. Secondly almost all the simple fields were encoded as
attributes rather than elements. So this means almost all objects
can be parsed from the startElement callback, and a single stack
is used to track encapsulated fields. Becuase I use arrays in a
few places a coule of ancilliary lists are used to build them (or
I could just change them to Lists).
It's still tedious and error-prone and a pretty shit indightment on
the state of Java SE in 2018 vs other languages but once it's done
it's done and not having a dependency on half a dozen badly
over-engineered packages means it's only done once and i'm not
wasting my time learning another fucking "framework".
I didn't investigate where javaee is headed - it'll no doubt
eventually solve this problem but removing the dependency from
desktop and command-line tools isn't such a bad thing - there
have to be good reasons it was dropped from JavaSE in the first
place.
One might point to json but that's just as bad to use as a DOM
based mechanism which is also just as tedious and error prone.
json only really works with fully dynamic languages where you
don't have to write any of the field bindings, although there are
still plenty of issues with no canonicalised encoding of things
like empty arrays or null strings. In any event I need file
format compatability so the fact that I also think it's an
unacceptably shit solution is entirely moot.
Modules
By the end of the week i'd modularised my main library and ported
one of the applications that uses it to the new structure. The
application itself also needs quite a bit of modularisation but
that's a job for next week, as is testing and debugging - it runs
but there's a bunch of broken shit.
So using the modules it's actually quite nice - IF you're using
modules all the way down. I didn't have time to look further to
find out if it's just a problem with netbeans but adding jars to
the classpath generally fucks up and it starts adding strange
dependencies to the build. So in a couple of cases I took
existing jars and added a module-info myself. When it works it's
actually really nice - it just works. When it doesn't, well i'm
getting resource path issues in one case.
I also like the fact the tools are the ones dictating the source
and class file structures - not left to 3rd party tools to mess
up.
Unfortunately I suspect modularisation will be a pretty slow-burn
and it will be a while before it benefits the average developer.
Netbeans / CVS
As an update on netbeans I joined the user mailing list and asked
about CVS - apparently it's in the netbeans plugin portal. Except
it isn't, and after providing screenshots of why I would think
that it doesn't exist I simply got ignored.
Yeah ok.
Command line will have to do for me until it decides to show up in
my copy.
Java After Next
So with Oracle loosening the reigns a bit (?) on parts of the java
platform like JavaFX i'm a little concerned about where things
will end up.
Outside of the relatively tight core of SE the java
platform there are some pretty shitty "industry standard" pieces.
ant - it's just a horrible to use tool. So horrible it looks like
they've added javascript to address some of it's issues (oh yay).
maven has a lot of issues beyond just being slow as fuck. The
ease with which it allows one to bloat out dependencies is not a
positive feature.
So yeah, if the "industry" starts dictating things a bit more,
hopefully they wont have a negative impact.
Java Modules
So I might not be giving a shit and doing it for fun but I'm still
looking into it at work.
After a couple of days of experiments and quite a bit of hacking
i've taken most of the libraries I have and re-combined them into
a set of modules. Ostensibly the modules are grouped by
functionality but I also moved a few bits and pieces around for
dependency reasons.
One handy thing is the module-info (along with netbeans) lets you
quickly determine dependencies between modules, so for example
when I wanted to remove java.desktop and javafx from a library I
could easily find the usages. It has made the library slightly
more difficult to use because i've moved some methods to static
functions (and these functions are used a lot in my prototype code
so there's a lot of no-benefit fixing to be done to port it) but
it seems like a reasonable compromise for the first cut. There
may be other approaches using interfaces or subclasses too,
although I tend to think that falls into over-engineering.
Spi
One of the biggest benefits is the service provider mechanism that
enables pluggability by just including modules the path. It's
something I should've looked into earlier rather than the messy
ad-hoc stuff i've been doing but I guess things get done
eventually.
I've probably not done a good job with it yet either but it's a
start and easy to modify. There should be a couple of other
places I can take advantage of it as well.
Redesign
I'm also mid-way through cleaning out a lot of stuff - cut and
paste, newer-better implementations, or just experiments that take
too much code and are rarely used.
I had a lot of stream processing experiements which just ended up
being over-engineered. For example I tried experimenting with
using streams and a Collector to calculate more than just
min/sum/max, instead calculating multi-dimensional statistics
(i.e. all at once) on multi-dimensional data (e.g. image
channels). So I came up with a set of classes (1 to 4
dimensions), collector factories, and so on - it's hundreds of
lines of code (and a lot of bytecode) and I think I only use it in
one or two places in non-performance critical code. So it's going
in the bin and if i do decide to replace it I think I can get by
with at most a single class and a few factory methods.
The NotAnywhereZone
Whilst looking for some info on netbeans+cvs I tried finding my
own posts, and it seems this whole site has effectively vanished
from the internet. Well with a specific search you can still find
blog posts on google, but not using the date-ranges (maybe the
date headers are wrong here). All you can find on duckduckgo is
the site root page.
So if you're reading this, congratulations on not being a spider
bot!
Not Netbeans 9.0, Java 11
Well that effort was short-lived, no CVS plugin anymore.
It's not that hard to live without, just use the command line
and/or emacs, but today i've already wasted enough time trying to
find out if it will ever return (on which question I found no
answer, clear or otherwise).
It was also going to be a bit of a pain translating an existing
project into a 'modular' one, even though from the perspective of
a makefile it's only a couple of small changes.
Netbeans 9, Java 11
So after months of rewriting license headers netbeans 9 is finally
out. So I finally had a bit more of a serious look at migrating
to openjdk + openjfx 11 for work.
Eh, it's going to be a bit of a pain but I think overall it should
be an improvement worth the effort. When i'm less hungover and
better-slept i'll start looking into jmodularising my projects
too.
One unfortunate bit is that netbeans doesn't seem to support
native libraries in modules so i'll need to use makefiles for
those. This is one of the more interesting features of jmods so
i'm also looking into utilising that a bit more.
At the moment as i'm looking into some deep learning stuff so i've
got a lot of time between drinks - pretty much every stage of it
is an obnoxiously slow process.
Lots of other little things to look into as well, and now the next
yearly contract has finally been done I don't have an easy excuse
for putting in fuck-all hours!
Dead Cats Dont Bounce
My cat died sometime in the last few weeks.
He was acting a bit strange and sore for a while. I later found
out my nephew had stepped on him - looking at his fucking phone no
doubt. But following that he did seem to recover - he wasn't
quite himself but he didn't seem to be bothered by anything as
such. Then one day he stopped turning up for food and I haven't
seen him since. However I started to smell dead-animal from a
spot he used to sleep in. Unfortunately (or perhaps fortunately)
I can't get to where it is - deep under a low deck - so I just
have to wait for the flies and bugs to do their work. It's also
right next to the loungeroom so it's a bit unpleasant depending on
the weather.
Rest In Peace gentle killer.
Life goes on ...
In other news.
Work has been slow for a few months - partly my mood, partly the
work, and significantly the countract hasn't been renewed.
There's still some money remaining but it's running out. The org
is just being funny about contractors, they'd rather deal with
multinationals who fleece them and the country than with local
businesses.
Not that I really care, barely been doing 2 days/week of work and
i'm still earning enough to blow way way too much at the pub. At
least those 2 days have been solid lately, customers are
super-happy with everything. The weather is slowly improving but
still has a ways to go to be beer drinking weather, not that it's
stopped me so far. The the weird bloke in a kilt that hangs
around a certain pub. Hmmm.
Nephew is pissing me off by just being here (i'm still not sure
how annoyed i should be with him over the whole stepping on the
cat thing). He was only supposed to be living with me for a
couple of months while he prepared the paperwork to go into the
army after emigrating from The Philippines but he decided to do an
apprentiship instead and he's been here over a year now. Mostly
nothing major but it's the little things like having to clean the
stove up to use it every time he's used it before, stack the
dishwasher properly, remind him to do things, and well just having
someone else banging around the house. His old man is visiting
this week so hopefully I can find out when he's moving out.
I've been lazy lazy for months but I finally did a few things on
the garden and around the house - pruning, mulching, weeding,
clearing up piles of wood. Build a short staircase from the deck
to the ground (one day to be paving). A few days here and there
of warmer weather helped.
My mood has continued to be pretty flat for the most part, and
worse than that every now and then. Often but not always sleep
related - i'm always tired but sometimes more tired than others.
I have to stop reading the news and forums - the world is just so
fucked up and so are too many of the people in it. Every time I
read about big corps or politicians I just want to go drink (or
just sleep).
Odds n Sods
Not playing games much but when I do it's usually No Man's Sky. I
think the game design is getting a little unfocused but it's still
an ok way to blow a few hours now and then. I really dislike the
new sentinal timeout mechanics though.
Looking at building a mini-ITX Ryzen machine but just can't decide
on bits and pieces. APU or 2700x+small GPU? I started building a
small case by hacking away at a bigger one (approx 200x400x400mm
-> 200x200x200mm) but that's still work in progress.
As an ongoing thing i've been poking around a start-up that's
looking at doing some machine vision/learning/artistic stuff, but
i'm just not certain I can commit to it and I just haven't been
coding much outside of work. It has the potential to be a whole
heap of fun but it just hasn't grabbed me so far.
Another shitty `technology' company
Oh FFS.
I have my previous workstation for work sitting idle so I thought
i'd drop in an xubuntu install and try building openjdk &
openjfx on it. It's got a 6x core I7-980 and plenty of RAM so it
should be ok right?
Well all went well until I tried to build webkit, just for
completeness. Result - consistent ICE inside g++. Blast. Well I
thought it was consistent until I tried it with a fresh build of
gcc 7.3, this also crashed but in a different place and when I
went back to the system gcc I noticed the crash whilst repeatable
wasn't in a consistent place. Actually it started crashing
everywhere, even inside various jvm based tasks.
This is typically a symptom of system problems, specifically RAM.
I looked in the BIOS incase it's been overclocked but it is so
ancient there's no settings for RAM, I ran a few memory testers, I
tried various numbers of threads for the build.
Then I remembered Intel and their notorious bugs this year causing
system stability problems in some cases. I tried to find the
options to turn off the bug mitigations but (in part due to isp
maintenance at just that moment) I gave up and just booted with
the 4.10.x kernel.
Oh look, works fine now (well, it compiles cleanly, webkit tests
still fail!)
Perhaps this is a failure of Canonical, or the Linux developers?
No, ultimately it's because Intel cut too many corners and have
shit hardware. Then again any company that could design something
as poor as HPET in this day and age is obviously fucking
incompetent.
On a related note i've been eyeing off a Ryzen system every few
months. I price one up and think about it but ultimately leave it
for the time being. I'm just not doing enough computing beyond
'read internet' to justify it. Another thing I can't decide on is
between some 'low-end' APU system or a beastly 2700X machine. The
RAM is still so $$$ here and you need good ram for either. At
least the last time I specced one up I noticed from some
benchmarks than a 2700X would pretty much cream that old I7-980 at
1/4 of the price (or less, not that I paid for it).
Copyright (C) 2019 Michael Zucchi, All Rights Reserved.
Powered by gcc & me!