About Me
Michael Zucchi
B.E. (Comp. Sys. Eng.)
also known as Zed
to his mates & enemies!
< notzed at gmail >
< fosstodon.org/@notzed >
Putting things back together
My brother was here a few weeks ago and I took the opportunity of having transport (I don't drive - just never got a license) to get a few things that are a little difficult to transport on the bicycle.
One was to take my old VAF DC-7 Rev 1.0 speakers and get them
reconditioned. It wasn't exactly cheap but they replaced the old
drivers (required a bigger hole) and i'm not sure what else. But
I asked to keep the old drivers just to muck around with, perhaps
build a 'portable speaker' type thing out of them. They are a bit
scratchy from uh, over-use, but I noticed the main problem is the
surrounds had perished. I looked up some info about replacing
them and ended up ordering some cheapies from China - i'm not sure
I can recover them regardless and it's not worth the cost if I
fail.
Part of the probess is removing the old rubber, and about all I
can say about it is you have to be patient. I used a very sharp
chisel and it took a couple of hours just to remove one, although
the final 1/4 went much faster than the first once I got the
technique sorted out.
Tools
Another thing I got was a welder, small drill press and some other
tools. And that lead to a bit of a spending spree that continued
after he left, buying a bunch of other workshop tools. They're
mostly cheap bits and pieces because i'm not sure how much use
i'll get out of them.
It wasn't the reason I got the welder but the first thing I
thought of making was a belt buckle for a very wide kilt belt. I
had asked a small leather work place (shoe repair, belts) shop in
the city whether they could make wide belts but they couldn't and
directed me to a saddlesmith at the other end of town. I dropped
in one day and asked about it - yeah he could make any width belt,
but he didn't have any buckles suitable. So the next weekend I
got the welder out and turned a couple of pieces of wine barrel
ring into a rather large belt buckle.
The welding is pretty shithouse but I haven't welded for years and
the grinding and polishing hides most of it. The buckle has a
loop through which the belt connects and a single pin which
selects the size. The end of the belt comes around out the front
(or can go behind) and a loop holds it in place.
The front finish is a sort of coarse brushed/dented appearance
from using an angle grinder, wire brush and polishing wheel. I'm
still not sure on the finish but I will probably clear coat it.
The belt I got made up is 70mm wide and the loop which connects to
the buckle uses press-studs so I can make more buckles and easily
replace them. 70mm is about the widest that suits the kilts I
have which is just as well because I didn't really know what it
would look like till it was made.
Not particularly cheap either at $99 but at least it's locally
made and very solid leather - it should last forever. My existing
belt was relatively 'cheap' one from utkilts that uses velcro to
adjust the length. But the finish is wearing already (mostly
creases), the velcro is coming off around the back, and the buckle
- while ok - is a bit cheap. The actual bits are made in Pakistan
I believe but the shipping costs from USA are outrageous -
although they seem to have gotten slightly better, but to get the
same belt ($26) and buckle ($17) again would cost $77 when
shipping is added anyway, and they don't handle GST (I have no
idea whether this means you get hit with import hassles above the
$7 GST cost). And that's the absolute cheapest/slowest option, it ranges up to $130!
Actually I have an idea for another buckle mechanism that I might
try out when I get time, if I do that I might get a 60mm belt
made. The other idea would be a lower profile, hide the tail of
the belt (wrap under), and possible be hole-less if I can create a
binding mechanism that wont damage the front surface of the belt.
And today I turned one of the practice pieces of barrel strap into
a bottle opener. I drilled some holes and filed out the opening
shape by hand.
The finish is a little shit because I cut a bit deep using a
sanding disc on an angle grinder and gave up trying to sand it
out, although it is a lot shinier than it appears in the photo.
Partly I was experimenting with finishes and patterns and i'm
happier with the pattern here, or at least the general approach.
I created a round ended punch from an old broken screwdriver and
used a small portable jackhammer (/ hammer drill) to pound in the
dots. Because i'm just cold working it's a bit difficult to do
much in the pattern department.
Gets me away from the computer screen anyway, i'm kinda burnt out
on that. I'm not really doing enough hours lately for the guy who
pays me (for various reasons beyond our control he's got more
money than hours I want to work!), although the customers who pay
him are quite happy with the output!
I'm at the point where I finally need to get some glasses. I had
another eye test last week and while I can still survive without
it it's to the point that i'm not recognising people from afar and
squinting a bit too much reading at times. I need a separate
reading and distance script unfortunately so I got a pair of
sunnies for distance and reading glasses for work. They would've
helped with the fabrication above! It's going to take a while to
get used to them, and/or work out whether I get progressive lenses
or whatnot, for example I can't read games very well on my TV, but
that's farther away than the reading glasses work at, sigh. I
guess i'll find out in a week or so.
Pulling things apart
I pulled an old deskjet printer apart the other day. It wasn't a
particularly expensive machine and it broke years ago; it's just
been sitting in the corner of my room collecting dust for the day
I threw it out or took out the useful bits. I guess what I found
most interesting is despite it being a disposable item just how
well put together it was.
- The main guide bar is a very solid piece of machined stainless steel. I always thought it was just a tube.
- Almost all parts could be removed by hand apart from a handful of pieces held by torx screws.
- All the metal parts could be easily removed from the plastic parts - springs, pins, rails and so on. Similarly
Basically it looks like it was designed to be repairable. Can't
imagine the equivalent today would be.
Flymo
I also have a more recent machine that still runs, a mains powered
electric hover mower. It's one that has the motor on a separate
spindle to the blade which is driven by a belt. For a few years
it's been 'running' rough due to bung bearings, I had looked at it
before but it looked irrepearable. The whole base-plate, clutch
and drive pulley assembly can be bought as a FRU but it must be
ordered from England. The last time I tried I couldn't get the
payment to go through and so i've just been living with a mower on
the edge of self-destruction.
Today it was getting so bad I finally had another look at it. And
lo! I managed to get the bearings out, although it took about 2+
hours with the tools I have at hand and an awful lot of swearing!
Anyway the two bearings should be easy to source and hopefully
it'll be back up and running once I put it back together. It
would be a pity to get a whole new mower just because a couple of
small cheap parts failed, and repairing it would've been
prohibitively expensive. Hate wasting stuff.
Apart from all that I took a few days off and have been doing a
lot of gardening, preparing vegetable gardens and whatnot.
Hopefully a year or so basically fallow will work in my favour, I
need some more exotic chillies and home-grown tomatoes can't be
beat.
Beer for the Win!
So apparently I won one of these things.
Just in time for summer!
I might have a beer to celebrate! I spent the morning pulling out
weeds so I earned it!
Fast incremental Java builds with openjdk 11 and GNU make
This post wont match the article but I think i've solved all the
main problems needed to make it work. The only thing missing is
ancestor scanning - which isn't trivial but should be
straightforward.
Conceptually it's quite simple and it doesn't take much code but
bloody hell it took a lot of mucking about with make and the javac
TaskListener.
I took the approach I outlined yesterday, I did try to get more
out of the AST but couldn't find the info I needed. The module
system is making navigating source-code a pain in Netbeans (it
wont find modules if the source can't). Some of the 'easy' steps
turned out to be a complete headfuck. Anyway some points of interest.
Dependency Tracking
Even though a java file can create any number of classes one
doesn't need to track any of the non top-level .class files that
might be created for dependency purposes. Any time a .java file
is compiled all of it's generated classes are created at the same
time. So if any java file uses for example a nested class it only
need to track the source file.
I didn't realise this at first and it got messy fast.
Modified Class List
I'm using a javac plugin (-Xplugin) to track the compilation via a
TaskListener. This notifies the plugin of various compilation
stages including generating class files. The painful bit here is
that you don't get information on the actual class files
generated, only on the source file and the name of the class being
generated. And you can't get the actual name of the class file
for anonymous inner classes (it's in the implementation but hidden
from public view). In short it's a bit messy getting a simple and
complete list of every class file generated from every java file
compiled.
But for various other reasons this isn't terribly important so I
just track the toplevel class files; but it was a tedious
discovery process on a very poorly documented api.
When the compiler plugin get the COMPILATION finished event it
uses the information it gathered (and more it discovers) to
generate per-class dependency files similar to `gcc -MD'.
Dependency Generation & Consistency
To find all the (immediate) dependencies the .class file is
processed. The ClassInfo records provide a starting point but all
field and method signatures (descriptors) must be parsed as well.
When an inner class is encountered it's container class is used to
determine if the inner class is still extant in the source code -
if not it can be deleted.
And still this isn't quite enough - if you have a package private
additional class embedded inside the .java file there is no
cross-reference between the two apart from the SourceFile
attribute and implied package path. So to determine if this is
stale one needs to check the Modified Class List instead.
The upshot is that you can't just parse the modified class list
and any inner classes that reference them. I scan a whole package
at a time and then look for anomilies.
One-shot compile
Because invoking the compiler is slow - but also because it will
discover and compile classes as necessary - it's highly beneficial
to run it once only. Unfortunately this is not how make works and
so it needs to be manipulated somewhat. After a few false starts
I found a simple way that works:
The per-module rules are required due to the source-tree naming
conventions used by netbeans (src/[module]/classes/[name] to
build/modules/[module]/[name]), a common-stem based approach is
also possible in which case it wouldn't be required. In practice
it isn't particularly onerous as I use metamake facilities to
generate these per-module rules automatically.
I spent an inordinate amount of time trying to get this to work
but kept hitting puzzling (but documented) behaviour with pattern
and implicit rule chaining and various other issues. One big one was
using concrete rules (made files) for tracking stages, suddenly
everything breaks.
I resorted to just individual java invocations as one would do for
gcc, and trying the compiler server idea to mitigate the costs.
It worked well enough particularly since it parallelises properly.
But after I went to bed I realised i'd fucked up and then spent a
few hours working out a better solution.
Example
This is the prototype i've been using to develop the idea.
modules:=notzed.proto notzed.build
SRCS:=$(shell find src -name '*.java')
CLASSES:=$(foreach mod,$(modules),\
$(patsubst src/$(mod)/classes/%.java,classes/$(mod)/%.class,$(filter src/$(mod)/%,$(SRCS))))
all: $(CLASSES)
lists='$(foreach mod,$(modules),$(wildcard status/$(mod).list))' ; \
built='$(patsubst %.list,%.built,$(foreach mod,$(modules),$(wildcard status/$(mod).list)))' ; \
files='$(addprefix @,$(foreach mod,$(modules),$(wildcard status/$(mod).list)))' ; \
if [ -n "$$built" ] ; then \
javac -Xplugin:javadepend --processor-module-path classes --module-source-path 'src/*/classes' -d classes $$files ; \
touch $$built; \
rm $$lists ; \
else \
echo "All classes up to date" ; \
fi
define makemod=
classes/$1/%.class: src/$1/classes/%.java
$$(file >> status/$1.list,$$<)
$1: $2
if [ -f status/$1.list ] ; then \
javac --module-source-path 'src/*/classes' -d classes @status/$1.list ; \
rm status/$1.list ; \
touch status/$1.built ; \
fi
endef
$(foreach mod,$(modules),$(eval $(call makemod,$(mod),\
$(patsubst src/$(mod)/classes/%.java,classes/$(mod)/%.class,$(filter src/$(mod)/%,$(SRCS))))))
-include $(patsubst classes/%,status/%.d,$(CLASSES))
In addition there is a compiler plugin which is about 500 lines of
standalone java code. This creates the dependency files (included
at the end above) and purges any stale .class files.
I still need to work out a few details with ancestor dependencies
and a few other things.
Java 11 Modules, Building
I'm basically done modularising the code at work - at least the
active code. I rather indulgently took the opportunity to do a
massive cleanup - pretty well all the FIXMEs, should be FIXMEs,
TODOs, dead-code and de-duplication that's collected over the last
few years. Even quite a few 'would be a bit nicers'. It's not
perfect but it was nice to be able to afford the time to do it.
I'm still trying to decide if I break the projects up into related
super-projects or just put everything in the single one as
modules. I'm aiming toward the latter because basically i'm sick
of typing "cd ../blah" so often, and Netbeans doesn't recompile
the dependencies properly.
I'm going to reset the repository and try using git. I don't like
it but I don't much like mercurial either.
Building
At the moment I have a build system which uses make and compiles
at the module level - i.e. any source changes and the whole module
is recompiled, and one can add explicit module-module dependencies
to control the build order and ensure a consistent build.
One reason I do this is because there is no 1:1 correspondance
between build sources and build classes. If you add or remove
nested or anonymous (or co-located) classes from a source file
that adds or removes .class files which are generated. So to
ensure there are no stale classes I just reset it on every build.
This isn't too bad and absolutely guarantees a consistent build
(assuming one configures the inter-module dependencies properly)
but the compiler is still invoked multiple times which has
overheads.
Building Faster
Really the speed isn't a problem for these projects but out of
interest i'm having a look at a couple of other things.
One is relatively simple - basically use JSR-199 to create a
compiler server something like the jdk uses to build itself.
The more complicated task is incremental builds using GNU Make. I
think I should be able to hook into JavacTask and with a small bit
of extra code create something akin to the "gcc -MD" option for
auto-generating dependencies. It has the added complication of
having to detect and remove stale .class files, and doing it all
in a way that make understands. I've already done a few little
experiments today while I was procrastinating over some weeding.
Using JavacTask it is possible to find out all the .class files
that are generated for a given source file. This is one big part
of the puzzle and covers the first-level dependencies
(i.e. %.java: %.class plus all the co-resident classes). One can
also get the AST and other information but that isn't necessary
here.
To find the other dependencies I wrote a simple class file decoder
which finds all the classes referenced by the binary. Some
relatively simple pattern matching and name resolution should be
able to turn this into a dependency list.
Actually it may not be necessary to use JavacTask for this because
the .class files contain enough information. There is extra
overhead because they must be parsed, but they are simple to
parse.
Concurrent Hash Tables
So I got curious about whether the GC in NativeZ would cause any
bottlenecks in highly contested situations - one I already faced
with an earlier iteration of the library. The specific case I had
was running out of memory when many threads were creating
short-lived native objects; the single thread consuming values
from the ReferenceQueue wasn't able to keep up.
The out of memory situation was fairly easily addressed by just
running a ReferenceQueue poll whenever a new object is created,
but I was still curious about the locking overheads.
A field of tables
I made a few variations of hash tables which supported the
interface I desired which was to store a value which also provides
the key itself, as a primitive long. So its more of a set but
with the ability to remove items by the key directly. For now
i'll just summarise them.
- SynchronisedHash
This subclasses HashMap and adds the desired interfaces, each of
which is synchronised to the object.
Rather than use the pointer
directly as the key it is shifted by 4 (all malloc are 16-byte
aligned here) to avoid Long.hasCode() pitfalls on pointers.
- ConcurrentHash
This subclasses ConcurrentHashMap and adds the desired
interfaces, but no locking is required.
The same shift trick is
used for the key as above.
- PointerTable
This was the use-weakreference-as-node implementation
mentioned in the last post - a simple single-threaded chained hash
table implementation. I just synchronised all the entry
points.
A load factor of 2.0 is used for further memory
savings.
- CopyHash
This is a novel(?) approach in which all modifications to
the bucket chains is implemented by copying the whole linked list
of nodes and replacing the table entry with the new list
using compare and set (CAS). A couple of special cases can avoid
any copies.
In the event of a failure it means someone else
updated it so it simply retries. Some logic akin to a simplified
version of what ConcurrentHashMap does is used to resize the table
when needed, potentially concurrently.
It uses a load factor of
1.0.
I also tested a fast-path insert which doesn't try to find
an existing value (this occurs frequently with the design) but
that so overloaded the remove() mechanism it wasn't actually a
good fit!
-
- ArrayHash
This is similar to CopyHash but instead of a linked list of
container nodes it either stores the value directly in the table,
or it stores an array of objects. Again any modifications require
rewriting the table entry. Again resize is handled specially and
all threads can contribute.
This also has a small modification
in that the retry loop includes a ficonacci-increasing
Thread.sleep(0,nanos) delay if it fails which may slow down
wall-clock time but can improve the cpu load.
I had some ideas for other approaches but i've had enough
for now.
Tests
I created a native class which allocates and frees memory of a
given size, and a simple 'work' method which just adds up the
octets within. I ran two tests, one which just allocated 16 bytes
whic immediately went out of scope, and another which allocated
1024 bytes, added them up (a small cpu-bound task), then let it
fall out of scope.
Note that lookups are never tested in this scenario apart from the
implicit lookup during get/put. I did implement get() - which are
always non-blocking in the case of the two latter
implementations (even during a resize), but I didn't test or use
them here.
I then created 8 000 000 objects in a tight loop in one
or more threads. Once the threads finish I invoke System.gc(),
then wait for all objects to be freed. The machine I'm running it
on has 4 cores/8 threads so I tried it with 1 and with 8 threads.
Phew, ok that's a bit of a mouthful that probably doesn't make
sense, bit tired. The numbers below are pretty rough and are from
a single run using openjdk 11+28, with -Xmx1G.
8 threads, alloc 8 threads, alloc+sum 1 thread, alloc 1 thread, alloc+sum
x 1 000 000 x 1 000 000 x 8 000 000 x 8 000 000
Elapsed User Elapsed User Elapsed User Elapsed User
SynchronisedHash 8.3 38 11 54 8.4 27 16 40
ConcurrentHash 6.9 35 11 52 8.2 27 17 38
PointerTable 7.2 26 13 44 7.9 17 19 29
CopyHash 6.6 31 8.2 42 8.1 24 18 33
ArrayHash 6.0 28 8.2 39 8.5 23 16 24
ArrayHash* 6.9 23 12 30 8.1 21 17 23
* indicates using a delay in the retry loop for the remove() call
To be honest I think the only real conclusion is that this machine
doesn't have enough threads for this task to cause a bottleneck in
the hash table! Even in the most contested case (alloc only)
simple synchronised methods are just about as good as anything
else. And whilst this is represenative of a real life scenario, it's
just a bloody pain to properly test high concurrency code and i'm just not that
into it for a hobby (the journey not the destination and so forth).
I haven't shown the numbers above but in the case of the Copy and
Array implementations I count how many retries are required for
both put and get calls. In the 8-thread cases where there is no
explicit delay it can be in the order of a million times! And yet
they still run faster and use less memory. Shrug.
All of the non java.util based implementations also benefit in
both time and space from using the primitive key directly without
boxing and storing the key as part of the containee object and not
having to support the full Collections and Streams interfaces.
PointerTable also benefits from fewer gc passes due to not
needing any container nodes and having a higher load factor.
BTW one might note that this is pretty bloody slow compared to C
or pure Java - there are definitely undeniably high overheads of
the JNI back and forths.
The other thing of note is that well hashtables are hashtables -
they're all pretty good at what they're doing here because of the
algorithm that they share. There's not really all that much practical
difference between any of them.
But why?
I'm not sure what posessed me to look into it this deeply but I've
done it now. Maybe i'll post the code a bit later, it may have
enough bugs to invalidate all the results but it was still a
learning experience.
For the lockless algorithms I made use of VarHandles (Java 9's
'safe' interface to Unsafe) to do the CAS and other operations,
and some basic monitor locking for coordinating a resize pass.
The idea of 'read, do work, fail and retry' is something I
originally learnt about using a hardware feature of the CELL SPU
(and Power based PPU). On that you can reserve a memory location
(128 bytes on the SPU, 4 or 8 on the PPU), and if the ticket is
lost by the time you write back to it the write fails and you know
it failed and can retry. So rather than [spin-lock-doing-nothing
WORK unlock], you spin on [reserve WORK write-or-retry]. I guess
it's a speculative reservation. It's not quite as clean when
using READ work CAS (particularly without the 128 byte blocks on
the SPU!) but the net result is (mostly) the same. One significant
difference is that you actually have to write a different value
back and in this instance being able to merely indicate change
could have been useful.
ConcurrentHashMap does something similar but only for the first
insert into an empty hash chain, after that it locks the entry and
only ever appends to the chain.
Some of the trickiest code was getting the resize mechanism to
synchronise across threads, but I really only threw it together
without much thought using object mointors and AtomicInteger.
Occasionally i'm gettting hangs but they don't appear to make
sense: some number of threads will be blocked waiting to enter a
synchronised method, while a few others are already on a wait()
inside it, all the while another thread is calling it at will -
without blocking. If I get keen i'll revisit this part of the
code.
Other JNI bits/ NativeZ, jjmpeg.
Yesterday I spent a good deal of time continuing to experiment and
tune NativeZ. I also ported the latest version of jjmpeg to a
modularised build and to use NativeZ objects.
Hashing C Pointers
C pointers obtained by malloc are aligned to 16-byte boundaries on
64-bit GNU systems. Thus the lower 4 bits are always zero. Standard
malloc also allocates a contiguous virtual address range which is
extended using sbrk(2) which means the upper bits rarely change. Thus
it is sufficient to generate a hashcode which only takes into account
the lower bits (excluding the first 4).
I did some experimenting with hashing the C pointer values using
various algorithms,
from Knuth's
Magic Number to various integer hashing algorithms
(e.g. hash-prospector),
to Long.hashCode(), to a simple shift (both 64-bit and 32-bit).
The performance analysis was based on Chi-squared distance between
the hash chain lengths and the ideal, using pointers generated
from malloc(N) for different fixed values of N for multiple runs.
Although it wasn't the best statistically, the best performing
algorithm was a simple 32-bit, 4 bit shift due to it's significantly
lower cost. And typically it compared quite well statically
regardless.
static int hashCode(long p) {
return (int)p >>> 4;
}
In the nonsensical event that 28 bits are not sufficient the hash bucket index
it can be extended to 32-bits:
static int hashCode(long p) {
return (int)(p >>>> 4);
}
And despite all the JNI and reflection overheads, using the two-round
function from the hash-prospector project increased raw execution time
by approximately 30% over the trivial hashCode() above.
Whilst it might not be ideal for 8-bit aligned allocations it's
probably not that bad either in practice. One thing I can say for
certain though is NEVER use Long.hashCode() to hash C pointers!
Concurrency
I also tuned the use of synchronisation blocks very slightly to
make critical sections as short as possible whilst maintaining
correct behaviour. This made enough of a difference to be worth
it.
I also tried more complex synchronisation mechanisms
- read-write
locks, hash bucket row-locks and so on, but it was at best a
bit slower than using synchronize{}.
The benchmark I was using wasn't particularly fantastic - just one
thread creating 10^7 `garbage' objects in a tight loop whilst the
cleaner thread freed them. No resolution of exisitng objects, no
multiple threads, and so on. But apart from the allocation rate
it isn't an entirely unrealistic scenario either and i was just
trying to identify raw overheads.
Reflection
I've only started looking at the reflection used for allocating
and releaseing objects on the Java side, and in isolation these
are the highest costs of the implementation.
There are ways to reduce these costs but at the expense of extra
boilerplate (for instantiation) or memory requirements (for
release).
Still ongoing. And whilst the relative cost over C is very high,
the absolute cost is still only a few hundred nanoseconds per
object.
From a few small tests it looks like that maximum i could achieve
is a 30% reduction in object instantiation/finalisation costs, but
I don't think it's worth the effort or overheads.
Makefile foo
I'm still experiemnting with this, I used some macros and implicit
rules to get most things building ok, but i'm not sure if it
couldn't be better. The basic makefile is working ok for
multi-module stuff so I think i'm getting there. Most of the work
is just done by the jdk tools as they handle modules and so on
quite well and mostly dicatate the disk layout.
I've broken jjmpeg into 3 modules - the core, the javafx related
classes and the awt related classes.
GC JNI, HashTables, Memory
I had a very busy week with work working on porting libraries and
applications to Java modules - that wasn't really the busy part, I
also looked into making various implementation's pluggable using
services and then creating various pluggable implementations,
often utilising native code. Just having some (much faster)
implementation of parts also opened other opportunities and it
sort of cascaded from there.
Anyway along the way I revisited my implementation of
Garbage Collection with
JNI and started working on a modular version that can be
shared between libraries without having to copy core object, and
then along the way found bugs and things to improve.
Here are some of the more interesting pieces I found along the way.
JNI call overheads
The way i'm writing jni these days is typically just write the
method signature as if it were a Java method and just mark it
native. Let the jni handle Java to C mappings direclty. This is
different to how I first started doing it and flies in the
convention i've typically seen amongst JNI implementations where
the Java just passes the pointers as a long and has a wrapper
function which resolves these longs as appropriate.
The primary reason is to reduce boilerplate and signficiantly
simplify the Java class writing without having a major impact on
performance. I have done some performance testing before but I
re-ran some tests and they confirm the design decisions used in
zcl for example.
Array Access
First, I tested some mechanisms for accessing arrays. I passed
two arrays to a native function and had it perform various tests:
- No op;
- GetPrimitiveArrayCritical on both arrays;
- GetArrayElements for read-only arrays (call Release(ABORT))
- GetArrayElements for read-only on one array and read-write
on the other (call Release(Abort, Commit));
- GetArrayRegion for read-only, to memory allocated using alloca
- GetArrayRegion and SetArrayRegion for one array, to memory using alloca
- GetArrayRegion for read-only, to memory allocated using malloc
- GetArrayRegion and SetArrayRegion for one array, to memory using malloc
I then ran these tests for different sized float[] arrays, for
1 000 000 iterations, and the results in seconds are below. It's some intel laptop.
NOOP Critical Elements Region/alloca Region/malloc
0 1 2 3 4 5 6 7
1 0.014585537 0.116005779 0.199563981 0.207630731 0.104293268 0.127865782 0.185149189 0.217530639
2 0.013524620 0.118654092 0.201340322 0.209417471 0.104695330 0.129843794 0.193392346 0.216096210
4 0.012828157 0.113974453 0.206195102 0.214937432 0.107255090 0.127068808 0.190165219 0.215024016
8 0.013321001 0.116550424 0.209304277 0.205794572 0.102955338 0.130785133 0.192472825 0.217064583
16 0.013228272 0.116148320 0.207285227 0.211022409 0.106344162 0.139751496 0.196179709 0.222189471
32 0.012778452 0.119130446 0.229446026 0.239275912 0.111609011 0.140076428 0.213169077 0.252453033
64 0.012838540 0.115225274 0.250278658 0.259230054 0.124799171 0.161163577 0.230502836 0.260111468
128 0.014115022 0.120103332 0.264680542 0.282062633 0.139830967 0.182051151 0.250609001 0.297405818
256 0.013412645 0.114502078 0.315914219 0.344503396 0.180337154 0.241485525 0.297850562 0.366212494
512 0.012669807 0.117750316 0.383725378 0.468324904 0.261062826 0.358558946 0.366857041 0.466997977
1024 0.013393850 0.120466096 0.550091063 0.707360155 0.413604094 0.576254053 0.518436072 0.711689270
2048 0.013493996 0.118718871 0.990865614 1.292385065 0.830819392 1.147347700 0.973258653 1.284913436
4096 0.012639675 0.116153318 1.808592969 2.558903773 1.628814486 2.400586604 1.778098089 2.514406096
Some points of note:
- Raw method invocation is around 14 nanoseconds, pretty much
irrelevant once you do any work.
- Get/SetArrayElements is pretty much the same as using
GetSet/ArrayRegion with malloc but with less flexibility.
- For small arrays 2 calls to malloc/free is nearly 50% of the
processing time. Given the gay abandon with which most C
programmers throw these around like they cost nothing, the extra
JNI overhead is modest.
- For larger arrays memcpy time dominates.
- For one way transfers shorter than 64 float using
Get/SetRegion to the stack or pre-allocated memory is the fastest.
- For all other cases including any-sized two-way transfers,
GetPrimitiveArrayCritical is the fastest. But it has other
overheads and isn't always applicable.
I didn't look at ByteBuffer because it doesn't really fit what i'm
doing with these functions.
Anyway - the overheads are unavoidable with JNI but are quite
modest. The function in question does nothing with the data and
so any meaningful operation will quickly dominate the processing
time.
Object Pointer resolution
The next test I did was to compare various mechanisms for
transferring the native C pointer from Java to C.
I created a Native object with two long fields, native final long
p, and native long q.
- No operation;
- C invokes getP() method which returns p;
- C invokes getQ() method which returns q;
- C access to .p field;
- C access to .q field;
- The native signature takes a pointer directly, call it resolving the .p field in the caller;
- The native signature takes a pointer directly, call it resolving the .p field via a wrapper function.
Again invoking it 1 000 000 times.
NOOP getP() getQ() (C).p (C).q (J).p J wrapper
0 1 2 3 4 5 6
0.016606942 0.293797182 0.294253973 0.020146810 0.020154508 0.015827028 0.016979563
- final makes no difference.
- method invocation is 15x slower than a field lookup!
- Field lookups are much slower in C than Java, but the absolute
cost is insignificant at ~2.5nS per lookup.
In short, just passing Java objects directly and having the C
resolve the pointer via a field lookup is slightly slower but
requires much less boilerplate and so is the preferred solution.
Logger
After I sorted out the basic JNI mechanisms I started looking at
the reference tracking implementation (i'll call this NativeZ from
here on).
For debugging and trying to be a more re-usable library I had added
logging to various places in the C code using
Logger.getLogger(tag).fine(String.printf());
It turns out this was really not a wise idea and the logging calls
alone were taking approximately 50% of the total execution time -
versus java to C to java, hashtable lookups and synchronisation
blocks.
Simply changing to use the Supplier versions of the logging
functions approximately doubled the performance.
Logger.getLogger(tag).fine(String.printf());
->
Logger.getLogger(tag).fine(() -> String.printf());
But I also decided to just make including any of the code optional
by bracketing each call to a test against a final static boolean
compile-time constant.
This checking indirectly confirmed that the reflection invocations
aren't particualrly onerous assuming the're doing any work.
HashMap<Long,WeakReference>
Now the other major component of the NativeZ object tracking is
using a hash-table to map C pointers to Java objects. This serves
two important purposes:
- Allows the Java to resolve separate C pointers to the same object;
- Maintains a hard reference to the WeakReference, without
which they just don't work.
For simplicity I just used a HashMap for this purpose. I knew it
wasn't ideal but I did the work to quantify it.
Using jol
and perusing the source I got some numbers for a jvm using
compressed oops and an 8-byte object alignment.
Object | Size |
HashMap.Node | 32 | Used for short hash chains. |
HashMap.TreeNode | 56 | Used for long hash chains. |
Long | 24 | The node key |
CReference | 48 | The node value. Subclass of WeakReference |
Thus the incremental overhead for a single C object is either 104
bytes when a linear hashchain is used, and 128 bytes when a tree
is used.
Actually its a bit more than that because the hashtable (by
default) uses a 75% load factor so also allocates 1.5 pointers for
each object but that's neither here nor there and also a feature
of the algorithm regardless of implementation.
But there are other bigger problems, the Long.hashCode() method just
mixes the low and high words together using xor. If all C
pointers are 8 (or worse, 16) byte aligned you essentially only
get every 8 (or 16) buckets ever in use. So apart from the
wasted buckets the HashMap is very likely to end up using Trees
to store each chain.
So I wrote another hashtable implementation which addresses this
by using the primitive long stored in the CReference directly as
the key, and using the CReference itself as the bucket nodes. I
also used a much better hash function. This reduced the memory
overhead to just the 48 bytes for the CReference plus a (tunable)
overhead for the root table - anywhere from 1/4 to 1 entry per
node works quite well with the improved hash function.
This uses less memory and runs a bit faster - mostly because the
gc is run less often.
notzed.nativez
So i'm still working on wrapping this all up in a module
notzed.nativez which will include the Java base class and a shared
library for other JNI libraries to link to which includes the
(trivial) interface to the NativeZ object and some helpers to help
write small and robust JNI libraries.
And then of course eventually port jjmpeg and zcl to use it.
Copyright (C) 2019 Michael Zucchi, All Rights Reserved.
Powered by gcc & me!