Putting things back together

My brother was here a few weeks ago and I took the opportunity of having transport (I don't drive - just never got a license) to get a few things that are a little difficult to transport on the bicycle.

One was to take my old VAF DC-7 Rev 1.0 speakers and get them reconditioned. It wasn't exactly cheap but they replaced the old drivers (required a bigger hole) and i'm not sure what else. But I asked to keep the old drivers just to muck around with, perhaps build a 'portable speaker' type thing out of them. They are a bit scratchy from uh, over-use, but I noticed the main problem is the surrounds had perished. I looked up some info about replacing them and ended up ordering some cheapies from China - i'm not sure I can recover them regardless and it's not worth the cost if I fail.

Part of the probess is removing the old rubber, and about all I can say about it is you have to be patient. I used a very sharp chisel and it took a couple of hours just to remove one, although the final 1/4 went much faster than the first once I got the technique sorted out.

Tools

Another thing I got was a welder, small drill press and some other tools. And that lead to a bit of a spending spree that continued after he left, buying a bunch of other workshop tools. They're mostly cheap bits and pieces because i'm not sure how much use i'll get out of them.

It wasn't the reason I got the welder but the first thing I thought of making was a belt buckle for a very wide kilt belt. I had asked a small leather work place (shoe repair, belts) shop in the city whether they could make wide belts but they couldn't and directed me to a saddlesmith at the other end of town. I dropped in one day and asked about it - yeah he could make any width belt, but he didn't have any buckles suitable. So the next weekend I got the welder out and turned a couple of pieces of wine barrel ring into a rather large belt buckle.

The welding is pretty shithouse but I haven't welded for years and the grinding and polishing hides most of it. The buckle has a loop through which the belt connects and a single pin which selects the size. The end of the belt comes around out the front (or can go behind) and a loop holds it in place.

The front finish is a sort of coarse brushed/dented appearance from using an angle grinder, wire brush and polishing wheel. I'm still not sure on the finish but I will probably clear coat it.

The belt I got made up is 70mm wide and the loop which connects to the buckle uses press-studs so I can make more buckles and easily replace them. 70mm is about the widest that suits the kilts I have which is just as well because I didn't really know what it would look like till it was made.

Not particularly cheap either at $99 but at least it's locally made and very solid leather - it should last forever. My existing belt was relatively 'cheap' one from utkilts that uses velcro to adjust the length. But the finish is wearing already (mostly creases), the velcro is coming off around the back, and the buckle - while ok - is a bit cheap. The actual bits are made in Pakistan I believe but the shipping costs from USA are outrageous - although they seem to have gotten slightly better, but to get the same belt ($26) and buckle ($17) again would cost $77 when shipping is added anyway, and they don't handle GST (I have no idea whether this means you get hit with import hassles above the $7 GST cost). And that's the absolute cheapest/slowest option, it ranges up to $130!

Actually I have an idea for another buckle mechanism that I might try out when I get time, if I do that I might get a 60mm belt made. The other idea would be a lower profile, hide the tail of the belt (wrap under), and possible be hole-less if I can create a binding mechanism that wont damage the front surface of the belt.

And today I turned one of the practice pieces of barrel strap into a bottle opener. I drilled some holes and filed out the opening shape by hand.

The finish is a little shit because I cut a bit deep using a sanding disc on an angle grinder and gave up trying to sand it out, although it is a lot shinier than it appears in the photo.

Partly I was experimenting with finishes and patterns and i'm happier with the pattern here, or at least the general approach. I created a round ended punch from an old broken screwdriver and used a small portable jackhammer (/ hammer drill) to pound in the dots. Because i'm just cold working it's a bit difficult to do much in the pattern department.

Gets me away from the computer screen anyway, i'm kinda burnt out on that. I'm not really doing enough hours lately for the guy who pays me (for various reasons beyond our control he's got more money than hours I want to work!), although the customers who pay him are quite happy with the output!

I'm at the point where I finally need to get some glasses. I had another eye test last week and while I can still survive without it it's to the point that i'm not recognising people from afar and squinting a bit too much reading at times. I need a separate reading and distance script unfortunately so I got a pair of sunnies for distance and reading glasses for work. They would've helped with the fabrication above! It's going to take a while to get used to them, and/or work out whether I get progressive lenses or whatnot, for example I can't read games very well on my TV, but that's farther away than the reading glasses work at, sigh. I guess i'll find out in a week or so.

Pulling things apart

I pulled an old deskjet printer apart the other day. It wasn't a particularly expensive machine and it broke years ago; it's just been sitting in the corner of my room collecting dust for the day I threw it out or took out the useful bits. I guess what I found most interesting is despite it being a disposable item just how well put together it was.

The main guide bar is a very solid piece of machined stainless steel. I always thought it was just a tube.
Almost all parts could be removed by hand apart from a handful of pieces held by torx screws.
All the metal parts could be easily removed from the plastic parts - springs, pins, rails and so on. Similarly

Basically it looks like it was designed to be repairable. Can't imagine the equivalent today would be.

Flymo

I also have a more recent machine that still runs, a mains powered electric hover mower. It's one that has the motor on a separate spindle to the blade which is driven by a belt. For a few years it's been 'running' rough due to bung bearings, I had looked at it before but it looked irrepearable. The whole base-plate, clutch and drive pulley assembly can be bought as a FRU but it must be ordered from England. The last time I tried I couldn't get the payment to go through and so i've just been living with a mower on the edge of self-destruction.

Today it was getting so bad I finally had another look at it. And lo! I managed to get the bearings out, although it took about 2+ hours with the tools I have at hand and an awful lot of swearing! Anyway the two bearings should be easy to source and hopefully it'll be back up and running once I put it back together. It would be a pity to get a whole new mower just because a couple of small cheap parts failed, and repairing it would've been prohibitively expensive. Hate wasting stuff.

Apart from all that I took a few days off and have been doing a lot of gardening, preparing vegetable gardens and whatnot. Hopefully a year or so basically fallow will work in my favour, I need some more exotic chillies and home-grown tomatoes can't be beat.

Beer for the Win!

So apparently I won one of these things.

Just in time for summer!

I might have a beer to celebrate! I spent the morning pulling out weeds so I earned it!

Fast incremental Java builds with openjdk 11 and GNU make

This post wont match the article but I think i've solved all the main problems needed to make it work. The only thing missing is ancestor scanning - which isn't trivial but should be straightforward.

Conceptually it's quite simple and it doesn't take much code but bloody hell it took a lot of mucking about with make and the javac TaskListener.

I took the approach I outlined yesterday, I did try to get more out of the AST but couldn't find the info I needed. The module system is making navigating source-code a pain in Netbeans (it wont find modules if the source can't). Some of the 'easy' steps turned out to be a complete headfuck. Anyway some points of interest.

Dependency Tracking

Even though a java file can create any number of classes one doesn't need to track any of the non top-level .class files that might be created for dependency purposes. Any time a .java file is compiled all of it's generated classes are created at the same time. So if any java file uses for example a nested class it only need to track the source file.

I didn't realise this at first and it got messy fast.

Modified Class List

I'm using a javac plugin (-Xplugin) to track the compilation via a TaskListener. This notifies the plugin of various compilation stages including generating class files. The painful bit here is that you don't get information on the actual class files generated, only on the source file and the name of the class being generated. And you can't get the actual name of the class file for anonymous inner classes (it's in the implementation but hidden from public view). In short it's a bit messy getting a simple and complete list of every class file generated from every java file compiled.

But for various other reasons this isn't terribly important so I just track the toplevel class files; but it was a tedious discovery process on a very poorly documented api.

When the compiler plugin get the COMPILATION finished event it uses the information it gathered (and more it discovers) to generate per-class dependency files similar to `gcc -MD'.

Dependency Generation & Consistency

To find all the (immediate) dependencies the .class file is processed. The ClassInfo records provide a starting point but all field and method signatures (descriptors) must be parsed as well.

When an inner class is encountered it's container class is used to determine if the inner class is still extant in the source code - if not it can be deleted.

And still this isn't quite enough - if you have a package private additional class embedded inside the .java file there is no cross-reference between the two apart from the SourceFile attribute and implied package path. So to determine if this is stale one needs to check the Modified Class List instead.

The upshot is that you can't just parse the modified class list and any inner classes that reference them. I scan a whole package at a time and then look for anomilies.

One-shot compile

Because invoking the compiler is slow - but also because it will discover and compile classes as necessary - it's highly beneficial to run it once only. Unfortunately this is not how make works and so it needs to be manipulated somewhat. After a few false starts I found a simple way that works:

A phony per-module target depends on all the toplevel class filenames from all the classes in the module. It can optionally build all the changed files in a given module, using the plugin to auto-generate dependency includes.
A per-module pattern rule tracks needed .java to .class compilations by simply appending to a per-module file. It must also be a phony target in that it doesn't actually generate the .class file itself.
Another optional phony `all' target can incrementally build the entire project using a single compiler invocation.
First it must depend on all of the classes; the pattern rules will automagically update the tracking files when they are out of date. If any of the tracking files were created by this time it knows a compile is needed and it runs the compiler with the list of changd source files. It then deletes thes files for next time.

The per-module rules are required due to the source-tree naming conventions used by netbeans (src/[module]/classes/[name] to build/modules/[module]/[name]), a common-stem based approach is also possible in which case it wouldn't be required. In practice it isn't particularly onerous as I use metamake facilities to generate these per-module rules automatically.

I spent an inordinate amount of time trying to get this to work but kept hitting puzzling (but documented) behaviour with pattern and implicit rule chaining and various other issues. One big one was using concrete rules (made files) for tracking stages, suddenly everything breaks.

I resorted to just individual java invocations as one would do for gcc, and trying the compiler server idea to mitigate the costs. It worked well enough particularly since it parallelises properly. But after I went to bed I realised i'd fucked up and then spent a few hours working out a better solution.

Example

This is the prototype i've been using to develop the idea.

modules:=notzed.proto notzed.build

SRCS:=$(shell find src -name '*.java')
CLASSES:=$(foreach mod,$(modules),\
 $(patsubst src/$(mod)/classes/%.java,classes/$(mod)/%.class,$(filter src/$(mod)/%,$(SRCS))))

all: $(CLASSES)
        lists='$(foreach mod,$(modules),$(wildcard status/$(mod).list))' ; \
        built='$(patsubst %.list,%.built,$(foreach mod,$(modules),$(wildcard status/$(mod).list)))' ; \
        files='$(addprefix @,$(foreach mod,$(modules),$(wildcard status/$(mod).list)))' ; \
        if [ -n "$$built" ] ; then \
                javac -Xplugin:javadepend --processor-module-path classes --module-source-path 'src/*/classes' -d classes $$files ; \
                touch $$built; \
                rm $$lists ; \
        else \
                echo "All classes up to date" ; \
        fi

define makemod=
classes/$1/%.class: src/$1/classes/%.java
        $$(file >> status/$1.list,$$<)

$1: $2
        if [ -f status/$1.list ] ; then \
                javac --module-source-path 'src/*/classes' -d classes @status/$1.list ; \
                rm status/$1.list ; \
                touch status/$1.built ; \
        fi
endef

$(foreach mod,$(modules),$(eval $(call makemod,$(mod),\
  $(patsubst src/$(mod)/classes/%.java,classes/$(mod)/%.class,$(filter src/$(mod)/%,$(SRCS))))))

-include $(patsubst classes/%,status/%.d,$(CLASSES))

In addition there is a compiler plugin which is about 500 lines of standalone java code. This creates the dependency files (included at the end above) and purges any stale .class files.

I still need to work out a few details with ancestor dependencies and a few other things.

Java 11 Modules, Building

I'm basically done modularising the code at work - at least the active code. I rather indulgently took the opportunity to do a massive cleanup - pretty well all the FIXMEs, should be FIXMEs, TODOs, dead-code and de-duplication that's collected over the last few years. Even quite a few 'would be a bit nicers'. It's not perfect but it was nice to be able to afford the time to do it.

I'm still trying to decide if I break the projects up into related super-projects or just put everything in the single one as modules. I'm aiming toward the latter because basically i'm sick of typing "cd ../blah" so often, and Netbeans doesn't recompile the dependencies properly.

I'm going to reset the repository and try using git. I don't like it but I don't much like mercurial either.

Building

At the moment I have a build system which uses make and compiles at the module level - i.e. any source changes and the whole module is recompiled, and one can add explicit module-module dependencies to control the build order and ensure a consistent build.

One reason I do this is because there is no 1:1 correspondance between build sources and build classes. If you add or remove nested or anonymous (or co-located) classes from a source file that adds or removes .class files which are generated. So to ensure there are no stale classes I just reset it on every build.

This isn't too bad and absolutely guarantees a consistent build (assuming one configures the inter-module dependencies properly) but the compiler is still invoked multiple times which has overheads.

Building Faster

Really the speed isn't a problem for these projects but out of interest i'm having a look at a couple of other things.

One is relatively simple - basically use JSR-199 to create a compiler server something like the jdk uses to build itself.

The more complicated task is incremental builds using GNU Make. I think I should be able to hook into JavacTask and with a small bit of extra code create something akin to the "gcc -MD" option for auto-generating dependencies. It has the added complication of having to detect and remove stale .class files, and doing it all in a way that make understands. I've already done a few little experiments today while I was procrastinating over some weeding.

Using JavacTask it is possible to find out all the .class files that are generated for a given source file. This is one big part of the puzzle and covers the first-level dependencies (i.e. %.java: %.class plus all the co-resident classes). One can also get the AST and other information but that isn't necessary here.

To find the other dependencies I wrote a simple class file decoder which finds all the classes referenced by the binary. Some relatively simple pattern matching and name resolution should be able to turn this into a dependency list.

Actually it may not be necessary to use JavacTask for this because the .class files contain enough information. There is extra overhead because they must be parsed, but they are simple to parse.

Concurrent Hash Tables

So I got curious about whether the GC in NativeZ would cause any bottlenecks in highly contested situations - one I already faced with an earlier iteration of the library. The specific case I had was running out of memory when many threads were creating short-lived native objects; the single thread consuming values from the ReferenceQueue wasn't able to keep up.

The out of memory situation was fairly easily addressed by just running a ReferenceQueue poll whenever a new object is created, but I was still curious about the locking overheads.

A field of tables

I made a few variations of hash tables which supported the interface I desired which was to store a value which also provides the key itself, as a primitive long. So its more of a set but with the ability to remove items by the key directly. For now i'll just summarise them.

SynchronisedHash

This subclasses HashMap and adds the desired interfaces, each of which is synchronised to the object.

Rather than use the pointer directly as the key it is shifted by 4 (all malloc are 16-byte aligned here) to avoid Long.hasCode() pitfalls on pointers.

ConcurrentHash

This subclasses ConcurrentHashMap and adds the desired interfaces, but no locking is required.

The same shift trick is used for the key as above.

PointerTable

This was the use-weakreference-as-node implementation mentioned in the last post - a simple single-threaded chained hash table implementation. I just synchronised all the entry points.

A load factor of 2.0 is used for further memory savings.

CopyHash

This is a novel(?) approach in which all modifications to the bucket chains is implemented by copying the whole linked list of nodes and replacing the table entry with the new list using compare and set (CAS). A couple of special cases can avoid any copies.

In the event of a failure it means someone else updated it so it simply retries. Some logic akin to a simplified version of what ConcurrentHashMap does is used to resize the table when needed, potentially concurrently.

It uses a load factor of 1.0.

I also tested a fast-path insert which doesn't try to find an existing value (this occurs frequently with the design) but that so overloaded the remove() mechanism it wasn't actually a good fit!

ArrayHash

This is similar to CopyHash but instead of a linked list of container nodes it either stores the value directly in the table, or it stores an array of objects. Again any modifications require rewriting the table entry. Again resize is handled specially and all threads can contribute.

This also has a small modification in that the retry loop includes a ficonacci-increasing Thread.sleep(0,nanos) delay if it fails which may slow down wall-clock time but can improve the cpu load.

I had some ideas for other approaches but i've had enough for now.

Tests

I created a native class which allocates and frees memory of a given size, and a simple 'work' method which just adds up the octets within. I ran two tests, one which just allocated 16 bytes whic immediately went out of scope, and another which allocated 1024 bytes, added them up (a small cpu-bound task), then let it fall out of scope.

Note that lookups are never tested in this scenario apart from the implicit lookup during get/put. I did implement get() - which are always non-blocking in the case of the two latter implementations (even during a resize), but I didn't test or use them here.

I then created 8 000 000 objects in a tight loop in one or more threads. Once the threads finish I invoke System.gc(), then wait for all objects to be freed. The machine I'm running it on has 4 cores/8 threads so I tried it with 1 and with 8 threads.

Phew, ok that's a bit of a mouthful that probably doesn't make sense, bit tired. The numbers below are pretty rough and are from a single run using openjdk 11+28, with -Xmx1G.

                   8 threads, alloc  8 threads, alloc+sum        1 thread, alloc     1 thread, alloc+sum
                     x 1 000 000           x 1 000 000            x 8 000 000             x 8 000 000
                  Elapsed     User      Elapsed     User       Elapsed     User        Elapsed     User
		   
 SynchronisedHash    8.3       38           11       54           8.4       27             16       40
 ConcurrentHash      6.9       35           11       52           8.2       27             17       38

 PointerTable        7.2       26           13       44           7.9       17             19       29
 CopyHash            6.6       31            8.2     42           8.1       24             18       33
 ArrayHash           6.0       28            8.2     39           8.5       23             16       24
 ArrayHash*          6.9       23           12       30           8.1       21             17       23

		    * indicates using a delay in the retry loop for the remove() call

To be honest I think the only real conclusion is that this machine doesn't have enough threads for this task to cause a bottleneck in the hash table! Even in the most contested case (alloc only) simple synchronised methods are just about as good as anything else. And whilst this is represenative of a real life scenario, it's just a bloody pain to properly test high concurrency code and i'm just not that into it for a hobby (the journey not the destination and so forth).

I haven't shown the numbers above but in the case of the Copy and Array implementations I count how many retries are required for both put and get calls. In the 8-thread cases where there is no explicit delay it can be in the order of a million times! And yet they still run faster and use less memory. Shrug.

All of the non java.util based implementations also benefit in both time and space from using the primitive key directly without boxing and storing the key as part of the containee object and not having to support the full Collections and Streams interfaces. PointerTable also benefits from fewer gc passes due to not needing any container nodes and having a higher load factor.

BTW one might note that this is pretty bloody slow compared to C or pure Java - there are definitely undeniably high overheads of the JNI back and forths.

The other thing of note is that well hashtables are hashtables - they're all pretty good at what they're doing here because of the algorithm that they share. There's not really all that much practical difference between any of them.

But why?

I'm not sure what posessed me to look into it this deeply but I've done it now. Maybe i'll post the code a bit later, it may have enough bugs to invalidate all the results but it was still a learning experience.

For the lockless algorithms I made use of VarHandles (Java 9's 'safe' interface to Unsafe) to do the CAS and other operations, and some basic monitor locking for coordinating a resize pass.

The idea of 'read, do work, fail and retry' is something I originally learnt about using a hardware feature of the CELL SPU (and Power based PPU). On that you can reserve a memory location (128 bytes on the SPU, 4 or 8 on the PPU), and if the ticket is lost by the time you write back to it the write fails and you know it failed and can retry. So rather than [spin-lock-doing-nothing WORK unlock], you spin on [reserve WORK write-or-retry]. I guess it's a speculative reservation. It's not quite as clean when using READ work CAS (particularly without the 128 byte blocks on the SPU!) but the net result is (mostly) the same. One significant difference is that you actually have to write a different value back and in this instance being able to merely indicate change could have been useful.

ConcurrentHashMap does something similar but only for the first insert into an empty hash chain, after that it locks the entry and only ever appends to the chain.

Some of the trickiest code was getting the resize mechanism to synchronise across threads, but I really only threw it together without much thought using object mointors and AtomicInteger. Occasionally i'm gettting hangs but they don't appear to make sense: some number of threads will be blocked waiting to enter a synchronised method, while a few others are already on a wait() inside it, all the while another thread is calling it at will - without blocking. If I get keen i'll revisit this part of the code.

Other JNI bits/ NativeZ, jjmpeg.

Yesterday I spent a good deal of time continuing to experiment and tune NativeZ. I also ported the latest version of jjmpeg to a modularised build and to use NativeZ objects.

Hashing C Pointers

C pointers obtained by malloc are aligned to 16-byte boundaries on 64-bit GNU systems. Thus the lower 4 bits are always zero. Standard malloc also allocates a contiguous virtual address range which is extended using sbrk(2) which means the upper bits rarely change. Thus it is sufficient to generate a hashcode which only takes into account the lower bits (excluding the first 4).

I did some experimenting with hashing the C pointer values using various algorithms, from Knuth's Magic Number to various integer hashing algorithms (e.g. hash-prospector), to Long.hashCode(), to a simple shift (both 64-bit and 32-bit). The performance analysis was based on Chi-squared distance between the hash chain lengths and the ideal, using pointers generated from malloc(N) for different fixed values of N for multiple runs.

Although it wasn't the best statistically, the best performing algorithm was a simple 32-bit, 4 bit shift due to it's significantly lower cost. And typically it compared quite well statically regardless.

static int hashCode(long p) {
    return (int)p >>> 4;
}

In the nonsensical event that 28 bits are not sufficient the hash bucket index it can be extended to 32-bits:

static int hashCode(long p) {
    return (int)(p >>>> 4);
}

And despite all the JNI and reflection overheads, using the two-round function from the hash-prospector project increased raw execution time by approximately 30% over the trivial hashCode() above.

Whilst it might not be ideal for 8-bit aligned allocations it's probably not that bad either in practice. One thing I can say for certain though is NEVER use Long.hashCode() to hash C pointers!

Concurrency

I also tuned the use of synchronisation blocks very slightly to make critical sections as short as possible whilst maintaining correct behaviour. This made enough of a difference to be worth it.

I also tried more complex synchronisation mechanisms - read-write locks, hash bucket row-locks and so on, but it was at best a bit slower than using synchronize{}.

The benchmark I was using wasn't particularly fantastic - just one thread creating 10^7 `garbage' objects in a tight loop whilst the cleaner thread freed them. No resolution of exisitng objects, no multiple threads, and so on. But apart from the allocation rate it isn't an entirely unrealistic scenario either and i was just trying to identify raw overheads.

Reflection

I've only started looking at the reflection used for allocating and releaseing objects on the Java side, and in isolation these are the highest costs of the implementation.

There are ways to reduce these costs but at the expense of extra boilerplate (for instantiation) or memory requirements (for release).

Still ongoing. And whilst the relative cost over C is very high, the absolute cost is still only a few hundred nanoseconds per object.

From a few small tests it looks like that maximum i could achieve is a 30% reduction in object instantiation/finalisation costs, but I don't think it's worth the effort or overheads.

Makefile foo

I'm still experiemnting with this, I used some macros and implicit rules to get most things building ok, but i'm not sure if it couldn't be better. The basic makefile is working ok for multi-module stuff so I think i'm getting there. Most of the work is just done by the jdk tools as they handle modules and so on quite well and mostly dicatate the disk layout.

I've broken jjmpeg into 3 modules - the core, the javafx related classes and the awt related classes.

GC JNI, HashTables, Memory

I had a very busy week with work working on porting libraries and applications to Java modules - that wasn't really the busy part, I also looked into making various implementation's pluggable using services and then creating various pluggable implementations, often utilising native code. Just having some (much faster) implementation of parts also opened other opportunities and it sort of cascaded from there.

Anyway along the way I revisited my implementation of Garbage Collection with JNI and started working on a modular version that can be shared between libraries without having to copy core object, and then along the way found bugs and things to improve.

Here are some of the more interesting pieces I found along the way.

JNI call overheads

The way i'm writing jni these days is typically just write the method signature as if it were a Java method and just mark it native. Let the jni handle Java to C mappings direclty. This is different to how I first started doing it and flies in the convention i've typically seen amongst JNI implementations where the Java just passes the pointers as a long and has a wrapper function which resolves these longs as appropriate.

The primary reason is to reduce boilerplate and signficiantly simplify the Java class writing without having a major impact on performance. I have done some performance testing before but I re-ran some tests and they confirm the design decisions used in zcl for example.

Array Access

First, I tested some mechanisms for accessing arrays. I passed two arrays to a native function and had it perform various tests:

No op;
GetPrimitiveArrayCritical on both arrays;
GetArrayElements for read-only arrays (call Release(ABORT))
GetArrayElements for read-only on one array and read-write on the other (call Release(Abort, Commit));
GetArrayRegion for read-only, to memory allocated using alloca
GetArrayRegion and SetArrayRegion for one array, to memory using alloca
GetArrayRegion for read-only, to memory allocated using malloc
GetArrayRegion and SetArrayRegion for one array, to memory using malloc

I then ran these tests for different sized float[] arrays, for 1 000 000 iterations, and the results in seconds are below. It's some intel laptop.

       NOOP         Critical     Elements                  Region/alloca             Region/malloc
            0            1            2            3            4            5            6            7
    1  0.014585537  0.116005779  0.199563981  0.207630731  0.104293268  0.127865782  0.185149189  0.217530639
    2  0.013524620  0.118654092  0.201340322  0.209417471  0.104695330  0.129843794  0.193392346  0.216096210
    4  0.012828157  0.113974453  0.206195102  0.214937432  0.107255090  0.127068808  0.190165219  0.215024016
    8  0.013321001  0.116550424  0.209304277  0.205794572  0.102955338  0.130785133  0.192472825  0.217064583
   16  0.013228272  0.116148320  0.207285227  0.211022409  0.106344162  0.139751496  0.196179709  0.222189471
   32  0.012778452  0.119130446  0.229446026  0.239275912  0.111609011  0.140076428  0.213169077  0.252453033
   64  0.012838540  0.115225274  0.250278658  0.259230054  0.124799171  0.161163577  0.230502836  0.260111468
  128  0.014115022  0.120103332  0.264680542  0.282062633  0.139830967  0.182051151  0.250609001  0.297405818
  256  0.013412645  0.114502078  0.315914219  0.344503396  0.180337154  0.241485525  0.297850562  0.366212494
  512  0.012669807  0.117750316  0.383725378  0.468324904  0.261062826  0.358558946  0.366857041  0.466997977
 1024  0.013393850  0.120466096  0.550091063  0.707360155  0.413604094  0.576254053  0.518436072  0.711689270
 2048  0.013493996  0.118718871  0.990865614  1.292385065  0.830819392  1.147347700  0.973258653  1.284913436
 4096  0.012639675  0.116153318  1.808592969  2.558903773  1.628814486  2.400586604  1.778098089  2.514406096

Some points of note:

Raw method invocation is around 14 nanoseconds, pretty much irrelevant once you do any work.
Get/SetArrayElements is pretty much the same as using GetSet/ArrayRegion with malloc but with less flexibility.
For small arrays 2 calls to malloc/free is nearly 50% of the processing time. Given the gay abandon with which most C programmers throw these around like they cost nothing, the extra JNI overhead is modest.
For larger arrays memcpy time dominates.
For one way transfers shorter than 64 float using Get/SetRegion to the stack or pre-allocated memory is the fastest.
For all other cases including any-sized two-way transfers, GetPrimitiveArrayCritical is the fastest. But it has other overheads and isn't always applicable.

I didn't look at ByteBuffer because it doesn't really fit what i'm doing with these functions.

Anyway - the overheads are unavoidable with JNI but are quite modest. The function in question does nothing with the data and so any meaningful operation will quickly dominate the processing time.

Object Pointer resolution

The next test I did was to compare various mechanisms for transferring the native C pointer from Java to C.

I created a Native object with two long fields, native final long p, and native long q.

No operation;
C invokes getP() method which returns p;
C invokes getQ() method which returns q;
C access to .p field;
C access to .q field;
The native signature takes a pointer directly, call it resolving the .p field in the caller;
The native signature takes a pointer directly, call it resolving the .p field via a wrapper function.

Again invoking it 1 000 000 times.

NOOP         getP()       getQ()       (C).p        (C).q        (J).p        J wrapper
     0            1            2            3            4            5            6
0.016606942  0.293797182  0.294253973  0.020146810  0.020154508  0.015827028  0.016979563

final makes no difference.
method invocation is 15x slower than a field lookup!
Field lookups are much slower in C than Java, but the absolute cost is insignificant at ~2.5nS per lookup.

In short, just passing Java objects directly and having the C resolve the pointer via a field lookup is slightly slower but requires much less boilerplate and so is the preferred solution.

Logger

After I sorted out the basic JNI mechanisms I started looking at the reference tracking implementation (i'll call this NativeZ from here on).

For debugging and trying to be a more re-usable library I had added logging to various places in the C code using Logger.getLogger(tag).fine(String.printf());

It turns out this was really not a wise idea and the logging calls alone were taking approximately 50% of the total execution time - versus java to C to java, hashtable lookups and synchronisation blocks.

Simply changing to use the Supplier versions of the logging functions approximately doubled the performance.

  Logger.getLogger(tag).fine(String.printf());
->
  Logger.getLogger(tag).fine(() -> String.printf());

But I also decided to just make including any of the code optional by bracketing each call to a test against a final static boolean compile-time constant.

This checking indirectly confirmed that the reflection invocations aren't particualrly onerous assuming the're doing any work.

HashMap<Long,WeakReference>

Now the other major component of the NativeZ object tracking is using a hash-table to map C pointers to Java objects. This serves two important purposes:

Allows the Java to resolve separate C pointers to the same object;
Maintains a hard reference to the WeakReference, without which they just don't work.

For simplicity I just used a HashMap for this purpose. I knew it wasn't ideal but I did the work to quantify it.

Using jol and perusing the source I got some numbers for a jvm using compressed oops and an 8-byte object alignment.

Object	Size
HashMap.Node	32	Used for short hash chains.
HashMap.TreeNode	56	Used for long hash chains.
Long	24	The node key
CReference	48	The node value. Subclass of WeakReference

Thus the incremental overhead for a single C object is either 104 bytes when a linear hashchain is used, and 128 bytes when a tree is used.

Actually its a bit more than that because the hashtable (by default) uses a 75% load factor so also allocates 1.5 pointers for each object but that's neither here nor there and also a feature of the algorithm regardless of implementation.

But there are other bigger problems, the Long.hashCode() method just mixes the low and high words together using xor. If all C pointers are 8 (or worse, 16) byte aligned you essentially only get every 8 (or 16) buckets ever in use. So apart from the wasted buckets the HashMap is very likely to end up using Trees to store each chain.

So I wrote another hashtable implementation which addresses this by using the primitive long stored in the CReference directly as the key, and using the CReference itself as the bucket nodes. I also used a much better hash function. This reduced the memory overhead to just the 48 bytes for the CReference plus a (tunable) overhead for the root table - anywhere from 1/4 to 1 entry per node works quite well with the improved hash function.

This uses less memory and runs a bit faster - mostly because the gc is run less often.

notzed.nativez

So i'm still working on wrapping this all up in a module notzed.nativez which will include the Java base class and a shared library for other JNI libraries to link to which includes the (trivial) interface to the NativeZ object and some helpers to help write small and robust JNI libraries.

And then of course eventually port jjmpeg and zcl to use it.

About Me

Tags