Concurrent Hash Tables
So I got curious about whether the GC in NativeZ would cause any bottlenecks in highly contested situations - one I already faced with an earlier iteration of the library. The specific case I had was running out of memory when many threads were creating short-lived native objects; the single thread consuming values from the ReferenceQueue wasn't able to keep up.
The out of memory situation was fairly easily addressed by just running a ReferenceQueue poll whenever a new object is created, but I was still curious about the locking overheads.
A field of tables
I made a few variations of hash tables which supported the interface I desired which was to store a value which also provides the key itself, as a primitive long. So its more of a set but with the ability to remove items by the key directly. For now i'll just summarise them.
- SynchronisedHash
This subclasses HashMap and adds the desired interfaces, each of which is synchronised to the object.
Rather than use the pointer directly as the key it is shifted by 4 (all malloc are 16-byte aligned here) to avoid Long.hasCode() pitfalls on pointers.
- ConcurrentHash
This subclasses ConcurrentHashMap and adds the desired interfaces, but no locking is required.
The same shift trick is used for the key as above.
- PointerTable
This was the use-weakreference-as-node implementation mentioned in the last post - a simple single-threaded chained hash table implementation. I just synchronised all the entry points.
A load factor of 2.0 is used for further memory savings.
- CopyHash
This is a novel(?) approach in which all modifications to the bucket chains is implemented by copying the whole linked list of nodes and replacing the table entry with the new list using compare and set (CAS). A couple of special cases can avoid any copies.
In the event of a failure it means someone else updated it so it simply retries. Some logic akin to a simplified version of what ConcurrentHashMap does is used to resize the table when needed, potentially concurrently.
It uses a load factor of 1.0.
I also tested a fast-path insert which doesn't try to find an existing value (this occurs frequently with the design) but that so overloaded the remove() mechanism it wasn't actually a good fit!
- ArrayHash
This is similar to CopyHash but instead of a linked list of container nodes it either stores the value directly in the table, or it stores an array of objects. Again any modifications require rewriting the table entry. Again resize is handled specially and all threads can contribute.
This also has a small modification in that the retry loop includes a ficonacci-increasing Thread.sleep(0,nanos) delay if it fails which may slow down wall-clock time but can improve the cpu load.
I had some ideas for other approaches but i've had enough for now.
Tests
I created a native class which allocates and frees memory of a given size, and a simple 'work' method which just adds up the octets within. I ran two tests, one which just allocated 16 bytes whic immediately went out of scope, and another which allocated 1024 bytes, added them up (a small cpu-bound task), then let it fall out of scope.
Note that lookups are never tested in this scenario apart from the implicit lookup during get/put. I did implement get() - which are always non-blocking in the case of the two latter implementations (even during a resize), but I didn't test or use them here.
I then created 8 000 000 objects in a tight loop in one or more threads. Once the threads finish I invoke System.gc(), then wait for all objects to be freed. The machine I'm running it on has 4 cores/8 threads so I tried it with 1 and with 8 threads.
Phew, ok that's a bit of a mouthful that probably doesn't make sense, bit tired. The numbers below are pretty rough and are from a single run using openjdk 11+28, with -Xmx1G.
8 threads, alloc 8 threads, alloc+sum 1 thread, alloc 1 thread, alloc+sum x 1 000 000 x 1 000 000 x 8 000 000 x 8 000 000 Elapsed User Elapsed User Elapsed User Elapsed User SynchronisedHash 8.3 38 11 54 8.4 27 16 40 ConcurrentHash 6.9 35 11 52 8.2 27 17 38 PointerTable 7.2 26 13 44 7.9 17 19 29 CopyHash 6.6 31 8.2 42 8.1 24 18 33 ArrayHash 6.0 28 8.2 39 8.5 23 16 24 ArrayHash* 6.9 23 12 30 8.1 21 17 23 * indicates using a delay in the retry loop for the remove() call
To be honest I think the only real conclusion is that this machine doesn't have enough threads for this task to cause a bottleneck in the hash table! Even in the most contested case (alloc only) simple synchronised methods are just about as good as anything else. And whilst this is represenative of a real life scenario, it's just a bloody pain to properly test high concurrency code and i'm just not that into it for a hobby (the journey not the destination and so forth).
I haven't shown the numbers above but in the case of the Copy and Array implementations I count how many retries are required for both put and get calls. In the 8-thread cases where there is no explicit delay it can be in the order of a million times! And yet they still run faster and use less memory. Shrug.
All of the non java.util based implementations also benefit in both time and space from using the primitive key directly without boxing and storing the key as part of the containee object and not having to support the full Collections and Streams interfaces. PointerTable also benefits from fewer gc passes due to not needing any container nodes and having a higher load factor.
BTW one might note that this is pretty bloody slow compared to C or pure Java - there are definitely undeniably high overheads of the JNI back and forths.
The other thing of note is that well hashtables are hashtables - they're all pretty good at what they're doing here because of the algorithm that they share. There's not really all that much practical difference between any of them.
But why?
I'm not sure what posessed me to look into it this deeply but I've done it now. Maybe i'll post the code a bit later, it may have enough bugs to invalidate all the results but it was still a learning experience.
For the lockless algorithms I made use of VarHandles (Java 9's 'safe' interface to Unsafe) to do the CAS and other operations, and some basic monitor locking for coordinating a resize pass.
The idea of 'read, do work, fail and retry' is something I originally learnt about using a hardware feature of the CELL SPU (and Power based PPU). On that you can reserve a memory location (128 bytes on the SPU, 4 or 8 on the PPU), and if the ticket is lost by the time you write back to it the write fails and you know it failed and can retry. So rather than [spin-lock-doing-nothing WORK unlock], you spin on [reserve WORK write-or-retry]. I guess it's a speculative reservation. It's not quite as clean when using READ work CAS (particularly without the 128 byte blocks on the SPU!) but the net result is (mostly) the same. One significant difference is that you actually have to write a different value back and in this instance being able to merely indicate change could have been useful.
ConcurrentHashMap does something similar but only for the first insert into an empty hash chain, after that it locks the entry and only ever appends to the chain.
Some of the trickiest code was getting the resize mechanism to synchronise across threads, but I really only threw it together without much thought using object mointors and AtomicInteger. Occasionally i'm gettting hangs but they don't appear to make sense: some number of threads will be blocked waiting to enter a synchronised method, while a few others are already on a wait() inside it, all the while another thread is calling it at will - without blocking. If I get keen i'll revisit this part of the code.