`parallel' streams

I had a task which I thought naturally fitted the Java streams stuff so tried it out. Turns out it isn't so hot for this case.

The task is to load a set of data from files, process the data, and collate the results. It's quite cpu intensive so is a good fit for parallelisation on modern cpus. Queuing theory would suggest the most efficient processing pipeline would be to run each processing task on it's own thread rather than trying to break the tasks up internally.

I tried a couple of different approaches:

Files.find().forEach() (serial to compare)
Files.find().parallel().collector(custom concurrent collector)
Files.find().parallel().flatMap().collect(toList())

The result was a bit pants. At best they utilised 2 whole cores and the total execution times were 1.0x, 0.77x, and 0.76x respectively of the serial case. The machine is some intel laptop with 4 HT cores (i.e. 8x threads).

I thought maybe it just wasn't throwing enough threads at it and stalling on the i/o, so I tried a separate flatMap() stage to just load the data.

Files.find().parallel().flatMap(load).flatMap(process).collect(toList())

But that made no difference and basically ran the same as the custom collector implementation.

So I hand-rolled a trivial multi-thread processing graph:

I/O x 1: Files.find().forEach(load | queue)
Processing x 9: queue | process | outqueue
Collator x 1: outqueue | List.add()

With a few sentinel messages to handle finishing off and cleanup.

Result was all 8x "cores" fully utilised and a running time 0.30x of the serial case.

I didn't record the numbers but I also had a different implementation that parallelised parts of the numerical calculation instead. Also using streams via IntStream.range().parallel() (depending on the problem size). Surprisingly this had much better CPU utilisation (5x cores?) and improved runtime. It's surprising because that is a much finer-grained concurrency with higher overheads and not applied to the full calculation.

I've delved into the stream implementation a bit trying to understand how to implement my own Spliterators and whatnot, and it's an extraordinarily large amount of code for these rather middling results.

Not that it isn't a difficult problem to solve in a general way; the stream "executor" doesn't know that I have tasks and i/o which are slow and with latency compared to many small cpu-bound tasks which it seems to be tuned for.

Still a bit disappointing.

jjmpeg & stuff

Well for whatever reason I got stuck into redoing jjmpeg and seem to have written most of the code (90%?) after a couple of weekends. It was mostly mandraulic and a bit tedious but somehow surprisingly relaxing and engaging; a short stint of unchallenging work can be a nice change. A couple of features are still missing but the main core is done.

Unfortunately my hope that the ffmpeg api was more bindable didn't really pan out but it isn't really any worse either. Some of the nastiest stuff doesn't really need to be dealt with fortunately.

I transformed most of the getters and setters into a small number of simple macros, and thus that part is only about as much work as the previous implementation despite not needing a separate compilation stage. I split most of the objects into separate files to make them simpler to maintain and added some table-based initialisation helpers to reduce the source lines and code footprint.

It's pretty small - counting `;' there's only 750 lines of C and 471 lines of Java sources. The 0.x version has 800 lines of C and 900 lines of Java, a big portion of which is generated from an 800 line (rather unmaintainable) Perl script. And the biggest reduction is the compiled size, the jar shrank from 274KB to 73KB, with only a modest increase from 55KB to 71KB in the (stripped) shared library size (although the latter doesn't include the dvb or utility classes).

There's still a lot of work to do though, I still need to test anything actually works and port over the i/o classes and enum tables at the least, and a few more things probably. This is the boring stuff so it'll depend on my mood.

Fuck PCs

In other news I finally killed my PC - I tried one more time to play with the BIOS and after a few updates it got so unstable it just crashed during an update and bricked the motherboard. Blah. I discovered I could order a new BIOS rom so i've done that and i'll see if i can recover it, otherwise I might get another mobo if I can still get AM2+ boards here, or just get another machine. I'll probably look into the latter anyway as it's always been a bit of a hassle (despite working flawlessly when it does and it's a very nice small machine.

jjmpeg?

Well i've had reason to visit jjmpeg again for something and although it's still doing the job, it's a very very long way behind in version support (0.10.x?). I've added a couple of things here and there (recently AVFormatContext.open_input so I could open compressed webcam streams) but i'm not particularly interested in dropping another release.

But ... along the way I started looking into writing a new version that will be up to date with current ffmpeg. It's a pretty slow burner and i'm going to be pretty busy with something (relatively interesting, moderately related) for the next couple of months.

But regardless here are a few what-if's should I continue with the effort.

The old generator + garabage collection support required 4 classes per object.
1. Abstract autogenerated native WeakReference based accessors with native-oriented methods (passing `pointers').
2. Manually written native accessors as above and the glue to make it all work.
3. Abstract autogenerated public accessors with public-oriented methods (passing objects).
4. Manually written public accessors as above and the glue to make it all work.
Whilst most of it is autogenerated the generator sucks to maintain and it's a bit of a mess. I've also since learnt that cutting down the number of classes is desirable.

So instead i'll use the "CObject" mechanism with the WeakReference being a simple native pointer holder object which also knows how to free it. In this case at most 2 custom classes are required - one for autogenerated code (if that happens) and any helper/custom code.

A few things require reflection going this route but the overheads should be acceptable.
Native memory was wrapped in native ByteBuffer objects.

Originally the goal was to have java just access fields directly but in practice this wasn't practical as the structures change depending on the compile options so you end up with both the C and Java code being system specific, and the Java code requires a compiler to implement it (C being handled by gcc). A side-goal was to make the Java library bit-size independent without resoring to long - although that's all ByteBuffer uses.

Because the objects are just wrapped on the pointer there is the possibility that multiple objects can be created to reference the same underlying C object (e.g. getStreams().get(0) repeated). Whilst this isn't as bad as it sounds one has to ensure the objects aren't holding any of their own state java-side for example. It also turns out that a direct ByteBuffer isn't terribly fast either from the Java side or looking up from the C side (not sure why on the latter).

CObject just uses a long directly, which also precludes the likelyhood of poking around C internals by accident or otherwise. It also ensures unqiue objects reference unique pointers - this requires some overhead but it isn't onerous.
Using two concrete classes per object allowed the internal details of passing pointers (ByteBuffer) around to be hidden from the luser.
- But it requires a lot of scaffolding! The same method written at least 2 times!
- Although the C call gets a ByteBuffer directly, looking up the host pointer still requires a JNIEnv callback.
CObject likewise uses an accessor to retrieve the native pointer, but because it's the super-class of all objects the objects can simply be passed in directly. That is native methods just look like java methods and so there is no need for any trampolines between method interfaces. It does require a bit more support on the JNI side when returning objects, but it's trivial code.

An alternative would be to use a long and pass that around but then you still need the public/native separation and all the hassle that entails.
A lot of the current binding is autogenerated.

Once the generator was written this was fairly easy to maintain but getting the generator complete enough was a lot of work. The biggest issue is that the api just isn't very consistent, and some just don't map very nicely to a Java api. Things such as out parameters - or worse, absolute snot like like AVDictionary that should never have existed (onya libav!).

Each case required special-case code in the generator, often extra support code, and sometimes a fall-back to manually writing the whole lot.

Working with zcl - and in that case the OpenCL apis are much cleaner and consistent - I discovered it was somewhat less work just to do it manually and not really any harder to maintain afterwards. At least in the case of the original ffmpeg the inconistency was simply because it wasn't originally intended as a public library, and I suspect the newer versions might be a bit better.

I'm still undecided about simple data accessors as a good case can be made for saving the typing if there are many to write. So perhaps they could still be autogenerated (to an abstract super class as now), or they could be parameterised like they are in zcl (i.e. internally getInt(field) with public wrappers). Another half and half option would be to use the C preprocessor to do most of the ugly work and still write the Java headers by hand. Probably the last one.

Does anyone else care?

RAM saga

Powered down last night because of an approaching thunderstorm ... half my ram gone again.

disable_mtrr_cleanup did nothing. disable_mtrr_trim would hang the boot.

I noticed the RAM speed was wrong in the BIOS again so i reset it, and lo and behold it all showed up, but only until the next reboot. Back to bhe BIOS and just changed the ram from one speed to another and back again - RAM returns!

Erg.

I hate my life.

And I wish I was dead.

I hate peecees

Well I was up till 3am fucking around with this bloody machine.

After verifying the hardware actually works it seems that the whole problem with my RAM not being found is the damn BIOS. I downloaded a bunch of BIOSs intending to try an older one and realised I hadn't installed the latest anyway. So I dropped than in and low and behold the memory came back. Yay.

So now I had that, I thought i'd try and get OpenCL working again. Ok, installed the latest (17.40) amdgpu-pro ... and fuck. Unsupported something or other in the kernel module. Sigh. I discovered that 17.30 did apparently work ... but it took a bit of digging to find a download link for it as AMD doesn't link to older releases (at this point i also found out midori has utterly shithouse network code. sigh). I finally found the right one (that wasn't corrupt), installed it and finally ...

Oh, back to 3.5G ram again. FAAARK.

At this point the power went out for the whole street so I had a shower using my phone's torch and went to bed.

Did some more digging when I got up (I was going to give up and say fuck it, but i've come this far), tried manually adding the ram using memmap, and finally confirmed the problem was the BIOS. So i tried an older one. That worked.

But only for a while, and then that broke as well. So trying the previous one ... groan.

Mabye it's time to cut my losses, it's already 1pm, the sun is out, heading for a lovely 31 degrees.

I also got rocm installed but i don't know if it works on this hardware, although at least the kernel is running ok so far.

io scheduler, jfs, ssd

I didn't have much to do today and came across some articles about jfs and io schedulers and thought i'd run a few tests while polishing off a bottle of red (Noons Twleve Bells 2011). Actually i'd been waiting for a so-called "mate" to show up but he decided he was too busy to even let me know until after the day was over. Not like i've been suicidally depressed this week or anything.

The test I chose was to compile open jdk 9 which is a pretty i/o intensive and complex build.

My hardware is a Kaveri A10-7850K APU, "4G" of memory (sigh) on an ITX motherboard, and a Samsung SSD 840 EVO 250GB drive. For each test I blew away the build directory, re-ran configure, set the scheduler "echo foo > /sys/block/sda/queue/scheduler", and flushed the buffer caches "echo 3 > /proc/sys/vm/drop_caches" before running "time make images". I forced all cpu cores to a fixed 3.0Ghz using the performance scheduler.

At first I kept getting ICE's in the default compiler so I installed gcc-4.9 ... and got ICE's again. Inconsistently though, and we all know what ICE's in gcc means (it almost always means broken hardware, i.e. RAM). Sigh. It was afer a suspend-resume so that might've had something to do with it.

Well I rebooted and fudged around in the BIOS intending to play with the RAM clock but noticed I'd set the voltage too low. So what the hell I set it 1.60v and rebooted. And whattyaknow, suddenly I have 8G ram working again (for now), first time in a year. Fucking PCs.

Anyway so I ran the tests with each i/o scheduler on the filesystem and had no ICE's this time (which was dissapointing because it means the hardware isn't likely to be busted afterall, just temperamental). I used the default compiler. As it was a fresh reboot I ran builds from the same 80x24 xterm and the only other luser applications running beyond the panels and window manager were 1 other root exterm and an emacs to record the results.

cfq

real    10m39.440s
user    28m55.256s
sys     2m40.140s

deadline

real    10m36.500s
user    28m44.236s
sys     2m40.788s

noop

real    10m42.683s
user    28m43.960s
sys     2m41.036s

As expected from the articles i'd read, deadline (not the default in ubuntu!) is the best. But it's hardly a big deal. Some articles also suggested using noop for SSD storage but that is "clearly" worse (oddly that it increases sys time if it doesn't do anything). I only ran one test each in the order shown so it's only an illustrative result but TBH I'm just not that interested especially if the resutls are so close anyway. If there was something more significant I might care.

I guess i'll switch to deadline anyway, why not.

No doubt much of the compilation ran from buffer cache - infact this is over 2x faster than when I compiled it on Slackware64 a few days ago - at the time I only had 4G system ram (less the framebuffer) and a few fat apps running. But I also had different bios settings (slower ram speed) and btrfs rather than jfs.

As an aside I remember in the early days of Evolution (2000-2-x) when it took 45 minutes for a complete build on a desktop tower, and the code wasn't even very big at the time. That dell piece of crap had woefully shitful i/o. Header files really kill C compilation performance though.

Still openjdk is pretty big, and i'm using a pretty woeful CPU here as well. Steamroller, lol. Where on earth are the ryzen apu's amd??

notzed@minized:~/hg/jdk9u$ find . -type f \
  -a \( -name '*.java' -o -name '*.cpp' -o -name '*.c' \) \
   | xargs grep \; | wc -l
3063124

(some of the c and c++ is platform specific and not built but it's a good enough measure).

Midori user style

Well I found a way to make Midori usable for me as a browser-of-text using the user stylesheet thing.

Took some theme and stripped out the crap and came up with this ... it turns all the text readable but leaves most of the rest intact which is an improvement on how firefox rendered it's colour and stye overrides. It just overrode everything which broke a lot of style-sheet driven GUI toolkits amongst other sins.

* {
    color: #000 !important;
    text-shadow: 0 0 0px #000 !important;
    box-shadow: none !important;
    background-color: #777 !important;
    border-color: #000 !important;
    border-top-color: #000 !important;
    border-bottom-color: #000 !important;
    border-left-color: #000 !important;
    border-right-color: #000 !important;
}

div, body {
    background: transparent !important;
}

a, a * {
    color: #002255 !important;
    text-decoration: none !important;
}

a:hover, a:hover *, a:visited:hover, a:visited:hover *, span[onclick]:hover, div[onclick]:hover, [role="link"]:hover, [role="link"]:hover *, [role="button"]:hover *, [role="menuitem"]:hover, [role="menuitem"]:hover *, .link:hover, .link:hover * {
    color: #005522 !important;
    text-decoration: none !important;
}

a:visited, a:visited * {
    color: #550022 !important;
}

I don't know if i'll stick with it yet but it's a contender, its built-in javascript blocker looks a lot better than some shitty plugin which will break every ``upgrade'' too. It doesn't seem to run javascript terribly fast, but that's the language's fault for being so shithouse.

About Me

Tags