ahh

So I discovered from observation that the whole performance thing is basically some fundamental speed/space trade-off in the compiler. It seems that for any (abstract) class which implements a loop and calls a method defined in a sub-class inside that loop, it will be inlined (if possible and/or based on other parameters) for up to two sub-classes, but if you add a third those both will be thrown away and replaced with a version they all share which uses a method call instead.

This dynamic recompilation (or rebinding) is actually pretty cool but has some implications for the whole mechanism on which java streams/lambdas are built. It's only much of a "problem" for tasks that don't do much work or for platforms where call overhead is high and where the operation is being called multiple-millions of times. But lambdas rather encourage the former.

Micro-benchmarks overstate the impact and in any event it can still be worked around or avoided in performance-critical code. But regardless I know what to look for now if i see funny results.

I did some experimentation with yet another way to access and process the pixels: by row accessor. There are still on-going but my current approach is to have a Consumer which is supplied a type-specific row accessor. This allows flat-mapped access via a get(i)/set(i,v) interface pair and exposes the internal channel ordering (which is fixed and interleaved). To handle different storage formats the forEach function will perform data conversion and be supplied data-directional hints so that each implementation isn't forced to implement every possible storage format.

Performance seems to be pretty good and there don't seem to be any weird speed-degradation gotchas so far. The Consumer code still needs to implement a loop but it's somewhat cleaner than having to have an array+base index passed around and used everywhere and the inner loop being local seems to be the easiest way to guarantee good performance. Oh and I have a way to handle multi-image combining operations by just defining multi-row interfaces which can be passed with a single parameter.

I don't know yet whether I want to also go "full-on-stream" with this as well as the tiles, and/or whether i even want a "Pixel" interface anywhere. I'm getting ideas to feed back and forth between them though like the memory transfer hints and automatic format conversion which might also be applicable to tiles. Having a `Pixel' is still kind of handy for writing less code even if there are performance trade-offs from using it. Then again, how many interfaces do I want to write and build the scaffolding for? Then there is the question of whether these accesors are used in the base apis themselves outside of just supplying ways to read or write the information.

Well that was a short diversion to end the day but I actually spent most of today (actually almost a full working day's worth) implementing what I thought was a fairly modest task: adding a parameterisable coordinate mapper to an all-edge-cases handling "set pixels" function. I'm using it as part of the tile extraction so that for example you can extract a tile from anywhere and have the edges extended in different ways that suit your use.

I already had a mapper enum to handle this for other cases but performing the calculation per-pixel turned out to be rather slow, actually oddly slow now I think about it - like 100x slower then a straight copy and it's not really doing a lot of work. I tried writing an iterator which performed the calculation incrementally and although it was somewhat faster it was still fairly slow and exposed too much of the mess to the caller. After a few other ideas I settled on adding a batch row-operation method to the enum where it handles the whole calculation for a given primitive data type.

The mirror-with-edge-repeat case turned out to way more code than I would like but it's something like 20x faster than a trivial per-pixel mapped version. It breaks the target row into runs of in-order and reverse-order pixels and copies them accordingly. I can get a bit more performance (~20-30%) coding specific versions for each channel count/element size; but it's already enough code to have both float and byte versions. Hmm, maybe this is something that could take a row-pair-accessor, 4 arguments instead of 8, .. ahh but then i'd probably want to add bulk operations to the interface and it's starting to get complicated.

Well now i'm super-tired and not thinking straight so time to call it a night. My foot seems to be going south again and i've been downing all the tea and water I can stomach so i'll probably be up all night anyway :(

Streaming tiles

Yesterday I pretty much got lost all day poking around streams and iterators and puzzling over compiler nuances. It was a horrid cold and windy day and my foot's in no state to do anything so I didn't really have much else to do on an rdo. I learnt way more than I really wanted to ...

I was trying to come up with ways to make pixel-programming 'easy' and fit it into the stream framework yet still run at near-optimal efficiency. Streams of primitives are fine for single-channel images but since you can only return a single value the typical 'function' interfaces just don't work.

This means that you have to return multi-element values in some other structure and given it's stored that way; just writing directly to the storage array with an offset works.

My initial approach was to internalise the loop over rows or spans of consective pixels and just pass offsets and a length. i.e. all operators are batch/vector operators. This lets one amortise per-row addressing costs and the compiler can do some other optimisations fairly simply.

The problem is that you're still left writing yet-another-fucking-for-loop each time you write a new function ... and apart from that you've lost the positional information which is sometimes required. Ho hum.

So then I tried functions which operate on a single pixel but are given the coordinates and the channel number as parameters. This looked fairly promising and the performance is still good but it doesn't really solve the initial problem in that each channel is still being processed separately.

Pixel

Then I tried a Pixel accessor object. This somewhat suprisingly ... was actually very fast.

Well not really. Is it? I thought it was? Things get a little odd here. I had various other cases but the most recent and relevant one is that as soon as the same class of an iterator is used for more than two callbacks suddenly the performance plummets by 5-10x; and stays there forever more. This would explain the strange results that I encountered in the previous post. It seemed like a compiler bug soI tried some older jdks and they were identical, but since i don't feel like removing all the lambdas they were all 1.8.x. It may just be a space/speed tradeoff rather than a bug.

So, ... well shit to that.

And I tried everything ... I first noticed that sharing a spliterator class between different target classes seemed to be the problem, but then i found i could use a general spliterator and then use a custom iterator for the inner-most loop, but then I found even that drops in speed once you have too many functions. So I tried per-target iterators and spliterators and forEach() implementations with out luck.

Since I had some pretty poor performance from immutable base types before I abused the interface somewhat anyway: each pixel is the same instance and I just change the parameters for each consumer invocation. This means the object can't be passed out of the first map() or forEach() stage. It still works for many cases though.

For example:

// greyscale quantise to 16 levels
src.pixels().forEach((Pixel p) -> {
    int v = p.get(0) & 0xf0;
    p.set(v | v>>4);
}

Just for kicks I might see how well an immutable accessor works - the pixel data will still be stored in the original image but the 'Pixel' object will track it's location so it can be passed through a map function. I can't see much use for chaining per-pixel processing in this way though as you can just add the extra steps in the same function.

If i'm really bored I might try one where the pixel data itself is also copied so you'd need to use a collector to reconstruct an image. Again; I can't see how this would be useful though.

Tile

Because I thought i'd solved the Pixel performance problem; and I had at least exhausted all my current ideas, I thought i'd tackle the next big one: tile processing.

So the goal here is to break an image into tiles, process each one independently, and then re-form the result. This ... actually looks a lot like a stream processor and indeed it fits the model quite well assuming you retain the tile home position through the stream (does this make it not really a stream? I don't know, I think it's close enough and the arguments otherwise are a bit pointless).

After a couple of iterations (you can do a lot in 14 hours) I'm getting pretty close to something i'm pretty pleased with. It supports a copy mode which can pass tiles through a processing pipeline and an accessor mode which can be used for readonly access or per-pixel operations. Tiles can be overlapped. I've got a collector which forms the tiles back into a single image based on their coordinates.

A more more ideas to try but this is the sort of thing I've got working already:

// process an image by tiles into a new image
BytePixels1 src = ...;
dst = src.tiles(64, 64, 0, 0)
    .map((Tile t) -> {
            System.out.printf("do stuff @ %d,%d\n", t.getX(), t.getY());
            return t;
        })
    .collect(new TileCollector(src.create()));

Because there are fewer of them there aren't so many concerns with gc blow-out as with the pixel case so these are pipelineable objects.

The practical api

I'm not going for api perfection but i'm working on practicality with a heavy dose of performance.

So I like the tile streaming bit: this is just a necessary feature and it's a pleasantly good fit for java streams and works well with threads.

I'm still not sure on approach to pixels yet, so far i've got convenience or performance but not both (at least not in the general case). This latter target seems to be in a bit of a fight with the compiler but there might be a way to put the specialisation in the correct place so the compiler knows how to optimise it or at least not slow down other cases. Another issue is that any temporary storage needs to be allocated in each callback as you don't have access to thread-instance information in order to index pre-allocated information. So anything that needs extra workspace will add even more overhead.

Ideally I would like to support java streams ... but it may be more practical to just use different mechanisms for pixels and leave the stream and multi-thread handling to tiles. i.e. i could just have serial map(), forEach(), and reduce() calls which can be invoked per-tile and leave the combining to the higher levels. Yes that sounds reasonable.

And there are also some cases where the stream interface doesn't really fit such as a calculation across the same location of different images (apart from using a spliterator to create a multi-source accessor?).

As an old boss used to say, just gotta suck it and see. Time to suck?

Steaming piles

The syntax for generics are giving me the shits a bit. Nesting gets ugly very quickly and when things get complicated it's a bit of pot-luck working out what is needed. This is obviously something that could be learnt but it's not very useful knowledge and i'd rather be doing more interesting things. So I'm mostly just twiddling strings and seeing what netbeans tells me - when it decides to respond.

Something has 'happened' to netbeans too. It's using way too much memory and running far too slow. I'm constantly getting out of memory errors which then throws the whole system into a confused state. I've tried closing all the projects i'm not working on and regularly close all windows and even disabling all the plugins i'm not immediately using right now. Barely makes any difference. It starts at 500MB and keeps growing until I have to restart it.

And that's besides all of it's really annoying behaviours which are apparently not bugs. I've turned off all the tooltips and everything automatic I can find but it still wants to pop up this shitty argument list thing every time I do an expansion via ctrl-space: i'm constantly having to hit escape just to fuck that shit off. And it seems to have given up guessing types altogether now. Type `Foo f = new <ctrl-space>' - No, I do not want or could possibly want to ever create a new AbstractMethodError. It's 'smart' build system is constantly out of date - i never know when i'll need to clean-rebuild so rather than face that hassle I just have to do it every time; and even then it executes some other set of classes so can be out of sync. It seems to randomly decide when it'll let you run the same application twice or not. That's one of those utterly pointless features that just isn't necessary and should be removed as soon as it doesn't work perfectly every single time - there's been enough bug reports about it.

And we're talking fairly small toy projects with little or not dependencies written by one man. This image library is only 5KLOC of code lines or 12KOC of source.

Emacs is looking more attractive but that has big problems formatting straightforward C these days (all too often forcing a file close - reopen for a momentary reprieve) and with the size of the java runtime api it's just not practical typing out every function call, package, and type name.

spliterators are just iterators

I've been mucking about on-and-off with a 2D pixel library (i.e images) and as one part of that I was trying to utilise the stream functions from java 8 and/or at least look into some functional callbacks.

I can't just use the jre stuff because they don't have byte or float streams and they don't handle 2D data; I wanted to be able to support sub-regions of a rectangular array accessed by offset with different strides.

It took me a little while to wrap my head around spliterators because I was looking at them as if their primary job was to split the work up into smaller chunks to be worked across threads; but this is completely the wrong way to frame them. They are instead just iterators that can be broken into non-overlapping parts.

Once I realised that I had no trouble coming up with some 2D array streams of various types and trying to work out how to get better performance.

Given the code complexity i'm fairly impressed with the performance of the stream code in general but it doesn't always seem to interact with the jvm very well. Some code I have gets slower once hotspot has completed it's optimisations; which is odd. And sometimes it's really not too fast at all.

So I have a test-case which creates a 2D byte array of 1024x1024 elements and sums up every element as an integer 1000 times in tight loop.

The basic object i'm using for these tests is:

class BytePixels {
    public final byte[] pixels;
    public final int width, height, stride, offset;

    public int offset(int x, int y)  { return x+y*stride+offset; }
    public int getb(int x, int y)    { return pixels[offset(x,y)] & 0xff; }
    public int getb(int i)           { return pixels[i] & 0xff; }
}

The machine is a quad-core cpu and 4 threads are used whenever concurrency.

I'll go through some of the cases I tried and show the inner loops. For the most part i'm assuming offset is not zero since that case is a little simpler but its difference to runtime is minimal. I'm applying obvious micro-optimisations like copying loop bounds or array pointers to local variables.

The first examples are just using basic Java & jre facilities.

376ms Nested loops with getb(x,y)

This loops over the dimensions of the image and calls getb(x,y) to retrieve each element.

for (int y = 0; y < height; y++)
    for (int x = 0; x < width; x++)
        sum += bp.getb(x, y);

This sets a base-line for the most obvious implementation.

387ms Nested loops with pixels[o++]

This loops over the dimensions but removes the internal array calculation by calculating the offset at the start of each row and then accessing the pixel values directly.

for (int y = 0; y < height; y++)
    for (int x = 0, o = bp.offset(0, y); x < width; x++)
        sum += pixels[o++] & 0xff;

This is supposedly optimised ... but executes marginally slower or about the same. Ahah what? I can't explain this one.

328ms Nested loops with pixels[o++] take 2

This is the same as the previous but changes the inner loop bounds checking so that the incrementing value is the same one used as the bounds check.

for (int y = 0; y < height; y++)
    for (int o = bp.offset(0, y), e = o + width; o < < e; o++)
        sum += pixels[o] & 0xff;

Given the simplicity of the change; this makes a pretty big difference. I must remember this for the future.

I only tried this whilst writing this post so went back and applied it to some of the other algorithms ...

2980ms map/reduce in parallel with flat indices.

This uses IntStream.range().parallel() to create a stream of all indices in the array and uses .map() to turn it into an array of elements. Because range is only 1-dimentional this is not a general solution merely 'best case'.

IntStream.range(0, width * height)
    .parallel()
    .map((i) -> getb(i))
    .sum();

So ... it is not very 'best'. Actually a further wrinkle is unlike other examples this one runs slower after hotspot has completed it's optimisations and the first-run is the quickest. And we're talking 2-3x slower here not just a bit.

3373ms map/reduce in parallel with 2D indices.

This uses IntStream.range().parallel() to create a stream of all indices within the bounds of the image and then remaps these back to 2D coordinates inside the .map() function. This is the general solution which implements the required features.

IntStream.range(0, width * height)
    .parallel()
    .map((i) -> getb(offset + (i / width) * stride + (i % width)));
    .sum();

Not surprisingly it's a bit slower, and like the previous example also gets slower once optimised.

Basically; that's a no then, this just isn't going to cut it. So time to see if a custom spliterator will do it.

1960ms map/reduce in parallel with custom 2D spliterator

This uses a custom 2D spliterator that splits (at most) into rows and then uses map() to retrieve the pixel values.

StreamSupport.intStream(new Spliterator2DIndex(width, height, offset, stride), true)
    .map((i) -> getb(i))
    .sum();

Still "slow as shit" to use the correct technical jargon.

This is about when I worked out how to use spliterators and tried the next obvious thing: using the spliterator to do the map() since that seems to be "problematic".

125ms stream reduce in parallel with custom 2D array spliterator

The custom 2D spliterator performs the array lookups itself and feeds out the values as the stream. It supports the full features required.

StreamSupport.intStream(new Spliterator2DByteArray(width, height, offset, stride, pixels), true)
    .sum();

Ok, so now this is more like it. It's finally managed to beat that single threaded code.

Until this point I was ready to ditch even bothering with the stream code; one can deal with a bit of a performance drop for some convenience and code re-use, but 10x is just unacceptable.

115ms stream reduce in parallel with custom 1D array spliterator

This loops over all elements of an array and feeds out the array values. This is a non-conformant solution to determine the overheads of the 2D indexing.

StreamSupport.intStream(new SpliteratorByteArray(offset, width * height, pixels), true)
    .sum();

The overheads of the 2D case are not zero but they are modest.

I tried a bunch of other things right through to a completely different multi-threaded forEach() call and polled queues and a more OpenCL-like approach to the processing model; my best result was under 100ms.

By `OpenCL approach' I made the thread index an explicitly available parameter so that for example a reduction step can just write a partial result directly to a pre-allocated array of values by 'group id' rather than having to allocate and pass back working storage. This together with an exotic single-reader work queue produced the best result. I also learned how to use park(), ... maybe something for a later post.

Optimisations

I found it particularly critical to optimise the spliterator iteration function forEachRemaining(). Even trivial changes can have a big impact.

The difference between this:

private final byte[] pixels;
private final int width, stride;
private int height, offset;
private int x, y;

public void forEachRemaining(IntConsumer action) {
    while (y < height) {
        while (x < width)
            action.accept(pixels[(x++) + y * stride + offset] & 0xff);
        x = 0;
        y += 1;
    }
}

And this:

private final byte[] pixels;
private final int width, stride;
private int height, offset;
private int x, y;

public void forEachRemaining(IntConsumer action) {
    int y = this.y;
    int height = this.height;
    int x = this.x;
    byte[] pixels = this.pixels;
    int offset = this.offset;

    while (y < height) {
        for (int o = x + y * stride + offset, e = y * stride + offset + width; o < e; o++)
            action.accept(pixels[o] & 0xff);
        y += 1;
        x = 0;
    }
    this.x = x;
    this.y = y;
}

Is an over 600% difference in running time.

For images

These streams aren't really very useful for a lot of image processing; only handling single channel data and losing the positional information leaves only a few basic and not widely useful operations left over. The other stream interfaces like collections just wont scale very well to images either: they require intermediate blocks to be allocated rather than just referencing sub-ranges.

So I have been and probably will continue to use the streams in these cases as multi-threaded forEach() and gain the concurrency serialisation implicitly by each task working on independent values. But I can also just borrow ideas and implement similar interfaces even if they are not compatible, e.g. a reduce call that just runs directly on the data without forming a stream first. This might let me get some of the code re-usability and brevity benefits without trying to force a not-quite-right processing model onto the problem.

But being able to create custom/multi-dimensional spliterators is pretty handy all the same.

less shitty internet and more shitty gout

So I got my NBN working again - by power cycling the NBN modem. This isn't terribly easy as it's got a battery backup and a difficult to access/remove power connector. Internode were no help at all and seem to have stopped even responding to support requests now, or at least mine. I only got one response a couple of weeks ago.

Going back to the ADSL did however let me confirm that there's something not quite right with it, as I managed to get some multiplayer DRIVECLUB games in - whilst it wasn't perfect as their servers are still pretty shitty at least it worked a few times - and it's never worked even once using the NBN. To confirm I've flipped back and forth a few times and it's still the same: ADSL works sometimes, NBN never works (it might let me start but always kicks me out). Even using the same shitty fritz-box; so maybe it's not the modem but reports elsewhere suggest it's a piece of junk. Why internode recommend them is beyond me, although i'm sure a firmware update could fix it if it's just that.

I'm not particularly annoyed that DRIVECLUB doesn't work itself because i'm just not into that kind of competition (albeit there is a small spark when it works, but wears thin quickly against deaf and dumb strangers) but it shits me that i've got something that doesn't work properly. I reached level 50 a few days ago - I don't understand when people call the game unapproachable, as it's easy to just work at your own pace and you can still eventually complete the game. I'm very much middle-of-the-road on all the leaderboards and "challenges" and aren't even playing that much each week.

When I feel like blowing an afternoon I'll try setting up a router+dhcp+etc on a GNU box and see if that will work properly just to confirm it's the router. From what I can tell my old billion VG wont support PPPoE without going over the ADSL link - I tried but failed to get anywhere.

Gout kicked up again a few days ago but i'm not really sure why this time. I caught up with some people over a few beers a couple of times and so on with no ill effects but then a week later it just started again. All I can think of is not enough water one day or climbing up a tree barefoot to do some pruning (clumsily). Or the attack is just more delayed than I thought and it was due to something I ate/did a few days earlier. It's mostly ok unless I try to walk but I had to go to the shop today because I ran out of drugs and that was pretty limpy - and to my shock the supermarket was out of ibuprofen, although fortunately the chemist is next to it. Going to be fun getting my bike shoe off. The rate slowed but i'm still losing weight - I had my appetite back there for a bit but it's deserted me again although I'm also putting in a conscious effort too. Come on 80kg, just 2 more to go.

I'm avoiding the news and barely watching tv (working, reading, or playing games instead) but what little I pick up is probably more depressing than these localised issues. But if someone like tony-"fucknut"-abbott can be voted in by the will of the people then what are you gonna do? I just got a mail from the FSF about the TPP ... i just can't believe these slimy cunts who would sell their own nations souvereignty out for the benefit of those who least need it. Why aren't they in a gaol cell or hanging from a tree for such obvious treason?

Well writing this blew a couple of hours in-between doing the washing and some vacuuming; although it's raining now so I gotta find somewhere to hang it. But what to do with the rest of the weekend ...

Update: Internode ended up calling me one night, sort of to appologise for the delay and to follow it up. Guy was a bit annoying, you know the type trying to be `helpful' whilst constantly pushing the conversation forward and not admitting to anything. When I mentioned DRIVECLUB he just did a web search, yeah like i didn't already try that. I never wanted to have to contact support in the first place and hopefully I don't have to again.

Foot's been worse. I have no idea why as right now i'm not really eating enough for it to be diet related and i'm drinking a normal amount of water. Unless it's the weight loss itself, which I guess is just a necessary cost since I need to lose it so i'm taking advantage of the fact I don't feel hungry most of the time right now.

shitty internet & shitty stuff

Well my NBN is still down. Not even a response from internode after a service request 2.5 days ago. It's Easter which is a big holiday in Australia and I don't particularly begrudge the lack of response, but I see from whirlpool posts they are still working. I see from others it might not be something they can fix anyway.

So i'm using my old ADSL - the one I thought i "switched" from; but alas they just kept both running and billed. I think it's because i used their crappy on-line form to request it and it's been a drawn out collection of misunderstanding and mistakes since. For an internet company they seem pretty poor at doing everything over the internet.

Yay more shit

So I got hit with gout a couple of months ago as a nice birthday present which pretty much screwed up any tentative plans I had for that let alone the future. After the first couple of days of pain it just turned into another sore foot; I can hobble a bit by putting weight on the outside edge but even walking around the garden too much has consequences. Surprisingly, cycling is worse than walking and may hint at the trigger. It sometimes swells a bit but without shoes that isn't very noticable. A couple of weeks after an attack it starts to get to the point I can ride a few km or go to the supermarket but it still gets sore afterwards. For the most part even at it's worst the pain was mostly annoying - certainly painful but still able to sleep with drugs - although after the second attack I had to go to the shop to get groceries and walking in shoes was one of the more painful things i've ever done to myself; I should've waited for the ibuprofen to kick in properly and probably just gone bare foot. As I live alone I would have had to go eventually though.

What is most frustrating about it is that after a couple of months of summer doing pretty much nothing i was ready to get a bit of exercise and mix with humanity again; so i went for a 2 hour ride (took it mostly easy but pushed a little for the exercise), got drunk at the pub, then came home and had some biscuits and cheese for "dinner". Then suddenly it felt like i'd broken my big toe and i wasn't really sure I hadn't ... But I knew instantly it was gout (i think i'd looked it up before due to my other foot trouble). And that was pretty much summer and exercise (and life?) all over. Getting so bored sitting around at home is a big reason I started hacking code again. We also had a notably mild - and completely dry - summer this year so it was doubly frustrating I couldn't really enjoy it. As an aside some of my initial searching turned up a similar story of a guy the same age as me going for a "moderate" bike ride before his first attack.

It was a pretty depressing read looking up information about it on the internet: pretty much everything i eat was "off the list", pretty much everything i do for fun was "off the list". Then I started to notice it was almost all exclusively old wives tales full of completely contradictory information. Even the limited medical information is contradictory: e.g. it's all about the purines, well, except when they're in vegetables when they apparently don't count. Even beer which is the worst thing for gout doesn't have much in the way of purines. Also surprising that there seems to have been so little in the way of hard studies done for such a common ailment. What few medical suggestions are available seems to be targeted at the obese: it is unclear for example whether sugars, fats, and oils are bad for gout or just bad for fat people and being fat is bad for everything. I would consider myself overweight but plenty wouldn't these days due to their distorted sense of "normal" and efforts not to "offend" (typical family greeting however: when did you turn into such a fat cunt? - as it should be). The recognition in the irony of suggesting lots of exercise when you can't even walk or cycle is not terribly comforting.

So my initial readings pretty much put me off eating almost everything at the start; and now i've lost most interest in food. When I went to a doctor a couple of weeks afterwards he said meat should be ok but I've barely had any since; doing a roast just seems like asking for trouble. Pangs of hunger come and go but are easier to ignore, and small meals and snacks seem to fill me up much faster and for longer. I've lost nearly 1Kg/week so far so now i'm interested in seeing how far I can go. At this rate i'll break through 80 in another month and if i can do that at least it will be something gained from this sorry episode.

So now after a bit of time and experiments i've kinda worked out that tons of water is about all I can do to help. Together with ibuprofen if it's too sore or swollen. Conversely the main problem seems to be dehydration or more particularly a lack of urination. I usually drink plenty of water unless i'm too busy to notice, especially when cycling, but I also sweat profusely when doing anything physical so maybe it just isn't ever enough. I can have some wine and beer if i drink a lot of water - but it often gets to the point of being only slightly intoxicated while being waterlogged so it isn't remotely fun. If i feel it coming on (a pricking on the toe) absolutely drowning in water seems to prevent it going further but doing that together with visiting the dunny all night is starting to get pretty obnoxious. I'm having tons of coffee - some evidence suggests it reduces the chance of gout but again no study to say if it helps it - but at least it's something i can safely imbibe (i don't really get any buzz from it; it stops me sleeping properly if i have it too late is all). According to the internets (green) tea is supposed to do wonders but the medical evidence says otherwise: i concur with the latter. It does however have one important potential use: it lets you drink much more water without exploding. I prefer green to black without milk anyway, and black with milk doesn't seem to agree with me anymore. The doctor said no red wine but red seems to be better than white and wine better than beer. I really miss being properly beerly-drunk.

No doubt the beer and food over a lifetime helped it on it's way but I think the trigger was probably dehydration. In that 4 hour period of the bike ride+pub I had ~2L of beer and ~2L of water but only urinated once I got home. Maybe i've also just got a naturally high level of uric acid. It's unlikely to be family related; they all drink like fish - more often and intensely than I do.

I had another attack Thursday night and didn't feel like fighting it with water. It ended up fairly mild but my foot's still swollen and it feels like it's reset it's 'state' to two weeks ago. It's not like I was walking around properly anyway so hobbling a bit more is quickly becoming a matter of "so fucking what".

Since i'm pissed off with my internet too, despite having access to it, it's probably time to sit in the garden reading as i've done the last 2 days. With a bottle of water. And maybe something a little stronger - in small amounts :( It also needs a mow and other maintenance but my foot needs more rest.

another variation on histogram equalise

Yeah I dunno this was something I thought up whilst getting up this morning. I can't even remember why I would have been thinking of such a thing.

The original:

Standard histogram equalise (its slightly different than the gimp due to some range handling but it's equally just as shithouse):

My other really cheap tunable histogram equalise based on truncation on the default setting. Maybe a bit dark in the dark but it is tunable further.

The new one tries to solve the same problem but in a completely different way:

And another test image which probably shows it even better:

Here the standard operator just makes a pigs breakfast of pretty much everything:

The simple limit works pretty well here:

And here I think the new one pips it. It retains about the same amount of detail with in the coins but adds more contrast to the surface they're resting on without burning it out.

The previous algorithm applies a hard ceiling to the histogram bins after they are calculated based on the mean.

This algorithm instead applies a localised sameness test before they are added to the histogram and they are simply not counted if it fails the test. This means it simply drops any areas of similar or slow-changing intensity out of the histogram calculation. There are some tweaks needed to handle extreme cases but as can be seen it works pretty well as is on challenging input.

But its somewhat more expensive to calculate so the previous algorithm is probably more desirable given the similarity of the results.

I just snarfed the base pics from over here.

Unrelated musings and rants

Hmm, 2am again. My fibre NBN was down again tonight and I discovered my old ADSL is still connected and still being billed. Yay. What the hell internode, make it bloody clear that "switch plans" isn't a switch, if it isn't a switch. This whole thing really hasn't been anywhere near as smooth as it should've been; then again they charged me for permanent dial-up for about 6 years after i got ADSL so they certainly have form.

Probably should've gone to bed after that and a bit of TV but it's been a long week of trying to decipher some odd(?) C++ which wasn't very pleasant and I needed to 'relax'. Although the C++ thing end with some success.

I wrote a fairly long rant against it yesterday but decided not to post it. But in brief it started with the pleasant surprise of the performance of jaxb on a 200MB xml file (hah, not memory performance, but running-time), another in discovering that you can put a whole array of primitives into an attribute which drops the verbosity considerably - 200MB was after finding this out so it could've been much worse. A realisation of self-stupidity after spending an hour or so writing a s-expression loader from scratch for a completely one-off, throw-away data-transfer job (wasn't thinking straight there). Two lines of calling jaxb and a few annotated holding classes was only about 3x slower at reading the xml than the C++ all-hand-written all-in-header-files macro-template-stuff was at reading it's own proprietary binary format. For the task that would probably do but I want to run this a few times and 5 seconds is too long, so that lead to an ensuing of mirth once I tried using the java default all-i-did-was-add-an-interface serialisation via Object*Stream: about 5x faster and 30% smaller than the C++. And finally some satisfaction in porting some number crunching algorithm across and having it run 50% faster, with simple plain code using consistent syntax that compiles in a 10th of a second rather than 30, and no multiple-screen error messages for using the wrong type of bracket in one place.

Not that C++ programmers have writing needlessly complex code all to themselves, it seems to be a universal skill and one widely practiced. C++ does seem to attract adepts of the highest skill in the art though and fortunately for these code-fiends it has the most facilities to aid them in their vitally important work. They certainly seem to take pride in it. Java is probably a close second truth be told, but probably for very different reasons. The whole weird self-help like book/training scene and probably just a lot of newbies who don't know better, at least at one time (any time i look back ~5 years i always see things I wouldn't do that way now, or worse).

That is not abstraction

Now I remember one thing I thought of whilst writing the post yesterday based on my recent C++ exposure.

The C++ language just doesn't seem to understand what abstraction is for. It seems so busy hiding as many actually important machine-level details as it can whilst exposing in intricate detail as much fluffy non-essential information about it's own type system so the compiler has any hope of understanding what you intended together with some vague (and basically worthless) guarantee that all this noise will produce somehow "optimal" code - for some interpretation of "optimal" which doesn't seem to take into account physical hardware `as such' (not surprisingly given the preceding paragraph).

When these features are used in libraries (or perhaps conglomerations of syntax more succinctly described as "the headers are a platform") the information hiding can be taken to an absurd level: they've actually managed to hide the ability to understand even the syntax of a single line of program code in a widely used language by just looking at it. Now not only do you need to know the basic syntax and semantics of the language itself, you also need to have an intimate knowledge of the entire platform included in the header files before you know the meaning of basic expressions. Even in the one place where semantics cannot be deduced without extraneous knowledge - i.e. function invocation - C++ allows the removal of the one small mnemonic you had as a guide. Just think of all the man-years spent getting this snot to compile.

But back to the point - by doing this you haven't hidden anything - you've just exposed everything. Because without knowing everything you can't understand any of it.

Maybe it was a specifically weird of code but i did some searches and it seems the things it does is common and even seen as desirable. But people also believe the strangest things and the internet is on average simply full of shit.

other little things

So it turned out i was quite wrong on the prefix sorted table thing, it can be done in less memory and more efficiently and there's a whole field of research newly reinvigorated by search engines and genome stitching and modern hardware capabilities. But for a casual observer most of it is a bit complex for a weekend hack and needs some pretty messy pointer-filled code so doesn't translate well to Java; I tried a few things but gave up on that idea since my naive approaches weren't much worse in speed to the code i could find on the sort of data i'm playing with. If it was that important I would just use some C, but it really isn't. But like most of this stuff its good to know it's out there should I actually need it some day. And the maven stuff everyone in java-land insists on now using was giving me the utter shits too; e.g. oh, lets just import a whole fecking platform just so we can use one trivial one-line function that just calls java.util.* anyway, ... once. Why why?

Then I got a bit side-tracked in trying to re-use the system primitive array sort to sort sub-strings by sorting them 4-bytes at a time packed into a long. The lower 31 bits are the offset into the array (i.e. base pointer) and the next 32 bits are 4 bytes to sort in order with truncation handled using 0 padding. As it only uses 63 bits I can use a signed long sort to sort 4 bytes of string at a time. This gets sorted then ranges with a common prefix are recursively sorted again, and so on. Managed about a 2x speedup over a straight memcmp sort; which is probably nothing to sneeze at but not really enough here. Then I tried 8-bit radix sorting the first pass: gains a little bit more. And then a 16-bit radix first pass which was over 2x speed-up again, although by this time I didn't feel like writing multiple whole sorts just to test ideas and they still have undesirable worst-case behaviour. Actually a 6x speedup (if the code is functionally correct, which i didn't get to checking) is pretty decent: but it's delving into territory where far better and more scalable algorithms for the problem exist and i already have one that works well enough for me as a (better than) "70% solution".

Sorting is something i poke at every now just for interests sake so i may still revisit it later. I can't even remember if i've done radix sorting before, but i probably should have given it is parallelisable in a massively-parallel sense.

I also had the headache of trying to troll through a big C++ library yesterday trying to understand something. Ugh, it's like reading through a whole C application written using pre-processor macros, one for every case imaginable but few actually ever realised. Apparently this is a desirable thing in C++. But yeah, abstraction isn't really about hiding so much detail you can't work out what something is doing without intimately knowing everything about it; that's pretty much the opposite. And those compile times. And those error messages. With all that i'm suprised how much work the language and compiler still forces on the developer, and yet it is trivial to write incorrect code. But I digress, I had enough of that yesterday.

Limits

I did have another idea for the string matcher either this morning or yesterday which I finally just remembered and tried: just truncate the maximum hash bucket length to some fixed maximum. It drops the match rate (particularly for popular symbols) and thus the compression rate but it drastically improves run-time speed and worst-case behaviour. Saves some memory too. Ok I guess everyone else does this already but I tried to get by without it; but alas could not. Because i'm using a bucket-based hash table i've got some more options on how i deal with it; i don't just have to throw them away.

Actually the way i implemented it I think it has some interesting side-effects together with the new delta format; common prefixes are updated more often and this more likely to be close by or in the exact-address table so the address compression works better. Initially I tried replacing a random record with the new entry: this is usually a good idea for mathematical reasons "I recognise but don't completely understand", but generating the random numbers proved to be too expensive for short files. So I just have a rolling index shared across all records which is super-cheap and should be relatively random in a local sense and merely practical in all others.

So here's some variations and times for my two current format encoders on the kjv text - on the smaller files it doesn't make much difference. The limit is (2^n - 1) because i store the chain length/insertion point in the first index and i double the size each time i grow it. I included one result when using random replacement: as suggested it does improve the compression - even if only very slightly! Here the cost is somewhat hidden by the string compares. The dez home page has the benchmarks of the other software i've tried; they are not even remotely close in terms of output size even related to the fastest and "worst" below.

                   time                  size
             dez-1      dez-2*      dez-1    dez-2
limit
3            0.232623   0.319473    2596997  1970223
7            0.334969   0.433130    2328817  1831470
15           0.429424   0.521525    2157175  1757288
15 (random)  0.572432   0.666268    2156093  1756776
31           0.545488   0.626991    2028280  1702969
63           0.732696   0.822705    1934709  1659212

unlimited    9.724143   9.853381    1731250  1539501

     empty string to KJV bible "delta" encode

* actually i'm going to just call dez-2 `dez-1'.

The unlimited case is a bit (well 70%!) faster than previous because I added a small optimisation to the string length matcher: it first checks to see if the next character after the current best length is a match, if not it doesn't bother scanning the rest of the string at all. But still, whether 10s or 17s, it's still pretty shit (the main issue with this text is it is indented, so 25K+ substrings begin with newline+several spaces and they all end up in the same hash bucket).

For a given limit both format encoders are being fed an identical list of edits, which shows just how much better the second format is and why the first will vanish next time I do a code-drop.

more in less

Curiosity got the better of me again and now there's 3:30am night and early rise behind me and pretty much another whole weekend "down the drain".

I had another go trying to grok VCDIFF format but failed again: not sure if it's my encoder or decoder but I can't get them to agree on the address compression and so it fails on most data.

Putting that aside I experimented with some ideas on address compression from VCDIFF, and then ended up creating a new format based on what i learnt from it and other experiments. My work with VCDIFF demonstrated the utility of address compression and dual operation instructions.

Anyway I think it turned out quite neat and tidy so here's a summary. It might be useful if you're trying to understand VCDIFF - its very different but borrows similar mechanisms.

Opcodes

There are basically two families of opcodes: dual-operation and single operation. The MSb selects which so each has a full 7 bits of information to play with. This allows for quite a wide range of data lengths and operations to be encoded into a single byte.

Only 6 individual operations are defined together with 2 dual operations.

Integers are encoded like many formats as unsigned 7-bit big-endian with the MSb as a continue bit. Addresses are encoded independently using a separate mechanism as described below.

There are a couple of other format parameters:

smallest: The smallest copy size possible. Encoder dependent.
split: The split point for the single operation opcodes. Fixed at 100 but could be changed.

Dual operation instructions

The encoder tracks the last instruction and if possible can combine two into the same instruction. This is one of the things vcdiff uses to get more compact deltas.

0Taaabbb
  0 - selects a dual instruction
  T - the type of the first operation, 0=ADD 1=COPY
aaa - 3-bit length for first operation.  This allows a range of
      (1-8 ) byte appends or (0-7)+smallest copy.
bbb - 3-bit length for second operation.  It is always COPY.
      This allows a range of (0-7)+smallest copy.

3 bits allows for an ADD of 1-8 bytes and a COPY of (0-7)+smallest bytes. With a smallest size of 6 bytes this allows ADD(1-8)+COPY(6-13) or COPY(6-13)+COPY(6-13).

This is all bits exhausted so the only dual instructions possible. Any data or addresses required are encoded in order in the following bytes.

I also investigated a fixed function 3-bit ADD and 4-bit COPY but it wasn't as good (albeit with very limited investigation).

Single operation instructions

The single op instructions allow for longer immediate copies as well as extended instructions which encode the length as a separate parameter.

Rather than split the data in bit-selected blocks a single 128-value number is broken into several ranges which are interpreted differently.

1nnnnnnnn
  1 - selects single operation
  n - 7-bit number interpreted via inclusive ranges as follows:

000 - 099   copy (0-99)+smallest bytes
100 - 123   add (1-24) bytes
      124   read length.  copy length+100+smallest bytes.
      125   read length.  add length+24+1 bytes.
      126   read length.  run of length+3 bytes.
      127   eof/reserved or something

The split was based on some statistical analysis of a couple of files: copies cover a larger range than adds, and runs are rare. Well and 100 is easier to work with.

For a smallest of size 6, this allows a single instruction to encode a copy of 6-105 bytes or an append of 1-24 bytes. These cover the majority of cases for the data i've been testing with (not that it is a very big set) and the behaviour of the matcher.

The multi-byte encoding is such that the 6+ and 4+ bits already implied taken care of are removed from their lengths which can save the occasional `overflow' byte.

Addresses

This format generated slightly smaller deltas than the previous format but I knew from my experiments with VCDIFF that address compression could increase the gains. The problem here is now i've run out of bits to use so I had to come up with a solution which encoded addresses independently of any operation codes.

Setting aside even a few bits for each address would be too costly so after a bit of thought I came up with a solution based on the observation that most addresses will be >127 and require 2 bytes at least anyway, therefore if I just encode the rare all-7-bit addresses in 16 bits instead it leaves a full 7 bits to use for other encoding schemes whilst retaining the `natural use' of those bits for longer values.

The next problem is how to use those 7 bits. VCDIFF uses 3 bits to select from a near/same table to chose either a positive offset of the last 4 addresses or together with an octet to select a specific address from a table of 768 (using the default code table). I did some experiments to find out which aids more: 'next' is used more often but 'same' saves a lot each time it is used; but it needs 11 bits to do it. Too much here. I also found that adding a sign to the near addresses offset improved the results.

I chose a trade-off which has features of both but requires fewer bits. It combines the same and near table into the same array and adds a sign to the offsets of near addresses. Because I don't need the sign bit for 'same' addresses I can use that to increase the address space of the same table. This allows a full 6 bits to be used for the match table and 5 for the near table.

It is necessarily slower than VCDIFF because I perform a linear search over these values to find an exact match (64 elements) or the nearest match (32 elements). The tables could be overlapped: they are just the last 32 or 64 addresses encoded or decoded stored using a cyclic index. Like VCDIFF both encoder and decoder must maintain this in sync.

This is the encoding used:

00nnnnnn - This is an exact match and a complete address.
           n is an index into a 64-element table.
01Smmmmm - This is a near match.  S is a sign bit and m is
           an index into a 32-element table.  The offset follows.
1aaaaaaa* 0aaaaaaa - An absolute address.  To avoid clashing with
          the above it is forced to at least 2 bytes length.

Note that absolute addresses are just encoded as simple integers: with no `wasted bits' for the other information if their value is greater than 127; which is very likely in the common case.

The encoder is free to choose the smallest option of these for any address in the stream.

Results

These are with the same encoder settings of a 'smallest' of 6 bytes, as with the other benchmark data from the home page.

                       dez-1.2     dez-?.?     gzip -4
  GPL2 to GPL3         13 591      12 053
  jjmpeg.dll           10 809       8 770
  bible (compress)  1 731 250   1 539 501   1 550 998

Runtime is a tiny bit longer for the shorter files due to the address table lookup, although i haven't optimised the address table scan yet. It's still 180x slower than gzip on the KJV.

This actually beats my current VCDIFF encoder but since that is still broken it's pretty much useless to compare to it. Even a single bug can radically alter the patch size.

But one (admittedly small) plus is that unlike VCDIFF this format is fully streamed and doesn't require staging. Actually another plus is that the code is quite a bit simpler due to a more orthogonal instruction set and few special cases, but it only has an implied-fixed code table so isn't as flexible.

Can it go smaller?

Because the string matcher performs an exhaustive search it may find multiple equivalent-length matches for a given target sub-string. This provides an obvious opportunity for selecting a source address that can be represented in fewer bytes than others. This could save some more bytes at the cost of encoding time.

Update: Boy, didn't even let the paint dry on this one. Tried the last idea.

                       dez-1.2     dez-?.?     addr opt    gzip -4
  GPL2 to GPL3         13 591      12 053      11 965
  jjmpeg.dll           10 809       8 770       8 725
  bible (compress)  1 731 250   1 539 501   1 507 072   1 550 998

Well it's more than zero, so I guess that's a yes?

About Me

Tags

ahh