About Me

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as Zed
  to his mates & enemies!

notzed at gmail >
fosstodon.org/@notzed >

Tags

android (44)
beagle (63)
biographical (104)
blogz (9)
business (1)
code (77)
compilerz (1)
cooking (31)
dez (7)
dusk (31)
esp32 (4)
extensionz (1)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (459)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (231)
java ee (3)
javafx (49)
jjmpeg (81)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (10)
opencl (120)
os (17)
panamaz (5)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
players (1)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
vulkan (3)
wanki (3)
workshop (3)
zcl (4)
zedzone (26)
Thursday, 02 April 2015, 18:22

another variation on histogram equalise

Yeah I dunno this was something I thought up whilst getting up this morning. I can't even remember why I would have been thinking of such a thing.

The original:

Standard histogram equalise (its slightly different than the gimp due to some range handling but it's equally just as shithouse):

My other really cheap tunable histogram equalise based on truncation on the default setting. Maybe a bit dark in the dark but it is tunable further.

The new one tries to solve the same problem but in a completely different way:

And another test image which probably shows it even better:

Here the standard operator just makes a pigs breakfast of pretty much everything:

The simple limit works pretty well here:

And here I think the new one pips it. It retains about the same amount of detail with in the coins but adds more contrast to the surface they're resting on without burning it out.

The previous algorithm applies a hard ceiling to the histogram bins after they are calculated based on the mean.

This algorithm instead applies a localised sameness test before they are added to the histogram and they are simply not counted if it fails the test. This means it simply drops any areas of similar or slow-changing intensity out of the histogram calculation. There are some tweaks needed to handle extreme cases but as can be seen it works pretty well as is on challenging input.

But its somewhat more expensive to calculate so the previous algorithm is probably more desirable given the similarity of the results.

I just snarfed the base pics from over here.

Unrelated musings and rants

Hmm, 2am again. My fibre NBN was down again tonight and I discovered my old ADSL is still connected and still being billed. Yay. What the hell internode, make it bloody clear that "switch plans" isn't a switch, if it isn't a switch. This whole thing really hasn't been anywhere near as smooth as it should've been; then again they charged me for permanent dial-up for about 6 years after i got ADSL so they certainly have form.

Probably should've gone to bed after that and a bit of TV but it's been a long week of trying to decipher some odd(?) C++ which wasn't very pleasant and I needed to 'relax'. Although the C++ thing end with some success.

I wrote a fairly long rant against it yesterday but decided not to post it. But in brief it started with the pleasant surprise of the performance of jaxb on a 200MB xml file (hah, not memory performance, but running-time), another in discovering that you can put a whole array of primitives into an attribute which drops the verbosity considerably - 200MB was after finding this out so it could've been much worse. A realisation of self-stupidity after spending an hour or so writing a s-expression loader from scratch for a completely one-off, throw-away data-transfer job (wasn't thinking straight there). Two lines of calling jaxb and a few annotated holding classes was only about 3x slower at reading the xml than the C++ all-hand-written all-in-header-files macro-template-stuff was at reading it's own proprietary binary format. For the task that would probably do but I want to run this a few times and 5 seconds is too long, so that lead to an ensuing of mirth once I tried using the java default all-i-did-was-add-an-interface serialisation via Object*Stream: about 5x faster and 30% smaller than the C++. And finally some satisfaction in porting some number crunching algorithm across and having it run 50% faster, with simple plain code using consistent syntax that compiles in a 10th of a second rather than 30, and no multiple-screen error messages for using the wrong type of bracket in one place.

Not that C++ programmers have writing needlessly complex code all to themselves, it seems to be a universal skill and one widely practiced. C++ does seem to attract adepts of the highest skill in the art though and fortunately for these code-fiends it has the most facilities to aid them in their vitally important work. They certainly seem to take pride in it. Java is probably a close second truth be told, but probably for very different reasons. The whole weird self-help like book/training scene and probably just a lot of newbies who don't know better, at least at one time (any time i look back ~5 years i always see things I wouldn't do that way now, or worse).

That is not abstraction

Now I remember one thing I thought of whilst writing the post yesterday based on my recent C++ exposure.

The C++ language just doesn't seem to understand what abstraction is for. It seems so busy hiding as many actually important machine-level details as it can whilst exposing in intricate detail as much fluffy non-essential information about it's own type system so the compiler has any hope of understanding what you intended together with some vague (and basically worthless) guarantee that all this noise will produce somehow "optimal" code - for some interpretation of "optimal" which doesn't seem to take into account physical hardware `as such' (not surprisingly given the preceding paragraph).

When these features are used in libraries (or perhaps conglomerations of syntax more succinctly described as "the headers are a platform") the information hiding can be taken to an absurd level: they've actually managed to hide the ability to understand even the syntax of a single line of program code in a widely used language by just looking at it. Now not only do you need to know the basic syntax and semantics of the language itself, you also need to have an intimate knowledge of the entire platform included in the header files before you know the meaning of basic expressions. Even in the one place where semantics cannot be deduced without extraneous knowledge - i.e. function invocation - C++ allows the removal of the one small mnemonic you had as a guide. Just think of all the man-years spent getting this snot to compile.

But back to the point - by doing this you haven't hidden anything - you've just exposed everything. Because without knowing everything you can't understand any of it.

Maybe it was a specifically weird of code but i did some searches and it seems the things it does is common and even seen as desirable. But people also believe the strangest things and the internet is on average simply full of shit.

Tagged graphics, java, rants.
Tuesday, 31 March 2015, 16:07

other little things

So it turned out i was quite wrong on the prefix sorted table thing, it can be done in less memory and more efficiently and there's a whole field of research newly reinvigorated by search engines and genome stitching and modern hardware capabilities. But for a casual observer most of it is a bit complex for a weekend hack and needs some pretty messy pointer-filled code so doesn't translate well to Java; I tried a few things but gave up on that idea since my naive approaches weren't much worse in speed to the code i could find on the sort of data i'm playing with. If it was that important I would just use some C, but it really isn't. But like most of this stuff its good to know it's out there should I actually need it some day. And the maven stuff everyone in java-land insists on now using was giving me the utter shits too; e.g. oh, lets just import a whole fecking platform just so we can use one trivial one-line function that just calls java.util.* anyway, ... once. Why why?

Then I got a bit side-tracked in trying to re-use the system primitive array sort to sort sub-strings by sorting them 4-bytes at a time packed into a long. The lower 31 bits are the offset into the array (i.e. base pointer) and the next 32 bits are 4 bytes to sort in order with truncation handled using 0 padding. As it only uses 63 bits I can use a signed long sort to sort 4 bytes of string at a time. This gets sorted then ranges with a common prefix are recursively sorted again, and so on. Managed about a 2x speedup over a straight memcmp sort; which is probably nothing to sneeze at but not really enough here. Then I tried 8-bit radix sorting the first pass: gains a little bit more. And then a 16-bit radix first pass which was over 2x speed-up again, although by this time I didn't feel like writing multiple whole sorts just to test ideas and they still have undesirable worst-case behaviour. Actually a 6x speedup (if the code is functionally correct, which i didn't get to checking) is pretty decent: but it's delving into territory where far better and more scalable algorithms for the problem exist and i already have one that works well enough for me as a (better than) "70% solution".

Sorting is something i poke at every now just for interests sake so i may still revisit it later. I can't even remember if i've done radix sorting before, but i probably should have given it is parallelisable in a massively-parallel sense.

I also had the headache of trying to troll through a big C++ library yesterday trying to understand something. Ugh, it's like reading through a whole C application written using pre-processor macros, one for every case imaginable but few actually ever realised. Apparently this is a desirable thing in C++. But yeah, abstraction isn't really about hiding so much detail you can't work out what something is doing without intimately knowing everything about it; that's pretty much the opposite. And those compile times. And those error messages. With all that i'm suprised how much work the language and compiler still forces on the developer, and yet it is trivial to write incorrect code. But I digress, I had enough of that yesterday.

Limits

I did have another idea for the string matcher either this morning or yesterday which I finally just remembered and tried: just truncate the maximum hash bucket length to some fixed maximum. It drops the match rate (particularly for popular symbols) and thus the compression rate but it drastically improves run-time speed and worst-case behaviour. Saves some memory too. Ok I guess everyone else does this already but I tried to get by without it; but alas could not. Because i'm using a bucket-based hash table i've got some more options on how i deal with it; i don't just have to throw them away.

Actually the way i implemented it I think it has some interesting side-effects together with the new delta format; common prefixes are updated more often and this more likely to be close by or in the exact-address table so the address compression works better. Initially I tried replacing a random record with the new entry: this is usually a good idea for mathematical reasons "I recognise but don't completely understand", but generating the random numbers proved to be too expensive for short files. So I just have a rolling index shared across all records which is super-cheap and should be relatively random in a local sense and merely practical in all others.

So here's some variations and times for my two current format encoders on the kjv text - on the smaller files it doesn't make much difference. The limit is (2^n - 1) because i store the chain length/insertion point in the first index and i double the size each time i grow it. I included one result when using random replacement: as suggested it does improve the compression - even if only very slightly! Here the cost is somewhat hidden by the string compares. The dez home page has the benchmarks of the other software i've tried; they are not even remotely close in terms of output size even related to the fastest and "worst" below.

                   time                  size
             dez-1      dez-2*      dez-1    dez-2
limit
3            0.232623   0.319473    2596997  1970223
7            0.334969   0.433130    2328817  1831470
15           0.429424   0.521525    2157175  1757288
15 (random)  0.572432   0.666268    2156093  1756776
31           0.545488   0.626991    2028280  1702969
63           0.732696   0.822705    1934709  1659212

unlimited    9.724143   9.853381    1731250  1539501

     empty string to KJV bible "delta" encode

* actually i'm going to just call dez-2 `dez-1'.

The unlimited case is a bit (well 70%!) faster than previous because I added a small optimisation to the string length matcher: it first checks to see if the next character after the current best length is a match, if not it doesn't bother scanning the rest of the string at all. But still, whether 10s or 17s, it's still pretty shit (the main issue with this text is it is indented, so 25K+ substrings begin with newline+several spaces and they all end up in the same hash bucket).

For a given limit both format encoders are being fed an identical list of edits, which shows just how much better the second format is and why the first will vanish next time I do a code-drop.

Tagged dez, hacking, java.
Sunday, 29 March 2015, 11:53

more in less

Curiosity got the better of me again and now there's 3:30am night and early rise behind me and pretty much another whole weekend "down the drain".

I had another go trying to grok VCDIFF format but failed again: not sure if it's my encoder or decoder but I can't get them to agree on the address compression and so it fails on most data.

Putting that aside I experimented with some ideas on address compression from VCDIFF, and then ended up creating a new format based on what i learnt from it and other experiments. My work with VCDIFF demonstrated the utility of address compression and dual operation instructions.

Anyway I think it turned out quite neat and tidy so here's a summary. It might be useful if you're trying to understand VCDIFF - its very different but borrows similar mechanisms.

Opcodes

There are basically two families of opcodes: dual-operation and single operation. The MSb selects which so each has a full 7 bits of information to play with. This allows for quite a wide range of data lengths and operations to be encoded into a single byte.

Only 6 individual operations are defined together with 2 dual operations.

Integers are encoded like many formats as unsigned 7-bit big-endian with the MSb as a continue bit. Addresses are encoded independently using a separate mechanism as described below.

There are a couple of other format parameters:

smallest
The smallest copy size possible. Encoder dependent.
split
The split point for the single operation opcodes. Fixed at 100 but could be changed.

Dual operation instructions

The encoder tracks the last instruction and if possible can combine two into the same instruction. This is one of the things vcdiff uses to get more compact deltas.

0Taaabbb
  0 - selects a dual instruction
  T - the type of the first operation, 0=ADD 1=COPY
aaa - 3-bit length for first operation.  This allows a range of
      (1-8 ) byte appends or (0-7)+smallest copy.
bbb - 3-bit length for second operation.  It is always COPY.
      This allows a range of (0-7)+smallest copy.

3 bits allows for an ADD of 1-8 bytes and a COPY of (0-7)+smallest bytes. With a smallest size of 6 bytes this allows ADD(1-8)+COPY(6-13) or COPY(6-13)+COPY(6-13).

This is all bits exhausted so the only dual instructions possible. Any data or addresses required are encoded in order in the following bytes.

I also investigated a fixed function 3-bit ADD and 4-bit COPY but it wasn't as good (albeit with very limited investigation).

Single operation instructions

The single op instructions allow for longer immediate copies as well as extended instructions which encode the length as a separate parameter.

Rather than split the data in bit-selected blocks a single 128-value number is broken into several ranges which are interpreted differently.

1nnnnnnnn
  1 - selects single operation
  n - 7-bit number interpreted via inclusive ranges as follows:

000 - 099   copy (0-99)+smallest bytes
100 - 123   add (1-24) bytes
      124   read length.  copy length+100+smallest bytes.
      125   read length.  add length+24+1 bytes.
      126   read length.  run of length+3 bytes.
      127   eof/reserved or something

The split was based on some statistical analysis of a couple of files: copies cover a larger range than adds, and runs are rare. Well and 100 is easier to work with.

For a smallest of size 6, this allows a single instruction to encode a copy of 6-105 bytes or an append of 1-24 bytes. These cover the majority of cases for the data i've been testing with (not that it is a very big set) and the behaviour of the matcher.

The multi-byte encoding is such that the 6+ and 4+ bits already implied taken care of are removed from their lengths which can save the occasional `overflow' byte.

Addresses

This format generated slightly smaller deltas than the previous format but I knew from my experiments with VCDIFF that address compression could increase the gains. The problem here is now i've run out of bits to use so I had to come up with a solution which encoded addresses independently of any operation codes.

Setting aside even a few bits for each address would be too costly so after a bit of thought I came up with a solution based on the observation that most addresses will be >127 and require 2 bytes at least anyway, therefore if I just encode the rare all-7-bit addresses in 16 bits instead it leaves a full 7 bits to use for other encoding schemes whilst retaining the `natural use' of those bits for longer values.

The next problem is how to use those 7 bits. VCDIFF uses 3 bits to select from a near/same table to chose either a positive offset of the last 4 addresses or together with an octet to select a specific address from a table of 768 (using the default code table). I did some experiments to find out which aids more: 'next' is used more often but 'same' saves a lot each time it is used; but it needs 11 bits to do it. Too much here. I also found that adding a sign to the near addresses offset improved the results.

I chose a trade-off which has features of both but requires fewer bits. It combines the same and near table into the same array and adds a sign to the offsets of near addresses. Because I don't need the sign bit for 'same' addresses I can use that to increase the address space of the same table. This allows a full 6 bits to be used for the match table and 5 for the near table.

It is necessarily slower than VCDIFF because I perform a linear search over these values to find an exact match (64 elements) or the nearest match (32 elements). The tables could be overlapped: they are just the last 32 or 64 addresses encoded or decoded stored using a cyclic index. Like VCDIFF both encoder and decoder must maintain this in sync.

This is the encoding used:

00nnnnnn - This is an exact match and a complete address.
           n is an index into a 64-element table.
01Smmmmm - This is a near match.  S is a sign bit and m is
           an index into a 32-element table.  The offset follows.
1aaaaaaa* 0aaaaaaa - An absolute address.  To avoid clashing with
          the above it is forced to at least 2 bytes length.

Note that absolute addresses are just encoded as simple integers: with no `wasted bits' for the other information if their value is greater than 127; which is very likely in the common case.

The encoder is free to choose the smallest option of these for any address in the stream.

Results

These are with the same encoder settings of a 'smallest' of 6 bytes, as with the other benchmark data from the home page.

                       dez-1.2     dez-?.?     gzip -4
  GPL2 to GPL3         13 591      12 053
  jjmpeg.dll           10 809       8 770
  bible (compress)  1 731 250   1 539 501   1 550 998 

Runtime is a tiny bit longer for the shorter files due to the address table lookup, although i haven't optimised the address table scan yet. It's still 180x slower than gzip on the KJV.

This actually beats my current VCDIFF encoder but since that is still broken it's pretty much useless to compare to it. Even a single bug can radically alter the patch size.

But one (admittedly small) plus is that unlike VCDIFF this format is fully streamed and doesn't require staging. Actually another plus is that the code is quite a bit simpler due to a more orthogonal instruction set and few special cases, but it only has an implied-fixed code table so isn't as flexible.

Can it go smaller?

Because the string matcher performs an exhaustive search it may find multiple equivalent-length matches for a given target sub-string. This provides an obvious opportunity for selecting a source address that can be represented in fewer bytes than others. This could save some more bytes at the cost of encoding time.

Update: Boy, didn't even let the paint dry on this one. Tried the last idea.

                       dez-1.2     dez-?.?     addr opt    gzip -4
  GPL2 to GPL3         13 591      12 053      11 965
  jjmpeg.dll           10 809       8 770       8 725
  bible (compress)  1 731 250   1 539 501   1 507 072   1 550 998 

Well it's more than zero, so I guess that's a yes?

Tagged hacking, java.
Saturday, 28 March 2015, 09:50

dez 1.2

So I came up with a couple more ideas to try ... actually they all pretty much amounted to nothing but here is a small update anyway. See the notes on the home page.

Some things I tried and discarded: incrementally chaining the hash table (avoiding the realloc+copy); inlining the hash function; calculating all hashes of source and target at the start; doing the same in parallel. Most of them amounted to no gain although the parallel hashing helped quite a bit in some cases but not often enough to be worth the memory required.

Thus this is mostly a post about the expanded homepage where i added a bit detail and more benchmarks.

I was a bit surprised with the memory use compared to the "competition" because dez amounts to a full exhaustive search, albeit index assisted, whereas the others do not.

Tagged code, dez, hacking, java.
Friday, 27 March 2015, 10:44

synthz 0.0

Decided to code-drop the audio synthesiser i hacked up a couple of weeks ago.

Probably of no use to anybody but what the hell. Maybe one day i'll make some better sfx for that space invaders I did.

It's got it's own home page up on the usual spot.

Tagged code, games, hacking, java.
Friday, 27 March 2015, 07:00

dez 1.0

I packaged up the best version of the delta generator into dez-1.0. It and some details are available on the dez home page.

A couple of nights ago I tried one of the hashtable ideas from a previous post. It made a big improvement so I kept whittling it away until there really wasn't much of a hash table left apart from a couple of arrays. Fastest version using the least memory. Open addressing just didn't handle some bad cases at all well.

I didn't see the point in keeping the others around so i only kept this one in the release.

And that should be me done with this for a while.

Tagged code, hacking, java.
Tuesday, 24 March 2015, 12:18

Getting the runs.

As is usually the case I kept poking last night and added a runlength detector to the internal matcher loop of the delta processor.

To start with, any sequence which begins with 3 common bytes are not added to the hash table. For my problem file this reduces the hashtable load significantly.

Then when scanning the target data any location which starts with 3 common bytes are not looked up in the hash table and just converted to a byte run instead.

I had to add runs to the byte encoding and to do this and i noticed that ADD and RUN are typically tiny so just made the command+length byte always 1 byte with 6 bits of length. But I might change that to 5 bits of length so that it can still have a continue bit since most are under 32 bytes as well and that will protect against the occasional long length.

So something like this:

   00XXXXXX            - copy + 6 bit length
   10XXXXXX CXXXXXXX*  - copy + extended length
   011XXXXX            - add + 5 bit length
   111XXXXX CXXXXXXX*  - add + extended length
   010XXXXX            - run + 5 bit count
   110XXXXX CXXXXXXX*  - run + extended count

Expansion space? Why? If need be I could let $00 be an extended command byte or something.

The delta sizes are slightly larger but the runtime is ~2 orders better on this particular problem file (jjmpeg.dll) with similar settings. And it's like ~1s vs 10ms rather than 10ms vs 100us.

(6,1)      delta        size
 rev    orig     new    
  59  155323  155323    155323
  57   15509   15626    154093
  55    8168    8227    154093
  53   10425   10490    153839
  47    9942    9958    153095
  45   15879   16000    145425
  40   16584   16701    146274
  36   14561   14632    144658
  22   10751   10832    144145
  20      38      38    144145

           jjmpeg.dll

These are the deltas for jjmpeg.dll from the trunk of jjmpeg. It's a reverse delta so the top line is the latest version with no processing and each subsequent is an editing list which turns the previous version into it.

One probably obvious thing I hadn't realised is that reverse deltas work particularly well on source code - or anything that is added to over time. Since most edits are editions, a reverse delta is mostly just deletes (actually, just no commands at all).

 rev   delta    size
 147    3977    3977
 146      31    3801
 145      26    3671
  85     101    3177
  53      28    3180
  52      39    2591
  45      49    2488
  16      48    1645
  13      15    1556
  11      42    1300
   6      25    1194
   2      20    1051

     AVNative.java

I'm not working on a revision control system for source code, it just provides easy test data.

So with that i'm pretty well done with the that* ... unless i can do something about the hash table.

With short matching patterns required to get the best results certain data gets repeated a lot which is problematic for the open addressed hashing. Most solutions I can think of just blow the memory out since as soon as you start adding links it at least doubles the memory.

After a bit of thought a couple of possibly practical ideas spring to mind ...

Overflow and keep.
Once a search exceeds a certain range drop subsequent records into an overflow record. The record in the table can be a negative index to the overflow record.

Clashes still cause 'difficulty' because the match scan will have to deal with multiple overflow records. Although they could include the hash key to help filter them out.

Overflow all.
This is similar but rather than leave the entries there all values for the same key are extracted and placed into the overflow record and the rest are marked deleted. The first available slot is set to point to the overflow record.

The problem here is that the hash keys are not stored in the table and they are required to be able to filter the entries since they could include other hash keys. But! They can be regenerated quite cheaply.

Hybrid and keep.
Have a second smaller chained (or otherwise) hash table and if any search exceeds some limit store values in that instead.

This is the simplest but would require two lookups each time.

Hybrid all.
Combine the last two. Move all the other values of that key to the new table too.

This would only require two lookups for values which are not present in the overflow table, which by definition occur less often. It's not really the lookups which are the problem though.

Just thoughts, haven't tried anything.

The journey never ends

So while thinking about the next problem: the scalability of the brute force search of all the substrings with the same keys ... I thought of an exact solution that combines the string search with the index.

It's pretty obvious so probably has someone's name Update: Well sort of, it's a longest common prefix table, or LCP table. The below is definitely a naive approach with O(N log N) running time, but it is space efficient and simple to understand.

First, the whole file is pre-processed into a common prefix tree.

  1. Sort a list of every substring in the source;
  2. Iterate over the sorted list and count the length of the common prefix;
  3. (optional) build an index of the first row of each possible byte.

Because it just uses pointers/indices it doesn't take up any memory for the data and it only needs one entry for each location.

This data structure is basically a directed graph where each offset in 'starts' is a branch and each value in 'commons' is a link to the branch below.

 0 @ 15 c=1 '  Test.'
 1 @ 16 c=1 ' Test.'
 2 @ 7  c=1 ' a test.  Test.'
 3 @ 4  c=1 ' is a test.  Test.'
 4 @ 9  c=0 ' test.  Test.'
 5 @ 21 c=1 '.'
 6 @ 14 c=0 '.  Test.'
 7 @ 17 c=1 'Test.'
 8 @ 0  c=0 'This is a test.  Test.'
 9 @ 8  c=0 'a test.  Test.'
10 @ 18 c=4 'est.'
11 @ 11 c=0 'est.  Test.'
12 @ 1  c=0 'his is a test.  Test.'
13 @ 5  c=3 'is a test.  Test.'
14 @ 2  c=0 'is is a test.  Test.'
15 @ 6  c=2 's a test.  Test.'
16 @ 3  c=1 's is a test.  Test.'
17 @ 19 c=3 'st.'
18 @ 12 c=0 'st.  Test.'
19 @ 20 c=2 't.'
20 @ 13 c=1 't.  Test.'
21 @ 10 c=0 'test.  Test.'

This is the result for 'This is a test. Test.'. '@' is the position in the original string, and 'c=' is the number of characters of common prefix with the string below.

Using this table it is a simple walk to find the longest string from these which matches an arbitrary string.

And of course ... this is exactly the problem I was solving originally a few days ago.

Sorted Array Delta

So I started from scratch and wrote another delta builder based only on this technique. I only implemented the searches of the source buffer but it does do runs. I had to use a custom quick-sort as the java Arrays sort will not let me override the comparator for primitive arrays. I use an index for the first byte to jump to the right location to speed up the first level of the tree.

So how's it go?

Run-time wise it's something like 1.5-10x slower than the best hash based version I have depending on the data - the closest end is with jjmpeg.dll. The sorting stage can dominate the processing time, for jjmpeg.dll it's 5/6ths of the total processing time but for GPL2 it's only 1/2.

But it only needs half the memory of the best of the rest, so it's not all bad.

Rubbish

I continue to be surprised at how well java copes with lots of object allocation and garbage; despite using 10x the memory using a HashMap<Integer,ArrayList<<nteger>> is pretty much on par to the open addressed hash table even when it's working well and much faster compared to the failure cases for the dll file. It's tons faster than reference counting (and threads, java is dreams for threads) and I never found explicit allocation very fast to start with. Not quite as quick as alloca().

* - i'm never done with that.

Tagged hacking, java.
Monday, 23 March 2015, 09:31

deltas and diffs again

I had a quick look at cleaning up the delta code I wrote yesterday and got sidetracked into making a more optimised version - one where the hash table and other bits are just in-lined into a single class.

Firstly, its about 20% faster so far: as i suspected simple micro-optimisations really count here.

Secondly, it's somewhat better at producing compact deltas. Oops, I guess I had a bug somewhere in the code - the original still produces working output but it doesn't find as many matches as it should.

With the new encoder the (6,1) case creates ~16K of delta compared to ~18K previously. And now using a smaller block size of (4,1) does even better at ~14K5 where previously it only hindered.

I also realised just how simple it was to add the currently-decoded target data into the search space of the encoder and copy space of the decoder. Literally a couple of lines in the decoder and under 10 in the encoder.

This drops the test case of GPL2 to GPL3 text to under 13 000 bytes for DEZ1(4,1) and about 13K5 for DEZ1(6,1).

This also allows it to work as a functional if not particularly terrific compressor. An empty buffer to GPL3 "delta" is about 17K. "gzip -9" (== GZipOutputStream defaults) is 12k with comparable execution times (gzip a bit better and scales better). Not that it is very useful in general but concatenating 10x GPL3 texts and then performing the same operation produces just an ~18K delta vs ~100K+ gzip stream.

It's pretty memory hungry on the best setting though; windows could be used.

I'm also curious as to applying this to line-by-line diffs.

Update: Ahah, not so good with some binary data. It creates pretty good deltas but the performance drops a few orders of magnitude. Any data with lots of '00 00 00' type sequences blow out the hash collision rate and break the performance of the open hashing algorithm. A straight java collections implementation (with arraylists and boxed integer's and all) scales a lot better (and surprisingly runs about the same speed in general) although it uses way more memory.

It can be mitigated by limiting the search for an empty slot or other trade-offs at a pretty hefty cost to the generated delta size. Maybe detecting runs would help some.

lambdas and finals

I just tried using -verbose:gc to get some basic memory statistics and it looks like the previous implementation uses way more temporary memory than it should. It's not allocating any differently so i can only think it's something to do with the foreach/lambda code, and this probably explains the extra runtime.

To confirm i moved a local variable i had accessed as a final to the an object field: yep, it drops ~500K of garbage. Moving the same lambda code to a field variable as well saved another ~500K too, but even then it's still ~1M more than the other implementation.

"hmm".

Tagged hacking, java.
Newer Posts | Older Posts
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!