versioning, deltas and diffs
I've been jumping around a bit with hacking lately and this weekend i've been playing with a versioned database and related stuff. I have a fairly long article written about it but it's not quite ready yet. I came up with a few variations and some of the features included cheap copies, cheap branches, proper renaming and so on. Apart from the fun-factor I do have an end-use in mind but i'm still some way off.
One goal was also efficient storage so I worked on a system which stored changes as reverse deltas. After some searching i came across the vcdiff format and some java code to work with it - but much of each implementation is pretty undecipherable code and to be honest just not that hot and I had a niggling interest to look a bit further into the problem.
So today I basically wrote my own from scratch and surprisingly it creates smaller diffs than jvcdiff given that it is literally an afternoons hacking and a naive approach at best (at least on a single example). Runtime is comparable although in either case they are very quick. I'm still trying to get hard statistics on memory use and gc load but i'm pretty sure it's better too.
I also looked at badiff which works with very large files, but that isn't something I need and it generated very fat diffs, used a ton of memory, and wasn't very fast (putting it lightly, see below).
Anyway the problem turned out to be quite interesting along the way.
Delta Algorithm?
It's more or less based on Bentley-McIlroy's paper `Data Compression Using Long Common Strings' although that paper is so brief I didn't get a great deal from it after a skim read. It's guts I think is more or less based on Rabin-Karp's paper `Efficient randomised pattern-matching algorithms' but that was too dense for weekend reading so after my eyes glazed over I did a quick scan of the wikipedia article and then thought about what problem it was actually solving.
In more human terms I think it basically devolves into this: compare every sub-string in the target array to every sub-string in the source array to identify the longest match. Use it to do shit. The details come in trying to make this practical and efficient and what `shit' you do with `it'.
So the following algorithm is kinda probably the same thing but only based on a loose understanding thereof.
# algorithm to generate a list of editing commands # which transforms one array into another hsize = length of sub-string hstep = bytes to step in input htable = hash table mapping hash value to location # build hash table for each location in source in steps of hstep htable.addHash(source, location) end # find matches for each location in target hash = generateHash(target, location) for each candidate in htable.lookup(hash) determine longest actual match of source @ candidate against target @ location end if there was a match and longest > threshold if any data not matched output an append command endif output copy command location += match length fi end add any left over
As one can see it's actually rather simple and mirrors the one-sentence English text description in a straightforward way. The hash table is only used as a helper to narrow the search space to possible candidates rather than having to check them all.
It requires the full source data but could trivially stream the target data. Restoring the target requires the source plus the list of editing commands.
hsize is typically 4-8 and determines the runtime and compression rate. A smaller number gives more potential candidates to match against (and more hashing clashes) but a larger hash will not match smaller strings.
hstep sets how often the source is sampled. Using a hstep of 1 gives the best result by far but results in a large hashtable and longer running time.
Rolling hashes are used so the hash length has no effect on that calculation but it does affect the runtime in other ways.
There are two interesting problems here - rolling hashes and the hash table itself.
Rolling hash
I tried to implement a Rabin-Karp hash but lost patience trying to get my head around the modulo maths. I think i just had some indexing bug rather than mathematical ones but in the end I felt more comfortable with a cyclic hash based on rotating bit arithmetic. The mathematics are still messy but even the 6510 cpu has a 'ror' instruction and so the actual code is trivial for a digital computer.
// Copyright (C) 2015 Michael Zucchi // License is GNU GPL v3 or later public class CyclicHash { int b; int hash; int fix0; static final int[] random = new int[256]; static { Random r = new Random(); for (int i = 0; i < random.length; i++) random[i] = r.nextInt(); } public CyclicHash(int b) { this.b = b; this.fix0 = ((b - 1) * 9) & 31; } public int init(byte[] data, int off) { hash = 0; for (int i = 0; i < b; i++) hash = rotateLeft(hash, 9) ^ random[data[i + off] & 0xff]; return hash; } public int update(byte leave, byte enter) { int leaving = rotateLeft(random[leave & 0xff], fix0); int entering = random[enter & 0xff]; hash = rotateLeft(hash ^ leaving, 9) ^ entering; return hash; } }
A random table is used to create the hash values for each possible byte. To add a new value to the hash it is first rotated and then xor'd with the random hash for the new byte. To remove an old value one only has to rotate by the length of the block and xor it again. Rotation is supposed to be modulo 32 but the java implementation in Integer is not and doesn't match the javadocs - to make sure I do it first (actually it worked before i did this so i know it works but this simplifies the call). The shift amount of 9 i plucked out of my arse.
Choosing better `random' values has some effect but this works.
Approximate Hash Table
I don't just want a simple hash table I want a hash table which lists all the potential match candidates against a given hash value. In java this could be expressed as:
HashMap<Integer,List<Integer>> htable;
However this will (should, i imagine) have scalability and gc issues and boxed primitives are quite heavy weight to start with.
There are some good optimised java libraries for primitive hash tables but i don't think they support multiple entries off the same hash key and they tend to be gigantic libraries targeted at different problems. Anyway it's the sort of problem i find interesting (fun, actually) so I came up with my own.
I sat outside in the last of the rays of the sun with knuth and a beer and read up on open addressing hash tables. Typically i'm partial to the chained variety for their conceptual simplicity but from looking at some libraries like trove it seems open addressing has benefits in some cases and particularly for java. Armed with the knowledge and the setting sun I came up with a space efficient 'approximate' hash table which suits the problem at hand.
// Copyright (C) 2015 Michael Zucchi // License is GNU GPL v3 or later class MatchMap { int[] values; int mask; int logN; static final int EMPTY = Integer.MAX_VALUE; public MatchMap(int N) { int size; logN = 31 - Integer.numberOfLeadingZeros(N); size = 1 << (logN + 2); mask = size - 1; values = new int[size]; Arrays.fill(values, EMPTY); } public void add(int hash, int value) { int i = hash & mask; int c = ((hash >> logN) | 1) & mask; while (values[i] != EMPTY) i = (i + c) & mask; values[i] = value; } public void foreach(int hash, IntConsumer func) { int i = hash & mask; int c = ((hash >> logN) | 1) & mask; int v; while ((v = values[i]) != EMPTY) { func.accept(v); i = (i + c) & mask; } } }
It uses open addressing with double hashing and doesn't bother with store the hash key, only the value. Knowing the size is important - but also known in this application. Deletes are a bit messy but fortunately not required. The table is sized so that the load factor is well less than 0.5.
Adding a new entry just starts probing at the modulo-size location and searches for an empty slot, stepping by the second hash (higher bits in the function or'd with 1 to make it odd). Knuth suggests using an odd number for the step which makes a lot of sense; it will eventually visit every location. This is an effective technique against clustering that occurs if linear probing is performed (I also tried that).
Retrieval just does the same thing. Any values are considered potential matches even if they were stored by a completely different hash key. Given that each location needs to be compared directly with the data arrays it doesn't make mistakes very costly and it even seems to improve the delta output in some cases.
One might note here ... this in total ... is probably less code than just the extra scaffolding required to use a java collections hashtable+list for the same purpose and probably 4-10x more space efficient. open-vcdiff has ~1KLOC for this purpose although that's about half comments explaining it's complexity - i don't even care to understand it enough to know if any of that is actually justified.
These two functions are 80% of the smarts and about 30% of the code. The rest is just to wrap it in some 'efficient' byte-streaming encoding, settle on a magic number, and write a decoder. vcdiff encoding would be nice but the encoding is a little more work than I have time left this weekend.
A result
I created a delta from the text versions of GPL 2.0 to GPL 3.0 using this and compared to jvcdiff and badiff. As I said i don't think jvcdiff is the best implementation and it appears to just be a port of open-vcdiff, but it's what I have handy. For jvcdiff i'm forcing a single window but otherwise using the defaults.
badiff jvcdiff DEZ1 (6,1) (6,2) (6,4) (8,1) (8,8) (16,16) source size 18092 target size 35147 patch size 47833 28829 18064 19053 20421 19128 23159 28877 encode 0.174870 0.002000 0.001550 0.001200 0.000950 0.001250 0.000720 0.001000 decode 0.000700 0.000170 0.000150 0.000150 0.000150 0.000120 0.000090 0.000060 hardware is kaveri cpu, jvm is oracle java 8 64-bit.
FWIW gzip is about 16K for both concatenated together, but that isn't really the same thing.
The value in brackets is (hsize,hstep). A larger hstep will reduce the size of the hash table generated and thus affect memory use, possibly significantly. hsize affects the minimum length of a match as well as hashing conflicts at the low end.
Times are in seconds and approx/rounded from the last few of 200 runs - jvcdiff has much more complex code so it takes nearly this many before it is fully optimised by the jvm, before that it is approximately 0.01s for encode. Because it's fast it might need even more loops for the jvm to notice. Due to randomisation of the hash in these test runs the DEZ1 encoding can vary between runs.
Dunno what's up with badiff, maybe this data isn't suitable for it and/or it's tuned to larger datasets.
As I said I don't have any hard data on the memory footprint but for DEZ1 the (x,1) worst-case it uses a 64K element int array for the hash table, plus the memory for the input and output buffers and half a dozen small utility objects. Decoding needs no extra memory apart from input, patch, and output buffer; and the output size is encoded in the delta so it only needs one allocation of non-garbage.
After writing most of this I tried the (16,16) option and considering the patch size is close to jvcdiff perhaps that matches the parameters it uses. Actually I kinda knew about this but forgot about it. Even then it is surprising that the delta size is about the same as jvcdiff uses much more advanced encoding techniques from VCDIFF that should reduce the size.
Format
There's probably not much point going into a lot of detail in the file format but lets just say i chose the most compact and simplest format I could.
Integers are encoded using a big-endian stream of 7-bits data + 1 bit continue indicator, much the same as vcdiff and dozens of other binary encodings.
Because I only have two opcodes as a further optimisation I encode the opcode as a single bit along with the length. This allows lengths of up to 63 bytes (hmm, 64 if i add one ...) to be encoded together with the opcode in a single byte.
I just encode the copy source address as an integer every time which means they will tend to take 3-4 bytes in typical files. This is where a lot of the smarts in vcdiff comes from but it also requires more faffing about and runtime overheads.
Improvements
So a relatively cheap and possibly significant improvement is to include the the incremental output as part of the search space: i.e. the part of the output which will have been decoded up until that time. As long as the hash table is big enough that should be relatively simple in the encoder and isn't much more work in the decoder.
I don't see the need for run-length encoding as it rarely comes up in real text (and yeah, real men use tabs).
Since the code is so simple it is open to gains from micro-optimisations. Being simple does help the compiler a lot to start with though and hotspot compiles faster code with less lead-time than the more complex implementations.
The hash table is a critical part of the performance and could be tuned further, not that there is much code to tune.
Currently any match of 4+ is encoded as a copy even if an add might be smaller. I don't know if 4 is even the best value. A look-ahead of a few characters may also yield longer matches and more compact encodings. Other address compression techniques as used in vcdiff could be used.
Scalability is unknown. My application is 'post-sized' text so it should be fine there. Increasing the block size (hsize) tends to speed up the encoder as it reduces the search space, but it misses small matches so the delta size increases. A two-tiered approach might work: a longer hash for the whole file and a smaller hash for a local region.
Having said all that, 'simplicity' and 'good enough' also have their own considerable charm and appeal so I might just clean it up and do some better testing and tuning and leave it at that.
Closing Thoughts
Well ... I guess that was a fun afternoon, ... which seems to have gone well beyond midnight, oops - these long articles take hours to write. Given I went in blind i'm pretty surprised that the results are at least comparable to existing code with such modest effort.
I'm also surprised - or maybe not surprised if i let my ego talk - at how simple the algorithms actually are compared to the complexity of any implementation i've seen. And it's a little difficult to see this case as a matter of being a subject expert thus making it appear simpler than it really is. The Bentley/McIlroy paper mentions a working implementation in ~150LOC of C which was hard to believe looking at these libraries; but now i do.
And also ... back to work (later) today. It's been an indescribable 3 month break, but not as good as that sounds. I guess I lost 5 kilos in the last 6 weeks.