deltas and diffs again
I had a quick look at cleaning up the delta code I wrote yesterday and got sidetracked into making a more optimised version - one where the hash table and other bits are just in-lined into a single class.
Firstly, its about 20% faster so far: as i suspected simple micro-optimisations really count here.
Secondly, it's somewhat better at producing compact deltas. Oops, I guess I had a bug somewhere in the code - the original still produces working output but it doesn't find as many matches as it should.
With the new encoder the (6,1) case creates ~16K of delta compared to ~18K previously. And now using a smaller block size of (4,1) does even better at ~14K5 where previously it only hindered.
I also realised just how simple it was to add the currently-decoded target data into the search space of the encoder and copy space of the decoder. Literally a couple of lines in the decoder and under 10 in the encoder.
This drops the test case of GPL2 to GPL3 text to under 13 000 bytes for DEZ1(4,1) and about 13K5 for DEZ1(6,1).
This also allows it to work as a functional if not particularly terrific compressor. An empty buffer to GPL3 "delta" is about 17K. "gzip -9" (== GZipOutputStream defaults) is 12k with comparable execution times (gzip a bit better and scales better). Not that it is very useful in general but concatenating 10x GPL3 texts and then performing the same operation produces just an ~18K delta vs ~100K+ gzip stream.
It's pretty memory hungry on the best setting though; windows could be used.
I'm also curious as to applying this to line-by-line diffs.
Update: Ahah, not so good with some binary data. It creates pretty good deltas but the performance drops a few orders of magnitude. Any data with lots of '00 00 00' type sequences blow out the hash collision rate and break the performance of the open hashing algorithm. A straight java collections implementation (with arraylists and boxed integer's and all) scales a lot better (and surprisingly runs about the same speed in general) although it uses way more memory.
It can be mitigated by limiting the search for an empty slot or other trade-offs at a pretty hefty cost to the generated delta size. Maybe detecting runs would help some.
lambdas and finals
I just tried using -verbose:gc to get some basic memory statistics and it looks like the previous implementation uses way more temporary memory than it should. It's not allocating any differently so i can only think it's something to do with the foreach/lambda code, and this probably explains the extra runtime.
To confirm i moved a local variable i had accessed as a final to the an object field: yep, it drops ~500K of garbage. Moving the same lambda code to a field variable as well saved another ~500K too, but even then it's still ~1M more than the other implementation.
"hmm".
versioning, deltas and diffs
I've been jumping around a bit with hacking lately and this weekend i've been playing with a versioned database and related stuff. I have a fairly long article written about it but it's not quite ready yet. I came up with a few variations and some of the features included cheap copies, cheap branches, proper renaming and so on. Apart from the fun-factor I do have an end-use in mind but i'm still some way off.
One goal was also efficient storage so I worked on a system which stored changes as reverse deltas. After some searching i came across the vcdiff format and some java code to work with it - but much of each implementation is pretty undecipherable code and to be honest just not that hot and I had a niggling interest to look a bit further into the problem.
So today I basically wrote my own from scratch and surprisingly it creates smaller diffs than jvcdiff given that it is literally an afternoons hacking and a naive approach at best (at least on a single example). Runtime is comparable although in either case they are very quick. I'm still trying to get hard statistics on memory use and gc load but i'm pretty sure it's better too.
I also looked at badiff which works with very large files, but that isn't something I need and it generated very fat diffs, used a ton of memory, and wasn't very fast (putting it lightly, see below).
Anyway the problem turned out to be quite interesting along the way.
Delta Algorithm?
It's more or less based on Bentley-McIlroy's paper `Data Compression Using Long Common Strings' although that paper is so brief I didn't get a great deal from it after a skim read. It's guts I think is more or less based on Rabin-Karp's paper `Efficient randomised pattern-matching algorithms' but that was too dense for weekend reading so after my eyes glazed over I did a quick scan of the wikipedia article and then thought about what problem it was actually solving.
In more human terms I think it basically devolves into this: compare every sub-string in the target array to every sub-string in the source array to identify the longest match. Use it to do shit. The details come in trying to make this practical and efficient and what `shit' you do with `it'.
So the following algorithm is kinda probably the same thing but only based on a loose understanding thereof.
# algorithm to generate a list of editing commands
# which transforms one array into another
hsize = length of sub-string
hstep = bytes to step in input
htable = hash table mapping hash value to location
# build hash table
for each location in source in steps of hstep
htable.addHash(source, location)
end
# find matches
for each location in target
hash = generateHash(target, location)
for each candidate in htable.lookup(hash)
determine longest actual match
of source @ candidate against target @ location
end
if there was a match and longest > threshold
if any data not matched
output an append command
endif
output copy command
location += match length
fi
end
add any left over
As one can see it's actually rather simple and mirrors the one-sentence English text description in a straightforward way. The hash table is only used as a helper to narrow the search space to possible candidates rather than having to check them all.
It requires the full source data but could trivially stream the target data. Restoring the target requires the source plus the list of editing commands.
hsize is typically 4-8 and determines the runtime and compression rate. A smaller number gives more potential candidates to match against (and more hashing clashes) but a larger hash will not match smaller strings.
hstep sets how often the source is sampled. Using a hstep of 1 gives the best result by far but results in a large hashtable and longer running time.
Rolling hashes are used so the hash length has no effect on that calculation but it does affect the runtime in other ways.
There are two interesting problems here - rolling hashes and the hash table itself.
Rolling hash
I tried to implement a Rabin-Karp hash but lost patience trying to get my head around the modulo maths. I think i just had some indexing bug rather than mathematical ones but in the end I felt more comfortable with a cyclic hash based on rotating bit arithmetic. The mathematics are still messy but even the 6510 cpu has a 'ror' instruction and so the actual code is trivial for a digital computer.
// Copyright (C) 2015 Michael Zucchi
// License is GNU GPL v3 or later
public class CyclicHash {
int b;
int hash;
int fix0;
static final int[] random = new int[256];
static {
Random r = new Random();
for (int i = 0; i < random.length; i++)
random[i] = r.nextInt();
}
public CyclicHash(int b) {
this.b = b;
this.fix0 = ((b - 1) * 9) & 31;
}
public int init(byte[] data, int off) {
hash = 0;
for (int i = 0; i < b; i++)
hash = rotateLeft(hash, 9) ^ random[data[i + off] & 0xff];
return hash;
}
public int update(byte leave, byte enter) {
int leaving = rotateLeft(random[leave & 0xff], fix0);
int entering = random[enter & 0xff];
hash = rotateLeft(hash ^ leaving, 9) ^ entering;
return hash;
}
}
A random table is used to create the hash values for each possible byte. To add a new value to the hash it is first rotated and then xor'd with the random hash for the new byte. To remove an old value one only has to rotate by the length of the block and xor it again. Rotation is supposed to be modulo 32 but the java implementation in Integer is not and doesn't match the javadocs - to make sure I do it first (actually it worked before i did this so i know it works but this simplifies the call). The shift amount of 9 i plucked out of my arse.
Choosing better `random' values has some effect but this works.
Approximate Hash Table
I don't just want a simple hash table I want a hash table which lists all the potential match candidates against a given hash value. In java this could be expressed as:
HashMap<Integer,List<Integer>> htable;
However this will (should, i imagine) have scalability and gc issues and boxed primitives are quite heavy weight to start with.
There are some good optimised java libraries for primitive hash tables but i don't think they support multiple entries off the same hash key and they tend to be gigantic libraries targeted at different problems. Anyway it's the sort of problem i find interesting (fun, actually) so I came up with my own.
I sat outside in the last of the rays of the sun with knuth and a beer and read up on open addressing hash tables. Typically i'm partial to the chained variety for their conceptual simplicity but from looking at some libraries like trove it seems open addressing has benefits in some cases and particularly for java. Armed with the knowledge and the setting sun I came up with a space efficient 'approximate' hash table which suits the problem at hand.
// Copyright (C) 2015 Michael Zucchi
// License is GNU GPL v3 or later
class MatchMap {
int[] values;
int mask;
int logN;
static final int EMPTY = Integer.MAX_VALUE;
public MatchMap(int N) {
int size;
logN = 31 - Integer.numberOfLeadingZeros(N);
size = 1 << (logN + 2);
mask = size - 1;
values = new int[size];
Arrays.fill(values, EMPTY);
}
public void add(int hash, int value) {
int i = hash & mask;
int c = ((hash >> logN) | 1) & mask;
while (values[i] != EMPTY)
i = (i + c) & mask;
values[i] = value;
}
public void foreach(int hash, IntConsumer func) {
int i = hash & mask;
int c = ((hash >> logN) | 1) & mask;
int v;
while ((v = values[i]) != EMPTY) {
func.accept(v);
i = (i + c) & mask;
}
}
}
It uses open addressing with double hashing and doesn't bother with store the hash key, only the value. Knowing the size is important - but also known in this application. Deletes are a bit messy but fortunately not required. The table is sized so that the load factor is well less than 0.5.
Adding a new entry just starts probing at the modulo-size location and searches for an empty slot, stepping by the second hash (higher bits in the function or'd with 1 to make it odd). Knuth suggests using an odd number for the step which makes a lot of sense; it will eventually visit every location. This is an effective technique against clustering that occurs if linear probing is performed (I also tried that).
Retrieval just does the same thing. Any values are considered potential matches even if they were stored by a completely different hash key. Given that each location needs to be compared directly with the data arrays it doesn't make mistakes very costly and it even seems to improve the delta output in some cases.
One might note here ... this in total ... is probably less code than just the extra scaffolding required to use a java collections hashtable+list for the same purpose and probably 4-10x more space efficient. open-vcdiff has ~1KLOC for this purpose although that's about half comments explaining it's complexity - i don't even care to understand it enough to know if any of that is actually justified.
These two functions are 80% of the smarts and about 30% of the code. The rest is just to wrap it in some 'efficient' byte-streaming encoding, settle on a magic number, and write a decoder. vcdiff encoding would be nice but the encoding is a little more work than I have time left this weekend.
A result
I created a delta from the text versions of GPL 2.0 to GPL 3.0 using this and compared to jvcdiff and badiff. As I said i don't think jvcdiff is the best implementation and it appears to just be a port of open-vcdiff, but it's what I have handy. For jvcdiff i'm forcing a single window but otherwise using the defaults.
badiff jvcdiff DEZ1
(6,1) (6,2) (6,4) (8,1) (8,8) (16,16)
source size 18092
target size 35147
patch size 47833 28829 18064 19053 20421 19128 23159 28877
encode 0.174870 0.002000 0.001550 0.001200 0.000950 0.001250 0.000720 0.001000
decode 0.000700 0.000170 0.000150 0.000150 0.000150 0.000120 0.000090 0.000060
hardware is kaveri cpu, jvm is oracle java 8 64-bit.
FWIW gzip is about 16K for both concatenated together, but that isn't really the same thing.
The value in brackets is (hsize,hstep). A larger hstep will reduce the size of the hash table generated and thus affect memory use, possibly significantly. hsize affects the minimum length of a match as well as hashing conflicts at the low end.
Times are in seconds and approx/rounded from the last few of 200 runs - jvcdiff has much more complex code so it takes nearly this many before it is fully optimised by the jvm, before that it is approximately 0.01s for encode. Because it's fast it might need even more loops for the jvm to notice. Due to randomisation of the hash in these test runs the DEZ1 encoding can vary between runs.
Dunno what's up with badiff, maybe this data isn't suitable for it and/or it's tuned to larger datasets.
As I said I don't have any hard data on the memory footprint but for DEZ1 the (x,1) worst-case it uses a 64K element int array for the hash table, plus the memory for the input and output buffers and half a dozen small utility objects. Decoding needs no extra memory apart from input, patch, and output buffer; and the output size is encoded in the delta so it only needs one allocation of non-garbage.
After writing most of this I tried the (16,16) option and considering the patch size is close to jvcdiff perhaps that matches the parameters it uses. Actually I kinda knew about this but forgot about it. Even then it is surprising that the delta size is about the same as jvcdiff uses much more advanced encoding techniques from VCDIFF that should reduce the size.
Format
There's probably not much point going into a lot of detail in the file format but lets just say i chose the most compact and simplest format I could.
Integers are encoded using a big-endian stream of 7-bits data + 1 bit continue indicator, much the same as vcdiff and dozens of other binary encodings.
Because I only have two opcodes as a further optimisation I encode the opcode as a single bit along with the length. This allows lengths of up to 63 bytes (hmm, 64 if i add one ...) to be encoded together with the opcode in a single byte.
I just encode the copy source address as an integer every time which means they will tend to take 3-4 bytes in typical files. This is where a lot of the smarts in vcdiff comes from but it also requires more faffing about and runtime overheads.
Improvements
So a relatively cheap and possibly significant improvement is to include the the incremental output as part of the search space: i.e. the part of the output which will have been decoded up until that time. As long as the hash table is big enough that should be relatively simple in the encoder and isn't much more work in the decoder.
I don't see the need for run-length encoding as it rarely comes up in real text (and yeah, real men use tabs).
Since the code is so simple it is open to gains from micro-optimisations. Being simple does help the compiler a lot to start with though and hotspot compiles faster code with less lead-time than the more complex implementations.
The hash table is a critical part of the performance and could be tuned further, not that there is much code to tune.
Currently any match of 4+ is encoded as a copy even if an add might be smaller. I don't know if 4 is even the best value. A look-ahead of a few characters may also yield longer matches and more compact encodings. Other address compression techniques as used in vcdiff could be used.
Scalability is unknown. My application is 'post-sized' text so it should be fine there. Increasing the block size (hsize) tends to speed up the encoder as it reduces the search space, but it misses small matches so the delta size increases. A two-tiered approach might work: a longer hash for the whole file and a smaller hash for a local region.
Having said all that, 'simplicity' and 'good enough' also have their own considerable charm and appeal so I might just clean it up and do some better testing and tuning and leave it at that.
Closing Thoughts
Well ... I guess that was a fun afternoon, ... which seems to have gone well beyond midnight, oops - these long articles take hours to write. Given I went in blind i'm pretty surprised that the results are at least comparable to existing code with such modest effort.
I'm also surprised - or maybe not surprised if i let my ego talk - at how simple the algorithms actually are compared to the complexity of any implementation i've seen. And it's a little difficult to see this case as a matter of being a subject expert thus making it appear simpler than it really is. The Bentley/McIlroy paper mentions a working implementation in ~150LOC of C which was hard to believe looking at these libraries; but now i do.
And also ... back to work (later) today. It's been an indescribable 3 month break, but not as good as that sounds. I guess I lost 5 kilos in the last 6 weeks.
from link to dlsym call
I've been spending an inordinate amount of time the last few days playing with make and doing some work on zcl. Been staying up way too late and then doing it all again the next day. Been a bit of fun poking at little problems though.
The java makefile system i've been working on has got a little fiddly due to some features i'm trying to implement; but so far all problems have been solvable in a fairly straightforward way and it's mostly now taking time because i'm writing some fairly long article(s) about the work. Most of the complication is with generated source and resource files and dealing with the command line to jar which is like but not quite the same as tar.
So I took a break to play with zcl again. Because the sdk version of libOpenCL may not match the system version i've added an option to compile with dynamic linking - i think the linker can also do this automagically but it's something i'd need to learn about first. But just changing it over turned out to be pretty easy an infact most of the work is hidden in a single auto-generated file together with a few small bits of checking to throw exceptions when functions are not available.
I wrote a horrible bit of perl that parses the very strict format of cl.h and creates a list of static function pointers and an init function to set them up, as well as C macros which automagically change the source to go via the function pointers rather than via dynamic linking.
So for example it creates the following bits for a given function:
...
static cl_int (*pclBuildProgram)(cl_program, cl_uint, const cl_device_id *, const char *, void (CL_CALLBACK *)(cl_program, void *), void *);
...
static void init_clfunctions(void) {
...
pclBuildProgram = cl_symbol("clBuildProgram");
...
}
...
#define clBuildProgram (*pclBuildProgram)
...
And I just include this at the head of the file and call init during the JNI OnLoad. Saved most of the work since none of the rest of the code needs to change unless I want to add pointer null checks. This opens the possibility of a "sdk-free" build of zcl since without the need to link it can get everything it needs from the header files which I can bundle.
I also looked into a mechanism for implementing CL extensions - I think i've worked out a decent solution (~interfaces on the platform) but haven't (fully) implemented any yet. I'm having second thoughts on whether the array interfaces for asynchronous calls for buffers are really adding enough to counter the work required to implement them and the bloat of the api size, but for now they will remain.
It's not really so "simple binding for java" anymore at ~10KLOC, but then again most of the parts are simple on their own.
After I remembered how to get going in OpenCL I did a little bit of testing on some of the new interfaces in opencl 2.0 - so far they seem to function; although for some reason my catalyst driver is only reporting the CPU device so it's not the best test - i think i need to reboot the machine because of memory fragmentation. Netbeans has turned into (more of) a pig of late and a few hundred tabs in firefox tends to swallow (all) memory. I guess I made a mistake on the memory size in this machine.
ahh shit, google code is being scrapped
Joy. I guess it's not making google enough money.
I'll make a copy of all my projects and then delete them. Probably sooner rather than later (probably immediately because i'd rather get it out of the way, i have a script going now). I'm keeping all the commit history for now although i never find it particularly useful. Links in old posts may break.
Some of them may or may not appear later somewhere else, or they may just sit on my hdd until i forget about them and delete them by accident or otherwise.
Wherever they end end up though it wont be in one of the other commercial services because i'm not interested in doing this again.
I need to do the same to this blog, ... but that can wait.
Update: So ... it seems someone did notice. Don't worry i've got a full backup of everything. In part I was pissed off with google and in part I wanted to make it immediately obvious it was going away to see if anyone actually noticed. Who could know from the years of feedback i've never received.
If you are interested in particular projects then add a comment about them anywhere on this blog. All comments are moderated so they wont appear till i let them through but there is no need to post more than once (i'll delete any nonsense or spam unless it makes you look like a bit of a wanker). I will see about enabling anonymous comments for the moment if they are not already on. I can then decide what to do depending on the interest.
I will not be using github or sourceforge nor the like.
In a follow-up post i'll see where i'm going with them as well as pondering the future location and shape of the blog itself. I've already written some stuff to take the blogger atom stuff, strip out the posts, download images, and fix the urls.
(in resp to Peter below) I knew the writing was on the wall when google code removed binary hosting: it's kinda useless for it's purpose without it. This is why all my new projects subsequent to that date are hosted differently.
glNext/Vulkan/SPIR/stuff
Decided it was time i had a look into 'the state of things' wrt these new apis. I've not been keeping up with OpenCL of late either so I guess I should refresh on that (oh that's right, c++ kernels, well fuck that shit).
Vulkan definitely looks interesting. I mean vulkan looks like a huge amount of boilerplate from what little is available so far (the GDC 2015 slides from khronos) but it all makes plenty of sense. Pity it's still work in progress so all I can do is read about snippets. One hopes JavaFX will use it as a backend in the future though, and maybe i'll do a ZVK to compliment ZCL myself (it's good way to learn about an api).
There's a bizarre thread on the khronos forums where one poster wants to scare away beginner programmers from Vulkan least they get confused by such wild concepts as memory allocation and asynchronous synchronisation. That's ... well that's just wrong. All that information hiding in OpenGL is what makes it such a pain to understand beyond trivial cases, particularly since fixed pipeline went away. If people are using C (as they must to use it) then malloc is hardly an insurmountable barrier. And asynchronous sync is the only technique which makes multi-threading code manageable and really should be the main technique for MT code taught in any software engineering course. To be honest I think it's all that learning about critical sections and mutexes which do all the damage and make multi-threading seem so complicated and they should be something unlearnt as quickly as possible (as a general MT sync technique that is, obviously they are needed if only to implement more useful primitives like queues and barriers).
Anyway we know how it will play out for beginners and learning professionals alike: we'll all just use google to find the first example of the boilerplate that compiles and then keep using that for years. If I was khronos I'd make sure it's their up to date examples and documentation that using google finds and not some shitty agglomeration site full of out-dated shit.
SPIR-V
SPIR looks a little "odd" to me at first glance, but i'm not a compiler writer so it probably should. To my mind i would describe it (a bit) like assembly language that has been pre-processed and annotated ready for an optimising compiler and then it's internal object representation flattened into a flat global space referenced by index. Rather than registers or addresses results are referenced by the instruction that produces them. It should make a frontend compiler fairly simple as the compiler doesn't need to deal with registers or stacks, and it makes a backend a bit simpler as some of the work is already done. And apparently it's not a file format, although it sure looks like one and i can't imagine it would make sense to anyone to create a different one to store the same information.
I wonder what this mean for HSA (or infact, any other compiler system using it's own virtual processor)? I've not looked into it so maybe something is already announced but if not i suspect SPIR is in it's future too. BRIG takes quite a different approach in that a specific register-based virtual machine is targeted and it's up to the frontend compiler to perform register optimisation and the backend isn't (supposed to be) much more than a machine code translator (which is still strictly speaking, a compiler). I just can't see hardware vendors keeping both maintained.
Oh my gods, it's full of compilers ...
One realisation I had when playing with the soft-GPU code on the parallella was that modern GPU drivers are mostly just compilers with a bit of memory and queue management bolted on the side. And similarly that modern GPUs are really just moderately wide vector CPUs with a little bit logic for fragment generation and texture lookup. of HSA itself is more about compiler writers than the end-user too.
It's just a bit of a personal bummer because although compilers are a bit interesting they're not enough interesting to me to spend the years required to get up to speed.
dun dit dun
Following on from the last post I did some i/o sutff - went with an s-expression based thing since it's a trivial parser and easy to use - then I rewrote the whole model and some of the ui again. Added zoom for instance. Then I played with it for a while but couldn't manage to create the sort of sounds I was trying to create. I couldn't find modules that did what I wanted to do and trying to construct them from primitives got messy and didn't work either.
I was starting to lose interest.
So then I wrote a whole new synthesiser from scratch followed by enough of a midi sequencer to get a midi of Lazy Jones working - and I think it sounded better than the bundled Java midi synthesiser (more hard/harsh just like the original). It uses a thread to render a frame of samples at a time based on a master clock and uses a simple 'runLater()' style mechanism to synchronise parameter updates on the various components so threading issues are just hidden.
I tried a few other files and none worked either at all or very well but after another day of hacking I've got fully polyphonic voices and adsr envelopes and so on and so forth. Most midi files apart from Lazy Jones still sound kinda shit because I don't have any real instruments or the midi processing units (and well, it's midi), but they do play with the right timing and tone (I think, hard to know since midi files usually sound a bit out to me). For efficiency I created a multi channel oscillator and envelope bank which works pretty well but limits the instruments to simple waveform + ADSR envelope.
Most of it was pretty straightforward considering i just did it off the top of my head but I found I had to do some filtering on the amplitude changes to avoid pops when using a sooth waveform. One problem I have is that square and sawtooth are just so much louder than triangle; some filtering seems to help but I guess that's just the way it is and it's up to the musician to compensate.
Probably wont get much farther than that but it was a fun few days. It can be made an interesting real-time problem and parts are paralellisable so if I was keen i'd probably try some epiphany code but i don't think i'm that keen and it's fairly light cpu wise anyway.
I guess I forgot about the space invaders sfx. If the latency is acceptable i might try just using my synthesiser as a real-time sound mechanism, maybe maybe.