Getting the runs.
As is usually the case I kept poking last night and added a runlength detector to the internal matcher loop of the delta processor.
To start with, any sequence which begins with 3 common bytes are not added to the hash table. For my problem file this reduces the hashtable load significantly.
Then when scanning the target data any location which starts with 3 common bytes are not looked up in the hash table and just converted to a byte run instead.
I had to add runs to the byte encoding and to do this and i noticed that ADD and RUN are typically tiny so just made the command+length byte always 1 byte with 6 bits of length. But I might change that to 5 bits of length so that it can still have a continue bit since most are under 32 bytes as well and that will protect against the occasional long length.
So something like this:
00XXXXXX - copy + 6 bit length 10XXXXXX CXXXXXXX* - copy + extended length 011XXXXX - add + 5 bit length 111XXXXX CXXXXXXX* - add + extended length 010XXXXX - run + 5 bit count 110XXXXX CXXXXXXX* - run + extended count
Expansion space? Why? If need be I could let $00 be an extended command byte or something.
The delta sizes are slightly larger but the runtime is ~2 orders better on this particular problem file (jjmpeg.dll) with similar settings. And it's like ~1s vs 10ms rather than 10ms vs 100us.
(6,1) delta size rev orig new 59 155323 155323 155323 57 15509 15626 154093 55 8168 8227 154093 53 10425 10490 153839 47 9942 9958 153095 45 15879 16000 145425 40 16584 16701 146274 36 14561 14632 144658 22 10751 10832 144145 20 38 38 144145 jjmpeg.dll
These are the deltas for jjmpeg.dll from the trunk of jjmpeg. It's a reverse delta so the top line is the latest version with no processing and each subsequent is an editing list which turns the previous version into it.
One probably obvious thing I hadn't realised is that reverse deltas work particularly well on source code - or anything that is added to over time. Since most edits are editions, a reverse delta is mostly just deletes (actually, just no commands at all).
rev delta size 147 3977 3977 146 31 3801 145 26 3671 85 101 3177 53 28 3180 52 39 2591 45 49 2488 16 48 1645 13 15 1556 11 42 1300 6 25 1194 2 20 1051 AVNative.java
I'm not working on a revision control system for source code, it just provides easy test data.
So with that i'm pretty well done with the that* ... unless i can do something about the hash table.
With short matching patterns required to get the best results certain data gets repeated a lot which is problematic for the open addressed hashing. Most solutions I can think of just blow the memory out since as soon as you start adding links it at least doubles the memory.
After a bit of thought a couple of possibly practical ideas spring to mind ...
- Overflow and keep.
- Once a search exceeds a certain range drop subsequent records into an overflow record. The record in the table can be a negative index to the overflow record.
Clashes still cause 'difficulty' because the match scan will have to deal with multiple overflow records. Although they could include the hash key to help filter them out.
- Overflow all.
- This is similar but rather than leave the entries there all values for the same key are extracted and placed into the overflow record and the rest are marked deleted. The first available slot is set to point to the overflow record.
The problem here is that the hash keys are not stored in the table and they are required to be able to filter the entries since they could include other hash keys. But! They can be regenerated quite cheaply.
- Hybrid and keep.
- Have a second smaller chained (or otherwise) hash table and if any search exceeds some limit store values in that instead.
This is the simplest but would require two lookups each time.
- Hybrid all.
- Combine the last two. Move all the other values of that key to the new table too.
This would only require two lookups for values which are not present in the overflow table, which by definition occur less often. It's not really the lookups which are the problem though.
Just thoughts, haven't tried anything.
The journey never ends
So while thinking about the next problem: the scalability of the brute force search of all the substrings with the same keys ... I thought of an exact solution that combines the string search with the index.
It's pretty obvious so probably has someone's name Update: Well sort of, it's a longest common prefix table, or LCP table. The below is definitely a naive approach with O(N log N) running time, but it is space efficient and simple to understand.
First, the whole file is pre-processed into a common prefix tree.
- Sort a list of every substring in the source;
- Iterate over the sorted list and count the length of the common prefix;
- (optional) build an index of the first row of each possible byte.
Because it just uses pointers/indices it doesn't take up any memory for the data and it only needs one entry for each location.
This data structure is basically a directed graph where each offset in 'starts' is a branch and each value in 'commons' is a link to the branch below.
0 @ 15 c=1 ' Test.' 1 @ 16 c=1 ' Test.' 2 @ 7 c=1 ' a test. Test.' 3 @ 4 c=1 ' is a test. Test.' 4 @ 9 c=0 ' test. Test.' 5 @ 21 c=1 '.' 6 @ 14 c=0 '. Test.' 7 @ 17 c=1 'Test.' 8 @ 0 c=0 'This is a test. Test.' 9 @ 8 c=0 'a test. Test.' 10 @ 18 c=4 'est.' 11 @ 11 c=0 'est. Test.' 12 @ 1 c=0 'his is a test. Test.' 13 @ 5 c=3 'is a test. Test.' 14 @ 2 c=0 'is is a test. Test.' 15 @ 6 c=2 's a test. Test.' 16 @ 3 c=1 's is a test. Test.' 17 @ 19 c=3 'st.' 18 @ 12 c=0 'st. Test.' 19 @ 20 c=2 't.' 20 @ 13 c=1 't. Test.' 21 @ 10 c=0 'test. Test.'
This is the result for 'This is a test. Test.'. '@' is the position in the original string, and 'c=' is the number of characters of common prefix with the string below.
Using this table it is a simple walk to find the longest string from these which matches an arbitrary string.
And of course ... this is exactly the problem I was solving originally a few days ago.
Sorted Array Delta
So I started from scratch and wrote another delta builder based only on this technique. I only implemented the searches of the source buffer but it does do runs. I had to use a custom quick-sort as the java Arrays sort will not let me override the comparator for primitive arrays. I use an index for the first byte to jump to the right location to speed up the first level of the tree.
So how's it go?
Run-time wise it's something like 1.5-10x slower than the best hash based version I have depending on the data - the closest end is with jjmpeg.dll. The sorting stage can dominate the processing time, for jjmpeg.dll it's 5/6ths of the total processing time but for GPL2 it's only 1/2.
But it only needs half the memory of the best of the rest, so it's not all bad.
Rubbish
I continue to be surprised at how well java copes with lots of object allocation and garbage; despite using 10x the memory using a HashMap<Integer,ArrayList<<nteger>> is pretty much on par to the open addressed hash table even when it's working well and much faster compared to the failure cases for the dll file. It's tons faster than reference counting (and threads, java is dreams for threads) and I never found explicit allocation very fast to start with. Not quite as quick as alloca().
* - i'm never done with that.