dez - Binary Delta Encoder Library

This is a basic binary delta generator for Java. It operates only on in-memory arrays.

It is based on the algorithm from Bentley & McIllroy's paper ``Data Compression Using Long Common Strings'' with some simplifications that take advantage of modern memory limits and processing power.

It only includes an encoder and decoder for a proprietary delta format 'DEZ1' (documented in the source) but other formats (e.g. vcdiff or gdiff) are possible by implementing a new decoder and matching ByteDeltaEncoder without needing to touch the core classes.

One aim was also to make a simple bit of well-engineered code as a goal in itself. There is a clean separation between the string matcher and encoding format and the string matcher itself is almost fully self-contained in 300 lines of commented and well-spaced Java.

Details of Note

It uses a rolling 'polycyclic' hash implemented via rotate instructions. It is very easy to understand, fast, and should suffice.

Because every candidate that maps to the same hash code is kept conceptually it uses an instance of HashMap<Integer,List<Integer>>. But using java collections for this generates a lot of garbage and a large fixed overhead due to containers and boxing. Instead a custom hash table has been implemented as in-line code to gain both memory overhead reduction and opportunity for micro optimisations. The insertion/append function is under 15 lines of simple code and iteration is a simple array scan. Unecessary features like re-hash or delete are not implemented.

Look at the source for more details.

Benchmarks

Because it can check each location for match candidates and includes the target decoded-so-far it can produce reasonably compact deltas and can be used as a compressor if the source is empty.

Runtime performance is good-to-excellent for typical text files but as it does not window the candidate search space it slows down with large files (>>1MB). But it can be tuned to produce a reasonable running time over a wide range and windowing could be added to improve scalability.

As a point of comparison to other java libraries, the following are the results for creating a delta from a few different files. The bible test checks using it for straight compression.

      badiff-1.2  jvcdiff-2.0 dez-1.1     dez-1.1
      (udiff)                 (6,4,1)     (16,4,16)

      GPL 2.0 (18 092) to GPL 3.0 (35 147)

      encode time     0.17        0.0012      0.0023      0.00067 
      decode time     0.00059     0.00012     0.00013     0.000057
      patch size        32 940      28 829      13 591      27 267
      peak heap 1      105 767K      3 939K      1 972K      1 419K
      peak heap 100    115 500K      2 901K      1 962K      1 410K

      jjmpeg.dll r36 (144 658) to jjmpeg.dll r35 (144 145)

      encode time     1.1         0.0033      0.0072      0.0017 
      decode time     0.0026      0.00029     0.00017     0.00015
      patch size        26 885      15 622      10 819      18 935
      peak heap 1      105 339K      7 326K      4 434K      1 666K
      peak heap 100    139 841K      5 282K      5 527K      1 649K

      empty string (0) to KJV bible text (5 218 805)
      gzip --fast
      encode time     1.9         0.46       17.          1.1         0.098
      decode time     0.0094      0.017       0.023       0.016       0.048
      patch size     5 218 814   4 866 337   1 731 265   3 631 972   1 788 558
      peak heap 1      311 631K    105 186K     51 746K     23 964K
      peak heap 2      497 405K    105 479K     56 951K     23 958K

Times in seconds, best of 100 runs. The bible is best of 10 runs.
CPU is kaveri, Java is Java SE from Oracle jdk 8.0
Peak heap N is the maximum reported by -verbose:gc for run N and bracketing the runs with System.gc(). I couldn't get a reliable result for the number of gc iterations or the time spent in gc but dez is best.
The dez parameters (b,smallest,step): b is the block size of the hash code, smallest is the smallest substring considered for copying, and step is the step size in the candidate data. The two settings shown are approximately 'best' vs 'fastest' but there is a wide tunability to different types and sizes of files.

badiff

Maybe badiff doesn't like this data or isn't very good for in-memory arrays but it seems to have little to recomment it. At best it takes a couple of orders longer and needs a couple of orders more memory, only to create bigger deltas.

jvcdiff

Encodes fast but produces middling delta compression. Both are due to it's use of a 16-byte non-overlapping block size as this leads to less to search but fewer potential matches.

dez1

At it's best setting encoding time can scale poorly but it produces much smaller results and uses equal or less memory to do it. At the given fast setting it can still produce better results but in half the time with even lower memory requirements.

Not shown but the code simplicity also aids hotspot in optimisation so peak performance comes after fewer iterations and takes less effort.

Given the simplicity of the delta format these results are encouraging. The memory usage scales better than I expected and easily justifies the rather small cost of developing custom data structures.

Using the vcdiff format might reduce the size of the deltas further but an implementation is not planned at this time.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Downloads

Sourcode here. You may also browse the repository.

dez-2.0.tar.gz source code.
26/4/2019: It's been 4 years, I think it's ready enough.
- Data format now frozen;
- Added a CRC32 checksum to the patch format;
- The single-opcode split point is now an encoder parameter and patch value;
- Switched to modular Java 11 and OpenJDK 11;
- New common java.make;
- New experimental junit.make for junit(4) tests.
dez-1.2.tar.gz source code.
28/3/2015: Made a couple of minor changes:
- The matcher can now copy overlapping strings from the target as occurs in the source paper. But this doesn't really come up much because runs are encoded separately and repeating adjacent patterns of 3 or more elements are not very common.
- Rather than a separate length array the length is stored as the first element of the chain. This reduces memory usage very slightly. It doesn't seem to affect running time particularly either way.
dez-1.1.tar.gz source code.
27/3/2015: Fixed a silly typo.
dez-1.0.tar.gz source code.
27/3/2015: Initial release.

Contact

notzed on various mail servers, primarily gmail.com.