Super Cereal!

I got way too caught up with writing a new serialiser over the last couple of days. Actually I finished off another one I had so I ended up with two.

There are two cases i'm interested in. One is tight coupling where simplicity and performance outweights extensibility; basically for IPC. The other is where extensibility and size are the main considerations; for object serialisation / data storage.

So I have an XDR-like implementation for the former. The layout of items is the same as XDR (sans mistakes) but it uses native ordering of elements, so i dubbed it XDRN for xdr-native.

For the latter i have -yes- yet another tagged format. Each field is tagged and each object is also a tagged container. The header is at least 2 bytes - a control byte and a tag byte. I can't be bothered typing out all the details - here is whatI have in the source code at the moment.

      This is a streamable self-describing byte-oriented binary format.
      It is a general purpose format and supports a super-set of the
      ez_blob descriptor.  It supports primitive and struct types and
      sequences thereof and there is room for extension.

      Each item beings with a descriptor byte, then followed by a tag id,
      a possible count, and the payload.

      xxxxttcc control byte

      xxxx type code

      0 uint8       unsigned int, value zero-extended
      1 uint16
      2 uint32
      3 uint64
      - reserved
      5 float16
      6 float32
      7 float64
      - reserved
      f struct

      note that for int/float types, (code&3) == log2 of element size in bytes

      tt log2 of tag size in bytes

      0 1 byte
      1 2 byte
      2 4 byte
      3 reserved, no tag?

      cc log2 of count size in bytes, used to indicate sequence length or non-sequence.

      0 1 byte
      1 2 byte
      2 4 byte
      3 none, single item follows

      ff is struct-end code

      A header is a control byte followed by an optional 1/2/4 byte-length tag,
      followed by an optional 1/2/4 byte-length count.

      A structure payload is a list of tagged fields until a struct-end
      code.  A structure sequence is a list of count struct-encoded blocks.

      Integers can be stored in the smallest number of bytes, i.e. with
      all leading $00 bytes removed.

So basically each field has a type, a tag, and a count. Scalar values are with a special count code so don't require a count value. It also differentiates between scalars and single-item sequences. Sequences all have a count and no end sentinal.

It's versatile enough to hold most likely structures but isn't universal. String encoding is application layer. No 128+bit primitives (but there is room to add them). No map type, but there is room to add it (it could just application layer). Probably the only significant one is a 32-bit limit on sequence (array) lengths (for some level of significant!). There are only 96+1 valid codes defined now so there is room in a single control byte for some but not all of these but it's not likely to be as tidy.

One example: tt+cc only defines 12 codes, one could swap tt,cc when tt=11 and thus use all codes and support 1/2/4/8 byte counts with 1/2/4 byte tags.

    ttcc

    00cc   tag size 1, count size 1/2/4/8
    01cc   tag size 2, count size 1/2/4/8
    10cc   tag size 4, count size 1/2/4/8
    11tt   count size 0, tag size 1/2/4
    1111   spare (primtive) / sentinal (struct)

Ok, maybe that would work, and it's not really any more complex in the code. It could use a lookup table but shifts would probably be faster. And this still leaves room for 8 more data types.

I went through a few similar iterations to get to this point. It has a couple of noteworthy features.

write streamable

It doesn't need to calculate information it doesn't know in advance. For example the size of an encoded object. This was a mess in my initial attempts and sometimes required multiple recursive passes.

self describing / read streamable

To be robust to data format changes it needs to be able to skip over data it doesn't understand. The tag defines the field so can be used to identify known ones. The data type and length fields combine together to define the number of bytes to skip for unknown fields. An unitendified sequence of structs must be skipped one at a time, but they provide enough information to do so.

compact

Well, relatively compact for the features it provides.

Tags and integers only use the significant bytes. The minimum overhead for scalar values is 2 bytes per field for control+8 bit tag, which will cover almost everything. The minimum overhead for sequences if 3 bytes (control, tag, count), and for structures is also 3 bytes (control, tag, sentinal).

Fields all have default values and such values are simply not encoded into the byte stream.

I dunno I feel it's a bit over-engineered, but I couldn't see a way to simplify it as I really need that tag. It takes about 2x the amount of code to implement vs the xdrn implementation although a lot of that is mapping to the ez_blob descriptor tables. As it is a self-describing format it may be useful to have a map or stream based api too, and an implementation of either would be straightforward.

Internally both use a common robust i/o mechanism which is simple and reliable. This helps protect against common coding errors like buffer under/overruns. I may expose this as an api in itself.

I'm pretty useless at writing tests (can't be good at everything!) but I have tried to write a more comprehensive set of tests here. Particularly if i'm dumping information into a database I don't want it breaking.

I could've used an existing design, but well, where's the fun in that?

About Me

Tags

Super Cereal!