Super Cereal!
I got way too caught up with writing a new serialiser over the last couple of days. Actually I finished off another one I had so I ended up with two.
There are two cases i'm interested in. One is tight coupling where simplicity and performance outweights extensibility; basically for IPC. The other is where extensibility and size are the main considerations; for object serialisation / data storage.
So I have an XDR-like implementation for the former. The layout of items is the same as XDR (sans mistakes) but it uses native ordering of elements, so i dubbed it XDRN for xdr-native.
For the latter i have -yes- yet another tagged format. Each field is tagged and each object is also a tagged container. The header is at least 2 bytes - a control byte and a tag byte. I can't be bothered typing out all the details - here is whatI have in the source code at the moment.
This is a streamable self-describing byte-oriented binary format. It is a general purpose format and supports a super-set of the ez_blob descriptor. It supports primitive and struct types and sequences thereof and there is room for extension. Each item beings with a descriptor byte, then followed by a tag id, a possible count, and the payload. xxxxttcc control byte xxxx type code 0 uint8 unsigned int, value zero-extended 1 uint16 2 uint32 3 uint64 - reserved 5 float16 6 float32 7 float64 - reserved f struct note that for int/float types, (code&3) == log2 of element size in bytes tt log2 of tag size in bytes 0 1 byte 1 2 byte 2 4 byte 3 reserved, no tag? cc log2 of count size in bytes, used to indicate sequence length or non-sequence. 0 1 byte 1 2 byte 2 4 byte 3 none, single item follows ff is struct-end code A header is a control byte followed by an optional 1/2/4 byte-length tag, followed by an optional 1/2/4 byte-length count. A structure payload is a list of tagged fields until a struct-end code. A structure sequence is a list of count struct-encoded blocks. Integers can be stored in the smallest number of bytes, i.e. with all leading $00 bytes removed.
So basically each field has a type, a tag, and a count. Scalar values are with a special count code so don't require a count value. It also differentiates between scalars and single-item sequences. Sequences all have a count and no end sentinal.
It's versatile enough to hold most likely structures but isn't universal. String encoding is application layer. No 128+bit primitives (but there is room to add them). No map type, but there is room to add it (it could just application layer). Probably the only significant one is a 32-bit limit on sequence (array) lengths (for some level of significant!). There are only 96+1 valid codes defined now so there is room in a single control byte for some but not all of these but it's not likely to be as tidy.
One example: tt+cc only defines 12 codes, one could swap tt,cc when tt=11 and thus use all codes and support 1/2/4/8 byte counts with 1/2/4 byte tags.
ttcc 00cc tag size 1, count size 1/2/4/8 01cc tag size 2, count size 1/2/4/8 10cc tag size 4, count size 1/2/4/8 11tt count size 0, tag size 1/2/4 1111 spare (primtive) / sentinal (struct)
Ok, maybe that would work, and it's not really any more complex in the code. It could use a lookup table but shifts would probably be faster. And this still leaves room for 8 more data types.
I went through a few similar iterations to get to this point. It has a couple of noteworthy features.
- write streamable
It doesn't need to calculate information it doesn't know in advance. For example the size of an encoded object. This was a mess in my initial attempts and sometimes required multiple recursive passes.
- self describing / read streamable
To be robust to data format changes it needs to be able to skip over data it doesn't understand. The tag defines the field so can be used to identify known ones. The data type and length fields combine together to define the number of bytes to skip for unknown fields. An unitendified sequence of structs must be skipped one at a time, but they provide enough information to do so.
- compact
Well, relatively compact for the features it provides.
Tags and integers only use the significant bytes. The minimum overhead for scalar values is 2 bytes per field for control+8 bit tag, which will cover almost everything. The minimum overhead for sequences if 3 bytes (control, tag, count), and for structures is also 3 bytes (control, tag, sentinal).
Fields all have default values and such values are simply not encoded into the byte stream.
I dunno I feel it's a bit over-engineered, but I couldn't see a way to simplify it as I really need that tag. It takes about 2x the amount of code to implement vs the xdrn implementation although a lot of that is mapping to the ez_blob descriptor tables. As it is a self-describing format it may be useful to have a map or stream based api too, and an implementation of either would be straightforward.
Internally both use a common robust i/o mechanism which is simple and reliable. This helps protect against common coding errors like buffer under/overruns. I may expose this as an api in itself.
I'm pretty useless at writing tests (can't be good at everything!) but I have tried to write a more comprehensive set of tests here. Particularly if i'm dumping information into a database I don't want it breaking.
I could've used an existing design, but well, where's the fun in that?