Minimalistic Perfect Hashing

As a small diversion I played a little bit with minimal hash functions over the weekend.

The goal here is to convert a static list of words into a token value which could be used to implement a string switch statement or other such uses.

The GNU gperf tool is a production-ready general-purpose solution to this problem but I wanted to see if I could reduce the code size and try a more naive approach.

The approach is to just perform an exhaustive search over a limited set of possible functions. The function I am using is a multiply and shift. The input is an integer formed from the next 1, 2, or 4 bytes. I map these to a power-of-two table and just find a combination which produces no collisions over the table size. Then there are a couple of tables used to map these to an index and to a string which is used to verify the string is the same and not just the hashcode.

Using the 32 keywords for the C language this is one solution that takes only a single multiply-shift stage:

#include <string.h>
int hash_keyword(const char *word)
{
    static const char *words =
        "char\0" "long\0" "register\0" "if\0" "signed\0" "unsigned\0"
        "float\0" "switch\0" "int\0" "for\0" "while\0" "volatile\0"
        "static\0" "auto\0" "do\0" "union\0" "enum\0" "typedef\0"
        "struct\0" "sizeof\0" "const\0" "extern\0" "case\0" "default\0"
        "return\0" "break\0" "continue\0" "short\0" "else\0" "void\0"
        "double\0" "goto\0";
    static const unsigned char index[] =
        { 166, 0, 166, 166, 166, 4, 8, 16, 166, 18, 166, 166, 166, 166,
166, 166, 166, 24, 32, 37, 43, 166, 46, 166, 49, 166, 166, 54, 62, 166, 68, 72, 74, 79, 166,
166, 83, 90, 96, 102, 107, 166, 166, 113, 166, 117, 166, 124, 130, 166, 166, 166, 135, 166, 143,
166, 166, 166, 166, 148, 152, 156, 162, 166, };
    static const char value[] =
        { -1, 3, -1, -1, -1, 17, 18, 15, -1, 21, -1, -1, -1, -1, -1, -1,
-1, 28, 12, 25, 16, -1, 13, -1, 31, -1, -1, 30, 23, -1, 0, 7, 27, 10, -1, -1, 26, 24,
22, 4, 11, -1, -1, 2, -1, 6, -1, 19, 1, -1, -1, -1, 5, -1, 20, -1, -1, -1, -1, 9, 29,
8, 14, -1, };
    const unsigned char *p = (void *) word;
    unsigned int h = 0;
    {
        unsigned int v = 0;
        for (int i = 0; i < 4 && *p; i++)
            v = (v << 8) | (*p++);
        h ^= (v * 54991) >> 12;
    }
    h &= 63;
    return (strcmp(word, words + index[h]) == 0)
        ? value[h] : -1;
}

This is not a minimal perfect hash function because the hash size is 64 for 32 input words, but it does calculate the result using a single hash table calculation and probe.

Using smaller sample inputs and more steps provides more solutions but it requires more code at runtime. This one compiles into .text=87, .rodata=352 bytes.

By comparison equivalent functionality using gperf creates 595 bytes of .text and constants and 720 bytes of .data. The primary reason it is so much larger is because it stores the word and index in a structure which will take at least 16 bytes extra per entry (8 bytes for the string pointer, and 8 bytes alignment including the index value). This more than offsets the smaller table it uses.

The runtime should be comparable although being 1/3 the size should help the cpu caches.

It's slow to search and might not produce a solution but with some tuning it has worked for a limited number of cases i've tried. There are some other options it could try:

Use the word length. This finds more solutions;
The samples needn't be linearly spaced;
Other operations apart from multiply-shift;
Other combining operators than EOR.

I think it's already as minimal as possible in terms of .text and .data size, at least on amd64.

As a further comparison I compared a smaller problem to using a linear search. This case was just for 6 strings. Apart from the hash generator I wrote a cascaded if-else-if and a compacted linear search where all strings are stored in 1 string with embedded nuls used to separate them.

            .text   .(ro)data
phash	       79      64
if-else	      206      41
linear	       74      42

So the only one that's smaller uses atypical coding anyway.

Blogz Live

The blog is now using blogz as the driver.

The code is mostly the same so there's no big differences but I did change the tag-history urls to use the path for the postid rather than a query string. So those have changed but as they should be excluded from indexing and it doesn't make sense to use them as a link it shouldn't break anything.

I didn't do a lot of testing - there isn't a lot to test - but if i find things i'll fix them.

I did have a small few bugs I had to fix so i'll have to do another 0.3 series release I guess.

I'm doing stuff on 0.4 now, so it might not be far away. That'll move to using a database as the index which again wont make for very large visual changes although I will be able to add titles to the newer/older post linkes. Once that's done I can look at a bunch of other stuff from comments to vote buttons.

And another thing ...

What about memory mappable tagged format?

Whilst this couldn't be done for a struct mapping it would be useful for arrays.

Not hard to add, but needs agreement on both ends. Build-time decision? Runtime decision? Stream meta-tags?

The puzzle never ends.

Super Cereal!

I got way too caught up with writing a new serialiser over the last couple of days. Actually I finished off another one I had so I ended up with two.

There are two cases i'm interested in. One is tight coupling where simplicity and performance outweights extensibility; basically for IPC. The other is where extensibility and size are the main considerations; for object serialisation / data storage.

So I have an XDR-like implementation for the former. The layout of items is the same as XDR (sans mistakes) but it uses native ordering of elements, so i dubbed it XDRN for xdr-native.

For the latter i have -yes- yet another tagged format. Each field is tagged and each object is also a tagged container. The header is at least 2 bytes - a control byte and a tag byte. I can't be bothered typing out all the details - here is whatI have in the source code at the moment.

      This is a streamable self-describing byte-oriented binary format.
      It is a general purpose format and supports a super-set of the
      ez_blob descriptor.  It supports primitive and struct types and
      sequences thereof and there is room for extension.

      Each item beings with a descriptor byte, then followed by a tag id,
      a possible count, and the payload.

      xxxxttcc control byte

      xxxx type code

      0 uint8       unsigned int, value zero-extended
      1 uint16
      2 uint32
      3 uint64
      - reserved
      5 float16
      6 float32
      7 float64
      - reserved
      f struct

      note that for int/float types, (code&3) == log2 of element size in bytes

      tt log2 of tag size in bytes

      0 1 byte
      1 2 byte
      2 4 byte
      3 reserved, no tag?

      cc log2 of count size in bytes, used to indicate sequence length or non-sequence.

      0 1 byte
      1 2 byte
      2 4 byte
      3 none, single item follows

      ff is struct-end code

      A header is a control byte followed by an optional 1/2/4 byte-length tag,
      followed by an optional 1/2/4 byte-length count.

      A structure payload is a list of tagged fields until a struct-end
      code.  A structure sequence is a list of count struct-encoded blocks.

      Integers can be stored in the smallest number of bytes, i.e. with
      all leading $00 bytes removed.

So basically each field has a type, a tag, and a count. Scalar values are with a special count code so don't require a count value. It also differentiates between scalars and single-item sequences. Sequences all have a count and no end sentinal.

It's versatile enough to hold most likely structures but isn't universal. String encoding is application layer. No 128+bit primitives (but there is room to add them). No map type, but there is room to add it (it could just application layer). Probably the only significant one is a 32-bit limit on sequence (array) lengths (for some level of significant!). There are only 96+1 valid codes defined now so there is room in a single control byte for some but not all of these but it's not likely to be as tidy.

One example: tt+cc only defines 12 codes, one could swap tt,cc when tt=11 and thus use all codes and support 1/2/4/8 byte counts with 1/2/4 byte tags.

    ttcc

    00cc   tag size 1, count size 1/2/4/8
    01cc   tag size 2, count size 1/2/4/8
    10cc   tag size 4, count size 1/2/4/8
    11tt   count size 0, tag size 1/2/4
    1111   spare (primtive) / sentinal (struct)

Ok, maybe that would work, and it's not really any more complex in the code. It could use a lookup table but shifts would probably be faster. And this still leaves room for 8 more data types.

I went through a few similar iterations to get to this point. It has a couple of noteworthy features.

write streamable

It doesn't need to calculate information it doesn't know in advance. For example the size of an encoded object. This was a mess in my initial attempts and sometimes required multiple recursive passes.

self describing / read streamable

To be robust to data format changes it needs to be able to skip over data it doesn't understand. The tag defines the field so can be used to identify known ones. The data type and length fields combine together to define the number of bytes to skip for unknown fields. An unitendified sequence of structs must be skipped one at a time, but they provide enough information to do so.

compact

Well, relatively compact for the features it provides.

Tags and integers only use the significant bytes. The minimum overhead for scalar values is 2 bytes per field for control+8 bit tag, which will cover almost everything. The minimum overhead for sequences if 3 bytes (control, tag, count), and for structures is also 3 bytes (control, tag, sentinal).

Fields all have default values and such values are simply not encoded into the byte stream.

I dunno I feel it's a bit over-engineered, but I couldn't see a way to simplify it as I really need that tag. It takes about 2x the amount of code to implement vs the xdrn implementation although a lot of that is mapping to the ez_blob descriptor tables. As it is a self-describing format it may be useful to have a map or stream based api too, and an implementation of either would be straightforward.

Internally both use a common robust i/o mechanism which is simple and reliable. This helps protect against common coding errors like buffer under/overruns. I may expose this as an api in itself.

I'm pretty useless at writing tests (can't be good at everything!) but I have tried to write a more comprehensive set of tests here. Particularly if i'm dumping information into a database I don't want it breaking.

I could've used an existing design, but well, where's the fun in that?

More developments

libeze and blogz are now on code.zedzone.au.

I still haven't moved zedzone to use it but I probably will this week.

I've already prototyped all the code required to move blogz to using lmdb as an index. Conceptually it's very simple, just 3 tables and a few indexes. But there's a lot of boiler-platish code to map it to tables and enforce referential integrity across multiple secondary indices.

Although I haven't implemented it yet the tables support the required meta-data for versioning, renaming, tagging and so on. It will also support comments, wiki-like, and book-chapter like articles. So it will be a stepping stone toward that.

And votes.

I'm thinking about another blob serialiser that uses a tagged format. This is to support simpler schema upgrades should they be required. On the other hand it may never be necessary and it start to fall into over-engineering so I need to see how it falls out once i've gotten there.

A big bloody rant

Ok it's time for another rant. I had a list of things to rant about and I lost it ... so I created another one. I may remember on the way.

Children or pearl clutching nannikins may want to look away.

In short; i'm getting pretty sick of junk on the internet, junk in software development, junk in politics. There's the other crap like anti-vaxxers and climate change denialists - but well that's more of a collective fuckup that we'll all have to reckon with in the end.

Alright where to begin. Firefox. Fuck firefox. Every time it upgrades it seems to break somefucking thing. Now the preferences have been lobotomised for no obvious reason: rather than a click you have to scroll forever to find what you want. And the default scroll using a wheel-mouse is so frustratingly slow it's a total headfuck. Why can't they just include a proper javascript and image blocker in the base fucking application rather than a buggy plugin (oh no, extension!) that breaks every upgrade. And then when you run it it wants to run you through a fucking tutorial about all the great shit it does. Yeah i've been using a browser for 20 years, they're just not that special you know. The pocket. I don't even know what it is but I know I don't want it. Fancy url rendering: look it's just a bit of text, I want to see what it is, you don't need to hide the http or the www or the com. It doesn't make it any more usable or more readable. Multi process; which just means heavy swapping. Javascript is just such a stupid language the last thing anyone should be doing is encouraging more of it.

cmake, got what a load of junk. After years of failures I finally am able to actually compile some projects with cmake out of the box. But it's still usually fucked and the only solution seems to be to go edit the usually non-documented application-specific configuration files.

C++ is such a shit language. Templates are joke. You have<to::tell::the::compilerExactlyWhat>youMean.EveryTime(YouUseIt(X)). And it's still uses gobs of memory and runs like a total pig and gives overly verbose and low-information error dumps. Everyone uses templates because it inlining absolutely everything makes the fastest and most efficient code. Actually, no, it really doesn't. Over-inlining starves registers (particularly on the terminally register-starved x86 platform), blows out the L1 cache, and wastes memory and disk space. Sure you can use it judiciously but but the time you've done that you may as well have just use something else.

Python. Spaces are important. Oh no, actually the number of white-space characters are important. I haven't got time to wait for the molasses of a cursor to step-one-space-at-a-fuck-ing-time-to-get-to-where-the-action-is. I saw a comment on reddit yesterday about how python is so cool because of the indenting. It really isn't. With C or any other sane language you can type whatever you want and the compiler can work it out. The bit that python is missing is you can then automatically format it in various more readable ways that show exactly how the compiler will treat it. With python you're stuck with what you see. Add a block? Now you how to indent everything else and make sure you do it right, now just add a couple of braces and hit re-fromat. Ugh.

This isn't even the real problem with python. I recently had to try to compile pytorch, a 'lightweight' (ROFLMAO what the fuck?) artificial neural network engine and the abomonation that was once caffe. The webpage says use anaconda. Ok 4-fucking-gigabytes later I get something that doesn't work; this version isn't compiled with the features I need (probably because of fucking software patents and ffmpeg). So I spent a couple of days trying to get it to build from source (actually I tried that first but had to go back to it). Fighting with cmake and whatnot I finally get it compiled. But then the python doesn't run. Just a meaningless backtrace (which is pretty much my experience with anything python). Ok so some package isn't there. Ok so ... one uses pip or whatever to install it like you might with apt or yum. Except it isn't really like apt or yum at all. It isn't even like slackpkg. All it does is download a tar and drop in on the filesystem. Dependencies? What are they? Versions? Hah, surely you jest! It's just a bunch of files on the system.

So you run the script again. Again it breaks. Again you install another package. Then lmdb wont even bloody install because setup.py is broken. The internets claim this is a pip bug but then you fuck up and end up with 3 copies of pip installed (/usr, /usr/local, ~/.local) ... and well you fuck it all off and start again. Eventually it turns out lmdb is a compile option hidden somewhere inside the cmake scripts of the fucking custom OpenCV build you had to compile when you started on this journey a couple of days ago.

Javascript. I mean I can't even manage to encompass the size of the fuckup of this conceptually let alone put it into words.

Web page feedback. You go to AMD or wherever to look up some hardware or software details and before the front page has even loaded the page goes white and an overlay asks you to provide feedback for your 'web' experience. Here's some fucking feedback:

                     /"\
                    |\./|
                    |   |
                    |   |
                    |>~<|
                    |   |
                 /'\|   |/'\..
             /~\|   |   |   | \
            |   =[@]=   |   |  \
            |   |   |   |   |   \
            | ~   ~   ~   ~ |`   )
            |                   /
             \                 /
              \               /
               \    _____    /
                |--//''`\--|
                | (( +==)) |
                |--\_|_//--|

Jesus, even that was hard to find: nobody knows what the fuck ASCII is anymore. It's all windows-cpfuck-you-too or some other junk.

Highly related is a web-site that keeps asking you to subscribe. Or keeps giving you 'helpful hints' on how to use it. You know what, I can read, if it's not obvious it doesn't fucking matter. I don't need to be stepped in turn through every single feature that is so unique it's on every other bloody website. I'm sure you're very proud of getting it to work, well good for you. Now you just fucked it up by bugging the fuck out of me.

Or bloody achievements. I just don't bother with any forums anymore but I am on the FSF member forum although it's pretty dead and I just don't fit in (story of my life). I mean I get a happy-face-stamp for writing my first post. Ok thanks I guess? What am i, in grade fucking one? There's enough trouble with the world infanticising everything, do we really need it on every bloody forum we visit? I know i know, it drivers user engagement and all the metrics. Well fuck the metrics.

Speaking of metrics, why the fuck can't the yanks get with the programme? Even when they do use them they can't spell it properly. It's a bloody metre, a meter is something with a dial. Fuck the freedom units.

Modern user interfaces. aka the hamburger menu. aka the stack of shit menu. Yeah so apparently a small row or column of easly read text is too much for the kool kidz these days and it's all hip 2D iconography and animations and shit. It's a shit style that will age poorly. Every time a site or application updates menu's go missing or get moved somewhere else. Jesus fucking stick to something you don't have to actually change this this just for the sake of it. You know that people will just get used to whatever you have; the cognitive overload of having to relean it every time massively increases the real costs of any 'improvement' you might make.

Glaring white pages. Well this goes back to the original internet explorer, M$ windows, M$ word (yes I said itt! There's good solid reasoning behing it!). I regularly spend 12+ hours a day on a computer, if I was being blasted with that white shit I would be blinder than I am already. I don't get headaches for a reason.

Constant updates, rolling releases, fluid apis. It's just too much. Sometimes things are just done and don't need updating anymore. But if you do that you dissapear off the google search results. You're no longer part of the zeitgeist! Heavens! Can't have that! No you need a fancy logo and an annual rebranding `event' and need never maintain anything again, just throw out a new release when you've got enough new buttons to make it look like it matters! Gotta rush to the end of that sprint!

If you can't keep up, well you've just been out-evolved by the equivalent of a telephone sanitiser.

There was more i'm sure but that'll do for now.

Oh hang on, one more rather important one in my notes. Back to firefox. I normally run firefox which is finely tuned to be actually usable. I don't use many plugins - disable javascript, disable fonts and colours and that's about it. But I customise the fuck out of everything else: no fucking animations of smooth-this-or-fucking-that, middle-click to open a url, etc etc. I even have a userContent.css that hides a whole of shit on sites I commonly use (like the banners on stackoverflow and so on). But I tried a naked profile in order to test out the stylesheets i've worked on for code and docs. Boy what a fucking joke. All the reminders about shit I don't want. They even fucked up ctrl-tab so it goes forward but wont return where you came from (isntead it opens a menu with very different keyboard navigation) - i mean what's 25 years of convention for anyway?

But the most alarming one was when I went to download my own fucking sourcecode. Apparently a tar.gz file is now a dangerous download.

Yeah fuck off mozilla you bunch of stupid cunts.

Now I just feel like shit, this has been really depressing.

Sad face.

blogz 0.3

I did a bit of work on blogz today and pushed out a 0.3 release. This is not really tested and i'm still not using in on zedzone but I will be testing it and perhaps do another 0.x release if I have bugs. But the reason it's out now is that this is the snapshot base that I will load into git and then work on that from now on.

If it all works out the next step will be working on using a database for indices - at the very least it will be some lmdb thing, but it may go the whole hog and start to look at using crez for the backend so I can look into some more interesting ideas, although this is a lot more work. This will also allow me to look into a comment system eventually.

See the blogz homepage for the downloads as usual.

L-O-L like anyone gives a flying fuck.

Moar Sauce!

I've now moved my main projects over to using modular Java 11 as well as using git. I've quite a few projects remaining.

I'm undecided on many of them whether they just go in the bin, get updated and published (and abandoned) or get worked on actively again.

blogz is almost ready.

This is mostly a git setup but I'm not actually using it yet on this site either, although i've synchronised the code between them now. Once I do both it'll be there.

playerz is a mess.

This is in the midst of a large amount of changes and I lost track of where I was at. Its a fairly interesting little project though so I will get back to it.

izlib is unfinished

I think I discussed this years ago but basically it's unpublished and well from unfinished. It's basically an image processing toolkit and an experimental platform for Java Streams and API refinement.

I don't really have any plans but I noticed the other day there's a lot of code just sitting there rotting away.

mediaz is archived

mediaz was a project I had on google code which was sort of a predecessor to izlib as well as a collection of early JavaFX experiments which are more or less tutorials for others. And the basic start of a layered bitmap image editor (imagez).

It was about the only project anyone asked about when google code went down. I have the repository archived so I guess I could at least convert it.

termz is unfinished

This was a little fun project using OpenCL to drive a terminal emulator display. It's pretty much pointless and could be done directly with OpenGL.

cdez is rotting

I actually have a C version of dez 1.2 too. It could be updated and published or something.

rez and crez are complicated

rez is a project that implements a `personal' versioned blob store. It supports free branches and cheap copies (iirc), renames, metadata and all that jazz, and uses dez for compact storage. Berkely DB (JE) is used as an embedded database.

This is a project i've worked on and off for years (15?+) with separate C and Java implementations. The original driver was for a web CMS. The last time I worked on it was a year ago porting it to C (including a C versin of dez), for this very blog - although in the end I stayed with the simple blogz.

It's explicitly not `enterprise' oriented on purpose.

There's probably plenty of alternatives around now and I don't really know what to do with it. I never quite got the API down to the point I was really happy with. The C version should probably use lmdb now.

What I have done actually works though so I should probably drop it out somewhere. The problems aren't all that complicated but I think I solved them in a fairly tidy and compact and reusable way.

libeze - 2.0

I just need to drop this into git as is.

This was also in the midst of a large number of changes but I kept them out for the last release. But I have a good number of bits and pieces I can add to it. playerz is the main driver here, but that doesn't need much.

SynthZ - the popular one

I'm not doing anything with this, but it has been downloaded fairly often. Maybe I will git it.

I did learn some more about SourceDataLine so the soft keyboard is real-time now!

wanki is who the hell knows

This was a wiki engine using texinfo markup and with the abiity to properly organise and export multi-page documents. It's always been pre-alpha and gone through a number of varients, from C to JavaEE.

It was the original driver for what became (c)rez.

Who knows, maybe one day, I still think it's got some merit. Wiki's still have major trouble with multi-page documents ordered like a book.

DuskZ is unfinished

An attempt at working on a simple game. I got caught up in pointless (de)serialisation stuff and other sort of unimportant details. I'm sorry to the lad that I chatted to about it. He is a nice fellow. But it was sort of at a bad time personally and I just dont have the interest to work on it anymore. I hate letting people down.

I don't know if there's really anything salvagable at this point as all I did was break a bunch of stuff.

socles is archived

This was an OpenCL image processing library. Nobody cared and eventually neither did I. There might be something salvagable for an eventual izlib backend, maybe.

low level arm code, puppybits, zedos

puppybitz is somewhat stale but maybe there's some stuff useful there.

zedos stopped when I hit the USB driver. Fuck intel for making that junk.

paralella is DEAD TO ME

The parallella stuff i'm not longer working on and it will be staying where it is (at least it's published).

I still have some boards but the whole thing soured me on kickstarter and I will never put money into anything of the sort again. It seems it's mostly used for these sort of projects as a 'lets get some zero-cost bridging finance so our real investors have more confidence' which is pretty much fucked-up capitalism at it's finest.

Wherein capital should be the one taking risks to invest in capital to make more of it. This instead is, well, get the plebs to give us free money and take all the risk for no return!

Android apps are going nowhere

Again the source is already out, it's unlikely I will do more but whether I do will be on a case-by-case basis.

I'm sure there is more - that's just what I found from my archive of google code and a quick look at what I have sitting on THIS computer. I've got drives and backups elsehwere, who knows.

Probably if anything I should start checkpointing more often and dumping shit on code.zedzone.au. Until I had that I didn't really have anywhere to put the random otherwise not really publishable-in-themselves experiments which abound.

It will be a while before this list is fully processed.

About Me

Tags