About Me

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as Zed
  to his mates & enemies!

notzed at gmail >
fosstodon.org/@notzed >

Tags

android (44)
beagle (63)
biographical (104)
blogz (9)
business (1)
code (77)
compilerz (1)
cooking (31)
dez (7)
dusk (31)
esp32 (4)
extensionz (1)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (459)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (231)
java ee (3)
javafx (49)
jjmpeg (81)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (10)
opencl (120)
os (17)
panamaz (5)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
players (1)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
vulkan (3)
wanki (3)
workshop (3)
zcl (4)
zedzone (26)
Sunday, 05 May 2019, 08:29

Minimalistic Perfect Hashing

As a small diversion I played a little bit with minimal hash functions over the weekend.

The goal here is to convert a static list of words into a token value which could be used to implement a string switch statement or other such uses.

The GNU gperf tool is a production-ready general-purpose solution to this problem but I wanted to see if I could reduce the code size and try a more naive approach.

The approach is to just perform an exhaustive search over a limited set of possible functions. The function I am using is a multiply and shift. The input is an integer formed from the next 1, 2, or 4 bytes. I map these to a power-of-two table and just find a combination which produces no collisions over the table size. Then there are a couple of tables used to map these to an index and to a string which is used to verify the string is the same and not just the hashcode.

Using the 32 keywords for the C language this is one solution that takes only a single multiply-shift stage:

#include <string.h>
int hash_keyword(const char *word)
{
    static const char *words =
        "char\0" "long\0" "register\0" "if\0" "signed\0" "unsigned\0"
        "float\0" "switch\0" "int\0" "for\0" "while\0" "volatile\0"
        "static\0" "auto\0" "do\0" "union\0" "enum\0" "typedef\0"
        "struct\0" "sizeof\0" "const\0" "extern\0" "case\0" "default\0"
        "return\0" "break\0" "continue\0" "short\0" "else\0" "void\0"
        "double\0" "goto\0";
    static const unsigned char index[] =
        { 166, 0, 166, 166, 166, 4, 8, 16, 166, 18, 166, 166, 166, 166,
166, 166, 166, 24, 32, 37, 43, 166, 46, 166, 49, 166, 166, 54, 62, 166, 68, 72, 74, 79, 166,
166, 83, 90, 96, 102, 107, 166, 166, 113, 166, 117, 166, 124, 130, 166, 166, 166, 135, 166, 143,
166, 166, 166, 166, 148, 152, 156, 162, 166, };
    static const char value[] =
        { -1, 3, -1, -1, -1, 17, 18, 15, -1, 21, -1, -1, -1, -1, -1, -1,
-1, 28, 12, 25, 16, -1, 13, -1, 31, -1, -1, 30, 23, -1, 0, 7, 27, 10, -1, -1, 26, 24,
22, 4, 11, -1, -1, 2, -1, 6, -1, 19, 1, -1, -1, -1, 5, -1, 20, -1, -1, -1, -1, 9, 29,
8, 14, -1, };
    const unsigned char *p = (void *) word;
    unsigned int h = 0;
    {
        unsigned int v = 0;
        for (int i = 0; i < 4 && *p; i++)
            v = (v << 8) | (*p++);
        h ^= (v * 54991) >> 12;
    }
    h &= 63;
    return (strcmp(word, words + index[h]) == 0)
        ? value[h] : -1;
}

This is not a minimal perfect hash function because the hash size is 64 for 32 input words, but it does calculate the result using a single hash table calculation and probe.

Using smaller sample inputs and more steps provides more solutions but it requires more code at runtime. This one compiles into .text=87, .rodata=352 bytes.

By comparison equivalent functionality using gperf creates 595 bytes of .text and constants and 720 bytes of .data. The primary reason it is so much larger is because it stores the word and index in a structure which will take at least 16 bytes extra per entry (8 bytes for the string pointer, and 8 bytes alignment including the index value). This more than offsets the smaller table it uses.

The runtime should be comparable although being 1/3 the size should help the cpu caches.

It's slow to search and might not produce a solution but with some tuning it has worked for a limited number of cases i've tried. There are some other options it could try:

I think it's already as minimal as possible in terms of .text and .data size, at least on amd64.

As a further comparison I compared a smaller problem to using a linear search. This case was just for 6 strings. Apart from the hash generator I wrote a cascaded if-else-if and a compacted linear search where all strings are stored in 1 string with embedded nuls used to separate them.

            .text   .(ro)data
phash	       79      64
if-else	      206      41
linear	       74      42
  

So the only one that's smaller uses atypical coding anyway.

Tagged code, hacking.
jjstuff n stuff | Blogz Live
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!