About Me
Michael Zucchi
B.E. (Comp. Sys. Eng.)
also known as Zed
to his mates & enemies!
< notzed at gmail >
< fosstodon.org/@notzed >
Minimalistic Perfect Hashing
As a small diversion I played a little bit with minimal hash
functions over the weekend.
The goal here is to convert a static list of words into a token
value which could be used to implement a string switch statement
or other such uses.
The GNU gperf tool is a production-ready general-purpose solution
to this problem but I wanted to see if I could reduce the code
size and try a more naive approach.
The approach is to just perform an exhaustive search over a
limited set of possible functions. The function I am using is a
multiply and shift. The input is an integer formed from the next
1, 2, or 4 bytes. I map these to a power-of-two table and just
find a combination which produces no collisions over the table
size. Then there are a couple of tables used to map these to an
index and to a string which is used to verify the string is the
same and not just the hashcode.
Using the 32 keywords for the C language this is one solution that
takes only a single multiply-shift stage:
#include <string.h>
int hash_keyword(const char *word)
{
static const char *words =
"char\0" "long\0" "register\0" "if\0" "signed\0" "unsigned\0"
"float\0" "switch\0" "int\0" "for\0" "while\0" "volatile\0"
"static\0" "auto\0" "do\0" "union\0" "enum\0" "typedef\0"
"struct\0" "sizeof\0" "const\0" "extern\0" "case\0" "default\0"
"return\0" "break\0" "continue\0" "short\0" "else\0" "void\0"
"double\0" "goto\0";
static const unsigned char index[] =
{ 166, 0, 166, 166, 166, 4, 8, 16, 166, 18, 166, 166, 166, 166,
166, 166, 166, 24, 32, 37, 43, 166, 46, 166, 49, 166, 166, 54, 62, 166, 68, 72, 74, 79, 166,
166, 83, 90, 96, 102, 107, 166, 166, 113, 166, 117, 166, 124, 130, 166, 166, 166, 135, 166, 143,
166, 166, 166, 166, 148, 152, 156, 162, 166, };
static const char value[] =
{ -1, 3, -1, -1, -1, 17, 18, 15, -1, 21, -1, -1, -1, -1, -1, -1,
-1, 28, 12, 25, 16, -1, 13, -1, 31, -1, -1, 30, 23, -1, 0, 7, 27, 10, -1, -1, 26, 24,
22, 4, 11, -1, -1, 2, -1, 6, -1, 19, 1, -1, -1, -1, 5, -1, 20, -1, -1, -1, -1, 9, 29,
8, 14, -1, };
const unsigned char *p = (void *) word;
unsigned int h = 0;
{
unsigned int v = 0;
for (int i = 0; i < 4 && *p; i++)
v = (v << 8) | (*p++);
h ^= (v * 54991) >> 12;
}
h &= 63;
return (strcmp(word, words + index[h]) == 0)
? value[h] : -1;
}
This is not a minimal perfect hash function because the hash size
is 64 for 32 input words, but it does calculate the result using a
single hash table calculation and probe.
Using smaller sample inputs and more steps provides more solutions
but it requires more code at runtime. This one compiles into
.text=87, .rodata=352 bytes.
By comparison equivalent functionality using gperf creates 595
bytes of .text and constants and 720 bytes of .data. The primary
reason it is so much larger is because it stores the word and
index in a structure which will take at least 16 bytes extra per
entry (8 bytes for the string pointer, and 8 bytes alignment
including the index value). This more than offsets the smaller
table it uses.
The runtime should be comparable although being 1/3 the size
should help the cpu caches.
It's slow to search and might not produce a solution but with some
tuning it has worked for a limited number of cases i've tried.
There are some other options it could try:
- Use the word length. This finds more solutions;
- The samples needn't be linearly spaced;
- Other operations apart from multiply-shift;
- Other combining operators than EOR.
I think it's already as minimal as possible in terms of .text and
.data size, at least on amd64.
As a further comparison I compared a smaller problem to using a
linear search. This case was just for 6 strings. Apart from the
hash generator I wrote a cascaded if-else-if and a compacted linear
search where all strings are stored in 1 string with embedded nuls
used to separate them.
.text .(ro)data
phash 79 64
if-else 206 41
linear 74 42
So the only one that's smaller uses atypical coding anyway.
Blogz Live
The blog is now using blogz as
the driver.
The code is mostly the same so there's no big differences but I
did change the tag-history urls to use the path for the postid
rather than a query string. So those have changed but as they
should be excluded from indexing and it doesn't make sense to use
them as a link it shouldn't break anything.
I didn't do a lot of testing - there isn't a lot to test - but
if i find things i'll fix them.
I did have a small few bugs I had to fix so i'll have to do
another 0.3 series release I guess.
I'm doing stuff on 0.4 now, so it might not be far away. That'll
move to using a database as the index which again wont make for
very large visual changes although I will be able to add titles to
the newer/older post linkes. Once that's done I can look at a
bunch of other stuff from comments to vote buttons.
And another thing ...
What about memory mappable tagged format?
Whilst this couldn't be done for a struct mapping it would be
useful for arrays.
Not hard to add, but needs agreement on both ends. Build-time
decision? Runtime decision? Stream meta-tags?
The puzzle never ends.
Super Cereal!
I got way too caught up with writing a new serialiser over the
last couple of days. Actually I finished off another one I had so
I ended up with two.
There are two cases i'm interested in. One is tight coupling
where simplicity and performance outweights extensibility;
basically for IPC. The other is where extensibility and size are
the main considerations; for object serialisation / data storage.
So I have an XDR-like implementation for the former. The layout
of items is the same as XDR (sans mistakes) but it uses native
ordering of elements, so i dubbed it XDRN for xdr-native.
For the latter i have -yes- yet another tagged format. Each field
is tagged and each object is also a tagged container. The header
is at least 2 bytes - a control byte and a tag byte. I can't be
bothered typing out all the details - here is whatI have in the
source code at the moment.
This is a streamable self-describing byte-oriented binary format.
It is a general purpose format and supports a super-set of the
ez_blob descriptor. It supports primitive and struct types and
sequences thereof and there is room for extension.
Each item beings with a descriptor byte, then followed by a tag id,
a possible count, and the payload.
xxxxttcc control byte
xxxx type code
0 uint8 unsigned int, value zero-extended
1 uint16
2 uint32
3 uint64
- reserved
5 float16
6 float32
7 float64
- reserved
f struct
note that for int/float types, (code&3) == log2 of element size in bytes
tt log2 of tag size in bytes
0 1 byte
1 2 byte
2 4 byte
3 reserved, no tag?
cc log2 of count size in bytes, used to indicate sequence length or non-sequence.
0 1 byte
1 2 byte
2 4 byte
3 none, single item follows
ff is struct-end code
A header is a control byte followed by an optional 1/2/4 byte-length tag,
followed by an optional 1/2/4 byte-length count.
A structure payload is a list of tagged fields until a struct-end
code. A structure sequence is a list of count struct-encoded blocks.
Integers can be stored in the smallest number of bytes, i.e. with
all leading $00 bytes removed.
So basically each field has a type, a tag, and a count. Scalar
values are with a special count code so don't require a count
value. It also differentiates between scalars and single-item
sequences. Sequences all have a count and no end sentinal.
It's versatile enough to hold most likely structures but isn't
universal. String encoding is application layer. No 128+bit
primitives (but there is room to add them). No map type, but
there is room to add it (it could just application layer).
Probably the only significant one is a 32-bit limit on sequence
(array) lengths (for some level of significant!). There are only
96+1 valid codes defined now so there is room in a single control
byte for some but not all of these but it's not likely to be as
tidy.
One example: tt+cc only defines 12 codes, one could swap
tt,cc when tt=11 and thus use all codes and support 1/2/4/8 byte
counts with 1/2/4 byte tags.
ttcc
00cc tag size 1, count size 1/2/4/8
01cc tag size 2, count size 1/2/4/8
10cc tag size 4, count size 1/2/4/8
11tt count size 0, tag size 1/2/4
1111 spare (primtive) / sentinal (struct)
Ok, maybe that would work, and it's not really any more complex in
the code. It could use a lookup table but shifts would probably
be faster. And this still leaves room for 8 more data types.
I went through a few similar iterations to get to this point. It
has a couple of noteworthy features.
- write streamable
It doesn't need to calculate information it doesn't know
in advance. For example the size of an encoded object. This
was a mess in my initial attempts and sometimes required
multiple recursive passes.
- self describing / read streamable
To be robust to data format changes it needs to be able to
skip over data it doesn't understand. The tag defines the
field so can be used to identify known ones. The data type
and length fields combine together to define the number of
bytes to skip for unknown fields. An unitendified sequence of
structs must be skipped one at a time, but they provide enough
information to do so.
- compact
Well, relatively compact for the features it provides.
Tags and integers only use the significant bytes. The minimum
overhead for scalar values is 2 bytes per field for control+8
bit tag, which will cover almost everything. The minimum
overhead for sequences if 3 bytes (control, tag, count), and
for structures is also 3 bytes (control, tag, sentinal).
Fields all have default values and such values are simply
not encoded into the byte stream.
I dunno I feel it's a bit over-engineered, but I couldn't see a
way to simplify it as I really need that tag. It takes about 2x
the amount of code to implement vs the xdrn implementation
although a lot of that is mapping to the ez_blob descriptor
tables. As it is a self-describing format it may be useful to
have a map or stream based api too, and an implementation of
either would be straightforward.
Internally both use a common robust i/o mechanism which is simple
and reliable. This helps protect against common coding errors
like buffer under/overruns. I may expose this as an api in
itself.
I'm pretty useless at writing tests (can't be good at everything!)
but I have tried to write a more comprehensive set of tests here.
Particularly if i'm dumping information into a database I don't
want it breaking.
I could've used an existing design, but well, where's the fun in
that?
More developments
libeze and blogz are now on code.zedzone.au.
I still haven't moved zedzone to use it but I probably will this
week.
I've already prototyped all the code required to move blogz to
using lmdb as an index. Conceptually it's very simple, just 3
tables and a few indexes. But there's a lot of boiler-platish code
to map it to tables and enforce referential integrity across
multiple secondary indices.
Although I haven't implemented it yet the tables support the
required meta-data for versioning, renaming, tagging and so on.
It will also support comments, wiki-like, and book-chapter like
articles. So it will be a stepping stone toward that.
And votes.
I'm thinking about another blob serialiser that uses a tagged
format. This is to support simpler schema upgrades should they be
required. On the other hand it may never be necessary and it
start to fall into over-engineering so I need to see how it falls
out once i've gotten there.
A big bloody rant
Ok it's time for another rant. I had a list of things to rant
about and I lost it ... so I created another one. I may remember
on the way.
Children or pearl clutching nannikins may want to look away.
In short; i'm getting pretty sick of junk on the internet, junk in
software development, junk in politics. There's the other crap
like anti-vaxxers and climate change denialists - but well that's
more of a collective fuckup that we'll all have to reckon with in
the end.
Alright where to begin. Firefox. Fuck firefox. Every time it
upgrades it seems to break somefucking thing. Now the preferences
have been lobotomised for no obvious reason: rather than a click
you have to scroll forever to find what you want. And the default
scroll using a wheel-mouse is so frustratingly slow it's a total
headfuck. Why can't they just include a proper javascript and
image blocker in the base fucking application rather than a buggy
plugin (oh no, extension!) that breaks every upgrade. And then
when you run it it wants to run you through a fucking tutorial
about all the great shit it does. Yeah i've been using a browser
for 20 years, they're just not that special you know. The pocket.
I don't even know what it is but I know I don't want it. Fancy
url rendering: look it's just a bit of text, I want to see what it
is, you don't need to hide the http or the www or the com. It
doesn't make it any more usable or more readable. Multi process;
which just means heavy swapping. Javascript is just such a stupid
language the last thing anyone should be doing is encouraging more
of it.
cmake, got what a load of junk. After years of failures I finally
am able to actually compile some projects with cmake out of the
box. But it's still usually fucked and the only solution seems to
be to go edit the usually non-documented application-specific
configuration files.
C++ is such a shit language. Templates are joke. You
have<to::tell::the::compilerExactlyWhat>youMean.EveryTime(YouUseIt(X)).
And it's still uses gobs of memory and runs like a total pig and gives overly verbose and low-information error dumps.
Everyone uses templates because it inlining absolutely everything
makes the fastest and most efficient code. Actually, no, it
really doesn't. Over-inlining starves registers (particularly on
the terminally register-starved x86 platform), blows out the L1
cache, and wastes memory and disk space. Sure you can use it
judiciously but but the time you've done that you may as well have
just use something else.
Python. Spaces are important. Oh no, actually the number of
white-space characters are important. I haven't got time to wait
for the molasses of a cursor to
step-one-space-at-a-fuck-ing-time-to-get-to-where-the-action-is.
I saw a comment on reddit yesterday about how python is so cool
because of the indenting. It really isn't. With C or any other
sane language you can type whatever you want and the compiler can
work it out. The bit that python is missing is you can then
automatically format it in various more readable ways that show
exactly how the compiler will treat it. With python you're stuck
with what you see. Add a block? Now you how to indent everything
else and make sure you do it right, now just add a couple of
braces and hit re-fromat. Ugh.
This isn't even the real problem with python. I recently had to
try to compile pytorch, a 'lightweight' (ROFLMAO what the fuck?)
artificial neural network engine and the abomonation that was once
caffe. The webpage says use anaconda. Ok 4-fucking-gigabytes
later I get something that doesn't work; this version isn't
compiled with the features I need (probably because of fucking
software patents and ffmpeg). So I spent a couple of days trying
to get it to build from source (actually I tried that first but
had to go back to it). Fighting with cmake and whatnot I finally
get it compiled. But then the python doesn't run. Just a
meaningless backtrace (which is pretty much my experience with
anything python). Ok so some package isn't there. Ok so ... one
uses pip or whatever to install it like you might with apt or yum.
Except it isn't really like apt or yum at all. It isn't even like
slackpkg. All it does is download a tar and drop in on the
filesystem. Dependencies? What are they? Versions? Hah, surely
you jest! It's just a bunch of files on the system.
So you run the script again. Again it breaks. Again you install
another package. Then lmdb wont even bloody install because
setup.py is broken. The internets claim this is a pip bug but
then you fuck up and end up with 3 copies of pip installed (/usr,
/usr/local, ~/.local) ... and well you fuck it all off and start
again. Eventually it turns out lmdb is a compile option hidden
somewhere inside the cmake scripts of the fucking custom OpenCV
build you had to compile when you started on this journey a couple
of days ago.
Javascript. I mean I can't even manage to encompass the size of
the fuckup of this conceptually let alone put it into words.
Web page feedback. You go to AMD or wherever to look up some
hardware or software details and before the front page has even
loaded the page goes white and an overlay asks you to provide
feedback for your 'web' experience. Here's some fucking feedback:
/"\
|\./|
| |
| |
|>~<|
| |
/'\| |/'\..
/~\| | | | \
| =[@]= | | \
| | | | | \
| ~ ~ ~ ~ |` )
| /
\ /
\ /
\ _____ /
|--//''`\--|
| (( +==)) |
|--\_|_//--|
Jesus, even that was hard to find: nobody knows what the fuck
ASCII is anymore. It's all windows-cpfuck-you-too or some other
junk.
Highly related is a web-site that keeps asking you to subscribe.
Or keeps giving you 'helpful hints' on how to use it. You know
what, I can read, if it's not obvious it doesn't fucking matter. I
don't need to be stepped in turn through every single feature that
is so unique it's on every other bloody website. I'm sure you're
very proud of getting it to work, well good for you. Now you just
fucked it up by bugging the fuck out of me.
Or bloody achievements. I just don't bother with any forums
anymore but I am on the FSF member forum although it's pretty dead
and I just don't fit in (story of my life). I mean I get a
happy-face-stamp for writing my first post. Ok thanks I guess?
What am i, in grade fucking one? There's enough trouble with the
world infanticising everything, do we really need it on every
bloody forum we visit? I know i know, it drivers user engagement
and all the metrics. Well fuck the metrics.
Speaking of metrics, why the fuck can't the yanks get with the
programme? Even when they do use them they can't spell it
properly. It's a bloody metre, a meter is something with a dial.
Fuck the freedom units.
Modern user interfaces. aka the hamburger menu. aka the stack of
shit menu. Yeah so apparently a small row or column of easly read
text is too much for the kool kidz these days and it's all hip 2D
iconography and animations and shit. It's a shit style that will
age poorly. Every time a site or application updates menu's go
missing or get moved somewhere else. Jesus fucking stick to
something you don't have to actually change this this just for the
sake of it. You know that people will just get used to whatever
you have; the cognitive overload of having to relean it every time
massively increases the real costs of any 'improvement' you might
make.
Glaring white pages. Well this goes back to the original internet
explorer, M$ windows, M$ word (yes I said itt! There's good solid
reasoning behing it!). I regularly spend 12+ hours a day on a
computer, if I was being blasted with that white shit I would be
blinder than I am already. I don't get headaches for a reason.
Constant updates, rolling releases, fluid apis. It's just too
much. Sometimes things are just done and don't need updating
anymore. But if you do that you dissapear off the google search
results. You're no longer part of the zeitgeist! Heavens! Can't
have that! No you need a fancy logo and an annual rebranding
`event' and need never maintain anything again, just throw out a
new release when you've got enough new buttons to make it look
like it matters! Gotta rush to the end of that sprint!
If you can't keep up, well you've just been out-evolved by the
equivalent of a telephone sanitiser.
There was more i'm sure but that'll do for now.
Oh hang on, one more rather important one in my notes. Back to
firefox. I normally run firefox which is finely tuned to be
actually usable. I don't use many plugins - disable javascript,
disable fonts and colours and that's about it. But I customise
the fuck out of everything else: no fucking animations of
smooth-this-or-fucking-that, middle-click to open a url, etc etc.
I even have a userContent.css that hides a whole of shit on sites
I commonly use (like the banners on stackoverflow and so on). But
I tried a naked profile in order to test out the stylesheets i've
worked on for code and docs. Boy what a fucking joke. All the
reminders about shit I don't want. They even fucked up ctrl-tab
so it goes forward but wont return where you came from (isntead it
opens a menu with very different keyboard navigation) - i mean
what's 25 years of convention for anyway?
But the most alarming one was when I went to download my own
fucking sourcecode. Apparently a tar.gz file is now a dangerous
download.
Yeah fuck off mozilla you bunch of stupid cunts.
Now I just feel like shit, this has been really depressing.
Sad face.
blogz 0.3
I did a bit of work on blogz today and pushed out a 0.3 release.
This is not really tested and i'm still not using in on zedzone
but I will be testing it and perhaps do another 0.x release if I
have bugs. But the reason it's out now is that this is the
snapshot base that I will load into git and then work on that from
now on.
If it all works out the next step will be working on using a
database for indices - at the very least it will be some lmdb
thing, but it may go the whole hog and start to look at using crez
for the backend so I can look into some more interesting ideas,
although this is a lot more work. This will also allow me to look
into a comment system eventually.
See the blogz homepage for the
downloads as usual.
L-O-L like anyone gives a flying fuck.
Moar Sauce!
I've now moved my main projects over to using modular Java 11 as
well as using git. I've quite a few projects remaining.
I'm undecided on many of them whether they just go in the bin,
get updated and published (and abandoned) or get worked on
actively again.
-
blogz
is almost ready.
This is mostly a git setup but I'm not actually
using it yet on this site either, although i've synchronised the
code between them now. Once I do both it'll be there.
- playerz is a mess.
This is in the midst of a large amount of changes and I
lost track of where I was at. Its a fairly interesting
little project though so I will get back to it.
- izlib is unfinished
I think I discussed this years ago but basically it's
unpublished and well from unfinished. It's basically an image
processing toolkit and an experimental platform for Java Streams
and API refinement.
I don't really have any plans but I
noticed the other day there's a lot of code just sitting there
rotting away.
- mediaz is archived
mediaz was a project I had on google code which was sort of
a predecessor to izlib as well as a collection of early JavaFX
experiments which are more or less tutorials for others. And
the basic start of a layered bitmap image editor (imagez).
It was about the only project anyone asked about when google
code went down. I have the repository archived so I guess I
could at least convert it.
- termz is unfinished
This was a little fun project using OpenCL to drive a
terminal emulator display. It's pretty much pointless and could
be done directly with OpenGL.
- cdez is rotting
I actually have a C version of dez 1.2 too. It could be
updated and published or something.
- rez and crez are complicated
rez is a project that implements a `personal' versioned
blob store. It supports free branches and cheap copies
(iirc), renames, metadata and all that jazz, and uses dez for
compact storage. Berkely DB (JE) is used as an embedded
database.
This is a project i've worked on and off for years (15?+) with
separate C and Java implementations. The original driver was
for a web CMS. The last time I worked on it was a year ago
porting it to C (including a C versin of dez), for this very
blog - although in the end I stayed with the simple blogz.
It's explicitly not `enterprise' oriented on purpose.
There's probably plenty of alternatives around now and I don't
really know what to do with it. I never quite got the API
down to the point I was really happy with. The C version
should probably use lmdb now.
What I have done actually works though so I should probably
drop it out somewhere. The problems aren't all that
complicated but I think I solved them in a fairly tidy and
compact and reusable way.
- libeze - 2.0
I just need to drop this into git as is.
This was also in the midst of a large number of changes but I
kept them out for the last release. But I have a good number
of bits and pieces I can add to it. playerz is the main
driver here, but that doesn't need much.
- SynthZ - the popular one
I'm not doing anything with this, but it has been
downloaded fairly often. Maybe I will git it.
I did learn some more about SourceDataLine so the soft
keyboard is real-time now!
- wanki is who the hell knows
This was a wiki engine using texinfo markup and with the
abiity to properly organise and export multi-page documents. It's
always been pre-alpha and gone through a number of varients, from
C to JavaEE.
It was the original driver for what became (c)rez.
Who knows, maybe one day, I still think it's got some merit.
Wiki's still have major trouble with multi-page documents
ordered like a book.
- DuskZ is unfinished
An attempt at working on a simple game. I got caught up in
pointless (de)serialisation stuff and other sort of unimportant
details. I'm sorry to the lad that I chatted to about it. He is
a nice fellow. But it was sort of at a bad time personally and I
just dont have the interest to work on it anymore. I hate letting people down.
I don't know if there's really anything salvagable at this
point as all I did was break a bunch of stuff.
- socles is archived
This was an OpenCL image processing library. Nobody cared
and eventually neither did I. There might be something salvagable
for an eventual izlib backend, maybe.
- low level arm code, puppybits, zedos
puppybitz is somewhat stale but maybe there's some stuff useful there.
zedos stopped when I hit the USB driver. Fuck intel for making that junk.
- paralella is DEAD TO ME
The parallella stuff i'm not longer working on and it will
be staying where it is (at least it's published).
I still have some boards but the whole thing soured me on
kickstarter and I will never put money into anything of the
sort again. It seems it's mostly used for these sort of
projects as a 'lets get some zero-cost bridging finance so our
real investors have more confidence' which is pretty much
fucked-up capitalism at it's finest.
Wherein capital
should be the one taking risks to invest in capital to make
more of it. This instead is, well, get the plebs to give us
free money and take all the risk for no return!
- Android apps are going nowhere
Again the source is already out, it's unlikely I will do
more but whether I do will be on a case-by-case basis.
I'm sure there is more - that's just what I found from my archive
of google code and a quick look at what I have sitting on THIS
computer. I've got drives and backups elsehwere, who knows.
Probably if anything I should start checkpointing more often and
dumping shit on code.zedzone.au. Until I had that I didn't
really have anywhere to put the random otherwise not really
publishable-in-themselves experiments which abound.
It will be a while before this list is fully processed.
Copyright (C) 2019 Michael Zucchi, All Rights Reserved.
Powered by gcc & me!