Kinect2 device for FFmpeg, libfreenect2

I've been working on a kinect2 device backend for FFmpeg the last few days. Actually it's only about a day's work so far and i've got the code talking to the c++ (sigh) libfreenect2, building, and most of the glue written - I just haven't tested it with the hardware yet. I hope it should be quite straightforward but FFmpeg is a fairly complex library and there are a lot of details that could be wrong.

One feature it has is that the jpeg frames are not decoded; which means cheaper recording and no loss of capture quality. I had to make some minor modifications to libfreenect2 for this to be possible.

It exports 3x streams: the RGB data as jpeg, the IR data as grey16, and the depth data as grey16. I have options to enable various subsets of these streams, so for example depth+ir decoding can be skipped as it requires a good amount of flops. Cameras are referenced by serial number or index. Device queries work as do some basic capture settings. I'm also considering other options which libfreenect2 provides such as streams with the depth/rgb aligned to each other.

Once I have it at a working state and ported to git head (FFmpeg git was down when i started working on it) I will see if FFmpeg is interested in it. The fact that it requires c++ and a patched libfreenect2 might be a downside but there is already a c++ device in the source tree. Otherwise i'll just upload it to code.

This was going to be for work but they decided they'd rather use some junk matlab (ffs) on a shit platform (fffs) so i'm a little annoyed about the whole thing. While it should be possible to get this to work on their chosen shit platform as well it's a bit more involved.

Parallel Streams, Blocking Queues

I've been using Java Streams a bit to do various bits of work. One useful feature is the ability to go wide using parallel streams. But I've often found it doesn't perform all that well.

It only uses one thread per cpu thread so i/o heavy tasks are underworking the machine;
fork/join breaks work up by powers of 2 and not all jobs fit this very well;
Although it uses work stealing it is still basically statically scheduled;
Overheads.

I have written a Stream Parallel.map(Stream, Function) call which wraps a stream mapping process in a one that farms it out to a pool of threads and recombines it afterwards. This works well for some tasks particularly as you can set the number of threads, but it is still quite coarse and you can't recursively call it (actually you can, it just launches lots of threads).

Anyway so i'm trying to think of a way to break up multiple levels of parallel streams into smaller blocks of tasks that can be more freely scheduled. Whilst trying to fit it within the Stream processing model which is pull oriented.

I'm still nutting out the details but for now I have written a lockless, bounded, blocking, priority queue. It only supports a fixed set of discrete priority levels.

I did some micro-benchmarks against ArrayBlockingQueue (without priority) and i'm surprised how well it worked - from about 5x to 20x faster depending on the contention level.

Each priority level has it's own lockless queue implemented using an array and 3 counters. The array is accessed using cyclic addressing so all operations are O(1).

static class Queue<T> {
        volatile int head;
        volatile int tail;
        volatile int alloc;
        final T[] queue;
}

The trick is that it doesn't implement any waiting operations because it uses external arbitration to avoid the need to.

This makes put() particularly simple. I'm using pseudo-code atomics below, but these are implemented using VarHandles.

       void put(T value) {
                int a = atomic_inc(alloc);
                int b = a + 1;

                volatile_set(queue, a & (queue.length-1), value);
                while (!atomic_cas(head, a, b))
                        Thread.onSpinWait();
       }

First, the allocation cannot fail and simply assigns a working slot for the new item. The item is then filled and then the atomic_cas() (compare-and-set) is used to ensure that the head pointer is incremented sequentially regardless of the number of threads which reserved slots.

The poll() method is slightly more complex.

        T poll() {
                int h, n, t;
                T node;

                do {
                        t = tail;
                        h = head;
                        if (h == t)
                                return null;
                        node = volatile_get(queue, t & (queue.length - 1));
                        n = t + 1;
                } while (!atomic_cas(tail, t, n));

                return node;
        }

First it checks it the queue is empty and if so simply exits. It then takes the current queue tail and then updates the tail counter. If the tail pointer changed because another poll() completed first, then it just retries.

The order of reading the head and tail counters is important here! If tail is read second it is possible to double-read the same value.

This isn't a full queue implementation as a number of important features still missing:

Limiting the number of put()s so that the queue isn't overwritten;
Blocking on full-write when the queue is full, without busy-waiting;
Blocking on empty-read when the queue is empty, without busy waiting.

All of these can be almost trivially implemented using a pair of Semaphores and an atomic integer.

A sempaphore with (queue.length-1) reservations limits put() calls. A successful poll releases a reservation.
The first semaphore does this as well.
An atomically updated counter and another semaphore is used to implement wake-up on empty-read.

It's a bit tricky to benchmark and the results are quite noisy despite setting the CPU to a specific (low) clock speed.

But in general this is around 10x faster than using ArrayBlockingQueue for "low-contested" situations (6x writers, 6x readers on a 12x thread cpu). In a "high-contested" situation (32x writers, 32x readers), it's more like 15-20x faster, and scales better. Despite tight loops the ArrayBlockingQueue is unable to saturate the CPU resources (via top) and much of the time is spent in overhead (?somewhere?). Profiling in netbeans didn't offer any particular insight on where.

These are of course highly-contrived situations but the performance was a pleasant surprise. It might not work on systems with a weaker memory model than AMD-64 but I don't have access to such exotic systems.

This still doesn't solve the base problem I was working on but it might be a useful part thereof.

jjmpeg callbacks

Well I went and sorted out the AVIOContext callback code rather than posting the previous post so there is more of a gap than the date would imply ...

It turns out my idea to pass a reference to this to open2() was a bit dumb! Because of course open2() is the routine which creates the pointer that this refers to in the first place. Wrong!

Anyway the weak references were on the right track, and once I got it working further testing showed that I had to use a weak reference for the custom i/o handlers as well. In many cases it would still have worked but if you created a non-static anonymous class inside a method which stored the AVIOContext as a member variable then it would leak all three objects. A non-static anonymous inner class keeps a reference to this, which keeps a reference to the AVIOContext which keeps a reference to the non-static anonymous inner class.

So anyway the C now just uses weak references and the Java has a holder for the AVIOInterrupt or AVIOHandler object with which it is associated and some exploratory testing seems to confirm it works as expected.

I reverted the nativez changes too, the reference holder wasn't a completely terrible idea but it was duplicating existing JVM functionality unnecessarily.

jjstuff n stuff

Well i've mostly been busy with work for the last couple of weeks. I more or less ended up taking a couple of weeks off around Easter and ANZAC Day so I had to do some more hours. There's no deadlines at the moment and I was waiting for some resources to become available so there wasn't much else to do anyway, and between all the hacking I did I managed to get a little of the waning sun before winter really hit a week ago.

I did a little bit on jjmpeg over the weekend. Partly as I was using it to prototype some ideas I was evaluating for work; I will probably end up doing it in C but its an easy platform to experiment in.

I did a fair bit of cleanup around JJMediaWriter, although it still needs work. Video works well but audio is probably broken.

The VideoPlay demo is a little nicer and I moved to the java.time stuff for date formatting. Fixed a bug with error handling.

Another bug I fixed was with the initialisation of native methods for functions which don't derive from AVObject. AVObject contains the code which loads the native library so if you don't use it first you don't get any native linkage. I hacked a simple mechanism using a NOP function that seems to work properly. You can't just load it in another class.

JNI Callbacks

I spent a bit too much time working on filling out AVIOContext to make it more 'complete'. AVIOContext is used to read/write files and network connections and allows for user-supplied callbacks to implement custom i/o. jjmpeg supports both ffmpeg and custom i/o but there was an updated open2() api for the former which takes an interrupt-check callback when opening a context. This posed a problem because for the custom i/o case I had a free pointer opaque I use to track a hard reference to the Java side callbacks but for the ffmpeg i/o case it is used by ffmpeg.

jjmpeg relies on the object pointers being the actual objects so I can't easily put this through a proxy object that contains both pointers. Actually there is only one place it is used but even then it would be very messy to make it work sanely.

The solution I came up with was to have the Java object retain a reference via another NativeZ managed object which retains any number of JNI side references which gets freed JNI side when it gets cleaned up. I then just poke an instance of this onto the AVIOContext so it lasts at least as long as the AVIOContext itself. I put this manager object into NativeZ because it might be useful elsewhere.

While it should work, it does seem like a bit messier than it might be. Just now I thought perhaps I could just store the Java listener in the Java object, but the C callback still needs a reference to this to retrieve the listener and it can't hold a hard reference otherwise the object will leak. But I just looked up the JNI docs and found a WeakGlobalRef API so that should probably work. I will try it and if so I will use that instead and roll back the code I added to NativeZ as well.

Documentation & The Rest

I also started a longer term and possibly pointless exercise of cleaning up the accessor methods and documentation. The accessor methods just access the struct fields directly but most fields are also accessible via the pretty messy libavutil/opt.h interface which is mirrored in part in AVOptions. Originally I went through the structures and just inserted the ones that looked like they might be useful but there's a lot of cruft in there and AVOption covers all the important onces. So I went throuhg AVCodecContext and retained the useful looking ones and cleared out the rest. I also cross-referenced the options table and documented the corresponding names if they exist.

After that I got the javadocs to build. I had to fix a bunch of badly formatted comments before I read the javadoc manual and a bunch of broken comments that get imported by the constant extractors. It all looks pretty bare though. It will take a lot of work to make them something worth using. Will see, will see.

I even did a little OpenCL today. Just a tiny bit of code to run a random fern inferencer. It's been so long since i've written any it took a few iterations to get things going. Didn't need to make any changes to zcl though, so that's a plus. But even with all the niceties of zcl it's still a bit, well, messy. The CLTask is a good kernel of an idea (pun intended) but there might be ways to build on that to be something a little more convenient. How deep I look into this will depend on how much OpenCL I end up writing but that probably wont be much.

Tutorial

Oh another thing I did over the weekend was some Java tutoring for a mate. He's doing JavaFX, so yeah it wasn't much effort even with the hangover I had acquired from the night^* before.

I was somewhat surprised at how bad the assignment was though. For a first assignment the scope wasn't so bad but right from the start things were a bit strange.

Must be Eclipse, must be JavaFX11. Why JavaFX11? Who knows, it's not really that different as a toolkit compared to JavaFX8 apart from being a much bigger pain to install not to mention the complication with the module stuff. Just what you want for an introductory course. Eclipse is pretty much total pants too if you ask me.

There is a page for an eclipse module or plugin or whatever it is but the 'for the lazy' page is so short it seems to be missing all the details. And the screenshots are so low resolution you can't read any of the text. The only remotely likely looking available option didn't match the description and looked wrong - and it didn't work when you installed it regardless. We did get it running but I had to put in JVM arguments and frob around a bit. Admittedly I need to do that in netbeans too but at least it supports modules and you don't need to use explicit --add-module JVM args.

(Being an apple machine i wasn't familair with it's weird filesystem heirarchy and the funky mess of a thing 'Finder' has turned into so that didn't help either).

Back to the assignment itself. Rather than a list the requirements, it stepped you through the parts that needed to be done. This is after first insisting that it wasn't a guideline of the steps to follow - even though you couldn't do the assignment without following them. In the addition to this odd idea, the formatting was completely miserable, like an email where carriage returns have been inserted randomly between sentences. What text there was was also often contradictory, and when it wasn't it offered questionable solutions to common problems.

For example one paragraph would talk about placing code in a constructor then the next sentence would say the same thing needed to go in some other method. In one case this was talking about the root Application where you pretty much never use the constructor and you put this stuff in start() so it was just wrong. In another case it talked about creating a 'XYStage' and then the code suggested for the constructor is 'stage = new Stage()'. It even mentioned deriving from Stage although not consistently! For the rest of the assignment it kept talking about this stage variable which shouldn't exist. I really don't like using the word wrong for software since so many ways will work - poor, snot, junk, totally shithouse, sure all of those, but there really was a lot of utter wrong here!

Even strange stylistic choices that indicate the author doesn't write or read any code of any language. Who creates any rectilinear object like a Window using (sizex,sizey,x,y) syntax rather than (x,y,width,height)? And lets start out teaching bad ideas like using undecorated windows so you can implement your own buggy and non-accessible window manipulation features yourself. Garish colour suggestions, check! No namespaces. Variables which change names or types from paragraph to paragraph, go you covered there too.

Bit bored tonight. I almost did some more work-related stuff but decided I needed a break from it. Not much else to do though which is why I posted this. Well it looks like i'm now playing with jjmpeg for a little while at least.

Minimalistic Perfect Hashing

As a small diversion I played a little bit with minimal hash functions over the weekend.

The goal here is to convert a static list of words into a token value which could be used to implement a string switch statement or other such uses.

The GNU gperf tool is a production-ready general-purpose solution to this problem but I wanted to see if I could reduce the code size and try a more naive approach.

The approach is to just perform an exhaustive search over a limited set of possible functions. The function I am using is a multiply and shift. The input is an integer formed from the next 1, 2, or 4 bytes. I map these to a power-of-two table and just find a combination which produces no collisions over the table size. Then there are a couple of tables used to map these to an index and to a string which is used to verify the string is the same and not just the hashcode.

Using the 32 keywords for the C language this is one solution that takes only a single multiply-shift stage:

#include <string.h>
int hash_keyword(const char *word)
{
    static const char *words =
        "char\0" "long\0" "register\0" "if\0" "signed\0" "unsigned\0"
        "float\0" "switch\0" "int\0" "for\0" "while\0" "volatile\0"
        "static\0" "auto\0" "do\0" "union\0" "enum\0" "typedef\0"
        "struct\0" "sizeof\0" "const\0" "extern\0" "case\0" "default\0"
        "return\0" "break\0" "continue\0" "short\0" "else\0" "void\0"
        "double\0" "goto\0";
    static const unsigned char index[] =
        { 166, 0, 166, 166, 166, 4, 8, 16, 166, 18, 166, 166, 166, 166,
166, 166, 166, 24, 32, 37, 43, 166, 46, 166, 49, 166, 166, 54, 62, 166, 68, 72, 74, 79, 166,
166, 83, 90, 96, 102, 107, 166, 166, 113, 166, 117, 166, 124, 130, 166, 166, 166, 135, 166, 143,
166, 166, 166, 166, 148, 152, 156, 162, 166, };
    static const char value[] =
        { -1, 3, -1, -1, -1, 17, 18, 15, -1, 21, -1, -1, -1, -1, -1, -1,
-1, 28, 12, 25, 16, -1, 13, -1, 31, -1, -1, 30, 23, -1, 0, 7, 27, 10, -1, -1, 26, 24,
22, 4, 11, -1, -1, 2, -1, 6, -1, 19, 1, -1, -1, -1, 5, -1, 20, -1, -1, -1, -1, 9, 29,
8, 14, -1, };
    const unsigned char *p = (void *) word;
    unsigned int h = 0;
    {
        unsigned int v = 0;
        for (int i = 0; i < 4 && *p; i++)
            v = (v << 8) | (*p++);
        h ^= (v * 54991) >> 12;
    }
    h &= 63;
    return (strcmp(word, words + index[h]) == 0)
        ? value[h] : -1;
}

This is not a minimal perfect hash function because the hash size is 64 for 32 input words, but it does calculate the result using a single hash table calculation and probe.

Using smaller sample inputs and more steps provides more solutions but it requires more code at runtime. This one compiles into .text=87, .rodata=352 bytes.

By comparison equivalent functionality using gperf creates 595 bytes of .text and constants and 720 bytes of .data. The primary reason it is so much larger is because it stores the word and index in a structure which will take at least 16 bytes extra per entry (8 bytes for the string pointer, and 8 bytes alignment including the index value). This more than offsets the smaller table it uses.

The runtime should be comparable although being 1/3 the size should help the cpu caches.

It's slow to search and might not produce a solution but with some tuning it has worked for a limited number of cases i've tried. There are some other options it could try:

Use the word length. This finds more solutions;
The samples needn't be linearly spaced;
Other operations apart from multiply-shift;
Other combining operators than EOR.

I think it's already as minimal as possible in terms of .text and .data size, at least on amd64.

As a further comparison I compared a smaller problem to using a linear search. This case was just for 6 strings. Apart from the hash generator I wrote a cascaded if-else-if and a compacted linear search where all strings are stored in 1 string with embedded nuls used to separate them.

            .text   .(ro)data
phash	       79      64
if-else	      206      41
linear	       74      42

So the only one that's smaller uses atypical coding anyway.

Blogz Live

The blog is now using blogz as the driver.

The code is mostly the same so there's no big differences but I did change the tag-history urls to use the path for the postid rather than a query string. So those have changed but as they should be excluded from indexing and it doesn't make sense to use them as a link it shouldn't break anything.

I didn't do a lot of testing - there isn't a lot to test - but if i find things i'll fix them.

I did have a small few bugs I had to fix so i'll have to do another 0.3 series release I guess.

I'm doing stuff on 0.4 now, so it might not be far away. That'll move to using a database as the index which again wont make for very large visual changes although I will be able to add titles to the newer/older post linkes. Once that's done I can look at a bunch of other stuff from comments to vote buttons.

And another thing ...

What about memory mappable tagged format?

Whilst this couldn't be done for a struct mapping it would be useful for arrays.

Not hard to add, but needs agreement on both ends. Build-time decision? Runtime decision? Stream meta-tags?

The puzzle never ends.

Super Cereal!

I got way too caught up with writing a new serialiser over the last couple of days. Actually I finished off another one I had so I ended up with two.

There are two cases i'm interested in. One is tight coupling where simplicity and performance outweights extensibility; basically for IPC. The other is where extensibility and size are the main considerations; for object serialisation / data storage.

So I have an XDR-like implementation for the former. The layout of items is the same as XDR (sans mistakes) but it uses native ordering of elements, so i dubbed it XDRN for xdr-native.

For the latter i have -yes- yet another tagged format. Each field is tagged and each object is also a tagged container. The header is at least 2 bytes - a control byte and a tag byte. I can't be bothered typing out all the details - here is whatI have in the source code at the moment.

      This is a streamable self-describing byte-oriented binary format.
      It is a general purpose format and supports a super-set of the
      ez_blob descriptor.  It supports primitive and struct types and
      sequences thereof and there is room for extension.

      Each item beings with a descriptor byte, then followed by a tag id,
      a possible count, and the payload.

      xxxxttcc control byte

      xxxx type code

      0 uint8       unsigned int, value zero-extended
      1 uint16
      2 uint32
      3 uint64
      - reserved
      5 float16
      6 float32
      7 float64
      - reserved
      f struct

      note that for int/float types, (code&3) == log2 of element size in bytes

      tt log2 of tag size in bytes

      0 1 byte
      1 2 byte
      2 4 byte
      3 reserved, no tag?

      cc log2 of count size in bytes, used to indicate sequence length or non-sequence.

      0 1 byte
      1 2 byte
      2 4 byte
      3 none, single item follows

      ff is struct-end code

      A header is a control byte followed by an optional 1/2/4 byte-length tag,
      followed by an optional 1/2/4 byte-length count.

      A structure payload is a list of tagged fields until a struct-end
      code.  A structure sequence is a list of count struct-encoded blocks.

      Integers can be stored in the smallest number of bytes, i.e. with
      all leading $00 bytes removed.

So basically each field has a type, a tag, and a count. Scalar values are with a special count code so don't require a count value. It also differentiates between scalars and single-item sequences. Sequences all have a count and no end sentinal.

It's versatile enough to hold most likely structures but isn't universal. String encoding is application layer. No 128+bit primitives (but there is room to add them). No map type, but there is room to add it (it could just application layer). Probably the only significant one is a 32-bit limit on sequence (array) lengths (for some level of significant!). There are only 96+1 valid codes defined now so there is room in a single control byte for some but not all of these but it's not likely to be as tidy.

One example: tt+cc only defines 12 codes, one could swap tt,cc when tt=11 and thus use all codes and support 1/2/4/8 byte counts with 1/2/4 byte tags.

    ttcc

    00cc   tag size 1, count size 1/2/4/8
    01cc   tag size 2, count size 1/2/4/8
    10cc   tag size 4, count size 1/2/4/8
    11tt   count size 0, tag size 1/2/4
    1111   spare (primtive) / sentinal (struct)

Ok, maybe that would work, and it's not really any more complex in the code. It could use a lookup table but shifts would probably be faster. And this still leaves room for 8 more data types.

I went through a few similar iterations to get to this point. It has a couple of noteworthy features.

write streamable

It doesn't need to calculate information it doesn't know in advance. For example the size of an encoded object. This was a mess in my initial attempts and sometimes required multiple recursive passes.

self describing / read streamable

To be robust to data format changes it needs to be able to skip over data it doesn't understand. The tag defines the field so can be used to identify known ones. The data type and length fields combine together to define the number of bytes to skip for unknown fields. An unitendified sequence of structs must be skipped one at a time, but they provide enough information to do so.

compact

Well, relatively compact for the features it provides.

Tags and integers only use the significant bytes. The minimum overhead for scalar values is 2 bytes per field for control+8 bit tag, which will cover almost everything. The minimum overhead for sequences if 3 bytes (control, tag, count), and for structures is also 3 bytes (control, tag, sentinal).

Fields all have default values and such values are simply not encoded into the byte stream.

I dunno I feel it's a bit over-engineered, but I couldn't see a way to simplify it as I really need that tag. It takes about 2x the amount of code to implement vs the xdrn implementation although a lot of that is mapping to the ez_blob descriptor tables. As it is a self-describing format it may be useful to have a map or stream based api too, and an implementation of either would be straightforward.

Internally both use a common robust i/o mechanism which is simple and reliable. This helps protect against common coding errors like buffer under/overruns. I may expose this as an api in itself.

I'm pretty useless at writing tests (can't be good at everything!) but I have tried to write a more comprehensive set of tests here. Particularly if i'm dumping information into a database I don't want it breaking.

I could've used an existing design, but well, where's the fun in that?

About Me

Tags