parallel 64 bit tests using NEON
So the next bit of code i've been working on requires a lot of 64-bit bit tests. Essentially using the tests as a 64-element 1-bit lookup table.
My first approach was to take an obvious C implementation, and convert it to NEON and perform the bit tests in parallel.
The C is basically like this:
int i, count=0; for (i=0;i<N;i++) { uint64_t test = tests[i]; uint64_t bit = (uint64_t)1 << bits[i]; count += (test & bit) ? 1 : -1; }
It works, and one can perform the tests - but the tests need to be performed using 64-bit arithmetic - which takes a lot of register space, and a large amount of fuffing to get the data in the right format ('fuffing' is my technical term for data conversion/re-arrangement), and then the results must be converted back to 32-bits to form the selection masks.
So it's a great deal of code, much of which isn't doing anything but fuffing.
But the idea I had last night was to perform the testing using 8 bit values, and then scale these up to 32-bits to form the masks.
If one were writing the above code to avoid the 64-bit accesses in C, you could just use addressing to convert the 64-bit test to an 8 bit test - but on the correct byte instead.
int i, count=0; for (i=0;i<N;i++) { int bitno = bits[i]; uint8_t test = testsbyte[i * 8 + (bitno >> 3)]; uint8_t bit = bitno & 7; count += (test & bit) ? 1 : -1; }
This requires per-byte addressing ... which you don't get for neon memory accesses per element ... except with vtbl.
So the following performs 8x 64-bit bit tests in only 5 instructions, and the actual tests are only using 8-bit arithmetic.
@ d0 has 8 indices to test @ d1 has the 64-bit mask to test against @ d2 is set to 7 in 8 bytes @ d3 is set to 1 in 8 bytes vand.u8 d4,d2,d0 @ test & 7 vshl.u8 d4,d3,d4 @ (1 << (test & 7)) vshr.u8 d5,d0,#3 @ test >> 3 vtbl.u8 d5,{d16},d5 @ data[test >> 3] vtst.u8 d4,d5 @ data[test >> 3] & (1 << (test & 7))
Changing the classifier to use code similar to this (but unrolled/scheduled better) gave me a 100% speed-up on the beagleboard (still unverified code ...).
Unfortunately, I still need nearly as many instructions just to convert the 8 bit mask into 32-bits for a selection mask (i need 32 bits for what i'm doing). But some mathematics might let me change that.