vectors and bits
Updated, see the end of the post
Yesterday I started poking around with the SIMD unit. Wow, is that a way to eat up time or what.
Wasn't quite sure what to do with it, so played at first with writing an RGB888 to RGB565 converter. Didn't get to testing it, but it brought back memories of the SPU hacking I did before - the instruction set has a lot of similarities, although NEON is filled out more. And like with the SPU, there's so many ways to do the same thing it can be a bit overwhelming trying to find a good way of solving a problem. Particularly if you don't really know which instructions are there, or what they do. There seems to be some interesting ones though, like vrsi
which lets you insert the upper-bits of each element into the lower-bits of each element in another register (without clobbering it's contents). I still seem to be wedded to the vtbl
instruction as I was with the shuffleb
instruction on SPU, although I think it's not always the best route. I really missed the spu_timing tool though - although the issue rules and latencies are simpler.
That idea didn't seem to be going anywhere in particular, so I thought i'd look at some specific stuff I need, and for which I have very slow implementations - font rendering and rect fill, although I only got around to looking at rect fill, and that still doesn't work 100%. I just did it using ARM code though. For such an old architecture i'm was a little surprised at the lack of info available for such tasks - at least as it applies to searching using google. Maybe it's too old, and the new stuff is hidden away in proprietary and embedded systems, and nobody does software rendering anymore.
And then I totally lost track of the time reading about the DSP ... at 4am I thought it was time to `call it a night' - that's what I get for having coffee and chips for dinner (and in short; there's no free tools to use it, and the Linux driver uses binary blobs - of course).
Today I filled out the rect fill code a bit and tried various implementations, including some NEON variants. Oh, I also `discovered' the performance counting unit - wow, you can track a lot of stuff, from branches taken to cache and memory stats to stalls. Very nice.
Oh NEON. Fucking hell. Spent about 4 hours tracking down why the NEON instructions just threw an undefined instruction exception. After a couple of hours of digging I came across a reference to the Coprocessor Access Control Register, but that didn't really help (oh and a thread on the beagleboard group where people just say to turn CONFIG_NEON to y ... sigh). So here I was trying to turn on clocks and power and other PRCM registers ... and then I remembered something about a bit in a status register to enable/disable the whole shebang. A bit more tracking down (i've got about nearly 10K pages of documentation to search now) and I discovered the FPEXC register and VMSR/VMRS instructions (my memory was wrong, but it was a lucky guess). Although the binutils i'm using doesn't support them ... sigh. Finally found a workaround using MRC/MCR from Linux - about the only thing i've managed to find in there when tracking things down (a lot of stuff is so abstracted it it's very hard to follow). Gee that was frustrating.
Anyway, so I came up with some total cycle counts for various implementations of a 'rectangular block colour fill for RGB565'.
These are all with *NO CACHE* or write buffers, so they don't really mean anything other than relative to each other. You have to turn the MMU on to turn on data caches and write buffers, perhaps that is the next thing to try.
Code Total Slowest Fastest C short 36308222 1.00 5.25C long 18307488 0.50 2.64ARM asm 15877960 0.43 2.29 - uses 4x strd (writes 8 bytes/instruction) NEON 9735680 0.26 1.40 - uses 2x writes of 2xD regs, 64 bit aligned NEON2 9134690 0.25 1.32 - uses 1x write of 4xD regs, 64 bit aligned NEON3 9311284 0.25 1.34 - uses 2x writes of 4xD regs, 128 bit aligned NEON4 9191652 0.25 1.33 - uses vstm of 8xD regs sDMA 6910682 0.19 1.00
The NEON implementations use ARM code for the non-aligned 'edges', and none of them are particularly fantastic code.
Hrm, I thought the ARM asm one was ok when I was running it by itself, i guess twice as fast as something is quite noticeable, but obviously it's kind of slow.
Looking in more detail at a couple of them:
drawRect() C long total cycles=18307488 dwrite intns=169668 ext writes =169671 iexec =701230 istall =1201453drawRect() ARM asm total cycles=15877960 dwrite intns=168963 ext writes =168965 iexec =310508 istall =182922
Update: Should've tested more, the long version was still just a 'short' version, it just wrote half the width ... so all bogus. Will revisit in a newer post. The code in question is all in puppy bits: