Yawn.
Too damn tired, all the late nights have caught up with me. I don't think the uncounted beers at the pub last night helped either.
After the last post I mucked about with a couple of implementations of 'job queues' on the Cell. I wrote the whole lot up before testing on real hardware - and thought i'd really messed one up. But it turns out it was salvageable and i'd only made a small mistake. For the simple 1-PPU-writer-only-SPU-reader queue I can send about 1.3m 'jobs' to an SPU per second (the 128 byte `jobs' are sent one way and once received, simply marked as 'done') for a single SPU and about 1m jobs/sec when all 6 are used. Which seems reasonable scalability; the contention isn't getting in the way too much. For the more complex any-writer-any-reader queue (only being used with the same ppu-writer-n-spu-reader test driver), it drops down to about 1m/s for 1 SPU and 750K/s for all 6. Which could probably be improved - but the jobs aren't actually doing anything so creating artificially high contention anyway.
They seem to be stable and reliable, no races or deadlocks.
I'll keep poking to see if I can improve them.