So that's the second rewrite.
I find I always have to rewrite things at least once and usually 3 times to get it pretty well right, at least for problems I haven't encountered before.
So, after writing the post last night (and forgetting to post it) I had a thought about two of the later points - supporting different data types, and implementing line-by-line compositing.
At first I just sat down to put some ideas down - how would it run in the degenerate case of only processing a line at a time - but ended up writing a whole new compositor. Then I did some timings and it looked pretty reasonable - a tad faster than the other compositor but at least it wasn't slower, and now I had something I could throw at threads with abandon. And then I realised I was composing 8 bit data which has a fair hit of data conversion and I found it was actually 50% faster for the same case. Nice. Particularly considering it's doing a whole extra memcpy for every data layer, and a lot (lot) more hit tests (although they are simpler) (and this is always tricky with hotspot, maybe it noticed they were always true in my micro-benchmark and compiled them out ...).
So the code supports more data formats, executes faster, uses a lot less temporary memory, is trivially easy to convert to run using multiple threads, ... and is about 1/2 the total lines of code. Yes I will give myself a pat on the back. Hmm, that felt odd.
I'm borrowing a few ideas from the way GPU's execute and from some of the research I did on CELL - processing loops are much more efficient when data is accessed in a native format as the data conversion costs quickly overwhelm simple computations. If you're doing something more than once, it's almost always cheaper to do the data conversion in a separate step let alone the effort required to write specialised code for every case. So all the algorithms just work with floats as they did before - I didn't need to touch the blending kernels. I just added some batch interfaces to retrieve or store the data line by line into a pre-allocated buffer. This is the only point at which data and format conversion needs to take place. It's might not as fast as technically possible but it's quite quick, it's a lot less code to write and it's all simpler code as well - and compilers like simple code. Likely to be more cache friendly too, which cpu's also like. They like that a lot.
Then I spent a good chunk of this afternoon (had some time off work) and evening converting all the rest of the code to use this new compositor, and actually to redo the whole 'layer' object. Filter and effects now have to work differently too - they can either work with the specific types, work generically on data line-by-line, or make whole copies. A lot of operations are easy to convert to line based operations and doing so adds a couple of benefits. Firstly anything requiring temporary memory might only need to store a line of it at a time, and secondly if the work broken into lines is independent or locally independent then the work can be split to run across multiple threads.
With little effort I added a thread dispatching frontend to the compositor and now it's using all the cores on this machine which never otherwise seem to get much work to do. I haven't yet converted the gaussian blur but that will save a lot of memory as I had to extract the packed pixels to padded planes anyway and now I don't need to do that as a separate step.
The drawing tools didn't need any changes - they're just working with BufferedImage's wrapping the data - I tried changing the temporary layer to a 4-byte format and the paint tools worked just fine and somewhat quicker. I tried to leverage a bit more information from the WritableRaster and the SampleModel, but I don't really have much I need to get. I'm limiting the code to a couple of specific image formats for which I know the layout so the code can go straight to the array rather than through accessors to reduce any required address arithmetic which adds up pretty fast (or perhaps, doesn't add up fast enough).
Code is piling up quickly, hit 8KLOC already although there's a bit of stale stuff i'm keeping around since nothing is in version control. I had some pretty nasty experiences with Mercurial over the last week for work so that's fallen heavily out of favour. Heavily. I'm even considering cvs - I know it's quirks and at least it knows how to fucking merge properly which is only about the most fucking important thing for a fucking source management tool and the only thing it really has to fucking get right (fucking). Not that I need to do any merging with myself. I may rant more on that later, but maybe i've already wasted enough time with it.