ahh
So I discovered from observation that the whole performance thing is basically some fundamental speed/space trade-off in the compiler. It seems that for any (abstract) class which implements a loop and calls a method defined in a sub-class inside that loop, it will be inlined (if possible and/or based on other parameters) for up to two sub-classes, but if you add a third those both will be thrown away and replaced with a version they all share which uses a method call instead.
This dynamic recompilation (or rebinding) is actually pretty cool but has some implications for the whole mechanism on which java streams/lambdas are built. It's only much of a "problem" for tasks that don't do much work or for platforms where call overhead is high and where the operation is being called multiple-millions of times. But lambdas rather encourage the former.
Micro-benchmarks overstate the impact and in any event it can still be worked around or avoided in performance-critical code. But regardless I know what to look for now if i see funny results.
I did some experimentation with yet another way to access and process the pixels: by row accessor. There are still on-going but my current approach is to have a Consumer which is supplied a type-specific row accessor. This allows flat-mapped access via a get(i)/set(i,v) interface pair and exposes the internal channel ordering (which is fixed and interleaved). To handle different storage formats the forEach function will perform data conversion and be supplied data-directional hints so that each implementation isn't forced to implement every possible storage format.
Performance seems to be pretty good and there don't seem to be any weird speed-degradation gotchas so far. The Consumer code still needs to implement a loop but it's somewhat cleaner than having to have an array+base index passed around and used everywhere and the inner loop being local seems to be the easiest way to guarantee good performance. Oh and I have a way to handle multi-image combining operations by just defining multi-row interfaces which can be passed with a single parameter.
I don't know yet whether I want to also go "full-on-stream" with this as well as the tiles, and/or whether i even want a "Pixel" interface anywhere. I'm getting ideas to feed back and forth between them though like the memory transfer hints and automatic format conversion which might also be applicable to tiles. Having a `Pixel' is still kind of handy for writing less code even if there are performance trade-offs from using it. Then again, how many interfaces do I want to write and build the scaffolding for? Then there is the question of whether these accesors are used in the base apis themselves outside of just supplying ways to read or write the information.
Well that was a short diversion to end the day but I actually spent most of today (actually almost a full working day's worth) implementing what I thought was a fairly modest task: adding a parameterisable coordinate mapper to an all-edge-cases handling "set pixels" function. I'm using it as part of the tile extraction so that for example you can extract a tile from anywhere and have the edges extended in different ways that suit your use.
I already had a mapper enum to handle this for other cases but performing the calculation per-pixel turned out to be rather slow, actually oddly slow now I think about it - like 100x slower then a straight copy and it's not really doing a lot of work. I tried writing an iterator which performed the calculation incrementally and although it was somewhat faster it was still fairly slow and exposed too much of the mess to the caller. After a few other ideas I settled on adding a batch row-operation method to the enum where it handles the whole calculation for a given primitive data type.
The mirror-with-edge-repeat case turned out to way more code than I would like but it's something like 20x faster than a trivial per-pixel mapped version. It breaks the target row into runs of in-order and reverse-order pixels and copies them accordingly. I can get a bit more performance (~20-30%) coding specific versions for each channel count/element size; but it's already enough code to have both float and byte versions. Hmm, maybe this is something that could take a row-pair-accessor, 4 arguments instead of 8, .. ahh but then i'd probably want to add bulk operations to the interface and it's starting to get complicated.
Well now i'm super-tired and not thinking straight so time to call it a night. My foot seems to be going south again and i've been downing all the tea and water I can stomach so i'll probably be up all night anyway :(