This article is some notes and background on developing an image container / library for Java, and how a Streams interface was incorporated into the design.
The library was developed as a simple container for research and protoyping of image processing algorithms. The ultimate target was usually OpenCL/GPU but often the Java performance was good enough that the extra cost of OpenCL development proved unecessary.
It's primary goal is simply to get out of the way of the important bit - algorithmic development; it's not a "an image processing framework", merely a container. A single line of code is required to load an image. To display it inside a viewer with pan/zoom and drag and drop takes only a single line more.
Part of that goal is that it also be relatively efficient. This was accomplished by exposing the array and necessary addressing metrics directly as well as codifying the storage type and channel count into explicit classes. The former allows the compiler to generate "fastest" code, and the latter allows optimised implementations to override general-purpose ones.
Here is a summary of the part of the interface. Not shown are format-specific functions and some of the helper classes.
public abstract class Image2D<T extends Image2D, FT extends FloatImage, BT extends ByteImage> implements Serializable { public final int width; public final int height; public final int stride; public final int offset; protected Image2D(int width, int height, int stride, int offset) { } public static final byte ftob(float f) { } public static final float btof(byte b) { } public static final int ftoi(float f) { } public static final float itof(int i) { } public T create() { } public T create(int width, int height) { } public abstract FT createFloatImage(int width, int height); public FT createFloatImage() { } public abstract BT createByteImage(int width, int height); public BT createByteImage() { } public boolean inside(int x, int y) { } public int offset(int x, int y) { } public int getWidth() { } public int getHeight() { } public abstract int getChannels(); public abstract float getf(int x, int y, int c); public abstract int getb(int x, int y, int c); public abstract void setf(int x, int y, int c, float v); public abstract void setb(int x, int y, int c, int v); public abstract void getf(int sx, int sy, int w, int h, float[] dst, int doff, int dstride); public abstract void getb(int sx, int sy, int w, int h, byte[] dst, int doff, int dstride); public abstract void setf(int sx, int sy, int w, int h, float[] src, int soff, int sstride); public abstract void setb(int sx, int sy, int w, int h, byte[] src, int soff, int sstride); public abstract void fillf(float v); public abstract void fillf(float v, int c); public void set(int x0, int y0, ByteImage src) { } public void set(int x0, int y0, FloatImage src) { } public void set(int dx, int dy, BufferedImage src, int sx, int sy, int w, int h) { } public void set(BufferedImage src) { } public void set(int x0, int y0, int c, FloatImage src, int sc) { } public void set(int x0, int y0, int c, ByteImage src, int sc) { } public abstract float samplef(float x, float y, int c, Sampler smp); public abstract void samplef(float x, float y, Sampler smp, float[] dst, int doff); public abstract int sampleb(float x, float y, int c, Sampler smp); public abstract void sampleb(float x, float y, Sampler smp, byte[] dst, int doff); public abstract void sample(AffineTransform at, Sampler smp, T dst); public abstract T halve(T dst); public T halve() { } public abstract T subImage(int x, int y, int w, int h); public T subImage(ROI roi) { } public FT toFloatImage() { } public abstract FT toFloatImage(FT dst); public BT toByteImage() { } public abstract BT toByteImage(BT dst); final public FloatImage1 toFloatImage1() { } public abstract FloatImage1 toFloatImage1(FloatImage1 dst); final public ByteImage1 toByteImage1() { } public abstract ByteImage1 toByteImage1(ByteImage1 dst); public FloatImage1 toFloatImage1(int c) { } public FloatImage1 toFloatImage1(FloatImage1 dst, int c) {} public ByteImage1 toByteImage1(int c) { } public abstract BufferedImage toBufferedImage(boolean autoscale); public BufferedImage toBufferedImage() { } public abstract WritableImage toImage(WritableImage dst, boolean autoScale); public WritableImage toImage(WritableImage dst) { } public WritableImage toImage(boolean autoScale) { } public WritableImage toImage() { } } public abstract class ByteImage<T extends ByteImage, FT extends FloatImage, BT extends ByteImage> extends Image2D<T, FT, BT> { public final byte[] pixels; public ByteImage(int width, int height, byte[] pixels, int stride, int offset) { } /* ... provides some implementations common to byte images */ } public class ByteImage1 extends ByteImage<ByteImage1, FloatImage1, ByteImage1> implements Image2D1 { public ByteImage1(int width, int height) { } public ByteImage1(int width, int height, byte[] pixels) { } public ByteImage1(int width, int height, byte[] pixels, int stride, int offset) { } /* ... overrides provides some optimised functions */ } /* etc, you get the picture */
As can also be seen above despite it not doing a whole lot it's still quite a large api. And this represents the result of several iterations which attempted to pare down the functionality to only the core.
It served it's purpose well but one drawback is the need to write so many double for loops for sometimes trivial operations. The generic system wasn't also terribly well used but more on that later.
When Java8 arrived at General Availability the new "Streams" interfaces were explored. It promised a few features:
for
loop proliferation.
Concurrency hadn't been provided by Image2D
but was
implemented in an ad-hoc manner via some basic Java code. Whilst
this is almost trivial in Java it still requires some extra work.
Many approaches were tried.
IntStream.range().map(array lookup 1D).forEach(int | double)
Replaced the for
loop with a slightly slower, more cumbersome to use one. Only useful for single-channel images or limited operations.
IntStream.range().map(array lookup 2D).forEach(int | double)
>Replaced two for
loop with a slightly slower, slightly more cumbersome to use one. Only suitable for single-channel images or limited operations.
IntStream.range().map(new Pixel(array lookup 2D)).forEach(Pixel).collect()
>High GC load and a bit slow.
IntStream.range(rows).(row loop).collect()
Allowed easy concurrency but only replaced one loop with anohter.
IntStream.range(rows).(row loop on).collect()
Allowed easy concurrency but only replaced one loop with anohter.
And a few other variations. In short, it seemed hard to apply in this situation. Or in many others to be honest. It seemed to be over-engineered and just not that useful.
Spliterators weren't covered very much in the introductory material available for the new Streams feature but they change the game significantly.
It's possible to both taylor the work division to the problem domain and to simplify the generation of custom streams. The GC load can also be mitigated (footnote: the GC load isn't very expensive in terms of running time due to the efficiency of Java memory management, but in certain contexts it still helps to reduce it).
Again, many more experiments, some of them:
The initial work was attempting to efficiently split a packed
1D data into regions that could be traversed using linear addressing
- the most efficient for Java (or C, or assembly too). Some of
these got very complicated whilst trying to avoid integer %
and
/
.
Similar to IntStream.range() but possible to maintain per-band state.
For tile processing.
The other issue was how to operate on the data, pass it in and retrieve the calculation results.
Pixel
object was
usually avoided but was still experimented with.
An initial intent was to provide direct-access to the target objects where it was possible, and so this added another layer to this problem.
But all of this didn't solve anything or provide any benefit whatsoever; the interfaces were ugly and in all cases one ended up writing more boilerplate than simply using for loops with the occasional parallel for-each. Code-reuse was also more difficult than simply throwing some static functions into a library.
So after several iterations a solution based on rows of basic types
was settled on. One of the discarded iterations provided row pair
types for binary operators types but that was a poor design choice.
There was also a long-running episode of providing specific
map()
and forEach()
calls but these were all eventually
discarded.
Here's a summary of the additional classes and interfaces.
public class Image2D ... { public <X extends Image2D> X toImage2DByte(X dst, Function<ByteRow, ByteRow> convert) { } public <X extends Image2D> X toImage2DFloat(X dst, Function<FloatRow, FloatRow> convert) { } public Stream<ByteRow> byteRows(boolean parallel) { } public Stream<FloatRow> floatRows(boolean parallel) { } public Collector<FloatRow, T, T> floatCollector() { } public Collector<ByteRow, T, T> byteCollector() { } public void setByteRows(boolean parallel, Function<ByteRow, ByteRow> func) { } public void setFloatRows(boolean parallel, Function<FloatRow, FloatRow> func) { } } public interface ImageRow { public int getY(); public void setY(int y); public int getWidth(); public int getChannels(); public int getSize(); public void read(Image2D dst, int y); public void write(Image2D dst, int y); } public class ByteRow implements ImageRow { private final int width; private final int channels; public final byte[] pixels; // used by spliterator int y; public int getb(int i) { } public void setb(int i, int v) { } } public class FloatRow implements ImageRow { private final int width; private final int channels; public final float[] pixels; int y; public float getf(int i) { } public void setf(int i, float v) { } }
The ImageRow
implementations hide the internals to still allow
for reference rows or other tricks. The setByteRows()
and
setFloatRows()
are to work-around the problem that sometimes you
don't want to read the old contents first.
This Image2D
design has been in use as part of a much larger
library for enough time to discover some flaws.
get()
and set()
syntax actually really sucks to
use. Especially if you want to do a read/modify/write operation.
Image2D
is broken and wrong.
setXXXXRows
is a "work-around". The base api shouldn't
include workarounds of it's own design. And that's a tidy one -
you should've seen the monstrosly over-engineered solution that
proceeded that.
And whilst this feature is there, it still isn't used very widely. Partly this is just due to time and that streams in general are also being used sparingly in the given software. But it also stems from needing more work and planning than a simple for loop engendres.
Enter Pixels. This is an equivalent object but part of an as yet unpublished Free Software library.
A new implementation but learning from the previous experience. It
predates the current Image2D
intefaces shown above but not by
much.
And it made almost all the same mistakes and then added some more.
Try again.
A couple of cold evenings and another few attempts and lets see how this one fares.
The core object:
public abstract class Pixels implements Pixels0 { public final int width; public final int height; public Pixels(int width, int height) { } public static final byte ftob(float f) { } public static final float btof(byte b) { } public static final int ftoi(float f) { } public static final float itof(int i) { } public static final int btoi(byte b) { } public static final byte itob(int i) { } public abstract int offset(int y); public abstract int offset(int x, int y); public abstract int getCC(); public abstract <T extends Pixels> T subImage(int x, int y, int w, int h); public abstract void sample(float x, float y, float[] dst, int doff, PixelSampler smp, PixelIndexer idx); public void sample(float x, float y, float[] dst, int doff) { } public abstract <RT extends PixelRow> Stream<RT> rows(); public Stream<PixelRow.Byte4Row> byte4Rows() { } public Stream<PixelRow.Float4Row> float4Rows() { } public abstract PixelRow.Byte4Row readRow(PixelRow.Byte4Row row); public abstract void writeRow(PixelRow.Byte4Row row); public abstract PixelRow.Float4Row readRow(PixelRow.Float4Row row); public abstract void writeRow(PixelRow.Float4Row row); public abstract <RT extends PixelTile> Stream<RT> tiles(int twidth, int theight, int xborder, int yborder, PixelIndexer pxi); public abstract <T extends Pixels0> T create(); public abstract <T extends Pixels0> T toPixels(); }
Quite small as far as that goes. It also includes no "slow path" fallback implementation.
And part of the hierarchy for byte images:
public abstract class Byte0Pixels extends Pixels { public final byte[] pixels; public Byte0Pixels(int width, int height, byte[] pixels) { } public void sample(float x, float y, float[] dst, int doff, PixelSampler smp, PixelIndexer idx) { } } public class Byte1Pixels extends Byte0Pixels implements Pixels1 { public Byte1Pixels(int width, int height) { } public Byte1Pixels(int width, int height, byte[] pixels) { } public Byte1Pixels create() { } public Byte1Pixels toPixels() { } public int offset(int y) { } public int offset(int x, int y) { } public int getCC() { } public Byte1Pixels subImage(int x, int y, int w, int h) { } public Stream<PixelRow.Byte1Row> rows() { } public Stream<PixelRow.Byte1Row> byte1Rows() { } public Stream<PixelRow.Float1Row> float1Rows() { } public PixelRow.Byte4Row readRow(PixelRow.Byte4Row row) { } public PixelRow.Byte1Row readRow(PixelRow.Byte1Row row) { } public PixelRow.Float4Row readRow(PixelRow.Float4Row row) { } public PixelRow.Float1Row readRow(PixelRow.Float1Row row) { } public void writeRow(PixelRow.Byte4Row row) { } public void writeRow(PixelRow.Byte1Row row) { } public void writeRow(PixelRow.Float4Row row) { } public void writeRow(PixelRow.Float1Row row) { } public Stream<PixelTile.Byte1Tile> tiles(int twidth, int theight, int xborder, int yborder, PixelIndexer pxi) { } public PixelTile.Byte1Tile readTile(PixelTile.Byte1Tile tile, PixelIndexer pxi) { } public void writeTile(PixelTile.Byte1Tile tile) { } public void fill(float r) { } public void fill(int r) { } public static class Byte1View extends Byte1Pixels { private final int stride; private final int offset; public Byte1View(int width, int height, int stride, int offset, byte[] pixels) { } public Byte1Pixels toPixels() { } public int offset(int y) { } public int offset(int x, int y) { } } }
Almost all the code is just copying data around, perhaps doing some float conversion and clamping.
And some of the PixelRow
stuff to get a flavour of what is included.
public abstract class PixelRow { public int y; public abstract int getCC(); public abstract static class FloatRow extends PixelRow { public final float[] pixels; public FloatRow(float[] pixels) { } } public static class Float1Row extends FloatRow { public Float1Row(int width) { } } /// ... public static <RT extends PixelRow> Stream<RT> rows( int width, int height, boolean parallel, IntFunction<RT> createRow, UnaryOperator<RT> readRow) { } public static UnaryOperator<Byte3Row> byte3Permute(int i0, int i1, int i2) { } public static UnaryOperator<Byte4Row> byte4Permute(int i0, int i1, int i2, int i3) { } public static UnaryOperator<Float3Row> float3Permute(int i0, int i1, int i2) { } public static UnaryOperator<Float4Row> float4Permute(int i0, int i1, int i2, int i3) { } static class RowSpliterator<RT extends PixelRow> implements Spliterator<RT> { /// ... } static class PixelRowCollector<R extends PixelRow, P> implements Collector<R, P, P> { /// ... } }
This starts to introduce some (hopefully reusable!) map functions, and a stream generator and collector.
The basic hierarchy is similar with some small changes and some simplification of the api.
Byte4Row
and Float4Row
which allows for easy code re-use in many
common cases.
Image
or BufferedImage
by
simply providing a new UnaryOperator
to PixelRow.rows()
.
And because sometimes it really just is easier writing a function to
process a single pixel, support is included for Pixel
objects
and operators. But this time i just use a map operation on a stream
of rows. It's both more efficient and takes less code. Win win!
public class BytePixel { public int x; public int y; public int r, g, b, a; public static UnaryOperator<PixelRow.Float1Row> mapFloat1(UnaryOperator<BytePixel> op) { } public static UnaryOperator<PixelRow.Float3Row> mapFloat3(UnaryOperator<BytePixel> op) { } public static UnaryOperator<PixelRow.Float4Row> mapFloat4(UnaryOperator<BytePixel> op) { } public static UnaryOperator<PixelRow.Byte1Row> mapByte1(UnaryOperator<BytePixel> op) { } public static UnaryOperator<PixelRow.Byte3Row> mapByte3(UnaryOperator<BytePixel> op) { } public static UnaryOperator<PixelRow.Byte4Row> mapByte4(UnaryOperator<BytePixel> op) { } }
There is only two pixel types, BytePixel
(integer) and
FloatPixel
since the cost of wasted memory is insignificant.
Probably the most interesting here is the way generics have been used. It took more than a little fighting with the compiler to understand how they can be used in a simple way without turning it into a vomit-filled bucket of angle brackets.
One convenience is that you can override a locally templated return type in the implementation. So for example:
public abstract <T extends Pixels> T subImage(int x, int y, int w, int h); /// ... public class Byte1Pixels { public Byte1Pixels subImage(int x, int y, int w, int h) { } }
The original approach templated the instance-type into the base class. An obvious solution, but it turns out a rather naive one.
Unfortunately the same does not apply to templated arguments. Authors Note: I can't remember why I tried these, but I found a way not to need them so I guess it doesn't matter..
One feature I find quite ingenous is the way the reusable
PixelRow.rows()
stream factory works. I provided convenience
methods in each instance but they are simple one liners which invoke
this function.
public class PixelRow { public static <RT extends PixelRow> Stream<RT> rows( int width, int height, boolean parallel, IntFunction<RT> createRow, UnaryOperator<RT> readRow) { } } public Stream<PixelRow.Byte1Row> byte1Rows() { return PixelRow.rows(width, height, true, PixelRow.Byte1Row::new, this::readRow); }
Note here that the readRow
function is overriden for each of the
supported channel and data formats but doesn't need to be defined
explicitly. In this case the Supplier IntFunction
is
what is being used to bind the generic RT
to Byte1Row
and
that in turn forces the correct readRow
method to be bound. And
additionally it binds the Stream
element type as well so applies
down-stream.
Whilst it isn't entirely obvious why this is such a great thing in this isolated case it carries forward to all users of the stream. It just saves a lot of typing and allows for easier code reuse.
And just a small sample of how this api can be used and how streams (sort of) make it easier.
public Byte1Pixels filter(Byte1Pixels src, Byte1Pixels dst) { byte[] map = new byte[256]; int[] hist = src.rows().collect(byte1Histogram()); double m = Arrays.stream(hist).average().getAsDouble(); int ml = max(1, (int) (m * limit)); int total = 0; for (int i = 0; i < hist.length; i++) total += (hist[i] = min(hist[i], ml)); int sum = 0; for (int i = 0; i < hist.length; i++) { map[i] = (byte) (sum * 255 / total); sum += hist[i]; } src.rows().map(byte1Map(map)).forEach(dst::writeRow); return dst; }
This is an implementation of an adaptive histogram equalisation. The buckets can be limited to a proportion of the mean - this allows a softening of the effect which often produces more natural results Author Note: I wrote about it somewhere on my blog.
So ... what isn't shown ... is the parallel histogram reduction implementation, and the output pixel mapper. Fortunately Java 8 Streams make this relatively straight forward to write. But more importantly the code you do write is already able to run concurrently and in a reusable state.
static Collector<Byte1Row, int[], int[]> byte1Histogram() { return new Collector<Byte1Row, int[], int[]>() { public Supplier<int[]> supplier() { return () -> new int[256]; } public BiConsumer<int[], Byte1Row> accumulator() { return (int h[], Byte1Row u) -> { for (int i = 0; i < u.pixels.length; i++) h[u.pixels[i] & 0xff] += 1; }; } public BinaryOperator<int[]> combiner() { return (int h[], int[] u) -> { for (int i = 0; i < h.length; i++) h[i] += u[i]; return h; }; } public Function<int[], int[]> finisher() { return (int[] r) -> r; } public Set<Collector.Characteristics> characteristics() { return EnumSet.of(Characteristics.CONCURRENT, Characteristics.IDENTITY_FINISH); } }; } static UnaryOperator<Byte1Row> byte1Map(byte[] map) { return (Byte1Row r) -> { for (int i = 0; i < r.pixels.length; i++) r.pixels[i] = map[r.pixels[i] & 0xff]; return r; }; }
This demonstrates that sometimes you need to do some designing and planning with streams, but that design effort can be put to good use in many cases by providing re-usable modules as a "free" side effect.
As an aside, I was typically including Characteristics.UNORDERED
in my collectors as it seemed like the right flag to include. But
if your Spliterator
reports ORDERED
then this leads to a
suboptimal execution of concurrent streams. Well so it seems.
Another small example, separable convolution.
public Byte1Pixels filter(Byte1Pixels src, Byte1Pixels dst) { int padx = (xkernel.length + 1) / 2; int pady = (ykernel.length + 1) / 2; Float1Pixels tmp = new Float1Pixels(src.width, src.height); PixelRow.rows(src.width + xkernel.length, src.height, true, allocate(padx), readRow(src, padx)) .map(convolve(xkernel)) .forEach(tmp::writeRow); PixelRow.rows(src.height + ykernel.length, src.width, true, allocate(pady), readCol(tmp, pady)) .map(convolve(ykernel)) .forEach(writeCol(dst)); return dst; }
Wow nice! Runs multi-core for free as well.
Unfortunately it's not quite so small ...
IntFunction<PixelRow.Float1Row> allocate(int pad) { return (int n) -> new PixelRow.Float1Row(n + pad * 2); } UnaryOperator<PixelRow.Float1Row> readRow(Byte1Pixels src, int pad) { return (PixelRow.Float1Row t) -> { for (int i = 0, o = src.offset(t.y); i < src.width; i++) t.pixels[i + pad] = btof(src.pixels[o + i]); for (int i = 0; i < pad; i++) { t.pixels[i] = t.pixels[2 * pad - 1 - i]; t.pixels[i + src.width + pad] = t.pixels[src.width + pad - 1 - i]; } return t; }; } UnaryOperator<PixelRow.Float1Row> convolve(float[] kernel) { return (PixelRow.Float1Row t) -> { for (int i = 0; i < t.pixels.length - kernel.length; i++) { float v = 0; for (int j = 0; j < kernel.length; j++) v += kernel[j] * t.pixels[i + j]; t.pixels[i] = v; } return t; }; } UnaryOperator<PixelRow.Float1Row> readCol(Float1Pixels src, int pad) { return (PixelRow.Float1Row t) -> { for (int i = 0; i < src.width; i++) t.pixels[i + pad] = (src.pixels[src.offset(t.y, i)]); for (int i = 0; i < pad; i++) { t.pixels[i] = t.pixels[2 * pad - 1 - i]; t.pixels[i + src.width + pad] = t.pixels[src.width + pad - 1 - i]; } return t; }; } Consumer<PixelRow.Float1Row> writeCol(Byte1Pixels dst) { return (PixelRow.Float1Row t) -> { for (int i = 0; i < dst.width; i++) dst.pixels[dst.offset(t.y, i)] = ftob(t.pixels[i]); }; }
So already the api design 'fails' at re-use. Column access and padded rows with some sort of edge mapping (in this case mirror) are common enough that they should be in the core api.
Shit.
Saved something for version 2 I guess.
Because it turns out pretty tidy, here's another example. This is
converting to/from JavaFX Image
.
public class PixelFX { public static void fromImage(Image src, Pixels dst) { rows(src).forEach(dst::writeRow); } public static WritableImage toImage(Pixels src, WritableImage dst) { return src.byte4Rows() .collect(fromByte4Rows(dst)); } public static Stream<PixelRow.Byte4Row> rows(Image img) { int width = (int) img.getWidth(); int height = (int) img.getHeight(); return PixelRow.rows(width, height, true, PixelRow.Byte4Row::new, readRow(img)) .map(PixelRow.byte4Permute(2, 1, 0, 3)); } public static UnaryOperator<PixelRow.Byte4Row> readRow(Image img) { int width = (int) img.getWidth(); PixelReader pr = img.getPixelReader(); WritablePixelFormat<ByteBuffer> pf = PixelFormat.getByteBgraPreInstance(); return (PixelRow.Byte4Row row) -> { pr.getPixels(0, row.y, width, 1, pf, row.pixels, 0, 0); return row; }; } public static Collector<PixelRow.Byte4Row, WritableImage, WritableImage> fromByte4Rows(WritableImage img) { WritablePixelFormat<ByteBuffer> pf = PixelFormat.getByteBgraPreInstance(); PixelWriter pw = img.getPixelWriter(); UnaryOperator<PixelRow.Byte4Row> permute = PixelRow.byte4Permute(2, 1, 0, 3); return new PixelRow.PixelRowCollector<>(img, (WritableImage t, PixelRow.Byte4Row row) -> { row = permute.apply(row); pw.setPixels(0, row.y, row.pixels.length >> 2, 1, pf, row.pixels, 0, 0); }); } }
byte4Permute()
is a trivial row operator which does the obvious.
Take note at how every one of the static methods above are reusable
for more specialised cases.
Happy hacking.
notzed on various mail servers, primarily gmail.com.