About Me

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as Zed
  to his mates & enemies!

notzed at gmail >
fosstodon.org/@notzed >

Tags

android (44)
beagle (63)
biographical (104)
blogz (9)
business (1)
code (77)
compilerz (1)
cooking (31)
dez (7)
dusk (31)
esp32 (4)
extensionz (1)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (459)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (231)
java ee (3)
javafx (49)
jjmpeg (81)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (10)
opencl (120)
os (17)
panamaz (5)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
players (1)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
vulkan (3)
wanki (3)
workshop (3)
zcl (4)
zedzone (26)
Friday, 08 August 2014, 04:52

tunable histogram equalisation

So this is a rather simple but quite effective improvement to the basic histogram equalisation operation for automatic image correction.

I got the main idea from a paper: ``A Modified Histogram Equalization for Contrast Enhancement Preserving the Small Parts in Images''. Bit of a mouthful for what is just a bounded histogram.

I also made made another small modification which makes it tunable. Allowing for a fairly smooth range from what should be the same as 'normalise' in paint programs, up to the fully sun-seared over-exposed normal result from histogram equalisation.

Here's an example of the range of output.

The value is the proportion of the mean to which the histogram input is limited: a value of 0.0 should be equivalent to a contrast stretch or normalise operation, 1.0 matches the paper, and some large number (depending on the input, but approximately greater than 2) will be the same as a basic histogram equalisation.

One has to look closely with this particular image because they are already fairly balanced but a smooth range of 'enhancement' should be apparent.

I also played with a colour version which applies the histogram to the Y channel of a YUV image.

As of now it does tend to upset the colour balance a little bit and tends toward a metallic effect; but it's arguable better than what the gimp's equalise does to colour images.

Yikes. Although the image on the right is arguably more agreeable - the colour is very different from the source. Histogram equalising each component separately is effectively applying a white-balance correction where the colour temperature is some sort of flat grey: this sometimes works ok but it messes with the colour balance by definition.

I have some thoughts on applying the algorithm to floating point values using polynomial curve fitting, but I haven't tried them out yet. This would be to prevent binning from quantising the output.

For such a simple adjustment to the algorithm it's quite a good result - histogram equalisation is usually too harsh to use for much on it's own.

Tagged hacking.
Thursday, 07 August 2014, 10:49

On my god, it's full of local binary patterns!

I recently had need to look into feature descriptors. I've previously played with SURF and looked into others but I wasn't really happy with the complexity of the generator and needed something Java anyway.

A small search turned up FREAK (why the silly 'catchy' acronym names? Maybe it started with S-USANs?) which looked doable so I had a bit of a play. There is code available but it's OpenCV and C++ and the version I saw just wasn't very good code. I wrote up my own because I had some different needs for what I was looking at and porting the C++/OpenCV was going to be a pain.

I guess they work as advertised but i'm not sure they're what I want right now. I tried porting the AGAST detector as well but it wasn't really getting what I was after - i'm after specific features not just 'good features'.

The paper does include this interesting diagram though:

Although the paper doesn't reference them this diagram is pretty much a description of local binary patterns. The FREAK descriptor itself is just a very long local binary pattern with optional orientation normalisation.

Perhaps more interestingly is that this specific diagram could just as well be a description for how my fast object detector works. It is effectively a direct implementation of this diagram.

Assuming the above diagram is representative of human vision I guess one could say that the whole of visual reality is made of local binary patterns.

Tagged ml, philosophy.
Sunday, 03 August 2014, 12:18

bummer, that bandwidth thing again

I did some profiling by clearing the framebuffer directly from the epiphany.

Short summary:

My screen is 1280x1024, 24-bit - don't know if that can be configured to fewer bits as i have no serial console and that's just how it starts up (at least it's not widescreen).

I know it's not the case but it appears as if the memory transfers are somehow synchronised with the framebuffer DMA. It's only getting about 60 fps. At any rate, they're running at the same speed.

Tagged hacking, parallella.
Sunday, 03 August 2014, 05:24

that sucked a bit

I guess i should've known when i woke up feeling underslept with a headache. It was a dreadfully cold and stormy day so I ... well hacked again.

I had intended to just get a triangle out of an epiphany core, but it just didn't end up happening. I had to muck around getting the rev1 to a 'working' state which took surprisingly long. There are a lot of weird shitty changes to ubuntu (i mean, runlevels, why the fuck would anyone want those???) that took me a while to wade through. I did run the C code I have which outputs to the framebuffer, which worked but is a bit slow. I did seem to have weird issues with USB but a reboot more or less fixed that, with it running through a powered hub. nfs was super-slow till i set it to nfs3 mode. Apparently that's a thing that happens. I also ran one of the ezesdk tests and that worked ... i wasn't sure if that would.

And god, what the fuck have they done to the gtk+ version of emacs? It's like a really ugly version of notepad. I wish I had have found the emacs-lucid packages an hour earlier, would've saved my throat a good deal of violence. But I guess at some point the lucid version wont work anymore, I might have to find another editor (yeah it's that bad).

And what's with all the config/startup shit in debian and ubuntu? Run one tool and it gives a completely obtuse but apparently deep and meaningful message about using another tool, the other tool doesn't know what the fuck's going on and in the end you just edit a file by hand? Why is there even more than one? apt-get/dpkg/whatever else is bad enough. What sort of genious thought that "update-rc.d" was a nice name for a command anyone might ever want to run, ever? Trying to find solutions using a search engine is becoming pointless: it's the blind leading the blind and everything is years out of date. Try finding out how to disable screen blanking on the console for example?

This worked for me:

  echo "setterm -blank 0 -powersave off -powerdown 0" >> /etc/rc.local

Net"work"Manager was still spewing pointless shit to the logs whilst trying to "dynamically manage" a fixed ethernet cable ... so fuck that right off. Although i wish more shit actually logged happenings: it seems almost nothing logs when anything goes wrong. I don't see how dedicating a whole virtual terminal to the almost completely information-free "boot.log" is of use to anyone. The packaging system seems to have turned into a sort of enterprise configuration management tool: and you know what, they suck. They're slow and cumbersome and buggy as all shit, and we know because it's linux some doe-eyed fool will come along in a year or two with a new and even more broken 'fix' for all the brokenness.

I can't believe after 20 years of this shit ... it's now way more broken than how it started. At least back then the problems were restricted to hardware support. Now that's fantastic the software has all been fucked up by people poking their noses into places they have no fucking business being.

And i'm still underslept with a headache, with added fun of cold and hungry.

Rasterisers

After the last post I kind of remembered one reason to split the work across cores: calculating the reciprocal of 1/w is somewhat expensive unless it can be batched up.

So I was up way too late last night just trying different snippets of code to address that. I think I will go the branchless loop thing that performs the z-buffer test and in-triangle tests separately and then outputs a compact set of coordinates. The compiler was doing some funky stuff but I got some hand-rolled code down to like 10 cycles per pixel (and that could include the 1/w interpolation too); the only real problem with that being the memory required for the output :-/

A separate loop can then just calculate 1/(1/w) to a table (at something like 16 cycles per pixel), and the final loop can then interpolate all the varying values without having to many any decisions about which are live pixels. Without this kind of split there isn't enough registers to keep everything in registers within the inner loops.

Because of the memory it may have to do all this in several batches - slivers of 64 pixels each or somesuch.

Hello Triangle

But I kinda gave up after today and just worked on a "simple as possible" Java "gpu" to try and have something positive to hang onto after a miserable day (and i started before I got nfs fixed). I needed something which is the distilled/captured knowledge of what I know "so far' as a training simulator. There's still some stuff I need to work out wrt the 3d maths and it's just easier playing with some simple code to do it.

This for example is the code which generates the typical hello world example from OpenGL:

float[] vertices = {
        -0.75f, -0.75f, 0, 1,
        0.75f, 0, 0, 1,
        -0.75f, 0.75f, 0, 1,};

void helloTriangle() {
        Viewport vp = new Viewport(0, 0, width, height);
        PrimitiveTriangle tt = new PrimitiveTriangle();

        tt.setup(vp, 0, vertices);

        // red, green, blue
        tt.setVarying(0, 1, 0, 0);
        tt.setVarying(1, 0, 1, 0);
        tt.setVarying(2, 0, 0, 1);
 
        float uniformA = 1.0f;

        tt.draw(pbuffer, zbuffer, width, (float[] varying, float[] pixels, int x) -> {
                        pixels[x + 0] = varying[0];
                        pixels[x + 1] = varying[1];
                        pixels[x + 2] = varying[2];
                        pixels[x + 3] = uniformA;
                });
}

This is functionally equivalent to the low-level part of a gpu driver/hardware after the vertex shader (more or less).

Here the lambda expression is the fragment shader. The main problem with using Java as a fragment shader language is how ugly vector/matrix/complex maths ends up being when you need to use flat arrays for efficiency.

Right now this isn't really much more than a training tool and intellectual curiosity, but it's food for thought that computer systems (cpu+gpu+other) and compiler technology is explicitly working toward a point where code such as the above would be the way you "program" graphics drivers. And it would run just as fast as any other driver software. There will probably still be a need for some fixed-function units but these could also be encapsulated as co-processors. The reason this would be possible now when it wasn't previously is due to the technology that makes HSA possible.

A Saturday passes ...

I had a lot of trouble with the matrices and some with the triangle direction - as is usually the case with 3d maths. After playing with some opengl3 tutorials I got it enough worked out to get somewhere. I also played with framebuffer, javafx output, and parallel streams for the tile rendering. Oh and using fixed-point for the triangle edge calculations, which fix some rare edge cases with the edges and might be easier to optimise. And trying to optimise the reciprocal calculation and changing back to the fglrx driver as side-missions (so i could run the gl3 examples - for whatever reason mesa wasn't doing the job, except i forgot which kernel is the one I need to use and the one that mostly works causes some nasty bugs in X). Well you know, a ton of stuff - i lost track of time and suddenly it was 5am.

I should really add some lighting but it's quite mesmerising in full-frame-rate motion all the same. Ok result for a week of late nights piss-farting about.

Still no epiphany code; next perhaps?

Tagged hacking, rants.
Thursday, 31 July 2014, 12:36

lost in dots

Continued to play with software rendering. Ported the code to C and got it running on the framebuffer - still only on the workstation so I don't have any idea how it will run on the parallella.

The Java seems to be faster than the C TBH but that is probably rendering to X verses the framebuffer which is very slow. gcc isn't having the best time of the critical inner loop either. Currently i'm not running any fragment shaders and I have half an idea to just try creating a small soft-3d engine all in java since the jvm can handle some of that. Maybe not.

I thought i'd look at GLES2+ to see how it handles certain things, ... and decided to base some of the API on that because it dictates to a good degree the backend implementation and is based on 20+ years of industry experience. Apart from the shader compilers (which i'm going to side-step entirely) the core components are quite simple. The biggest one from my perspective is the one i've been focusing on so far - the rasteriser and varying interpolator.

I hadn't really thought about how many varyings there need to be, and 8x4 is a good amount. I hadn't got to thinking about uniforms yet either.

I played a bit with code to implement glDrawArrays() which is mostly just data conversion and flattening. My first obvious thought was to select elements from the arrays and turn them into rows (i.e. array of structures) and then batch-process them using the vertex shader. Locality of reference and all that. But after thinking about the inner loop of the rasteriser it will be more efficient to use a structure of arrays approach. It also makes it easier to SIMDise anything that needs to run on the ARM - quite likely the vertex shaders and primitive setup. It makes reading the vertex array data itself much more efficient anyway since each array type is an outer-loop decision rather than an inner-loop one although could have some poor cache behaviour if that isn't taken into consideration.

The potential size of the various data-structures has made me rethink the approach I was going to take. I was going to load as many triangles into the rasteriser as would fit in LDS and process them all at once. But that just wont be enough to be any use so it will have to cycle through them multiple times for different rendering bands. If I drop it down to just a pair of buffers it simplifies the i/o stuff and doesn't make any real difference to the number of times things are read from the external memory.

Due to memory constraints i'm going to have to split the rasteriser+zbuffer test to one or more cores and the fragment rendering to other cores. Each varying interpolator requires 3 floats per element, or 12 floats per 4-element vector. A simple gourad-shaded polygon will require at least 128 bytes to store everything needed (+some control stuff). I was going to pass the uniforms around with the triangle but they are too big so instead I will pass incremental updates to uniform elements in a processing stream, and the rasteriser will make sure the fragment shaders have the updates when necessary as well.

The queueing will be a bit of a pain to get right, but i'm ignoring it for now.

Rasteriser inner loop

Because of the number of varying elements I had to change the way I was thinking of writing the inner loop. I can't keep everything in registers and will need to load the interpolation addends as I go - this is why the data has to be stored as a structure of arrays because it always getting element 0 of a 3-element vector and that's more efficient to load using dword loads.

Also, since fadd is just as costly as fmadd I can remove the incremental update of the coefficients from always having to be executed - i.e. when a z-buffer test fails. It would also allow me to split the work into a couple of passes - one that generates live pixels, and another which turns them into interpolated fragments which could run branchless. I'm not sure if that is worth it yet though since it doesn't remove the branch.

So a single horizontal scan will be something like the following. This is the biggest case.

  (v0,v1,v2) = edge equation results for start location
  (zw, 1/w)  = interpolants for z/w, 1/w
  (p0-p31)   = interpolants for varying values

  (e0,e1,e2) = update values for edges
  (ez, ew)   = update values for z/w, 1/w

   x = start;
   lastx = start
start:
   in = v0.sign | v1.sign | v2.sign;
   if in
     zt = load zw-buffer[x]
       if (zw > zt)  // i think some of my maths is flipped
         store zw-buffer[x] = zw
         // use delta of X so that 3-instruction fmadd can be used!
         xdiff = x - lastx;
         lastx = x;

         ; imagine this is all magically scheduled properly ...

         1/w += xdiff * ew

         dload ep0-ep1
         p0 += xdiff * ep0
         p1 += xdiff * ep1
         dload ep2-ep3
         p2 += xdiff * ep2
         p3 += xdiff * ep3
         .. etc

         write target fragment record
         write 1/w
         write z/w
         // (i don't think i need v0-v3)
         write p0-p31
      fi
  fi
  v0 += e0;
  v1 += e1;
  v2 += e2;
  zw += ez;
while !done

I'm still experimenting with ways to work out how 'start' and '!done' are implemented. I've got a bunch of different algorithms although I need to get it on-hardware to get more accurate timing information. The stuff in ATTILA uses a sort of flood-fill algorithm but for various reasons I don't want to use that. I have a pretty reliable way to get both the start and end point of a line using a binary search, ... but its costly if run every line (it's still not perfect either but I think that's down to float being a shit :-/).

My current winner uses a 8x8 search every 8 lines and then uses that as the bounds. And that's even with a broken test which often over-runs (i've subsequently found the maths that work but haven't included it yet).

For small triangles just using the bounding box is pretty much as good as it gets, so may suffice.

Removing the loop branches

Actually it might be possible to have 2 branchless loops ...

  (v0,v1,v2) = edge equation results for start location
  (zw)       = interpolant for z/w,

  tmp        = room for hit locations
loop1 (over bounds):
  in = v0.sign | v1.sign | v2.sign;
  zt = read zwbuffer[x]
  in &= (zt - zw).sign
  zt = cmove(in, zw)
  write zwbuffer[x] = zt
  tmp[0] = x  
  tmp = cmove(in, tmp+1)

All branch-free. At the end, tmp points to the end of an array which includes the X values of all visible pixels.

The second loop just interpolates everything for the x locations stored in the tmp array.

  lastx = 0
loop2 (over tmp-tmp.start):
  load x
  increment everything by (x-lastx)
  output
  lastx = x

As with a previous post about getting good fmadd performance, it should be just about possible to dual-issue every fmadd and remove all stalls. All writes are written directly to the fragment shader core for that row?

Requires working memory for the X coordinates though; but that should be doable since this is all the core will be running.

However, having said all that; the fragment output from a single row can more than exceed the fragment processor memory so they need to be throttled. Back to loops and branches I guess but it was an interesting puzzle.

As one might tell i'm not really that concerned with getting a working solution, more in investigating the ways to fit the peculiarities of the hardware. There's just a lot to write even if i've got a fairly good idea of a workable solution and had all the components necessary :-/

Or maybe ...

Just do everything in a single-loop on the same core?

Yes maybe, hmm, dunno now. Really depends on if there's enough memory for it to fit and whether loading the triangle Nx times vs 1x times is slower than all the comms overheads. Unless the mesh is saturated; I would say no chance.

It's much easier and probably the first step to performance analysis anyway.

I guess I was more lost in dots than I realised when I made the subject (it was for the pic).

Tagged graphics, hacking, parallella.
Tuesday, 29 July 2014, 04:34

better edges

Been poking more at the renderer. I have a fairly clean set of routines now that can be used to implement most of what I need: generating fragments, interpolating arbitrary parameters, handling a depth test.

And well, just running it and looking at the pretty pictures ...

The Z/W buffer stretched to 8-bit:

Add a few more, use a different 'fragment shader', ... polygonal shade-bobs.

I'm still just running stuff in Java on my workstation but i've been doing a lot of experimenting with micro-optimisations an trying to work out how i'm going to split the work up.

My current idea is to process the scene by complete scan-lines rather than tiles. This greatly simplifies the looping and address arithmetic but would not be very efficient for texture lookups. I'm basing the following on a loose understanding of a gpu pipeline but erring on the side of epu-efficiency.

I'm not sure where the first and second passes will reside, and for small triangles they will possibly run in the same loop. It may make sense to place some of the decision work on the arm so as to reduce the need for multiple passes over the triangles for each 'run'. In any event several rows will be processed at once since there will be several epus dedicated as fragment/row processors.

It's interesting that due to the FPU and branch latency you're forced to consider SIMD style algorithms for efficiency in a lot of cases. i.e. unrolling loops, branchless logic, re-arranging processing so decisions are taken elsewhere. I'm still juggling the code around to find out what works best on-paper although I should start poking on-board soon.

That better edge test

I'm also surprised at just how little code is required for the inner loop. Only 3 flops and 4 ialu ops (+branch) are required to implement a length-bounded 'inside triangle run' test.

This is an example of scanning right looking for the first set pixel in the triangle.

   ; v0, v1, v2 = edge equation value (already calculated)
   ; e0, e1, e1 = edge equation x coefficient
   ; i          = length left
1: and   code,v0,v1    ; negative test for v0 & v1
   fadd  v0,v0,e0
   and   code,code,v2  ; negative test for v0 & v1 & v2
   fadd  v1,v1,e1
   sub   i,i,#1
   fadd  v2,v2,e2
   orr   code,code,i   ; negative test for i
   bgte  1b 

5 cycles + branch = 8. Can be unrolled.

This is the equivalent of:

  do {
   code = (v0 < 0 && v1 < 0 && v2 < 0) || (i < 0));
   v0 += e0;
   v1 += e1;
   v2 += e2;
   i -= 1;
  } while (code == 0);

Rather than move the negative bit somewhere else as in the previous post I found just leaving them in-place reduces the instruction count. This also lets me combine float and integer comparisons in the same way.

A similar calculation can be used as the exit condition for the fragment generating loop. As each parameter is interpolated using a single addition within the loop it makes for a very simple loop. The only "problem" is that the calculations are so simple that branches become more expensive.

Tagged graphics, hacking, java.
Sunday, 27 July 2014, 05:40

edges

Just been playing with edge equations this morning.

Here it's recursively determining the fill area of the triangle, red is no content, green is all fill, blue is partial fill. Dunno how useful it is in this form but it looks nifty.

If the 3 edge equation results for each corner of a tile are turned into bits then the equations for each case are simple bit logic.

        int ec0 = edgeCode(e, x0, y0);
        int ec1 = edgeCode(e, x0 + tsize, y0);
        int ec2 = edgeCode(e, x0, y0 + tsize);
        int ec3 = edgeCode(e, x0 + tsize, y0 + tsize);

        int and = ec0 & ec1 & ec2 & ec3;
        int orr = ec0 | ec1 | ec2 | ec3;


        all_filled = and == 7;
        all_empty = orr != 7;

Rather than rely on floating point compare (aka subtract) which adds further latency to the calculation and thus cannot be directly pipelined, I form form the edgeCode directly using integer arithmetic.

public static int edgeCode(float[] e, float x, float y) {
        float v0 = x * e[0] + y * e[1] + e[2];
        float v1 = x * e[3] + y * e[4] + e[5];
        float v2 = x * e[6] + y * e[7] + e[8];
        
        int c0 = Float.floatToRawIntBits(v0);
        int c1 = Float.floatToRawIntBits(v1);
        int c2 = Float.floatToRawIntBits(v2);

        return (c0 >>> 31) | ((c1 >>> 31) << 1) | ((c2 >>> 31) << 2);
}

(>>> is a LSR op in Java).

Since epiphany (and most decent ISAs) share float and int registers the above is going to translate directly into clean machine code. This stuff might need to live on the ARM too and is SIMDable.

Actually there's a bunch of optimisations possible that reduce that instruction count and if using power-of two tile sizes and fixed-point arithmetic everything can be reduced to simple integer addition; but I haven't explored that yet.

Obviously this is working toward toward one important requirement: the renderer will have to tile to take advantage of the LDS, and it also needs pipelineable/simdable algorithms. But that's enough for this weekend, things to do ...

Tagged graphics, hacking, parallella.
Saturday, 26 July 2014, 14:23

rasterise

After thinking about the old C64 and Amiga games I thought i'd look into something I used to play with back then but haven't really touched properly in a long time: 'vector gfx'.

Since the parallella doesn't have a gpu it leaves it to software.

I looked into how hardware does it presently and it seems to be down to the technique described in Triangle Scan Conversion using 2D Homogeneous Coordinates so that's what I looked at. I got a basic 2d half-space triangle rasteriser going quickly but wanted a quicker solution for something more capable (and I couldn't find some of the references on the net) so did a hunt and came across the ATTILA project which has all the bits needed. I've only done a cursory scan for what i was interested in right now but I expect i'll becoming quite acquainted with it should i continue working on this for any time.

I extracted some of the low-level bits from it, set up some vertex handling and matrix code and ended up with a very basic solid-colour rasteriser for a 'hello cube' demo:

Definitely not going to break any speed records but it does run at full-frame rate on this pc even if it's only flat shaded. (it's java+javafx on a kaveri pc).

Getting this stuff working on the epiphany will be ... well interesting I guess.

Ahah, I just sussed out the parameter interpolation, an important bit I needed before looking at epiphany code.

Tagged graphics, hacking, parallella.
Newer Posts | Older Posts
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!