some notes

Some waking up thoughts to jot down for later. It's too nice to be inside today. I have stuff I should be doing but i'm a little immobile due to hurting my foot again so I might just sit in the sun drinking. I thought it was better and over-tested it last weekend - and I wasn't even drinking :(. I can get around ok - it just doesn't heal at all if i don't rest it enough.

Expand the state machine on the controller and make all DMA asynchronous - currently some is not.
Use both DMA channels in the controller asynchronously, probably 0 for external reads and 1 for internal writes.
A single non-chained DMA request can load up to two distinct objects if they are within 32K (why oh why weren't the strides 32-bit!), halving the dma request rate.
Implement an asynchronous queue primitive combining eport + remote queue + async dma, or another primitive if it simplifies execution. For example I did manage to avoid the need for the double-dma yesterday to write the "dma complete" status on a variable-length record: I just wrote the record backwards!
A controller for the line-mode rasteriser should significantly address the source bandwidth problem.
Investigate synchronising the write stage to take advantage of the improved write performance of the memory interface. This probably needs to run asynchronously via interrupt code.
The rasteriser can avoid the need to calculate the edge equations per-pixel if it knows a given region is fully in-range. This reduces the inner loop by 3 flops and 3 iops but is hard to take full advantage of due to flop latency.
The C compiler is doing a fair job of the rasteriser loop but is still making a bit of a pigs breakfast of the address calculation. I think the 21 instructions can be reduced to 15 but to be effective I need to convert a large swathe of code to assembly.
Use edge equations to accurately index the tiles.
Investigate optimised renderers for special cases (for particles?). Flat shading reduces the fragment processor to a simple write (or alpha blend). Flat shading + disabled z buffer + full rectangle reduces to a simple rectangular write. etc.
Investigate primitive synthesis on-core. e.g. particles.

I had some further thoughts on the results of yesterday; even though it's half the speed of the line renderer considering the complexity of the interactions and the forced requirement of an additional read/write cycle across another core for each fragment - it's probably actually fairly good. The main bottleneck seems to be the mismatch of rasterisation to fragment rendering time which has nothing to do with the architecture - but the fragment shaders are only trivial 3-term colour interpolations and if they were more complex then shifting the rasterisation to another core would leave more time for them to execute. So I will still hook it up to the gl frontend to test it and other backends which can use the same or similar controller setup.

Although I think due to the possibility of other highly optimised special cases a combined implementation will still be the ultimate target.

About Me

Tags

some notes