bummer, that bandwidth thing again
I did some profiling by clearing the framebuffer directly from the epiphany.
Short summary:
- Using dma takes about 20% longer than using the cpu (this is useful though because it can run asynchronously to the cpu).
- Using int writes is 1/2 the speed of long writes (but this is a known feature of the design).
- Writing sequential addresses is about 2x faster than not (again: known).
- Writing with one core is 2x faster than with 1 cores (due to last point, known again).
- Trying to get the compiler to do a long write is a pain. volatile seems to do it for this case.
- Using hardware loops was nearly 10% faster than not, but you need to do 16 instructions (for easier loop count setup) in the loop which makes it too bulky. I don't understand why this is because I can add a couple of nops before it makes any difference to the execution time; must be something to do with the write-to-mesh pipeline mechanism.
- A simple C loop on the ARM writing to the memory-mapped framebuffer using int is about 5x faster than the epiphany.
My screen is 1280x1024, 24-bit - don't know if that can be configured to fewer bits as i have no serial console and that's just how it starts up (at least it's not widescreen).
I know it's not the case but it appears as if the memory transfers are somehow synchronised with the framebuffer DMA. It's only getting about 60 fps. At any rate, they're running at the same speed.