async copies
I had a play with this idea far too late into the night last night. A dma memcpy is a fine idea and has some benefits but isn't really taking advantage of the hardware features nor providing much opportunity to deal with some of it's problems.
You can resort to manual dma but that is pretty fiddly and bulky code-wise and it's such a common operation it makes sense to have it available in the runtime. And once it's in the runtime the runtime itself can use it too.
My current thoughts on the api and implementation follow.
// incase an int isn't enough at some point typedef unsigned int ez_async_t; // Enqueue 1d copy ez_async_t ez_async_memcpy(void *dst, void *src, size_t size); // Enqueue 2d copy ez_async_t ez_async_memcpy2d(void *dst, size_t dstride, void *src, size_t sstride, size_t width, size_t height); // Wait till dma is done void ez_async_wait(ez_async_t aid); // Query completion status int ez_async_complete(ez_async_t aid);
To drive it I have a short cyclic queue of dma header blocks which are written to by the memcpy functions. If the queue is empty when the memcpy is invoked it fires off the dma.
The interesting stuff happens when the queue is not empty - i.e. a dma is still outstanding. All it does in that case is write the queue block and increment the head index and return. Ok that's not really the interesting bit but how the work is picked up. A dma complete interrupt routine checks to see if the queue is not empty and if not then starts the next one.
If the queue is full then the memcpy calls just wait until a slot is free.
I haven't implemented chaining but it should be possible to implement - but it might not make a practical difference and adds quite a bit of complication. Message mode is probably something more important to consider though. This api would use channel 1, leaving 0 for application use (or other runtime functions with higher priority) although if it was flexible enough there shouldn't be a need for manual dma and perhaps the api could support both. It would then need two separate queues and isr's to handle them and would force all code to use these interfaces (it could always take an externally supplied dma request too).
This is the current (untested again) interrupt handler. I managed to get all calculations into only 3 registers to reduce the interrupt overheads.
_ez_async_isr: ;; make some scratch ;; this can cheat the ABI a bit since this isr has full control ;; over the machine while it is executing strd r0,[sp,#-2] strd r2,[sp,#-1] ;; isr must save status if it does any alu ops movfs r3,status ;; Advance tail mov r2,%low(_dma_queue + 8) ldrd r0,[r2,#-1] ; load head, tail add r1,r1,#1 ; update tail str r1,[r2,#-1] ;; Check for empty queue sub r0,r0,r1 beq 4f ;; Calc record address mov r0,#dma_queue_size-1 and r0,r0,r1 ; tail & size-1 lsl r0,r0,#5 ; << record size add r0,r0,r2 ; &dma_queue.queue[(tail & (size - 1))] ;; Form DMA control word lsl r0,r0,#16 ; dmacon = (addr << 16) | startup add r0,r0,#(1<<3) ;; Start DMA movts dma1config,r0 4: ;; restore state movts status,r3 ldrd r2,[sp,#-1] ldrd r0,[sp,#-2] rti
When I wrote this I thought using 32-bytes for each queue record would be a smart idea because it simplifies the addressing, but multiply by 24 is only 2 more instructions and a scratch register over multiply by 32 so might a liveable expense. The address of the queue is loaded directly so that saves having to add the offset to .queue when calculating the dma request location.
The enqueue is simple enough.
// Pads the dma struct to 32-bytes for easier arithmetic struct ez_dma_entry_t { ez_dma_desc_t dma; int reserved0; int reserved1; }; // a power of 2, of course #define dma_queue_size 4 struct ez_dma_queue_t { volatile unsigned int head; volatile unsigned int tail; struct ez_dma_entry_t queue[dma_queue_size]; }; struct ez_dma_queue_t dma_queue; ... uint head = dma_queue.head; int empty; // Wait until there's room while (head - dma_queue_size >= dma_queue.tail) ; ... dma_queue[head & (dma_queue_size-1)] is setup here ez_irq_disable(); // Enqueue job dma_queue.head = head + 1; // Check if start needed empty = head == dma_queue.tail; ez_irq_enable(); if (empty) { // DMA is idle, start it ez_dma_start(E_REG_DMA1CONFIG, dma); } return head;
The code tries to keep the irq disable block as short as possible since that's just generally a good idea. With interrupts this is basically equivalent to a mutex to protect a critical section.