More thoughts on the software code cache
So I veered off on another tangent ... just how to implement the software instruction cache from the previous post.
Came up with a few 'interesting' preliminary ideas although it definitely needs more work.
function linkage table
Each function call is mapped to a function linkage table. This is a 16-byte record which contains some code and data:
;; assembly format, some d* macros define the offset of the given type dstruct dint32 flt_move ; the move instruction dint32 flt_branch ; the branch instruction dint16 flt_section ; section address (or index?) dint16 flt_offset ; instruction offset dint16 flt_reserved0 ; padding dint16 flt_next ; list of records in this section dend flt_sizeof
The move instruction loads the index of the function itself into scratch register r12 and then either branches to a function which loads the section that contains the code, or a function which handles the function call itself. The runtime manages this. Unfortunately due to nesting it can't just invoke the target function directly. The data is needed to implement the functionality.
section table
The section table is also 16 bytes long and keeps track of the current base address of the section
dstruct dint16 sc_base ; loaded base address or ~0 if not loaded dint16 sc_flt ; function list for this section dint16 sc_size ; code size dint16 sc_reserved0 dint16 sc_next ; LRU list dint16 sc_prev dint32 sc_code ; global code location dend sc_sizeof
section header
To avoid the need of an auxiliary sort required at garbage collection time, the sections loaded into memory also have a small header of 8 bytes (for alignment).
dstruct dint32 sh_section ; the section record this belongs to dint32 sh_size ; the size of the whole section + data dend sh_sizeof
Function calls, hit
If the section is loaded ("cache hit") flt_branch points to a function which bounces to the actual function call and more importantly makes sure the calling function is loaded in memory before returning to it, which is the tricky bit.
Approximate algorithm:
rsp = return stack pointer ssp = section pointer docall(function, lr) ; save return address and current section rsp.push((lr-ssp.sc_base)); rsp.push(ssp); ; get section and calc entry point ssp = function.flt_section entry = ssp.sc_base + function.flt_offset ; set rebound handler lr = docall_return ; call function goto entry docall_return() ; restore ssp ssp = rsp.pull(); ; if still here, return to it (possibly different location) if (ssp.sc_base) { lr = rsp.pull() + ssp.sc_base; goto lr; } ; must load in section load_section(ssp) ; return to it lr = rsp.pull() + ssp.sc_base; goto lr;
I think I have everything there. It's fairly straightforward if a little involved.
If the section is the same it could avoid most of the work but the linker wont generate such code unless the code uses function pointers. The function loaded by the stub (flt record) might just be an id (support 65K functions) or an address (i.e. all on-core functions).
I have a preliminary shot at it which adds about 22 cycles to each cross-section function call in the case the section is present.
Function calls, miss
If the section for the function is not loaded, then the branch will instead point to a stub which loads the section first before basically invoking docall() itself.
doload(function, lr) ; save return address and current section rsp.push((lr-ssp)); rsp.push(ssp); ; load in section ssp = function.flt_section load_section(ssp); ; calculate function entry entry = ssp.sc_base + function.flt_offset ; set rebound handler (same) lr = docall_return ; call function goto entry
load_section() is where the fun starts.
Cache management
So I thought of a couple of ways to manage the cache but settled on a solution which uses garbage collection and movable objects. This ensures every byte possible is available for function use and i'm pretty certain will take less code to implement.
This is where the sh struct comes in to play - the cache needs both an LRU ordered list and a memory-location ordered list and this is the easiest way to implement it.
Anyway i've written up a bit of C to test the idea and i'm pretty sure it's sound. It's fairly long but each step is simple. I'm using AmigOS/exec linked lists as they('re cool and) fit this type of problem well.
loader_ctx { list loaded; list unloaded; int alloc; int limit; void *code; } ctx; load_section(ssp) { needed = ctx.limit - ssp.sc_size - sizeof(sh); if (ctx.alloc > needed) { ; not enough room - garbage collect based on LRU order ; scan LRU ordered list for sections which still fit used = 0; wnode = ctx.loaded.head; nnode = wnode.next; while (nnode) { nused = wnode.sc_size + used + sizeof(sh); if (nused > needed) break; used = nused; wnode = nnode; nnode = nnode.next; } ; mark the rest as removed while (nnode) { wnode.sc_base = -1; ;; fix all entry points to "doload" unload_functions(wnode.sc_flt); wnode.remove(); ctx.unloaded.addhead(wnode); wnode = nnode; nnode = nnode.next; } ; compact anything left, in address order src = 0; dst = 0; while (dst < used) { sh = ctx.code + src; wnode = sh.sh_section; size = sh.sh_size; ; has been expunged, skip it if (wnode.sc_base == -1) { src += size; continue; } ; move if necessary if (src != dst) { memmove(ctx.code + dst, ctx.code + src, size); wnode.sc_base = dst + sizeof(sh); } src += size; dst += size; } } ; load in new section ;; create section header sh = ctx.code + ctx.alloc; sh.sh_section = ssp; sh.sh_size = ssp.size + sizeof(sh); ;; allocate section memory ssp.sc_base = ctx.alloc + sizeof(sh); ctx.alloc += ssp.size + sizeof(sh); ;; copy in code from global shared memory memcpy(ssp.sc_base, ssp.sc_code, ssp.sc_size); ;; fix all entry points to "docall" load_functions(ssp.sc_flt); ;; move to loaded list ssp.remove(); ctx.loaded.addhead(ssp); }
The last couple of lines could also be used at each function call to ensure the section LRU list is correct, which is probably worth the extra overhead. Because the LRU order is only used to decide what to expunge and the memory order is used for packing it doesn't seem to need to move functions very often - which is obviously desirable. It might look like a lot of code but considering this is all that is required in totality it isn't that much.
The functions load_functions() and unload_functions() just set a synthesised branch instruction in the function stubs as appropriate.
Worth it?
Dunno and It's quite a lot of work to try it out - all the code above basically needs to be in assembly language for various reasons. And the loader needs to create all the data-structures needed as well, which is the flt table, the section table, and the section blocks themselves. And ... there needs to be some more relocation stuff done if the sections use relative relocs (i.e. -mshort-calls) when they are loaded or moved - not that this is particularly onerous mind you.
AmigaOS exec Lists
The basic list operations for an exec list are always efficient but turn out to be particularly compact in epiphany code, if the object is guaranteed to be 8-byte aligned which it should be due to the abi.
For example, node.remove() is only 3 instructions:
; r0 = node pointer ldrd r2,[r0] ; n.next, n.prev str r2,[r3] ; n.prev.next = n.next str r3,[r2,#1] ; n.next.prev = n.prev
The good thing about exec lists is that they don't need the list header to remove a node due to some (possibly) modern-c-breaking address-aliasing tricks, but asm has no such problems.
list.addTail() is only 4 instructions if you choose your registers wisely (5 otherwise).
; r3 = list pointer ; r0 = node pointer ldr r2,[r3] ; l.head strd r2,[r0] ; n.next = l.head, n.prev = &l.head str r0,[r2,#1] ; l.head.prev = n str r0,[r3] ; l.head = n
By design, &l.head == l.
Unfortunately the list iteration trick of going through d0 loads (which set the cc codes directly) on m68K doesn't translate quite as nicely, but it's still better than using a container list and still only uses 2 registers for the loop:
; iterate whole list ; r2 = list header ldr r0,[r2] ; wnhead (work node) ldr r1,[r0] ; nn = wn.next (next node) sub r1,r1,0 ; while (nn) { beq 2f 1: ; r0 is node pointer and free to be removed from list/reused ; node pointer is also ones 'data' so doesn't need an additional de-reference mov r0,r1 ; wn = nn ldr r1,[r1] ; nn = nn.next sub r1,r1,0 bne 1b 2:
Again the implementation has a hidden bonus in that the working node can be removed and moved to another list or freed without changing the loop construct or requiring additional registers.
For comparison, the same operation using m68K (devpac syntax might be out, been 20 years):
; a0 = list pointer mov.l (a0),a1 ; wn = l.head (work node) mov.l (a1),d0 ; nn = wn.next (next node) beq.s 2f ; while (nn) { .loop: ; a1 is node pointer, free to be removed from list/reused mov.l d0,a1 ; wn = nn mov.l (a1),d0 ; nn = wn.next bne.s .loop 2:
See lists.c from puppybits for a C implementation.