short hsa on kaveri note

I finally got around to upgrading to the latest hsa kernel and so far it seems as though hsa now continues to work after a suspend. Well i've suspended and resumed the machine once and the one of the aparapi-lambda tests continues to function with hsail.

Although this machine reboots rather fast it was just an inconvenience that pretty much put a stop to me playing around with hsa. I think there were other reasons too actually but I guess it's one less at least.

It's the 'v0.5' tag of git://people.freedesktop.org/~gabbayo/linux which I built from source rather than use the ubuntu images in Linux-HSA-Drivers-And-Images-AMD (although i grabbed the firmwares from there).

...

I've got 1 day left on my current bit of work, then another break, and then i'm not sure what i'm going to do. There is pretty much perpetually on-going 'bits of work' if I want it but i'm considering not continuing for various reasons (and have been for a while). Have to weigh that against trying to find some interesting work though. Ho hum.

Selfie

Had nothing better to do on my RDO but sit at the pub drinking jugs (translation for americans: rdo=rostered day off, jugs=pitchers). Took the pic with my cheap tablet on the front-facing camera so it's a bit ordinary but the subject matter hardly requires better ...

A very unseasonably warm 25C+ late autumn day, bloody wonderful. Bit of eye candy but I was well occupied reminiscing about happier timers and better company. Many people I knew/know and miss due to trans-state movements, emigration, and other. All the rest were busy today - work, pah.

Everyone moved on but I just got old and fat and grey. Yeah it's the same shirt I wore out in Bangalore but i've been working on the chops (may as well look the part if i'm going to be eccentric). Must be a bit of the Celtic blood showing itself there.

An untitled whine.

Maven is the wrong tool

So i've been working on a team project for the last few weeks. The people are decent but the processes are insane and the tool and technology choices are ... odd.

One of the tools is of course maven, which for some baffling reason has become quite popular in the java world. It's really a pretty horrid bit of software - very slow, unreliable (essentially you're forced to 'make clean; make all' every time and even then that doesn't guarantee a repeatable build), obtuse error messages which don't explain what's going on, plugin system with terrible discovery, shitty documentation, a dependency system which creates more problems than it solves. Like many java technologies it seems to have been created to sell books, training, and 'enterprise' versions rather than solve an existing need; and yet somehow created an evangelical following of essentially the blind leading the blind.

But you know, a lot of software has faults - although a build system which cannot create repeatable (incremental!!) builds is less than worthless in my mind. I've only been there 3x4-day weeks and probably spent 2 days waiting for things to run which makes it a very expensive bit of junk (as far as I can tell, so has every other developer).

However the primary problem with maven is that like git it is a configuration management tool.

But unlike git which can at least be (ab)used to become a developer tool, maven just doesn't have the required facilities to start with.

Subversion wtf?

I thought i'd see if the HSA stuff had been updated so I started updating my local checkouts yesterday. Apart from git being a bit of a git, subversion seems to have added 'interactive' bullshit now. So rather than just dump conflict markers in a file and show them it prompts you for what to do.

Yeah, no thanks.

Not sure I could be bothered building it right now anyway even if it wasn't still checking out the c++ snot from gcc.

Atom, wtf?

I came across a link to the "next generation" editor from github - Atom. Really pretty baffling why they're even working on some shitty editor, using 'web' technologies and then turning it into a macintosh application.

But really what surprised me most was the number of people saying things like "it's slow and it's buggy and even though it's exactly like some existing programmers editor I paid for, i'm going to be using this from now on".

What? Why? Do they think there'll be some sort of gold-rush on plugins? Why would they be willing to interfere with their productivity just to be part of something new?

Amazing the strange decisions people make based on hype.

And people still buy text editors? There's dozens of free ones out there and many many of them are particularly excellent. Maybe it's a macintosh thing. Like on the Amiga little developers can make niche products and get people to buy them.

Apparently the buzz is because it's extensible using an 'easy to use' language, whilst still being pretty and newbie friendly. Yeah I dunno, javascript is pretty shit when used as an app language.

Agile, the antonym of development methodologies

I've worked out how to treat everything that comes out of 'Agile' development. Just put a not in-front of each and every word of jargon from the agile bible (again another load of bullshit created to sell self-help books).

It's not a sprint, it's not a backlog item, and it's certainly not fucking agile in any dictionary-sense-of-the-word.

One can only imaging the authors of the 'best selling' book on the subject erroneously used the antonym section of the thesaurus when trying to generate their silly jargon rather than the synonym one.

A death in the family

Samantha finally died ... or what was left of her.

This is her second frame and the frame cracked in the same place - along the rear weld of the reinforcing strut behind the bottom bracket. The rear wheel is also new (the original just needs a couple of spokes and a re-true). I replaced all the cables when I changed the frame last time. The seat post needed changing when I changed frames, both brakes ended up being replaced (the springs on the arms broke off). At least one new chain - maybe two, and another rear cog cluster. So yeah, not much left of the original bike: front wheel, derailers, levers, handlebars, pedals, and forks.

This was originally a Specialized mountain bike I bought in Boston when I was working for Ximian in 2000, so i got a pretty good run out of it, and the replacement frame was only $150 (old but new stock). It's my shopping cart and commuter so even if i'm not doing as much distance as I am right now it's always had a regular weekly use and occasional heavy loads (like 10kg bags of flour, 25kg bags of potting mix, or a late-night dinky).

I have a Wheeler I bought even earlier (1998 or 1999?) which has been sitting outside with a broken rear wheel (fucking idiot at the bike shop on The Parade in Norwood over-tightened the spokes so much they pulled out of the rim - avoid, they're snobby wankers anyway) so over Easter I did a bit of a recondition on that. Gave it a good clean, sanded and touched up all the corrosion spots in the paintwork, cleaned and oiled the chain, then moved over the wheels, mudguards, pannier rack, lights, bottle cage, pump, and saddle from Sam.

The brand new rear wheel by this time was a little loose so I did a re-true on that and replaced a missing spoke on the front wheel and re-trued that (at least, as best I could), re-aligned the rear derailer and so on.

The spigot nut slips inside it's plastic case so I didn't really tighten the front wheel enough so that started working loose after a week. The axle was quite bent when I put it on (i'm not sure how that happened) so then the following weekend I decided to try to fix that as well - took out the bearing and re-greased, hammered the bent axle slightly straighter on a piece of railway line I use as a small anvil (trying to without damaging the thread), cleaned it up a bit and put it back on really tight.

Considering I haven't done any maintenance for years that isn't so bad, although I'm sure after all that something else is going to break next.

The Wheeler frame is built like a tank - oversized tubing and heavy reinforcing on the main welds. Unlike the other two frames there is no flexing whatsoever when I stand on the pedals; so that should last a few years yet. I mostly used the Specialized once I came back from Boston so it probably only had a couple of years use so far (it seemed lighter than the wheeler even though the wheeler had some better components).

Ahh I finally remembered the name I had for the Wheeler - Michelle. Just a running joke with some mates to name my bikes with female names.

Sewing Machine

I've been doing some very early riding of late, and together with a #2 haircut, a very well ventilated helmet, and some cold mornings ... it can get pretty chilly on the forehead (and the wingnuts). Cold here is around 10 degrees.

I was trying to find something I could use to keep some of the wind off and I came across an old knee warmer. The elastic is gone and the other knee got lost when I lent it to someone so it was convenient that it fits (if a bit snugly) over my head and ears. So I de-stitched the elastic (all 4 rows of strong polyester - certainly built well, I think it was an Australian made one) and brought out the sewing machine to just sew up the edge (hem?).

The machine just didn't seem to want to catch a stitch so I ended up pulling the whole bottom mechanism apart and trying to work out what was going on. In the process I finally learnt the magic that makes them work - I always wondered how the top-fed cotton was looped around the bottom-fed cotton. The mechanism is deceptively simple yet relies on precise alignment and some engineering of the needle shape. As the needle reaches it's lowest point, a rotating slot goes past and picks up the cotton that is sitting next to it and loops it around the bobbin feed. It relies on the needle being at the right place at the right time to fairly close tolerances and the flexibility of the cotton itself.

So after taking apart what I could and realising that nothing at all is adjustable I did a bit of an oil and put it back together and then I realised that the rotating spindle around the bobbin-cage has to be time-aligned with the rest of rotating parts ... but after some adjustments I got that back in the right spot.

Still didn't fucking work! If anything it was dropping even more stitches.

But then I looked closer at the needle I was using - and it had an oddly thin shape at the bottom I guessed it wasn't close enough to the rotating slot for the cotton to be picked up reliably. So I found another one, slotted it in, ... and away it went. I can't remember why I put that needle in now but it's clearly not made for this machine and at least i'll know where to look next time. It was with a box of shit the machine came with and must've come from another machine.

Actually because the material is so stretchy the stitching is hard to get right anyway, but I repaired some other things that needed sewing up and made a morning of it. After i'd cut it in half I realised I could have made a pretty creepy balaclava out of it but perhaps it's better I didn't ;-)

I think the machine is as old as me, Singer 248. It's built like something out of the steam age - the main body and platform is one big cast iron block and the drive shaft and cam-driven rockers are seemingly over-sized steel rods. It's got holes for pedal-driven belt but runs off a small motor. I was just about to throw it out so i'm pleased to find it was just a needle and I had an interesting morning learning how it worked. I couldn't see why it wouldn't be working otherwise - none of the parts seem particularly worn and it doesn't look like it's been used a lot.

Not that I have much to sew. Maybe I can make a slip-on cover for an e-book or tablet with the remainder of the knee warmer.

Update: Ok, so i got that wrong: i ran out of the cotton i was using and when i changed it it stopped catching again. I also found the adjustment screws and cams for the needle positioning. In the end the main problem was the bobbin cage timing, knew I should've recorded it's position before I took the belt off. After a few tries I managed to get it to catch on both sides of a zig-zag stitch although the stretchy material and needle were other factors. Fiddly, but I guess it occupied an afternoon ...

Cold and Wet

Another shitty cold wet weekend.

Probably wont bother rebuilding the HSA tools or doing any hacking (work and some poor sleep is completely wearing me out, and my eyes need a rest), might clean the kitchen, cook something, watch some footy if it's on and maybe crack a red.

I've only got a couple more weeks to go on this (really short) contract and then i'm not sure what's going to happen. I think the plan that others made was for me to continue on it for a few years but i've stated i'm not willing to work on the project further - although the location, technology, and team organisation is a contributing factor, there is a more serious and unrelated reason I can't mention.

I could survive on a few months/year for on-going work on previous tasks but whether that is acceptable is not up to me.

With the federal government seemingly hell-bent on turning Australia into some Victorian-era upstairs/downstairs nightmare of total class warfare (I just can't ... just can't .... fathom where they're headed with their current ideologically driven nonsense: a ruling class which takes all of the nation's wealth and works the underclass till death?) the whole work situation around here is only going to get tighter. Not that there is much call for parallel/performance programmers around here that I can tell.

zZed is listening to ...

Turrican_2/mdat.intro_and_title on (manual) loop.

I had to compile UADE from source - the audacious plugin needs to be disabled since it only supports version 2.0. I couldn't get the TMFX plugin for xmms to work (although i was impressed that slackware still includes xmms).

But yeah ... just sitting back with a couple of G&T's and reminiscing about 'good times' ...

Computers were so much more fun back in the early 90s. Weekend hacking was trying to see how few scan-lines you could get a MOD player to execute in or see how many bobs you could get on the screen within a frame - not writing a fucking ELF loader. Actually those 'good times' were pretty lonely and miserable too; maybe that's why I return whenever I'm again feeling shit.

I plugged one of the small amps I got into my workstation and hooked it up to some desktop speaker modules I acquired at some point to try them out (some sony vaio ones, they're decent enough). I thought the remote on my amp wasn't working so I took it apart and had a look. Oh dear the soldering and component mounting was surprisingly shithouse - some of the pins were nearly touching and the circuit board was covered in splattered solder. I reflowed the main amp IC and a few other pins and added some extra jumpers for the main output signals, did a bit of a clean-up, (the tracks get very thin in places) and made sure the power socket was properly soldered in (it was well short of enough solder). In the end I realised the remote only controls the MP3 module anyway so it was pretty much a pointless exercise apart from a bit of soldering practice.

Last night I finally started playing Dragons Dogma after downloading it from PS+ a few months ago. After Oblivion and Dragon Age it's character creation system is actually pretty decent - people look like people; and the cut-scenes almost look pre-rendered even with your custom character in them. Well pretty people anyway, and if i'm going to look at them for a few days solid that what i'd rather be looking at even if you mostly just see their backs and awkward gait. If I keep on playing it anyway; just can't seem to get much into games for the last couple of years but occasionally something keeps me busy for a while.

Had a weird and disturbing dream though. One of the "trying to run but feet can't get any purchase" variety which is pretty much a recurring nightmare of the last 20 years from my body trying to wake me up either from sleep apnoea or overheating. At least this one was dodging meteorites which was something new and novel, although by the end my mobility was so impaired I was reduced to crawling and still not being able to make forward progress (an alarmingly unpleasant experience, real or not). The Armageddon themed ones are always the most interesting although waking up from them is still horrible.

Ran out of Tonic, trying to find a quaffer in the cellar which isn't off ... they're all off so far. Damn.

A kernel runtime

Yesterday I had to get out of the house but in the morning I had a little bit of a play trying to work out how to implement a kernel runtime. That is one that loads small self-contained routines in sequence which may work on parts or a whole problem at a time. One feature I want to support is the ability to have persistent local storage so that multiple waves of code might work on the same buffers before exporting the results off-core - it may not end up being practical but I'm hoping it will be useful.

At first I thought of trying to embed multiple kernels in the same binary but that doesn't really work - things like the library functions are coalesced across all kernels at best and everything needs a unique name which is just a bit of a pain to work with. And it basically just devolved into using overlays which requires a linker script and are just generally a bit involved.

Whilst I was out 'doing stuff' I came up with another solution which I think solves the usability problem and at this stage I think should support everything I want.

A runtime executive (the exec) provides some global functions which are shared amongst all cores, possibly all of libezecore (e-lib);
Each kernel is compiled into a standalone binary which includes any ezecore functions it needs that aren't in the exec, it has no startup routine;
Persistent data structures are placed in special sections such as .data.kernel. The normal mechanism of using weak symbols is to to reference them from other kernels/cores;
At runtime, the full set of interacting kernels must be supplied for a given workgroup.

This last point breaks the queuing abstraction somewhat but I don't see any practical alternative because the total persistent data space needs to be known in advance. Well unless an alternative mechanism is used such as supplying a separate data object as a base elf file.

At load-link time there should be enough information to assign unique locations for any shared buffers and then relocate the kernel code to a common location (or multiple) which the exec can DMA in as required.

Hmm, well I guess I see if it will work when I get around to coding something up.

a better (debugging) printf

As a bit of a side-mission I thought i'd have a poke at the printf problem on epiphany. Using it drags in a whole pile of floating point snot and stdio and it completely blows out the text space so it wont fit on an epu.

#include <stdio.h>

int main(int argc, char **ragv) {
 printf("test! %f\n", 1.0);
}

$ e-gcc e-test.c
$ e-size a.out
   text    data     bss     dec     hex filename
  42282    2304      88   44674    ae82 a.out

The only way to use it is to drop it into the external memory which has some performance issues. The performance issues aren't critical for a debugging function but even then e-hal doesn't install a listener so it doesn't actually work.

So the solution i'm going to use ... just use a stub and dump the printf data to a queue which can then be processed by eze-host. The stub can hopefully be small enough to fit into LDS and the queuing provides implicit buffering that should let the code run fast enough to debug most problems.

The problem here comes in that printf is a varargs call and varargs by definition doesn't know how many arguments it has ... so the stub needs to parse the format string and marshall the data out to another structure and then the host has to interpret this.

Fortunately for me the varargs format on both epiphany and arm appear to be the same, and whats more, 'va_list' is just a simple pointer which simplifies the host processing considerably. Or appears to be from some investigations.

So the approach is basically:

ezecore:

void
ez_printf(const char *__restrict fmt, ...) {
   va_list ap;

   va_start(fmt, ap);

   ... scan fmt and work out how big the argument list is
   ...  copy any strings referenced and change the pointer

   ... allocate a queue slot and copy all data there
  
   va_end(ap);
}

Some of the more esoteric features of printf just don't need to be supported like output parameters and so the work is just some marshalling based on the fmt string. This still takes a bit of code so i have to try to shrink it as much as possible whilst not losing any important functionality. I think long double can go for instance.

The c compiler has already promoted every argument to 4 or 8 (or 16?) bytes long, and if it wasn't for strings it could just memcpy the va_list once it knew how long it was.

ezehost:

... process syscall queue, find printf:

int (*aprintf)(const char *fmt, uint32_t *a) = (int(*)(const char *, uint32_t *))vprintf;

void do_printf(const char *fmt, uint32_t *args) {
 aprintf(fmt, args);
}

This last 'hack' of just rewriting the argument types is wildly unportable but it works on arm with gcc. Without it you're basically forced to write your own printf or do some deeper (also non-portable) poking since there's no way to create a va_list portably in code.

One last pain is that that varags abi promotes all floats to doubles. So just using varargs with floats drags in 800 bytes or so of code to perform float to double conversion. I wrote a much smaller one that wont be fully standards compliant but should suffice for debugging purposes (well, probably).

Update: Hmm, I played with it and I dunno, parsing the format and using the va_arg stuff is still a bit of code. Probably acceptable; ... but

I guess two other alternatives are available:

Use a trap and move all the processing to the host.

Blocks but the code-size is absolutely minimal, just a stub which calls a stub which is a trap;
Just copy a fixed-sized block of the ap across and have that as a known limitation. Along with demanding that any %s argument strings have to be in .shared sections.

Better performance but the limitations are error prone.

There's always something ...

Update 2: So I had a look at this today ... and basically I decided to just go with the trap version because it's the smallest bit of code both on-core and on-host (and TBH, it's the first thing I got working and it's not interesting enough to keep investigating further.)

Because of a couple of decisions it turned out to be pretty easy actually. All I do is proxy it straight to the host and because the varargs abi is identical between the two cpus I don't even have to do any argument rewriting.

Rather than fudge around with tricking C to do what I want the epu stub is much easier just implemented in assembly language. The varargs call is like any other - it just gets the first 4 arguments in registers and the rest are stored on the stack. I could just handle this on the host but it's easier if I just convert it to an array on-epu first and then trap directly to the host.

// LGPL3
        ;; c-prototype
        ;; void ez_dprintf(const char *fmt, ...);

        ;; stores r1-r3 on the stack and changes r1 to point to it

        .balign 4
_ez_dprintf:
        strd    r2,[sp],#-2
        str     r1,[sp,#3]
        add     r1,sp,#12

        ;; r0 = fmt
        ;; r1 = args
        trap    #16
        
        add     sp,sp,#16
        rts
        export  _ez_dprintf

On the host code I launch a monitor thread which polls the DEBUGSTATUS register (unfortunately polling is the only possibility right now). This turns to 1 when the core is halted from a trap instruction (and a couple of others). At this point the host is free to peek and poke pretty much anything on the core including all registers so it just looks up r0 and r1 and then invokes vprintf directly using the type-rewriting trick mentioned above.

// LGPL3
// note this is not using the epiphany sdk
// runs in a polling loop:
        uint status = ee_read_reg(dev, r, c, E_REG_DEBUGSTATUS);

        if (status & 1) {
                e_core_t *ecore = ee_get_core(dev, r, c);
                int pc = ee_read_reg(dev, r, c, E_REG_PC);
                unsigned short insn = ((unsigned short *)(ecore->mems.base + pc))[-1];

                // check for trap instruction
                if ((insn & 0x3ff) == 0x3e2) {
                        // check trap code, there are 6 bits for the code
                        switch (insn>>10) {
                        case 16: { // dprintf
                                char *efmt = ez_host_addr(wg, ecore->mems.base,
                                                          ee_read_reg(dev, r, c, E_REG_R0));
                                uint *eargs = ez_host_addr(wg, ecore->mems.base,
                                                           ee_read_reg(dev, r, c, E_REG_R1));

                                aprintf(efmt, eargs);
                                break;
                        }
                        }
                }
                e_resume(dev, r, c);
        }

Now ... the next trick is to make sure string arguments are in the right location. I modified the eze-loader to just put all constant strings (SHF_STRINGS in the section header flags) into the shared external memory block and ezehost puts this as the same virtual address location for both the host and the epiphany so at least in most cases it "just works". This may have side-effects although i can't see why an epu should be working with constant strings locally in the general case and particularly for debugging statements it's nice that it takes as little memory as possible on-core (added bonus: the host-accesses don't need to go across the mesh either).

To use a %s specifier on non-constant strings the caller currently has to just make sure the buffer is in the shared memory block. If necessary the host code can be modified to fix any string addresses by parsing the format.

The buffering version I started with is probably still worth looking into but this is a start. Problems with this include the latency of the call and the fact that it behaves almost like a barrier in practice and thus interferes with the running state.

If I re-try the size test:

float f = 12.0f;
int main(void) {
        ez_dprintf("test! %f\n", f);
}

$ e-size e-test-printf.elf
   text    data     bss     dec     hex filename
    208       4       0     212      d4 e-test-printf.elf

I moved the float to a variable to force the double conversion at runtime, which uses my lightweight (non-compliant?) implementation. The whole on-core overhead including the double converter is only 90 bytes.

More thoughts on the software code cache

So I veered off on another tangent ... just how to implement the software instruction cache from the previous post.

Came up with a few 'interesting' preliminary ideas although it definitely needs more work.

function linkage table

Each function call is mapped to a function linkage table. This is a 16-byte record which contains some code and data:

 ;; assembly format, some d* macros define the offset of the given type

        dstruct
        dint32  flt_move        ; the move instruction
        dint32  flt_branch      ; the branch instruction
        dint16  flt_section     ; section address (or index?)
        dint16  flt_offset      ; instruction offset
        dint16  flt_reserved0   ; padding
        dint16  flt_next        ; list of records in this section
        dend    flt_sizeof

The move instruction loads the index of the function itself into scratch register r12 and then either branches to a function which loads the section that contains the code, or a function which handles the function call itself. The runtime manages this. Unfortunately due to nesting it can't just invoke the target function directly. The data is needed to implement the functionality.

section table

The section table is also 16 bytes long and keeps track of the current base address of the section

        dstruct
        dint16  sc_base         ; loaded base address or ~0 if not loaded
        dint16  sc_flt          ; function list for this section
        dint16  sc_size         ; code size
        dint16  sc_reserved0
        dint16  sc_next         ; LRU list
        dint16  sc_prev
        dint32  sc_code         ; global code location
        dend    sc_sizeof

section header

To avoid the need of an auxiliary sort required at garbage collection time, the sections loaded into memory also have a small header of 8 bytes (for alignment).

        dstruct
        dint32  sh_section      ; the section record this belongs to
        dint32  sh_size         ; the size of the whole section + data
        dend    sh_sizeof

Function calls, hit

If the section is loaded ("cache hit") flt_branch points to a function which bounces to the actual function call and more importantly makes sure the calling function is loaded in memory before returning to it, which is the tricky bit.

Approximate algorithm:

rsp = return stack pointer
ssp = section pointer

docall(function, lr)
   ; save return address and current section
   rsp.push((lr-ssp.sc_base));
   rsp.push(ssp);

   ; get section and calc entry point   
   ssp = function.flt_section
   entry = ssp.sc_base + function.flt_offset

   ; set rebound handler
   lr = docall_return

   ; call function
   goto entry

docall_return()
   ; restore ssp
   ssp = rsp.pull();

   ; if still here, return to it (possibly different location)
   if (ssp.sc_base) {
      lr = rsp.pull() + ssp.sc_base;
      goto lr;
   }

   ; must load in section
   load_section(ssp)

   ; return to it
   lr = rsp.pull() + ssp.sc_base;
   goto lr;

I think I have everything there. It's fairly straightforward if a little involved.

If the section is the same it could avoid most of the work but the linker wont generate such code unless the code uses function pointers. The function loaded by the stub (flt record) might just be an id (support 65K functions) or an address (i.e. all on-core functions).

I have a preliminary shot at it which adds about 22 cycles to each cross-section function call in the case the section is present.

Function calls, miss

If the section for the function is not loaded, then the branch will instead point to a stub which loads the section first before basically invoking docall() itself.

doload(function, lr)
   ; save return address and current section
   rsp.push((lr-ssp));
   rsp.push(ssp);

   ; load in section
   ssp = function.flt_section
   load_section(ssp);

   ; calculate function entry
   entry = ssp.sc_base + function.flt_offset

   ; set rebound handler (same)
   lr = docall_return

   ; call function
   goto entry

load_section() is where the fun starts.

Cache management

So I thought of a couple of ways to manage the cache but settled on a solution which uses garbage collection and movable objects. This ensures every byte possible is available for function use and i'm pretty certain will take less code to implement.

This is where the sh struct comes in to play - the cache needs both an LRU ordered list and a memory-location ordered list and this is the easiest way to implement it.

Anyway i've written up a bit of C to test the idea and i'm pretty sure it's sound. It's fairly long but each step is simple. I'm using AmigOS/exec linked lists as they('re cool and) fit this type of problem well.

loader_ctx {
  list loaded;
  list unloaded;
  int alloc;
  int limit;
  void *code;
} ctx;

load_section(ssp) {
   needed = ctx.limit - ssp.sc_size - sizeof(sh);

   if (ctx.alloc > needed) {
      ; not enough room - garbage collect based on LRU order

      ; scan LRU ordered list for sections which still fit
      used = 0;
      wnode = ctx.loaded.head;
      nnode = wnode.next;
      while (nnode) {
         nused = wnode.sc_size + used + sizeof(sh);

         if (nused > needed)
            break;

         used = nused;

         wnode = nnode;
         nnode = nnode.next;
      }

      ; mark the rest as removed
      while (nnode) {
         wnode.sc_base = -1;

         ;; fix all entry points to "doload"
         unload_functions(wnode.sc_flt);

         wnode.remove();
         ctx.unloaded.addhead(wnode);

         wnode = nnode;
         nnode = nnode.next;
      }

      ; compact anything left, in address order
      src = 0;
      dst = 0;
      while (dst < used) {
         sh = ctx.code + src;
         wnode = sh.sh_section;
         size = sh.sh_size;

         ; has been expunged, skip it
         if (wnode.sc_base == -1) {
            src += size;
            continue;
         }

         ; move if necessary
         if (src != dst) {
            memmove(ctx.code + dst, ctx.code + src, size);
            wnode.sc_base = dst + sizeof(sh);
         }

         src += size;
         dst += size;
      }
   }

   ; load in new section
   ;; create section header
   sh = ctx.code + ctx.alloc;
   sh.sh_section = ssp;
   sh.sh_size = ssp.size + sizeof(sh);

   ;; allocate section memory
   ssp.sc_base = ctx.alloc + sizeof(sh);
   ctx.alloc += ssp.size + sizeof(sh);

   ;; copy in code from global shared memory
   memcpy(ssp.sc_base, ssp.sc_code, ssp.sc_size);

   ;; fix all entry points to "docall"
   load_functions(ssp.sc_flt);

   ;; move to loaded list
   ssp.remove();
   ctx.loaded.addhead(ssp);   
}

The last couple of lines could also be used at each function call to ensure the section LRU list is correct, which is probably worth the extra overhead. Because the LRU order is only used to decide what to expunge and the memory order is used for packing it doesn't seem to need to move functions very often - which is obviously desirable. It might look like a lot of code but considering this is all that is required in totality it isn't that much.

The functions load_functions() and unload_functions() just set a synthesised branch instruction in the function stubs as appropriate.

Worth it?

Dunno and It's quite a lot of work to try it out - all the code above basically needs to be in assembly language for various reasons. And the loader needs to create all the data-structures needed as well, which is the flt table, the section table, and the section blocks themselves. And ... there needs to be some more relocation stuff done if the sections use relative relocs (i.e. -mshort-calls) when they are loaded or moved - not that this is particularly onerous mind you.

AmigaOS exec Lists

The basic list operations for an exec list are always efficient but turn out to be particularly compact in epiphany code, if the object is guaranteed to be 8-byte aligned which it should be due to the abi.

For example, node.remove() is only 3 instructions:

; r0 = node pointer
        ldrd    r2,[r0]         ; n.next, n.prev
        str     r2,[r3]         ; n.prev.next = n.next
        str     r3,[r2,#1]      ; n.next.prev = n.prev

The good thing about exec lists is that they don't need the list header to remove a node due to some (possibly) modern-c-breaking address-aliasing tricks, but asm has no such problems.

list.addTail() is only 4 instructions if you choose your registers wisely (5 otherwise).

; r3 = list pointer
; r0 = node pointer
        ldr     r2,[r3]         ; l.head
        strd    r2,[r0]         ; n.next = l.head, n.prev = &l.head
        str     r0,[r2,#1]      ; l.head.prev = n
        str     r0,[r3]         ; l.head = n

By design, &l.head == l.

Unfortunately the list iteration trick of going through d0 loads (which set the cc codes directly) on m68K doesn't translate quite as nicely, but it's still better than using a container list and still only uses 2 registers for the loop:

; iterate whole list
; r2 = list header
        ldr     r0,[r2]         ; wnhead  (work node)
        ldr     r1,[r0]         ; nn = wn.next (next node)
        sub     r1,r1,0         ; while (nn) {
        beq     2f
1:
        ; r0 is node pointer and free to be removed from list/reused
        ; node pointer is also ones 'data' so doesn't need an additional de-reference

        mov     r0,r1           ; wn = nn
        ldr     r1,[r1]         ; nn = nn.next
        sub     r1,r1,0
        bne     1b
2:

Again the implementation has a hidden bonus in that the working node can be removed and moved to another list or freed without changing the loop construct or requiring additional registers.

For comparison, the same operation using m68K (devpac syntax might be out, been 20 years):

; a0 = list pointer
        mov.l   (a0),a1         ; wn = l.head  (work node)
        mov.l   (a1),d0         ; nn = wn.next (next node)
        beq.s    2f             ; while (nn) {
.loop:
        ; a1 is node pointer, free to be removed from list/reused

        mov.l   d0,a1           ; wn = nn
        mov.l   (a1),d0         ; nn = wn.next
        bne.s   .loop
2:

See lists.c from puppybits for a C implementation.

software instruction cache

So I was posting on the parallella forums and had an idea for something to investigate in the future.

Basically have the ezesdk loader create an automatic software code cache based on sections.

The thinking is this:

The relocatable elf files contain RELOC hunks for all function calls outside of the current section (and inside if they are not compiled with -mshort-calls);
These need to be resolved by the loader anyway;
The loader could point these anywhere - including to a global relocation table which included loader-generated entries;
The global relocation table could call a stub which loads the code anywhere in memory because the code is now relocatable (since every function call goes via the stubs).

Well, it's a little more complex than that because return addresses from the stack also need to make sure they track the current location of the caller and if it needs to be re-loaded into the cache. Still off the top of my head, ... this may be possible. For example each stub could track the current callee module in a separate return stack which can be updated should any section be unloaded or relocated. Interrupts would need special handling.

Idea needs to stew.

I haven't had the energy to do much but a 0.0 release of 'ezesdk' isn't too far off. I did the license headers, updated the readme from elf-loader, and tweaked the makefiles a few times. Still playing with some of the apis too.

About Me

Tags