EPU elf loader, reloc, etc.
I've come to the point where I need to start looking at placing different code on different epu's and having them talk to each other via on-chip writes ...
But the current SDK is a bit clunky here. One basically has to write custom linker scripts, custom loader calls, and then manually link various bits together either with custom compile-time steps or manual linking (or even hardcoded absolute addresses).
So ... i've been looking into writing my own loader which will take care of some of the issues:
- Allow symbolic lookup from host code, a-la OpenGLES;
- Allow standard C resolution of symbols across cores;
- Allow multi-core code to be loaded into different topologies/different hardware setups automatically.
Symbolic lookup
This is relatively straightforward and just involves a bit of poking around the ELF file. It's pretty straightforward and since ELF is designed for this kind of thing it takes very little code in the simple case.
Cross-core symbols
Fortunately the linker can do most of this, I just need a linker script but one that can be shared across multiple implementations.
My idea is to have the linker work against "virtual" cores which are simply 1MB apart in the address space. Section attributes can place code or data blocks into individual cores or shared memory or tls blocks.
Relocating loader
Because the cores are "virtual" the loader can then re-arrange them to suit the target topology and/or work-group. I'm going to rely on the linker creating relocatable code so i'm able to do this - basically it retains the reloc hunks in the final binary.
I'm not relying on position independent code for this - and actually that would just make life harder.
Linker too?
The problem is that the linker is going to spew if i try to put the same code into local blocks on different cores ... you know simple things like maths routines that really need to be local to the core. The alternative is to build a different complete binary for each core ... but then you're stuck with no way to automatically resolve addresses across cores and you're back where you started.
So it's going to have to get a lot more involved than just a simple load and reloc.
I'm just hoping i can somehow leverage/trick the linker into creating a single executable that has most of the linking work done, and then I'm able to finish it off at runtime without having to do everthing. Perhaps just duplicate all the sections common to all cores and then relocate and link in the per-core blocks.
Hmm, i think i need to think about this a bit more first.