About Me

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as Zed
  to his mates & enemies!

notzed at gmail >
fosstodon.org/@notzed >

Tags

android (44)
beagle (63)
biographical (104)
blogz (9)
business (1)
code (77)
compilerz (1)
cooking (31)
dez (7)
dusk (31)
esp32 (4)
extensionz (1)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (459)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (231)
java ee (3)
javafx (49)
jjmpeg (81)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (10)
opencl (120)
os (17)
panamaz (5)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
players (1)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
vulkan (3)
wanki (3)
workshop (3)
zcl (4)
zedzone (26)
Wednesday, 02 April 2014, 14:11

async copies

I had a play with this idea far too late into the night last night. A dma memcpy is a fine idea and has some benefits but isn't really taking advantage of the hardware features nor providing much opportunity to deal with some of it's problems.

You can resort to manual dma but that is pretty fiddly and bulky code-wise and it's such a common operation it makes sense to have it available in the runtime. And once it's in the runtime the runtime itself can use it too.

My current thoughts on the api and implementation follow.

 // incase an int isn't enough at some point
 typedef unsigned int ez_async_t;

 // Enqueue 1d copy
 ez_async_t ez_async_memcpy(void *dst, void *src, size_t size);

 // Enqueue 2d copy
 ez_async_t ez_async_memcpy2d(void *dst, size_t dstride,
                              void *src, size_t sstride,
                              size_t width, size_t height);

 // Wait till dma is done
 void ez_async_wait(ez_async_t aid);

 // Query completion status
 int ez_async_complete(ez_async_t aid);

To drive it I have a short cyclic queue of dma header blocks which are written to by the memcpy functions. If the queue is empty when the memcpy is invoked it fires off the dma.

The interesting stuff happens when the queue is not empty - i.e. a dma is still outstanding. All it does in that case is write the queue block and increment the head index and return. Ok that's not really the interesting bit but how the work is picked up. A dma complete interrupt routine checks to see if the queue is not empty and if not then starts the next one.

If the queue is full then the memcpy calls just wait until a slot is free.

I haven't implemented chaining but it should be possible to implement - but it might not make a practical difference and adds quite a bit of complication. Message mode is probably something more important to consider though. This api would use channel 1, leaving 0 for application use (or other runtime functions with higher priority) although if it was flexible enough there shouldn't be a need for manual dma and perhaps the api could support both. It would then need two separate queues and isr's to handle them and would force all code to use these interfaces (it could always take an externally supplied dma request too).

This is the current (untested again) interrupt handler. I managed to get all calculations into only 3 registers to reduce the interrupt overheads.

_ez_async_isr:
        ;; make some scratch
        ;; this can cheat the ABI a bit since this isr has full control
        ;; over the machine while it is executing
        strd    r0,[sp,#-2]
        strd    r2,[sp,#-1]

        ;; isr must save status if it does any alu ops
        movfs   r3,status

        ;; Advance tail
        mov     r2,%low(_dma_queue + 8)
        ldrd    r0,[r2,#-1]             ; load head, tail
        add     r1,r1,#1                ; update tail
        str     r1,[r2,#-1]

        ;; Check for empty queue
        sub     r0,r0,r1
        beq     4f

        ;; Calc record address
        mov     r0,#dma_queue_size-1
        and     r0,r0,r1        ; tail & size-1
        lsl     r0,r0,#5        ; << record size
        add     r0,r0,r2        ; &dma_queue.queue[(tail & (size - 1))]

        ;; Form DMA control word
        lsl     r0,r0,#16       ; dmacon = (addr << 16) | startup
        add     r0,r0,#(1<<3)
        
        ;; Start DMA
        movts   dma1config,r0
4:
        ;; restore state
        movts   status,r3
        
        ldrd    r2,[sp,#-1]
        ldrd    r0,[sp,#-2]
        rti

When I wrote this I thought using 32-bytes for each queue record would be a smart idea because it simplifies the addressing, but multiply by 24 is only 2 more instructions and a scratch register over multiply by 32 so might a liveable expense. The address of the queue is loaded directly so that saves having to add the offset to .queue when calculating the dma request location.

The enqueue is simple enough.

// Pads the dma struct to 32-bytes for easier arithmetic
struct ez_dma_entry_t {
        ez_dma_desc_t dma;
        int reserved0;
        int reserved1;
};

// a power of 2, of course
#define dma_queue_size 4

struct ez_dma_queue_t {
        volatile unsigned int head;
        volatile unsigned int tail;
        struct ez_dma_entry_t queue[dma_queue_size];
};

struct ez_dma_queue_t dma_queue;

...

        uint head = dma_queue.head;
        int empty;

        // Wait until there's room
        while (head - dma_queue_size >= dma_queue.tail)
                ;

 ... dma_queue[head & (dma_queue_size-1)] is setup here

        ez_irq_disable();

        // Enqueue job
        dma_queue.head = head + 1;

        // Check if start needed
        empty = head == dma_queue.tail;

        ez_irq_enable();

        if (empty) {
                // DMA is idle, start it
                ez_dma_start(E_REG_DMA1CONFIG, dma);
        }

        return head;

The code tries to keep the irq disable block as short as possible since that's just generally a good idea. With interrupts this is basically equivalent to a mutex to protect a critical section.

Tagged code, hacking, parallella.
Tuesday, 01 April 2014, 14:33

The journey. It continues.

So I spent the last couple of nights staying up way too late poking at my epiphany runtime library. Last night I basically finished the first cut of the feature complete version with a C implementation backend where i've finally settled on specific implementations of each function tuned mostly for size.

Most of the functions are intended to be inlined and only those more than about 10 instructions are worth putting into their own separate function - this in-lining is something the C compiler can do that isn't possible with assembly language. In every case these functions produce measurably smaller overall code by being inlined either because the function invocation itself takes more to form than the total work done and/or because it allows the callee to become a leaf and thus completely changes the way registers and stack can be assigned and used.

I spent a lot of time swearing at the compiler. Probably if i hadn't been up so late the night before I would merely have been puzzled by it. I played quite a bit with barrier() but ended up with something that is basically identical to the assembly version I came up with, but a good bit bulkier because the compiler doesn't try to use the small registers (0-7) for the whole function. It also does some redundant testing, but I think this is about as good as i'm going to get and it does pick up a couple of optimisations. I go into detail below.

I tried implementing the barrier of the the previous post whereby it tests 4 states per inner iteration but I decided to stay with the byte version due to the code-size increase. I also decided that because a workgroup barrier has to be workgroup wide, there's no need to ever allow the developer to assign memory for it and i've included the barrier memory as part of the runtime setup code (or it can be created by a linker script) which guarantees it will always be in the same location across all cores in the workgroup even if they are running different binaries. This can go somewhere fixed - either before or after the executive (aka kernel aka bios) or after the stack at the end of memory.

One of the 'interesting' loops was where the remote cores are notified of the barrier continuation. In e-lib it uses an array of pointers to implement this which leads to a trivial inner loop at the cost of extra memory reads. I decided to use the workgroup information to update the barrier directly which means it needs a bit more calculation which increases code-size but the loops are still quite small. This is the final implementation which I created by converting the assembly version literally into C after giving up trying to fight with the compiler for a better one.

        volatile unsigned char *root = ez_global_core(barrier, 0, 0);
        // Notify from farthest away
        int r = ez_config->group_rows-1;
        int ce = ez_config->group_cols-1;
        do {
                int c = ce;
                
                do {
                        root[((r<<6) | c) << 20] = 0;
                } while (--c >= 0);
        } while (--r >= 0);

Looking at the listing output (oh boy, finding that option to as really brought back memories, it's -Wa,-a to gcc) one observes that the loop setup takes about half the code with the loop implementation taking up the other. The inner loop is only 6 instructions long.

// build with -O2 -msmall16
   6 0000 0305                  mov r0,#40
   7 0002 C420                  ldr r1,[r0,#1]
   8 0004 4C010040              ldr r16,[r0,#2]
   9 0008 CC210040              ldr r17,[r0,#3]

  10 000c 0B800220              mov ip,_barrier
  11 0010 9606                  lsl r0,r1,#20
  12 0012 7F900A24              orr ip,ip,r0

  13 0016 9B03FF48              add r16,r16,#-1
  14 001a 9B27FF48              add r17,r17,#-1

  15 001e 0360                  mov r3,#0
  16                    .L5:
  17 0020 DF400608              lsl r2,r16,#6
  18 0024 EF040208              mov r0,r17
  19                    .L3:
  20 0028 7A21                  orr r1,r0,r2
  21 002a 9626                  lsl r1,r1,#20
  22 002c 9303                  add r0,r0,#-1
  23 002e 99700004              strb r3,[ip,r1]
  24 0032 3320                  sub r1,r0,#0
  25 0034 70FA                  bgte .L3

  26 0036 9B03FF48              add r16,r16,#-1
  27 003a 3B000008              sub r0,r16,#0
  28 003e 70F1                  bgte .L5

This is almost identical to the assembly I came up with beforehand as the emboldened sections which indicate the non-redundant instructions show.

(untested code follows)

   6 0000 0360                  mov     r3,#0           ; for fragment

   7 0002 0325                  mov     r1,#ez_config
   8 0004 C444                  ldr     r2,[r1,#ezc_group_id]
   9 0006 E484                  ldrd    r4,[r1,#ezc_group_rows/2]
  10 0008 0B000200              mov     r0,%low(_barrier)
  11 000c 964A                  lsl     r2,r2,#20       ; groupid << 20
  12 000e 7A48                  orr     r2,r2,r0        ; root = get_global_address(barrier, 0, 0)
  13 0010 B390                  sub     r4,r4,#1        ; r = rows - 1
  14                    1:
  15 0012 3B360000              sub     r1,r5,#1        ; c = cols - 1
  16 0016 D6D0                  lsl     r6,r4,#6        ; rshift = r << 6
  17                    2:
  18 0018 FAF8                  orr     r7,r6,r1        ; o = (r << 6) | c
  19 001a 96FE                  lsl     r7,r7,#20       ; o <<= 20
  20 001c 916B                  strb    r3,[r2,r7]      ; root[o] = 0
  21 001e B324                  sub     r1,r1,#1        ; while (--c >= 0)
  22 0020 70FC                  bgte    2b

  23 0022 B390                  sub     r4,r4,#1        ; while (--r >= 0)
  24 0024 70F7                  bgte    1b

The register usage is because I extracted it from a bigger fragment - I may have broken it too but since it's only 4 accountable instructions different to the C compiler it should be ok apart from typos.

Being able to use the lower 8 registers really makes a big difference to code size as you'd imagine. Despite only being 4 instructions shorter this is 38 bytes compared to 64. Due to the ABI it does come at the cost of having to save/restore 4 registers which is 16 bytes of instructions and some time but it doesn't take much code before that is made up.

The other point of note is the direct use of the condition codes after the loop index adjustment. This saves one instruction per loop here plus the scratch register which would otherwise be required (line 24 and 27 of the first listing). This is really something I would expect the compiler to pick up. Perhaps less expected is the double-word load which saves one instruction, and the other instruction is saved by changing the redundant register move on line 18 in listing one into a substraction on line 15 in listing two which means the one on line 14 in listing one can be discarded.

The full barrier implementation needs to use a volatile read of a byte array and I found the compiler (and oddly, the arm compiler as well although only in the signed char case) does some weird and unnecessary data size extension when you read from a volatile char array which adds two pointless instructions for every byte load.

So with that the total implementation in C hits 144 bytes vs 88 for the assembly - which is fair enough I guess but there is room for improvement and only small things can make quite a difference when you're dealing with such small functions.

The e_barrier() code in e-lib compiled with the same options (-O2 -msmall16) hits 236, but much of that is because of the need to perform some multiplies. It's actually much worse than that because e_barrier_init() which is also a necessary part of the implementation compiles to 346 bytes ... :-/. I'm pretty sure the e_barrier_init()/e_barrier() pair has a race condition as well which can't be solved without another ... barrier.

If e_group_config is modified to include group_size and core_index values this drops them down to 132 and 224 bytes respectively which in terms of on-core code is still 2.5x the C one above (and 4.0x the asm). Strangely enough the notify loop uses implicit condition code tests on it's loop counter! But it added a separate counter register to implement this so that is probably why it isn't available on 'normal' calculations. Actually an externally initialised version of this implementation could be made very small; but then the data-size will dominate.

If only that hardware barrier worked properly in this case.

I did some further experiments while writing this and although it's kind of academic because of it needing 5x the runtime memory I fiddled with an implementation that uses an array of pointers assuming they have been initialised by the runtime / loader. I can get it down to 70 bytes for the barrier() call.

Tagged code, hacking, parallella.
Sunday, 30 March 2014, 16:34

a (slightly) better barrier, signals

So I kept thinking about the last point on the last post. That is, having a job-based runtime similar to hsa queuing.

One feature that employs is a signalling mechanism that I think lets waiters wait on multiple events before continuing. I came up with a possible epiphany-friendly primitive to support that.

First there is the sending object - this is what each sender gets to signal against. It requires a static allocation of slots by the runtime so that they can signal with a single asynchronous write without needing an atomic index update that a semaphore like object would require.

struct ez_signal_t {
 uint slot;
 ez_signal_bank_t *bank;
};

void ez_signal(ez_signal_t s) {
 s.bank->s.signals[s.slot] = 0xff;
}

The signal bank is basically just an array of bytes per sender. But also allows for word access for a more efficient wait.

struct ez_signal_bank_t {
 uint size;
 uint reserved;
 union {
  ubyte signals[8];
  uint block[2];
 } s;
 // others follow
 // must be rounded up to 4 bytes
};

(the structure is aligned to 8 because of the union, it's probably not worth it and i will use an assembly version anyway)

The bank is initialised by setting every byte within size to 0, but those outside of that range which might fall into a word slot are pre-filled with the signal value of 0xff. This allows the waiter to execute a simple uint based loop without having to worry about edge conditions.

void ez_signal_wait(ez_signal_bank_t *bank) {
        uint size = bank->size;
        volatile uint *si = bank->s.block;

        size = ((size+3) >> 2);
        while (size > 0) {
                while (si[size-1] != 0xffffffff)
                        ;
                size -= 1;
        }
}

The same technique could be applied to barrier() as well - a tiny improvement but one nonetheless. It does need some edge case handling for the reset and init though which will take more code.

(Hmm, on closer inspection it isn't clear whether the hsa signalling objects actually support multiple writers or not - it seems perhaps not. However they do operate at a higher level of abstraction and may need multiple writers internally in order to implement the public interface when multiple cores are involved. Update: On further inspection it looks like they are basically signal semaphores but with a flexible condition check on the counter and the ability to use opencl atomics on the index.)

Actually another thought I had was that since the barrier is tied directly to the workgroup why is the user code even calling init on it's own structure anyway? It may as well be a parameter-less function that just uses a pre-defined area as the barrier memory which is setup by the loader and/or the init routine. I think the current barrier implementation mechanism may have an issue because you can't initialise the barrier data structures in parallel without it being a race condition and so even the setup needs another synchronisation mechanism to implement it properly. By placing it in an area that can be initialised by the loader/runtime that race goes away quietly to die. And if the barrier is chip-wide or the WAND routing ever gets fixed it could just use a hardware barrier implementation instead.

This thought may be extended to other data structures and facilities that are needed at runtime. Why have the error prone and bulky code to setup a dma structure when the logic could be encoded in a routine with simple arguments that covers the most common cases with just two functions: 1D and 2D? And if you're doing that why not add an async copy capability with interrupt driven queuing as well? Well it may not be worth it in the end but its worth exploring to find out.

gdb sim

So I actually used the gdb simulator for the first time yesterday to step through some assembly language to verify some code. So far I've just debugged using the run-till-it-stops-crashing method. Bit of a pain it doesn't seem to think assembly language is actually a thing though:

$ e-gdb
...

(gdb) target sim
Connected to the simulator.
(gdb) file a.out
Reading symbols from /home/notzed/src/elf-loader/a.out...done.
(gdb) load
Loading section ivt_reset, size 0x4 lma 0x0
Loading section .reserved_crt0, size 0xc lma 0x58
Loading section NEW_LIB_RO, size 0x170 lma 0x64
Loading section NEW_LIB_WR, size 0x450 lma 0x1d8
Loading section GNU_C_BUILTIN_LIB_RO, size 0x6 lma 0x628
Loading section .init, size 0x24 lma 0x62e
Loading section .text, size 0x268 lma 0x660
Loading section .fini, size 0x1a lma 0x8c8
Loading section .ctors, size 0x8 lma 0x8e4
Loading section .dtors, size 0x8 lma 0x8ec
Loading section .jcr, size 0x4 lma 0x8f4
Loading section .data, size 0x4 lma 0x8f8
Loading section .rodata, size 0x8 lma 0x900
Start address 0x0
Transfer rate: 17632 bits in <1 sec.
(gdb) b ez_signal_init
Breakpoint 1 at 0x85a
(gdb) r
Starting program: /home/foo/src/elf-loader/a.out 

Breakpoint 1, 0x0000085a in ez_signal_init ()
(gdb) stepi
0x0000085e in ez_signal_init ()
(gdb) 
0x00000860 in ez_signal_init ()
(gdb) 
0x00000862 in ez_signal_init ()
(gdb) 
0x00000864 in ez_signal_init ()

Yay? Had better tools on the C=64.

But I came across this article which provides something of a workaround. Not the best but it will suffice.

(gdb) display /3i $pc
1: x/3i $pc
=> 0x864 <ez_signal_init+10>:   lsl r1,r1,0x2
   0x866 <ez_signal_init+12>:   lsl r3,r3,0x3
   0x868 <ez_signal_init+14>:   beq 0x880 <ez_signal_init+38>
(gdb) stepi
0x00000866 in ez_signal_init ()
1: x/3i $pc
=> 0x866 <ez_signal_init+12>:   lsl r3,r3,0x3
   0x868 <ez_signal_init+14>:   beq 0x880 <ez_signal_init+38>
   0x86a <ez_signal_init+16>:   mov r2,0xffff
(gdb) 
0x00000868 in ez_signal_init ()
1: x/3i $pc
=> 0x868 <ez_signal_init+14>:   beq 0x880 <ez_signal_init+38>
   0x86a <ez_signal_init+16>:   mov r2,0xffff
   0x86e <ez_signal_init+20>:   movt r2,0xffff
(gdb) 
0x0000086a in ez_signal_init ()
1: x/3i $pc
=> 0x86a <ez_signal_init+16>:   mov r2,0xffff
   0x86e <ez_signal_init+20>:   movt r2,0xffff
   0x872 <ez_signal_init+24>:   lsr r2,r2,r3
(gdb) 

set confirm off may also reduce offence, but otherwise it was all a bit easier than I remember it seemed when I read about it as the few commands above show.

I tend to only use debuggers these days as a tool of absolutely last resort. Most 'bugs' I work on are 'known bugs' like missing or incomplete features where such a tool isn't any help. And debugger features which should be useful like variable watches are often just too much of a pain to setup and work with or trying to step to the 5000th iteration of some loop which also tends to be painful with debuggers even if they have support for it. Obviously i'm not having to debug production code in-place these days, thankfully.

But I remember using it heavily in the bad old days when working on Evolution which uses threads a fair bit - whether each updated revision of gdb would even be able to produce usable stack traces of a threaded application was a bit of a lucky dip. I remember not upgrading for years at a time once I got a combination of kernel and gdb that actually worked. It's probably the single most influential experience that lead me to become very conservative with my tool updates. There's nothing more frustrating than having to fix a breakage for something that wasn't broken in the first place.

Tagged hacking, parallella.
Saturday, 29 March 2014, 15:56

what happens after the cpu starts

So I started thinking about the startup code that you might want if you were executing code on the epiphany as a kernel processor as opposed to running a "standard" C interface. One problem is you can't be loading arbitrary code while the cpu is not in a halted state so there has to be some fixed block of code which is always there which can contain that state.

data

To get my bearings I created a full memory map of the entire epiphany 16-core device. Below is an excerpt from the lowest addresses on each core as I had it initially. Yeah ... didn't really feel like watching tv.

 +----------+---- local ram start for each core
 |     0000 | sync
 |     0004 | swe
 |     0008 | memfault
 |     000c | timer0
 |     0010 | timer1
 |     0014 | message
 |     0018 | dma0
 |     001c | dma1
 |     0020 | wand
 |     0024 | swi
 +----------+------
 |     0028 | extmem
 |     002c | group_id
 |     0030 | group_rows
 |     0034 | group_cols
 |     0038 | core_row
 |     003c | core_col
 |     0040 | group_size
 |     0044 | core_index
 +----------+------
 |     0048 | 
 |     004c | 
 |     0050 | 
 |     0054 | 
 +----------+------
 |     0058 | .text initial code
 |          |

The base load address matched the current sdk because I was originally compatible with e-lib.

I tried a couple of ideas but settled on leaving room for 8 words (easy indexing) and found a good use for every slot.

 |          |
 +----------+------
 |     0048 | argv0
 |     004c | argv1
 |     0050 | argv2
 |     0054 | argv3
 |     0058 | entry
 |     005c | sp
 |     0060 | imask
 |     0064 | exit code
 +----------+------
 |     0068 | .text.init
 |          |

Being able to load r0-r3 and set the stack pointer from the host allows any argument list to be created - i.e. the 'main' routine takes native types as arguments rather than nothing at all. This can help simplify some startup issues because you don't need to synchronise with anything to get the data on startup and for simple cases might be all you need.

Being able to set the entry point allows multiple 'kernels' to be stored in the same binary and be invoked without having to reload the code. It also 'fixes' the problem of having to trampoline to a 32-bit address via the startup routine if you want an off-core main (which may have limited uses).

And the imask just allows some initial system state to be configured. There may need to be more.

And finally the exit code allows the kernel to return some value. Actually i'm not sure how useful it is but there was an empty slot and it was easy to add.

code

So to the reset interrupt handler. This is just a rough sketch of where my thoughts are at the moment and i haven't tried it on the machine so it may contain some silly mistakes or other thinkos.

        .section .text.ivt0
_ivt_sync:      
        b.l     _isr_sync       ;sync

Currently in ez-loader the section name .text.ivt0 sets the interrupt vector although because it can create the ivt entries on the fly it may be something that can be removed. Or have it work as an override in which case all the following code is never used. The compiler outputs various sections for interrupt handlers too and I intend to remain compatible with those even though interrupt handlers in c can be a bit nasty due to the size of the register file and lack of multi push/pull instructions.

        .section .text.init
_isr_sync:
        ;; load initial state
        mov     r7,#tcb_off
        ldrd    r0,[r7,#tcb_argv0_1]      ; argv0, argv1
        ldrd    r2,[r7,#tcb_argv2_3]      ; argv2, argv3
        ldrd    r12,[r7,#tcb_entry_sp]    ; entry, sp
        ldr     r4,[r7,#tcb_imask]        ; imask

Then we come to the meat and potatoes. First the state is loaded from the 'task control block'. Double-word loads are used where possible and most values are loaded directly into their final desired location.

        ;; set frame pointer
        mov     fp,sp

        ;; set imask
        movts   imask,r4

A couple of values then need to be set by hand. There may need to be other init work done here such as clearing ipend.

        ;; set entry point
        movts   iret,r12

Rather than drop down to user-mode to run setup routines as required by a full c runtime this just jumps straight to the kernel entry point via iret. This saves the need to resolve the pre-main 'start' trampoline routine too.

        ;; link to exit routine
        mov     lr,%low(_exit)

Similarly because main isn't being called via a user-mode stub using bl or jalr the link register needs to be set manually. This can go also directly to _exit without passing go. Because this is intended to be on-core it only needs to load the bottom 16 bits of the address.

        ;; launch main directly
        rti

And after that ... there's nothing more to do. rti will clear the interrupt state, enable interrupts and jump to the chosen kernel entry point. r0-r3 and the stack will contain any args required, and when it finishes it will return from the function ...

_exit:  mov     r7,#tcb_off
        str     r0,[r7,#tcb_exit_code]
1:      idle
        b.s     1b
        .size   _exit, .-_exit

... and end up directly at exit. This can save out the return code from the function and then 'shuts down' via an idle and repeat loop. And at this point new code can (relatively) safely be loaded, or just the entry and arguments changed to relaunch the code with a sync signal. It may need another field for to indicate the core state such as running or exited and that might be more useful than having an exit code.

The nice thing about the dynamic loader I have is that I can separate this startup mechanism from the code completely - with a bit more work it doesn't even need to be linked to it. Because code is relocated at runtime I can change the startup code or tcb structure without forcing a recompile and only the workgroup config stuff needs to remain fixed.

For example another option may be to have a job dispatch loop replace most of the above and avoid the need to have to add it externally. It could even potentially be loading in the code itself from asynchronously en-queued jobs like hsa does. Possibly even using the hsa dispatch packet or some subset of it. Hmm, thoughts.

Tagged code, hacking, parallella.
Saturday, 29 March 2014, 10:30

ez-lib-tiny

So I just finished writing two versions of the 'tiny' elib equivalent - but completely untested so it will have bugs and may change once those are fixed. One is in C, and one is all in assembly.

The functionality is basically the same as e-lib but uses a different runtime e_config and the apis were modified slightly in order to gain efficiency and sometimes ease of use. For example the dma code takes a the lower 16-bits of the dma config register base address rather than an index which makes no real difference to the user but shaves a few precious bytes off the implementation. Another example is the irq functions which take take a mask and a set of bits so any combination of bits can be modified in a single invocation rather than having to call multiple times for different ones. Some of the other functions like the global address mapping functions i de-parameterised into separate special purpose functions since generally you know what type of address you have and it makes quite a difference if you specialise and this is the sort of stuff the compiler needs some help with. I just noticed I missed some of the coreid manipulation stuff though.

The ultimate goal is actually to put a lot of stuff in inline blocks which will actually generate less code and help the compiler but to compare the total code sizes I just generated a file with every function in it and mirrored the functionality identically between assembly and c.

Anyway ... (C compiled with -O2 -mshort-calls -std=gnu99)

notzed@minized:~/src/elf-loader$ e-size ez-lib-tiny-c.o
   text    data     bss     dec     hex filename
    932       0       0     932     3a4 ez-lib-tiny-c.o

notzed@minized:~/src/elf-loader$ e-size ez-lib-tiny.o
   text    data     bss     dec     hex filename
    696       0       0     696     2b8 ez-lib-tiny.o

As a comparison the current e-lib.a contains 3072 bytes of code (and a few more trivial functions). Also related is the c startup/support code (crt0 etc) which takes about another 2k above a minimal start-up implementation I am now using (none of which is included in the above numbers).

The smaller one is of course in assembly language. The biggest pieces were the timer stuff because it can't really be parameterised, and particularly the barrier implementation. I think i spotted some bugs in the timer implementation with it's handling of the config register (sigh, why did everything have to go through the one register) but i'm not sure they're bugs. In actual code this will actually shrink even in worst case because quite a few of these will be inlined instead, and generate code fragments which are smaller than a function invocation and without side effects like clobbering work registers.

I managed to squeeze the barrier implementation into only the lowest 8 registers which helped reduce the code-size quite a bit. The whole implementation is 88 bytes - it's amazing how quickly they add up (only 38 instructions). Of course I haven't verified it yet so it could all be terribly broken too.

I made some changes to the api to simplify the code and usage. It only takes a single array pointer which must be a local-core address of group-size bytes of memory. The memory must be at the same address in every core which allows the implementation to implicitly calculate any address required.

This is the C version, and the assembly is just a straightforward hand-compilation of that.

void ez_barrier_init(ez_barrier_t *barrier_array) {
        for (int i=0;i<ez_config->group_size;i++)
                barrier_array[i] = 0;
}

Rather than add code to special-case core0, just have every core clear their barrier block. Only one byte of the non-control core is actually used but why add the extra code to the library if it doesn't hurt.

void ez_barrier(ez_barrier_t *barrier_array) {
        volatile ez_barrier_t *ba = barrier_array;
        int index = ez_config->core_index;

        if (index == 0) {
                // Wait for all others (not us, we know we're here)
                for (int i=ez_config->group_size-1;i>0;i--) {
                        while (ba[i] == 0)
                                ;
                        // We can reset the local flag immediately
                        ba[i] = 0;
                }

                // Notify (do us because it doesn't hurt and simplifies the loops)
                for (int r=ez_config->group_rows;r>0;r--) {
                        for (int c=ez_config->group_cols;c>0;c--) {
                                volatile ez_barrier_t *rb = ez_global_core(barrier_array, r-1, c-1);

                                rb[0] = 0;
                        }
                }
        } else {
                volatile ez_barrier_t *root = ez_global_core(barrier_array, 0, 0);

                // Mark local signal and notify root
                ba[0] = 1;
                root[index] = 1;

                // Await clear
                while (ba[0])
                        ;
        }
}

In contrast to the assembly version this compiles into 140 bytes (-O2). I know it's pretty much pissing in the wind at this point but, you know, hobbies generally are. The one in e-lib hits 252 but a sizeable chunk of that is the floating point unit config manipulation needed for an integer multiply - not needed in my case because I tweaked the workgroup config to include the same info pretty much exactly for this purpose (and wanting a flat index for the current core is extremely common in practice, even in 2d kernels). The e-lib version does require more run-time memory though: 16 + 64*4 = 80 bytes vs just 16 for a 4x4 epiphany.

(as an aside, it's a real bummer the hardware barrier has bugs which probably prevent it from working on anything but a whole-chip workgroup. That would've been much faster and taken much less code to implement. I will add a separate api entry point for hardware barriers; a whole-chip barrier is certainly useful for many applications.

I might be able to apply some of the optimisation techniques I employed in the assembly version to further improve it too - for example the assembly version basically collapses all 'calls' to ez_global_core() into a single calculation in the prologue. This feedback from coding in assembly and trying to work out what the compile is doing has happened a few times although it isn't very reliable and might make no difference. Note that the compiler has in-lined almost all of the local calls like ez_global_core() already which helps code-size quite a bit (this is a version of e_get_global_address() that doesn't have special case code for global addresses or E_SELF).

I removed the separate reset loop on the controller core code path since each slot in the array has a dedicated user and there's no need to synchronise with the others before clearing the local flags.

Oh, and in an attempt to help improve the latency it starts writing from the farthest away core first. If one takes a single row it should end up having all the writes arrive at approximately the same time as the train of writes arrives at each destination in lock-step. That's the idea anyway. I need to check it against the routing algorithm to see if it should be by columns instead of rows although it might not make any difference.

I should've really been out enjoying another unusually warm day but ... yeah i didn't. Mowed the lawn though and nearly headed into the city for an afternoon drink, but somehow got distracted for about 8 hours and now it's dark and the lights are still off (Hmmm. Just fixed). Better go hunt for food I guess.

Tagged hacking, parallella.
Saturday, 29 March 2014, 09:09

add/or, or, add?

Another micro-optimisation that gcc isn't grabbing.

Example code:

 unsigned int a;
 unsigned int b;

 unsigned int foo(void *dma) {
     return ((unsigned int)dma << 16) | 1;
 }

 -->
00000000 _foo:
   0:   0216            lsl r0,r0,0x10
   2:   2023            mov r1,0x1
   4:   00fa            orr r0,r0,r1
   6:   194f 0402       rts

This is part of the calculation to form a dma start code.

This compiles literally as expected - but orr requires a register argument so it needs an additional register and the load (unfortunately, would be nice to have a fully orthogonal instruction set on such matters). Because the lower 5 bits will always be clear thanks to the shift one can just use an add instead.

 unsigned int fooa(void *dma) {
     return ((unsigned int)dma << 16) + 1;
 }

 -->
00000000 _fooa:
   0:   0216            lsl r0,r0,0x10
   2:   0093            add r0,r0,1
   4:   194f 0402       rts

If the constant is greater than 3 bits (or something that can be made with a negative 3-bit number) then the code-size will grow by two bytes - however it is still only 2 instructions and requires no auxiliary register and allows setting up to 10 bits with a constant.

Actually trying to inline some dma setting stuff hit some interesting issues. Because the base pointer to the dma control packet is only turned into an integer it isn't referenced - it's possible for the compiler to completely optimise the initialisation code out of existence.

static dma_start(int chan, ez_dma_desc_t *dma) {
   uint start = ((uint)dma << 16) | 1;

   set_reg(E_DMA0CONFIG, start);
}

static dma_run(int chan, ez_dma_desc_t *dma) {
   dma_wait(chan);
   dma_start(chan, dma);
   dma_wait(chan);
}

void ez_dma_memcpy(int chan, void *dst, void *src, size_t size) {
 uint align = ((uint) dst | (uint)src | (uint)size) & 0x7;
 uint shift = dma_shift[align];
 ez_dma_desc_t dma;

 dma.config = 3 + (shift << 5);
 dma.inner_stride = 0x00010001 << shift;
 dma.count = 0x10000 | (size >> shift);
 dma.outer_stride = 0x00010001 << shift;
 dma.src_addr = src;
 dma.dst_addr = dst;

 dma_run(chan, &dma);
}

If this code is in the same file optimisation may decide to inline all the functions even they weren't marked as such. Since the only use of dma is as an integer it "loses" it's pointedness and thus it's reference to the object. I dunno, I suppose it's a valid optimisation but an interesting gotcha nonetheless. It is fixed by making the dma declaration volatile.

Actually I had some other strange behaviour with this routine. Originally I was using an initialiser to set the content of dma. As in:

    ez_dma_desc_t dma = {
        .config = 3 + (shift << 5),
        .inner_stride = 0x00010001 << shift,
        etc
    };

But yeah, this did weird shit. It seemed to build the full content of the structure on the stack in a staging buffer, and then copy it to the actual structure, also on the stack. In -Os mode this even uses a call memcpy? *shrug* I'm really not sure why it should do that unless it's trying to implement some alignment restriction but I'm pretty sure structs are 8-byte aligned anyway (they would have to be). It's a syntax I use all the time because it's so handy ...

ez-lib-tiny

I've been working on more compact 'e-lib' for epiphany, this is mostly gained by generous use of inline and inline asm where appropriate in many cases converting a function call into fewer instructions than the invocation sequence. But I have also investigated the code size of almost every routine compared to hand crafted code.

The compiler always loses, sometimes by over a 100% increase over the hand-crafted code size.

Bullshit quote of the day: "compilers can create better code than you can".

So obviously I've been looking at a lot of compiler output in the last few days and that's clearly far from the truth.

Actually i'm pretty bummed out that compilers are still not able to do a really great job "in this day and age", even with cpu that has such a simple instruction set and a ton of registers which should be very compiler friendly. I think it just shows what a complex optimisation problem translating high level text into machine code really is.

I'm not having a go at compiler writers but those who claim how good they are usually from a position of ignorance and because they read it somewhere and it sounded authoritative. Optimising compilers are dreadfully complex things and trying to convert expert knowledge into an algorithm isn't an easy task at all.

Tagged hacking, parallella.
Thursday, 27 March 2014, 03:03

unix or gnu: sh n awk. finding the next lun.

Had a query from a mate about writing a sh script for a specific purpose. He had one or more files containing a natural number on each line (luns) and wanted to find out where the next whole was.

Apart from being a bit ... i dunno, surprised that a "unix sysadmin" of 25 years (i.e. a lifer) wasn't a total Bourne Shell fiend with lashings of awk and perl for dessert ... it seemed a simple lunch-break 'challenge' so I came up with a couple of solutions.

First I just used bash. It's a bit clumsy though.

#!/bin/sh

# usage: nextlun file ...

n=0
cat $* | sort -n | while read c; do
    if [ $c -ne $n ] ; then
        echo $n
        exit 1;
    fi
    n=`expr $n + 1`
done
# above runs in sub-shell so doesn't update n
if [ $? -eq 0 ]; then
    last=`cat $* | sort -n | tail -1`
    echo `expr $last + 1`
fi

I probably should've known because I use it quite often but I didn't realise (or forgot) "while read" runs in a sub-shell.

However, awk makes this much easier and is the sort of thing it's really good at. The algorithm is identical though.

#!/bin/sh

cat $* | sort -n \
 | awk -e 'BEGIN { n=0; } { if (n != $0) { exit 0; } else { n=n+1; } } END { print n; }' 

Neither handles blank lines properly, so an easy fix should that be necessary in the awk:

#!/bin/sh

cat $* | sort -n \
 | awk -e 'BEGIN { n=0; } /[0-9]+/ { if (n != $0) { exit 0; } else { n=n+1; } } END { print n; }' 

A grep could perform the same duty in the bash version as well.

This is the core of what makes "unix" actually something worth using. The whole system itself is the "integrated development environment". And all that power is available to any user who wants it without having to buy some overpriced application to do it.

Imagine how many people would use a spreadsheet for such a simple task - and then have to do it manually every time to rub salt into the wound. So not only do you have to pay real cash for the privilege, you have to keep on paying with your own time - which is something you can never buy back for any price.

Update: As a further emphasis on the last point I was relaying this anecdote to a mate of mine, one who is also in an area where scripts should be a comfortable notion (dba on unix systems). He actually suggested using "excel" to do the same task. But then again he did earn his wings as a dba being paid by the hour ... so perhaps can be forgiven for not minding a bit of menial busy work ;-)

Tagged code, hacking.
Wednesday, 26 March 2014, 12:51

inlining register reads

42.

I've been looking at the way the epiphany on-core library implements a couple of functions in order to improve them. One of the simplest are the special register read/write functions used for dma and other purposes.

 unsigned e_reg_read(e_core_reg_id_t reg_id);

 unsigned e_reg_read(e_core_reg_id_t reg_id)
 {
        volatile register unsigned reg_val;
        unsigned *addr;

        // TODO: function affects integer flags. Add special API for STATUS
        switch (reg_id)
        {
        case E_REG_CONFIG:
                __asm__ __volatile__ ("MOVFS %0, CONFIG" : "=r" (reg_val) : );
                return reg_val;
        case E_REG_STATUS:
                __asm__ __volatile__ ("MOVFS %0, STATUS" : "=r" (reg_val) : );
                return reg_val;
        default:
                addr = (unsigned *) e_get_global_address(e_group_config.core_row,
                                     e_group_config.core_col, (void *) reg_id);
                return *addr;
        }
 }

As alluded to in the comments this actually breaks reading the status register anyway ... and it is incomplete. There are 42 special registers but because they are not contiguous and the actual memory address is passed to the function the compiler generates either a giant jump table or a long sequence of nested branches searching for the switch target.

And apart from this any calling code needs to go via the a function invocation which may be as simple as a branch but is more likely to be a 32-bit load followed by a jsr, and then the function itself needs to implement a switch.

Update: I added some markup to the output. Bold is the code that is required to do the actual job, for the first examples it includes the e_reg_read function implementation but in the ideal case it is a single instruction. Italic is bad code either due to incorrect implementation or the compiler going whacko the didlio for whatever reason.

00000000 _e_reg_read:
   0:   40e2            mov r2,r0
   2:   000b 0042       mov r0,0x400
   6:   01eb 1002       movt r0,0xf
   a:   d65c 2700       str lr,[sp],-0x4
   e:   283a            sub r1,r2,r0
  10:   2800            beq 60 <_e_reg_read+0x60>
  12:   008b 0042       mov r0,0x404
  16:   01eb 1002       movt r0,0xf
  1a:   283a            sub r1,r2,r0
  1c:   1700            beq 4a <_e_reg_read+0x4a>
  1e:   000b 0002       mov r0,0x0
  22:   200b 0002       mov r1,0x0
  26:   000b 1002       movt r0,0x0
  2a:   200b 1002       movt r1,0x0
  2e:   600b 0002       mov r3,0x0
  32:   0044            ldr r0,[r0]
  34:   2444            ldr r1,[r1]
  36:   600b 1002       movt r3,0x0
  3a:   0d52            jalr r3
  3c:   d64c 2400       ldr lr,[sp,+0x4]
  40:   0044            ldr r0,[r0]
  42:   b41b 2402       add sp,sp,16
  46:   194f 0402       rts
  4a:   0512            movfs r0,status
  4c:   15dc 0400       str r0,[sp,+0x3]
  50:   15cc 0400       ldr r0,[sp,+0x3]
  54:   d64c 2400       ldr lr,[sp,+0x4]
  58:   b41b 2402       add sp,sp,16
  5c:   194f 0402       rts
  60:   0112            movfs r0,config
  62:   15dc 0400       str r0,[sp,+0x3]
  66:   15cc 0400       ldr r0,[sp,+0x3]
  6a:   d64c 2400       ldr lr,[sp,+0x4]
  6e:   b41b 2402       add sp,sp,16
  72:   194f 0402       rts
  76:   01a2    nop

And an example call:

unsigned a;
void foo(void) {
 a = e_reg_read(E_REG_STATUS);
}
 --> e-gcc -std=gnu99 -O2  -c -o e-foo.o e-foo.c

00000000 _foo:
   0:   008b 0042       mov r0,0x404
   4:   200b 0002       mov r1,0x0
   8:   d55c 2700       str lr,[sp],-0x2
   c:   200b 1002       movt r1,0x0
  10:   01eb 1002       movt r0,0xf
  14:   0552            jalr r1
  16:   400b 0002       mov r2,0x0
  1a:   400b 1002       movt r2,0x0
  1e:   0854            str r0,[r2]
  20:   d54c 2400       ldr lr,[sp,+0x2]
  24:   04e2            mov r0,r1
  26:   b41b 2401       add sp,sp,8
  2a:   194f 0402       rts
  2e:   01a2    nop

Can we do better?

inline

So the solution is to inline it. Simply moving e_reg_read to an inline function in a header helps. Well sort of helps.

 static inline unsigned ex_reg_read(e_core_reg_id_t reg_id)
 .. exactly the same

int foo_inline(void) {
        a = ex_reg_read(E_REG_STATUS);
}

 --> e-gcc -std=gnu99 -O2  -c -o e-foo.o e-foo.c

00000000 _foo_inline:
   0:   b41b 24ff       add sp,sp,-8
   4:   0512            movfs r0,status
   6:   15dc 0400       str r0,[sp,+0x3]
   a:   35cc 0400       ldr r1,[sp,+0x3]
   e:   000b 0002       mov r0,0x0
  12:   000b 1002       movt r0,0x0
  16:   2054            str r1,[r0]
  18:   810b 2002       mov r12,0x8
  1c:   b61f 248a       add sp,sp,r12
  20:   194f 0402       rts

Yeah not sure what's going on there to make it go through the stack. Maybe the type or something. What the hell? (oh I later worked it out: the unnecessary volatile on the reg_val value is forcing a store to and read from the stack which is not desirable at all, i filed a bug in prickhub).

Actually I'm going backwards here, I actually already wrote this and wanted to compare to what currently happens, so lets just forget all that and see what I came up with.

static inline uint32_t
ez_reg_read(e_core_reg_id_t id) {
        register uint32_t v;

        switch (id) {
        case E_REG_CONFIG:
                asm volatile ("movfs %0,config" : "=r" (v));
                break;
        case E_REG_STATUS:
                asm volatile ("movfs %0,status" : "=r" (v));
                break;
        default:
                v = *((volatile uint32_t *) e_get_global_address(e_group_config.core_row,
                     e_group_config.core_col, (void *) id));
                break;
        }
        return v;
}

And an example.

void fooz_inline(void) {
    a = ez_reg_read(E_REG_STATUS);
}

 --> e-gcc -std=gnu99 -O2  -c -o e-foo.o e-foo.c

00000024 _fooz_inline:
   0:   2512            movfs r1,status
   2:   000b 0002       mov r0,0x0
   6:   000b 1002       movt r0,0x0
   a:   2054            str r1,[r0]
   c:   194f 0402       rts

Ok that's what I wanted. This helps the compiler generate much better code too since the result can go in any register, no scratch registers need to be saved, etc.

So I took this and proceeded to add all 42 of the special registers ... and then compiled an example that had more than one register read and ... damn. Suddenly it decides that it doesn't want to inline it anymore and turns every get into a function call to a 912 byte function. Oops. __attribute__((always_inline)) fixed that at least. Although it's different depending on the compiler - on the device I don't need to add the always_inline thing (or maybe some tiny detail was different).

However, things get a bit nasty when a non-constant expression is passed. Suddenly it inlines as much of that gigantic switch statement as it needs - potentially the whole lot if it doesn't know the range. Obviously not much use.

gcc has one last trick ... __builtin_constant_p(x). This returns true if the parameter is probably a constant. Since for this particular case on the epiphany any special register can just be read by a global memory access (as it uses already for a fallback) this can be used to decide the path to use.

#define ez_reg_read(x) (__builtin_constant_p(x) \
    ? _ez_reg_read(x) \
    : (*((volatile uint32_t *) e_get_global_address( \
       e_group_config.core_row, e_group_config.core_col, (void *) x))))

The macro decides whether to call _ez_reg_read() which will compile into a single movfs instruction, or fall-back to the memory load path (this may have some nasty unrolled-loop cases, hopefully unlikely). Although ... It's probably not terribly important to support non-constant parameters because any tools that need it can do it themselves and the api could just be for known registers (the enum implies that already).

Given how much code the current implementation compiles into it doesn't seem worth special casing any registers at all and it could just fall back to a global memory load every time. I'm lead to believe that movfs/movts goes through the same physical path (unfortunately) and since the required address is the key to the function there's little more to do.

I think elib can be shrunk significantly by changing some of the parameterisation and judicious use of inlining.

So ... whilst playing with a version of e_dma_wait() using this ... I think I found another gcc bug when __builtin_constant_p() is used, so yeah, non-constant args are in the bin.

Update: So it seems I missed the brackets on the macro which someone kindly pointed out on the forums ... oops. Actually before that I had written up the implementation and realised that the macro wasn't doing anything magical and __builtin_constant_p() can just go into the inline function itself.

Below is the code so far which results in identical output.

I thought it wouldn't work with no optimisation turned on but it seems to do the right thing. It still in-lines the ez_reg_read() but drops to the memory access path. This wont work for the config and status registers because they need special handling according to Andreas but I just realised I had separate entry points for those anyway so it should be ok. I'm not sure why C would ever be reading status anyway.

ez_regval_t 
EZ_ALWAYS_INLINE
ez_reg_read(ez_regid_t id) {
 register uint32_t v;

 if (__builtin_constant_p(id)) {
  switch (id) {
   // bank 0
  case E_REG_CONFIG:
   asm volatile ("movfs %0,config" : "=r" (v));
   break;
  case E_REG_STATUS:
   asm volatile ("movfs %0,status" : "=r" (v));
   break;
  case E_REG_PC:
   asm volatile ("movfs %0,pc" : "=r" (v));
   break;
  case E_REG_DEBUGSTATUS:
   asm volatile ("movfs %0,debug" : "=r" (v));
   break;
  ... every other named register ...
  default:
   // unknown register, who cares
   v = 0;
   break;
  }
 } else {
  v = *(volatile uint32_t *)ez_global_core_self((void *)id);
 }

 return v;
}
Tagged code, hacking, parallella.
Newer Posts | Older Posts
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!