ez-lib-tiny

So I just finished writing two versions of the 'tiny' elib equivalent - but completely untested so it will have bugs and may change once those are fixed. One is in C, and one is all in assembly.

The functionality is basically the same as e-lib but uses a different runtime e_config and the apis were modified slightly in order to gain efficiency and sometimes ease of use. For example the dma code takes a the lower 16-bits of the dma config register base address rather than an index which makes no real difference to the user but shaves a few precious bytes off the implementation. Another example is the irq functions which take take a mask and a set of bits so any combination of bits can be modified in a single invocation rather than having to call multiple times for different ones. Some of the other functions like the global address mapping functions i de-parameterised into separate special purpose functions since generally you know what type of address you have and it makes quite a difference if you specialise and this is the sort of stuff the compiler needs some help with. I just noticed I missed some of the coreid manipulation stuff though.

The ultimate goal is actually to put a lot of stuff in inline blocks which will actually generate less code and help the compiler but to compare the total code sizes I just generated a file with every function in it and mirrored the functionality identically between assembly and c.

Anyway ... (C compiled with -O2 -mshort-calls -std=gnu99)

notzed@minized:~/src/elf-loader$ e-size ez-lib-tiny-c.o
   text    data     bss     dec     hex filename
    932       0       0     932     3a4 ez-lib-tiny-c.o

notzed@minized:~/src/elf-loader$ e-size ez-lib-tiny.o
   text    data     bss     dec     hex filename
    696       0       0     696     2b8 ez-lib-tiny.o

As a comparison the current e-lib.a contains 3072 bytes of code (and a few more trivial functions). Also related is the c startup/support code (crt0 etc) which takes about another 2k above a minimal start-up implementation I am now using (none of which is included in the above numbers).

The smaller one is of course in assembly language. The biggest pieces were the timer stuff because it can't really be parameterised, and particularly the barrier implementation. I think i spotted some bugs in the timer implementation with it's handling of the config register (sigh, why did everything have to go through the one register) but i'm not sure they're bugs. In actual code this will actually shrink even in worst case because quite a few of these will be inlined instead, and generate code fragments which are smaller than a function invocation and without side effects like clobbering work registers.

I managed to squeeze the barrier implementation into only the lowest 8 registers which helped reduce the code-size quite a bit. The whole implementation is 88 bytes - it's amazing how quickly they add up (only 38 instructions). Of course I haven't verified it yet so it could all be terribly broken too.

I made some changes to the api to simplify the code and usage. It only takes a single array pointer which must be a local-core address of group-size bytes of memory. The memory must be at the same address in every core which allows the implementation to implicitly calculate any address required.

This is the C version, and the assembly is just a straightforward hand-compilation of that.

void ez_barrier_init(ez_barrier_t *barrier_array) {
        for (int i=0;i<ez_config->group_size;i++)
                barrier_array[i] = 0;
}

Rather than add code to special-case core0, just have every core clear their barrier block. Only one byte of the non-control core is actually used but why add the extra code to the library if it doesn't hurt.

void ez_barrier(ez_barrier_t *barrier_array) {
        volatile ez_barrier_t *ba = barrier_array;
        int index = ez_config->core_index;

        if (index == 0) {
                // Wait for all others (not us, we know we're here)
                for (int i=ez_config->group_size-1;i>0;i--) {
                        while (ba[i] == 0)
                                ;
                        // We can reset the local flag immediately
                        ba[i] = 0;
                }

                // Notify (do us because it doesn't hurt and simplifies the loops)
                for (int r=ez_config->group_rows;r>0;r--) {
                        for (int c=ez_config->group_cols;c>0;c--) {
                                volatile ez_barrier_t *rb = ez_global_core(barrier_array, r-1, c-1);

                                rb[0] = 0;
                        }
                }
        } else {
                volatile ez_barrier_t *root = ez_global_core(barrier_array, 0, 0);

                // Mark local signal and notify root
                ba[0] = 1;
                root[index] = 1;

                // Await clear
                while (ba[0])
                        ;
        }
}

In contrast to the assembly version this compiles into 140 bytes (-O2). I know it's pretty much pissing in the wind at this point but, you know, hobbies generally are. The one in e-lib hits 252 but a sizeable chunk of that is the floating point unit config manipulation needed for an integer multiply - not needed in my case because I tweaked the workgroup config to include the same info pretty much exactly for this purpose (and wanting a flat index for the current core is extremely common in practice, even in 2d kernels). The e-lib version does require more run-time memory though: 16 + 64*4 = 80 bytes vs just 16 for a 4x4 epiphany.

(as an aside, it's a real bummer the hardware barrier has bugs which probably prevent it from working on anything but a whole-chip workgroup. That would've been much faster and taken much less code to implement. I will add a separate api entry point for hardware barriers; a whole-chip barrier is certainly useful for many applications.

I might be able to apply some of the optimisation techniques I employed in the assembly version to further improve it too - for example the assembly version basically collapses all 'calls' to ez_global_core() into a single calculation in the prologue. This feedback from coding in assembly and trying to work out what the compile is doing has happened a few times although it isn't very reliable and might make no difference. Note that the compiler has in-lined almost all of the local calls like ez_global_core() already which helps code-size quite a bit (this is a version of e_get_global_address() that doesn't have special case code for global addresses or E_SELF).

I removed the separate reset loop on the controller core code path since each slot in the array has a dedicated user and there's no need to synchronise with the others before clearing the local flags.

Oh, and in an attempt to help improve the latency it starts writing from the farthest away core first. If one takes a single row it should end up having all the writes arrive at approximately the same time as the train of writes arrives at each destination in lock-step. That's the idea anyway. I need to check it against the routing algorithm to see if it should be by columns instead of rows although it might not make any difference.

I should've really been out enjoying another unusually warm day but ... yeah i didn't. Mowed the lawn though and nearly headed into the city for an afternoon drink, but somehow got distracted for about 8 hours and now it's dark and the lights are still off (Hmmm. Just fixed). Better go hunt for food I guess.

add/or, or, add?

Another micro-optimisation that gcc isn't grabbing.

Example code:

 unsigned int a;
 unsigned int b;

 unsigned int foo(void *dma) {
     return ((unsigned int)dma << 16) | 1;
 }

 -->
00000000 _foo:
   0:   0216            lsl r0,r0,0x10
   2:   2023            mov r1,0x1
   4:   00fa            orr r0,r0,r1
   6:   194f 0402       rts

This is part of the calculation to form a dma start code.

This compiles literally as expected - but orr requires a register argument so it needs an additional register and the load (unfortunately, would be nice to have a fully orthogonal instruction set on such matters). Because the lower 5 bits will always be clear thanks to the shift one can just use an add instead.

 unsigned int fooa(void *dma) {
     return ((unsigned int)dma << 16) + 1;
 }

 -->
00000000 _fooa:
   0:   0216            lsl r0,r0,0x10
   2:   0093            add r0,r0,1
   4:   194f 0402       rts

If the constant is greater than 3 bits (or something that can be made with a negative 3-bit number) then the code-size will grow by two bytes - however it is still only 2 instructions and requires no auxiliary register and allows setting up to 10 bits with a constant.

Actually trying to inline some dma setting stuff hit some interesting issues. Because the base pointer to the dma control packet is only turned into an integer it isn't referenced - it's possible for the compiler to completely optimise the initialisation code out of existence.

static dma_start(int chan, ez_dma_desc_t *dma) {
   uint start = ((uint)dma << 16) | 1;

   set_reg(E_DMA0CONFIG, start);
}

static dma_run(int chan, ez_dma_desc_t *dma) {
   dma_wait(chan);
   dma_start(chan, dma);
   dma_wait(chan);
}

void ez_dma_memcpy(int chan, void *dst, void *src, size_t size) {
 uint align = ((uint) dst | (uint)src | (uint)size) & 0x7;
 uint shift = dma_shift[align];
 ez_dma_desc_t dma;

 dma.config = 3 + (shift << 5);
 dma.inner_stride = 0x00010001 << shift;
 dma.count = 0x10000 | (size >> shift);
 dma.outer_stride = 0x00010001 << shift;
 dma.src_addr = src;
 dma.dst_addr = dst;

 dma_run(chan, &dma);
}

If this code is in the same file optimisation may decide to inline all the functions even they weren't marked as such. Since the only use of dma is as an integer it "loses" it's pointedness and thus it's reference to the object. I dunno, I suppose it's a valid optimisation but an interesting gotcha nonetheless. It is fixed by making the dma declaration volatile.

Actually I had some other strange behaviour with this routine. Originally I was using an initialiser to set the content of dma. As in:

    ez_dma_desc_t dma = {
        .config = 3 + (shift << 5),
        .inner_stride = 0x00010001 << shift,
        etc
    };

But yeah, this did weird shit. It seemed to build the full content of the structure on the stack in a staging buffer, and then copy it to the actual structure, also on the stack. In -Os mode this even uses a call memcpy? *shrug* I'm really not sure why it should do that unless it's trying to implement some alignment restriction but I'm pretty sure structs are 8-byte aligned anyway (they would have to be). It's a syntax I use all the time because it's so handy ...

ez-lib-tiny

I've been working on more compact 'e-lib' for epiphany, this is mostly gained by generous use of inline and inline asm where appropriate in many cases converting a function call into fewer instructions than the invocation sequence. But I have also investigated the code size of almost every routine compared to hand crafted code.

The compiler always loses, sometimes by over a 100% increase over the hand-crafted code size.

Bullshit quote of the day: "compilers can create better code than you can".

So obviously I've been looking at a lot of compiler output in the last few days and that's clearly far from the truth.

Actually i'm pretty bummed out that compilers are still not able to do a really great job "in this day and age", even with cpu that has such a simple instruction set and a ton of registers which should be very compiler friendly. I think it just shows what a complex optimisation problem translating high level text into machine code really is.

I'm not having a go at compiler writers but those who claim how good they are usually from a position of ignorance and because they read it somewhere and it sounded authoritative. Optimising compilers are dreadfully complex things and trying to convert expert knowledge into an algorithm isn't an easy task at all.

unix or gnu: sh n awk. finding the next lun.

Had a query from a mate about writing a sh script for a specific purpose. He had one or more files containing a natural number on each line (luns) and wanted to find out where the next whole was.

Apart from being a bit ... i dunno, surprised that a "unix sysadmin" of 25 years (i.e. a lifer) wasn't a total Bourne Shell fiend with lashings of awk and perl for dessert ... it seemed a simple lunch-break 'challenge' so I came up with a couple of solutions.

First I just used bash. It's a bit clumsy though.

#!/bin/sh

# usage: nextlun file ...

n=0
cat $* | sort -n | while read c; do
    if [ $c -ne $n ] ; then
        echo $n
        exit 1;
    fi
    n=`expr $n + 1`
done
# above runs in sub-shell so doesn't update n
if [ $? -eq 0 ]; then
    last=`cat $* | sort -n | tail -1`
    echo `expr $last + 1`
fi

I probably should've known because I use it quite often but I didn't realise (or forgot) "while read" runs in a sub-shell.

However, awk makes this much easier and is the sort of thing it's really good at. The algorithm is identical though.

#!/bin/sh

cat $* | sort -n \
 | awk -e 'BEGIN { n=0; } { if (n != $0) { exit 0; } else { n=n+1; } } END { print n; }'

Neither handles blank lines properly, so an easy fix should that be necessary in the awk:

#!/bin/sh

cat $* | sort -n \
 | awk -e 'BEGIN { n=0; } /[0-9]+/ { if (n != $0) { exit 0; } else { n=n+1; } } END { print n; }'

A grep could perform the same duty in the bash version as well.

This is the core of what makes "unix" actually something worth using. The whole system itself is the "integrated development environment". And all that power is available to any user who wants it without having to buy some overpriced application to do it.

Imagine how many people would use a spreadsheet for such a simple task - and then have to do it manually every time to rub salt into the wound. So not only do you have to pay real cash for the privilege, you have to keep on paying with your own time - which is something you can never buy back for any price.

Update: As a further emphasis on the last point I was relaying this anecdote to a mate of mine, one who is also in an area where scripts should be a comfortable notion (dba on unix systems). He actually suggested using "excel" to do the same task. But then again he did earn his wings as a dba being paid by the hour ... so perhaps can be forgiven for not minding a bit of menial busy work ;-)

inlining register reads

42.

I've been looking at the way the epiphany on-core library implements a couple of functions in order to improve them. One of the simplest are the special register read/write functions used for dma and other purposes.

 unsigned e_reg_read(e_core_reg_id_t reg_id);

 unsigned e_reg_read(e_core_reg_id_t reg_id)
 {
        volatile register unsigned reg_val;
        unsigned *addr;

        // TODO: function affects integer flags. Add special API for STATUS
        switch (reg_id)
        {
        case E_REG_CONFIG:
                __asm__ __volatile__ ("MOVFS %0, CONFIG" : "=r" (reg_val) : );
                return reg_val;
        case E_REG_STATUS:
                __asm__ __volatile__ ("MOVFS %0, STATUS" : "=r" (reg_val) : );
                return reg_val;
        default:
                addr = (unsigned *) e_get_global_address(e_group_config.core_row,
                                     e_group_config.core_col, (void *) reg_id);
                return *addr;
        }
 }

As alluded to in the comments this actually breaks reading the status register anyway ... and it is incomplete. There are 42 special registers but because they are not contiguous and the actual memory address is passed to the function the compiler generates either a giant jump table or a long sequence of nested branches searching for the switch target.

And apart from this any calling code needs to go via the a function invocation which may be as simple as a branch but is more likely to be a 32-bit load followed by a jsr, and then the function itself needs to implement a switch.

Update: I added some markup to the output. Bold is the code that is required to do the actual job, for the first examples it includes the e_reg_read function implementation but in the ideal case it is a single instruction. Italic is bad code either due to incorrect implementation or the compiler going whacko the didlio for whatever reason.

00000000 _e_reg_read:
   0:   40e2            mov r2,r0
   2:   000b 0042       mov r0,0x400
   6:   01eb 1002       movt r0,0xf
   a:   d65c 2700       str lr,[sp],-0x4
   e:   283a            sub r1,r2,r0
  10:   2800            beq 60 <_e_reg_read+0x60>
  12:   008b 0042       mov r0,0x404
  16:   01eb 1002       movt r0,0xf
  1a:   283a            sub r1,r2,r0
  1c:   1700            beq 4a <_e_reg_read+0x4a>
  1e:   000b 0002       mov r0,0x0
  22:   200b 0002       mov r1,0x0
  26:   000b 1002       movt r0,0x0
  2a:   200b 1002       movt r1,0x0
  2e:   600b 0002       mov r3,0x0
  32:   0044            ldr r0,[r0]
  34:   2444            ldr r1,[r1]
  36:   600b 1002       movt r3,0x0
  3a:   0d52            jalr r3
  3c:   d64c 2400       ldr lr,[sp,+0x4]
  40:   0044            ldr r0,[r0]
  42:   b41b 2402       add sp,sp,16
  46:   194f 0402       rts
  4a:   0512            movfs r0,status
  4c:   15dc 0400       str r0,[sp,+0x3]
  50:   15cc 0400       ldr r0,[sp,+0x3]
  54:   d64c 2400       ldr lr,[sp,+0x4]
  58:   b41b 2402       add sp,sp,16
  5c:   194f 0402       rts
  60:   0112            movfs r0,config
  62:   15dc 0400       str r0,[sp,+0x3]
  66:   15cc 0400       ldr r0,[sp,+0x3]
  6a:   d64c 2400       ldr lr,[sp,+0x4]
  6e:   b41b 2402       add sp,sp,16
  72:   194f 0402       rts
  76:   01a2    nop

And an example call:

unsigned a;
void foo(void) {
 a = e_reg_read(E_REG_STATUS);
}
 --> e-gcc -std=gnu99 -O2  -c -o e-foo.o e-foo.c

00000000 _foo:
   0:   008b 0042       mov r0,0x404
   4:   200b 0002       mov r1,0x0
   8:   d55c 2700       str lr,[sp],-0x2
   c:   200b 1002       movt r1,0x0
  10:   01eb 1002       movt r0,0xf
  14:   0552            jalr r1
  16:   400b 0002       mov r2,0x0
  1a:   400b 1002       movt r2,0x0
  1e:   0854            str r0,[r2]
  20:   d54c 2400       ldr lr,[sp,+0x2]
  24:   04e2            mov r0,r1
  26:   b41b 2401       add sp,sp,8
  2a:   194f 0402       rts
  2e:   01a2    nop

Can we do better?

inline

So the solution is to inline it. Simply moving e_reg_read to an inline function in a header helps. Well sort of helps.

 static inline unsigned ex_reg_read(e_core_reg_id_t reg_id)
 .. exactly the same

int foo_inline(void) {
        a = ex_reg_read(E_REG_STATUS);
}

 --> e-gcc -std=gnu99 -O2  -c -o e-foo.o e-foo.c

00000000 _foo_inline:
   0:   b41b 24ff       add sp,sp,-8
   4:   0512            movfs r0,status
   6:   15dc 0400       str r0,[sp,+0x3]
   a:   35cc 0400       ldr r1,[sp,+0x3]
   e:   000b 0002       mov r0,0x0
  12:   000b 1002       movt r0,0x0
  16:   2054            str r1,[r0]
  18:   810b 2002       mov r12,0x8
  1c:   b61f 248a       add sp,sp,r12
  20:   194f 0402       rts

Yeah not sure what's going on there to make it go through the stack. Maybe the type or something. What the hell? (oh I later worked it out: the unnecessary volatile on the reg_val value is forcing a store to and read from the stack which is not desirable at all, i filed a bug in prickhub).

Actually I'm going backwards here, I actually already wrote this and wanted to compare to what currently happens, so lets just forget all that and see what I came up with.

static inline uint32_t
ez_reg_read(e_core_reg_id_t id) {
        register uint32_t v;

        switch (id) {
        case E_REG_CONFIG:
                asm volatile ("movfs %0,config" : "=r" (v));
                break;
        case E_REG_STATUS:
                asm volatile ("movfs %0,status" : "=r" (v));
                break;
        default:
                v = *((volatile uint32_t *) e_get_global_address(e_group_config.core_row,
                     e_group_config.core_col, (void *) id));
                break;
        }
        return v;
}

And an example.

void fooz_inline(void) {
    a = ez_reg_read(E_REG_STATUS);
}

 --> e-gcc -std=gnu99 -O2  -c -o e-foo.o e-foo.c

00000024 _fooz_inline:
   0:   2512            movfs r1,status
   2:   000b 0002       mov r0,0x0
   6:   000b 1002       movt r0,0x0
   a:   2054            str r1,[r0]
   c:   194f 0402       rts

Ok that's what I wanted. This helps the compiler generate much better code too since the result can go in any register, no scratch registers need to be saved, etc.

So I took this and proceeded to add all 42 of the special registers ... and then compiled an example that had more than one register read and ... damn. Suddenly it decides that it doesn't want to inline it anymore and turns every get into a function call to a 912 byte function. Oops. __attribute__((always_inline)) fixed that at least. Although it's different depending on the compiler - on the device I don't need to add the always_inline thing (or maybe some tiny detail was different).

However, things get a bit nasty when a non-constant expression is passed. Suddenly it inlines as much of that gigantic switch statement as it needs - potentially the whole lot if it doesn't know the range. Obviously not much use.

gcc has one last trick ... __builtin_constant_p(x). This returns true if the parameter is probably a constant. Since for this particular case on the epiphany any special register can just be read by a global memory access (as it uses already for a fallback) this can be used to decide the path to use.

#define ez_reg_read(x) (__builtin_constant_p(x) \
    ? _ez_reg_read(x) \
    : (*((volatile uint32_t *) e_get_global_address( \
       e_group_config.core_row, e_group_config.core_col, (void *) x))))

The macro decides whether to call _ez_reg_read() which will compile into a single movfs instruction, or fall-back to the memory load path (this may have some nasty unrolled-loop cases, hopefully unlikely). Although ... It's probably not terribly important to support non-constant parameters because any tools that need it can do it themselves and the api could just be for known registers (the enum implies that already).

Given how much code the current implementation compiles into it doesn't seem worth special casing any registers at all and it could just fall back to a global memory load every time. I'm lead to believe that movfs/movts goes through the same physical path (unfortunately) and since the required address is the key to the function there's little more to do.

I think elib can be shrunk significantly by changing some of the parameterisation and judicious use of inlining.

~~So ... whilst playing with a version of e_dma_wait() using this ... I think I found another gcc bug when __builtin_constant_p() is used, so yeah, non-constant args are in the bin.~~

Update: So it seems I missed the brackets on the macro which someone kindly pointed out on the forums ... oops. Actually before that I had written up the implementation and realised that the macro wasn't doing anything magical and __builtin_constant_p() can just go into the inline function itself.

Below is the code so far which results in identical output.

I thought it wouldn't work with no optimisation turned on but it seems to do the right thing. It still in-lines the ez_reg_read() but drops to the memory access path. This wont work for the config and status registers because they need special handling according to Andreas but I just realised I had separate entry points for those anyway so it should be ok. I'm not sure why C would ever be reading status anyway.

ez_regval_t 
EZ_ALWAYS_INLINE
ez_reg_read(ez_regid_t id) {
 register uint32_t v;

 if (__builtin_constant_p(id)) {
  switch (id) {
   // bank 0
  case E_REG_CONFIG:
   asm volatile ("movfs %0,config" : "=r" (v));
   break;
  case E_REG_STATUS:
   asm volatile ("movfs %0,status" : "=r" (v));
   break;
  case E_REG_PC:
   asm volatile ("movfs %0,pc" : "=r" (v));
   break;
  case E_REG_DEBUGSTATUS:
   asm volatile ("movfs %0,debug" : "=r" (v));
   break;
  ... every other named register ...
  default:
   // unknown register, who cares
   v = 0;
   break;
  }
 } else {
  v = *(volatile uint32_t *)ez_global_core_self((void *)id);
 }

 return v;
}

epiphany stack frame

I came across this months ago but forgot some of the finer details.

I thought that 8-byte empty area looked odd in generated code but I think it's there to simplify stacking saved registers due to the lack of pre-decrement addressing modes (otherwise it doesn't make much sense - leaf function use wouldn't matter if it had such addressing modes).

It's in the gcc source.

/* EPIPHANY stack frames look like:

             Before call                       After call
        +-----------------------+       +-----------------------+
        |                       |       |                       |
   high |  local variables,     |       |  local variables,     |
   mem  |  reg save area, etc.  |       |  reg save area, etc.  |
        |                       |       |                       |
        +-----------------------+       +-----------------------+
        |                       |       |                       |
        |  arguments on stack.  |       |  arguments on stack.  |
        |                       |       |                       |
  SP+8->+-----------------------+FP+8m->+-----------------------+
        | 2 word save area for  |       |  reg parm save area,  |
        | leaf funcs / flags    |       |  only created for     |
  SP+0->+-----------------------+       |  variable argument    |
                                        |  functions            |
                                 FP+8n->+-----------------------+
                                        |                       |
                                        |  register save area   |
                                        |                       |
                                        +-----------------------+
                                        |                       |
                                        |  local variables      |
                                        |                       |
                                  FP+0->+-----------------------+
                                        |                       |
                                        |  alloca allocations   |
                                        |                       |
                                        +-----------------------+
                                        |                       |
                                        |  arguments on stack   |
                                        |                       |
                                  SP+8->+-----------------------+
   low                                  | 2 word save area for  |
   memory                               | leaf funcs / flags    |
                                  SP+0->+-----------------------+

compiler strangeness

So I hit a strange issue with gcc. Well i don't know ... not 'strange', just unexpected. It probably doesn't matter much on x86 because it has so few registers and such a shitty breadth of addressing modes but on arm and epiphany it generates some pretty shit load/store code outside of an unexpected optimisation flag (and -O3, and even then only sometimes?).

Easiest to demonstrate in the epiphany instruction set.

A simple example:

extern const e_group_config_t e_group_config;
int id[8];
void foo(void) {
 id[0] = e_group_config.core_row;
 id[1] = e_group_config.core_col;
}

 -->

   0:   000b 0002       mov r0,0x0
                        0: R_EPIPHANY_LOW       _e_group_config+0x1c
   4:   000b 1002       movt r0,0x0
                        4: R_EPIPHANY_HIGH      _e_group_config+0x1c
   8:   2044            ldr r1,[r0]
   a:   000b 0002       mov r0,0x0
                        a: R_EPIPHANY_LOW       .bss
   e:   000b 1002       movt r0,0x0
                        e: R_EPIPHANY_HIGH      .bss
  12:   2054            str r1,[r0]
  14:   000b 0002       mov r0,0x0
                        14: R_EPIPHANY_LOW      _e_group_config+0x20
  18:   000b 1002       movt r0,0x0
                        18: R_EPIPHANY_HIGH     _e_group_config+0x20
  1c:   2044            ldr r1,[r0]
  1e:   000b 0002       mov r0,0x0
                        1e: R_EPIPHANY_LOW      .bss+0x4
  22:   000b 1002       movt r0,0x0
                        22: R_EPIPHANY_HIGH     .bss+0x4
  26:   2054            str r1,[r0]

Err, what?

It's basically going to the linker to resolve every memory reference (all those R_* reloc records), even for the array array. At first I thought this was just an epiphany-gcc thing but i cross checked on amd64 and arm with the same result. Curious.

Curious also ...

extern const e_group_config_t e_group_config;
int id[8];
void foo(void) {
 int *idp = id;
 const e_group_config_t *ep = &e_group_config;

 idp[0] = ep->core_row;
 idp[1] = ep->core_col;
}

 -->
   0:   200b 0002       mov r1,0x0
                        0: R_EPIPHANY_LOW       _e_group_config+0x1c
   4:   200b 1002       movt r1,0x0
                        4: R_EPIPHANY_HIGH      _e_group_config+0x1c
   8:   2444            ldr r1,[r1]
   a:   000b 0002       mov r0,0x0
                        a: R_EPIPHANY_LOW       .bss
   e:   000b 1002       movt r0,0x0
                        e: R_EPIPHANY_HIGH      .bss
  12:   2054            str r1,[r0]
  14:   200b 0002       mov r1,0x0
                        14: R_EPIPHANY_LOW      _e_group_config+0x20
  18:   200b 1002       movt r1,0x0
                        18: R_EPIPHANY_HIGH     _e_group_config+0x20
  1c:   2444            ldr r1,[r1]
  1e:   20d4            str r1,[r0,0x1]

This fixes the array references, but not the struct references.

If one hard-codes the pointer address (which is probably a better idea anyway - yes it really is) and uses the pointer-to-array trick, then things finally reach the most-straightforward-compilation I get by just looking at the code and thinking in assembly (which is how i always look at memory-accessing code).

#define e_group_config ((const e_group_config_t *)0x28)
int id[8];
void foo(void) {
  int *idp = id;
  idp[0] = e_group_config->core_row;
  idp[1] = e_group_config->core_col;[/code]
}

 -->

   0:   2503            mov r1,0x28
   2:   47c4            ldr r2,[r1,0x7]
   4:   000b 0002       mov r0,0x0
                        4: R_EPIPHANY_LOW       .bss
   8:   000b 1002       movt r0,0x0
                        8: R_EPIPHANY_HIGH      .bss
   c:   4054            str r2,[r0]
   e:   244c 0001       ldr r1,[r1,+0x8]
  12:   20d4            str r1,[r0,0x1]

Bit of a throwing-hands-in-the-air moment.

Using -O3 on the original example gives something reasonable:

   0:   200b 0002       mov r1,0x0
                        0: R_EPIPHANY_LOW       _e_group_config+0x1c
   4:   200b 1002       movt r1,0x0
                        4: R_EPIPHANY_HIGH      _e_group_config+0x1c
   8:   4444            ldr r2,[r1]
   a:   000b 0002       mov r0,0x0
                        a: R_EPIPHANY_LOW       .bss
   e:   24c4            ldr r1,[r1,0x1]
  10:   000b 1002       movt r0,0x0
                        10: R_EPIPHANY_HIGH     .bss
  14:   4054            str r2,[r0]
  16:   20d4            str r1,[r0,0x1]

Which is what it should've been doing to start with. After testing every optimisation flag different between -O3 and -O2 I found that it was -ftree-vectorize that activates this 'optimisation'.

I can only presume the cost model of offset address calculations is borrowing too much from x86 where the lack of registers and addressing modes favours pre-calculation every time. -O[s23] compile this the same on amd64 as one would expect.

   0:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 6 
                        2: R_X86_64_PC32        e_group_config+0x18
   6:   89 05 00 00 00 00       mov    %eax,0x0(%rip)        # c 
                        8: R_X86_64_PC32        .bss-0x4
   c:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 12 
                        e: R_X86_64_PC32        e_group_config+0x1c
  12:   89 05 00 00 00 00       mov    %eax,0x0(%rip)        # 18 
                        14: R_X86_64_PC32       .bss+0x7c

It might seem insignificant but the initial code size is 40 bytes vs 24 for the optimised (or 20 using hard address) - these minor things can add up pretty fast.

Looks like epiphany will need a pretty specific set of optimisation flags to get decent code (just using -O3 on it's own usually bloats the code too much).

Alternate runtime

I'm actually working toward an alternate runtime for epiphany cores. Just the e-lib stuff and loader anyway.

I was looking at creating a more epiphany optimised version of e_group_config and e_mem_config, both to save a few bytes and make access more efficient. I was just making sure every access could fit into a 16-bit instruction when a test build surprised me.

I've come up with this group-info structure which leads to more compact code for a variety of reasons:

struct ez_config_t {
    uint16_t reserved0;
    uint16_t reserved1;

    uint16_t group_size;
    uint16_t group_rows;
    uint16_t group_cols;

    uint16_t core_index;
    uint16_t core_row;
    uint16_t core_col;

    uint32_t group_id;
    void *extmem;

    uint32_t reserved2;
    uint32_t reserved3;
};

The layout isn't random - shorts are all within a 3-bit offset so a single 16-bit instruction can load them. The whole structure supports some expansion slots all which fit in with the 3-bit offset constraint for the data-type, and there is room for some bytes if necessary.

To test it I access every value once:

        #define ez_configp ((ez_config_t *)(0x28))
        int *idp = id;
        idp[0] = ez_configp->group_size;
        idp[1] = ez_configp->group_rows;
        idp[2] = ez_configp->group_cols;
        idp[3] = ez_configp->core_index;
        idp[4] = ez_configp->core_row;
        idp[5] = ez_configp->core_col;

        idp[6] = ez_configp->group_id;
        idp[7] = (int32_t)ez_configp->extmem;

 -->
   0:   2503            mov r1,0x28
   2:   000b 0002       mov r0,0x0
   6:   4524            ldrh r2,[r1,0x2]
   8:   000b 1002       movt r0,0x0
   c:   4054            str r2,[r0]
   e:   45a4            ldrh r2,[r1,0x3]
  10:   40d4            str r2,[r0,0x1]
  12:   4624            ldrh r2,[r1,0x4]
  14:   4154            str r2,[r0,0x2]
  16:   46a4            ldrh r2,[r1,0x5]
  18:   41d4            str r2,[r0,0x3]
  1a:   4724            ldrh r2,[r1,0x6]
  1c:   4254            str r2,[r0,0x4]
  1e:   47a4            ldrh r2,[r1,0x7]
  20:   42d4            str r2,[r0,0x5]
  22:   4744            ldr r2,[r1,0x6]
  24:   27c4            ldr r1,[r1,0x7]
  26:   4354            str r2,[r0,0x6]
  28:   23d4            str r1,[r0,0x7]

And the compiler's done exactly what you would expect here. Load the object base address and then simply access everything via an indexed access taking advantage of the hand-tuned layout to use a 16-bit instruction for all of them too.

I've included a couple of pre-calculated flat index values because these things are often needed in practical code and certainly to implement any group-wide primitives. This is somewhat better than the existing api which must calculate them on the fly.

    int *idp = id;
    idp[0] = e_group_config.group_rows * e_group_config.group_cols;
    idp[1] = e_group_config.group_rows;
    idp[2] = e_group_config.group_cols;
    idp[3] = e_group_config.group_row * e_group_config.group_cols + e_group_config.group_col;
    idp[4] = e_group_config.group_row;
    idp[5] = e_group_config.group_col;

    idp[6] = e_group_config.group_id;
    idp[7] = (int32_t)e_emem_config.base;

 --> -Os with default fpu mode
   0:   000b 0002       mov r0,0x0
   4:   000b 1002       movt r0,0x0
   8:   804c 2000       ldr r12,[r0,+0x0]
   c:   000b 0002       mov r0,0x0
  10:   000b 1002       movt r0,0x0
  14:   4044            ldr r2,[r0]
  16:   000b 4002       mov r16,0x0
  1a:   000b 0002       mov r0,0x0
  1e:   000b 1002       movt r0,0x0
  22:   010b 5002       movt r16,0x8
  26:   2112            movfs r1,config
  28:   0392            gid
  2a:   411f 4002       movfs r18,config
  2e:   487f 490a       orr r18,r18,r16
  32:   410f 4002       movts config,r18
  36:   0192            gie
  38:   0392            gid
  3a:   611f 4002       movfs r19,config
  3e:   6c7f 490a       orr r19,r19,r16
  42:   610f 4002       movts config,r19
  46:   0192            gie
  48:   0a2f 4087       fmul r16,r2,r12
  4c:   80dc 2000       str r12,[r0,+0x1]
  50:   800b 2002       mov r12,0x0
  54:   800b 3002       movt r12,0x0
  58:   4154            str r2,[r0,0x2]
  5a:   005c 4000       str r16,[r0]
  5e:   104c 4400       ldr r16,[r12,+0x0]
  62:   800b 2002       mov r12,0x0
  66:   800b 3002       movt r12,0x0
  6a:   412f 0807       fmul r2,r16,r2
  6e:   904c 2400       ldr r12,[r12,+0x0]
  72:   025c 4000       str r16,[r0,+0x4]
  76:   82dc 2000       str r12,[r0,+0x5]
  7a:   4a1f 008a       add r2,r2,r12
  7e:   41d4            str r2,[r0,0x3]
  80:   400b 0002       mov r2,0x0
  84:   400b 1002       movt r2,0x0
  88:   4844            ldr r2,[r2]
  8a:   4354            str r2,[r0,0x6]
  8c:   400b 0002       mov r2,0x0
  90:   400b 1002       movt r2,0x0
  94:   48c4            ldr r2,[r2,0x1]
  96:   43d4            str r2,[r0,0x7]
  98:   0392            gid
  9a:   611f 4002       movfs r19,config
  9e:   6c8f 480a       eor r19,r19,r1
  a2:   6ddf 480a       and r19,r19,r3
  a6:   6c8f 480a       eor r19,r19,r1
  aa:   610f 4002       movts config,r19
  ae:   0192            gie
  b0:   0392            gid
  b2:   011f 4002       movfs r16,config
  b6:   008f 480a       eor r16,r16,r1
  ba:   01df 480a       and r16,r16,r3
  be:   008f 480a       eor r16,r16,r1
  c2:   010f 4002       movts config,r16
  c6:   0192            gie

 --> -O3 with -mfp-mode=int
   0:   000b 0002       mov r0,0x0
   4:   000b 1002       movt r0,0x0
   8:   2044            ldr r1,[r0]
   a:   000b 0002       mov r0,0x0
   e:   000b 1002       movt r0,0x0
  12:   6044            ldr r3,[r0]
  14:   000b 0002       mov r0,0x0
  18:   000b 1002       movt r0,0x0
  1c:   804c 2000       ldr r12,[r0,+0x0]
  20:   4caf 4007       fmul r18,r3,r1
  24:   000b 4002       mov r16,0x0
  28:   000b 5002       movt r16,0x0
  2c:   000b 0002       mov r0,0x0
  30:   662f 4087       fmul r19,r1,r12
  34:   000b 1002       movt r0,0x0
  38:   204c 4800       ldr r17,[r16,+0x0]
  3c:   000b 4002       mov r16,0x0
  40:   4044            ldr r2,[r0]
  42:   000b 5002       movt r16,0x0
  46:   000b 0002       mov r0,0x0
  4a:   00cc 4800       ldr r16,[r16,+0x1]
  4e:   000b 1002       movt r0,0x0
  52:   491f 480a       add r18,r18,r2
  56:   605c 4000       str r19,[r0]
  5a:   80dc 2000       str r12,[r0,+0x1]
  5e:   2154            str r1,[r0,0x2]
  60:   41dc 4000       str r18,[r0,+0x3]
  64:   6254            str r3,[r0,0x4]
  66:   42d4            str r2,[r0,0x5]
  68:   235c 4000       str r17,[r0,+0x6]
  6c:   03dc 4000       str r16,[r0,+0x7]

Unless the code has no flops the fpumode=int is probably not very useful but this probably represents the best it could possibly do. And there's some real funky config register shit going on there in the -Os version but that just has to be a bug.

Oh blast, and the absolute loads are back anyway!

My hands might not stay attached if i keep throwing them up in the air at this point.

For the hexadecimal challenged (i.e. me) each fragment is 42, 200, and 112 bytes long respectively. And each uses 3, 9, or 9 registers.

Saturday evening elf-loader hackery

Wasn't much on TV so I kept poking fairly late last night. I had a look at a Java binding to the code.

It's kind of looking a bit like OpenCL but without any of the queuing stuff.

I tried creating an empty demo to try out the api and I'm going to need a bit more runtime support to make it practical. So at present this is how the api might work for a mandelbrot painter.

First the communication structures that live in the epiphany code.

#define WIDTH 512
#define HEIGHT 512

struct shared {
 float left, top, right, bottom;
 jbyte status[16]; 
};

// Shared comm block
struct shared shared EZ_SHARED;
// RGBA pixels
byte pixels[WIDTH * HEIGHT * 4] EZ_SHARED;

And then an example main.

        EZPlatform plat = EZPlatform.init("system.hdf", EZ_SHARED_POINTERS);
        EZWorkgroup wg = plat.createWorkgroup(0, 0, 4, 4);

        EZProgram eprog = EZProgram.load("emandelbrot.elf");

        // Halt the cores
        wg.reset();

        // Bind program to all cores
        wg.bind(eprog, 0, 0, 4, 4);

        // Link/load the code
        wg.load();

        // Access comms structures
        ByteBuffer shared = wg.mapSymbol("_shared");
        ByteBuffer pixels = wg.mapSymbol("_pixels");

        // Job parameters
        shared.putFloat(0).putFloat(0).putFloat(1).putFloat(1).put(new byte[16]).rewind();

        // Start calculation
        wg.start();

        // Wait for all jobs to finish
        for (int i = 0; i < 16; i++) {
            while (shared.get(i + 16) == 0)
                try {
                    Thread.sleep(1);
                } catch (InterruptedException ex) {
                    Logger.getLogger(Test.class.getName()).log(Level.SEVERE, null, ex);
                }
        }
        
        // Use pixels.
        // ...

It's all pretty straightforward and decent until the job completion stuff. I probably want some way of abstracting that to something re-usable. Perhaps one day there will be some hardware support as well negating the need to poll the result. But just being able to look up structures by name is a big plus over the way you have to do it with the existing tools.

This (non-existent) example is just a one-shot execution but it already supports a persistent server mode. Perhaps it would also be useful to be able to support multi-kernel one-shot operation, e.g. choose the kernel and then a SYNC will launch a different main. If I do that then supporting kernel arguments would become useful although it's only worth it if the latency is ok versus the code size of a dispatch loop approach.

At the moment the .load() function is probably the interesting one. Internally this first relocates and links all the code to an arm-local buffer. Then it just memcpy's this to each core they are bound to. This state is remembered so it is possible to switch the functionality of a whole workgroup with a relatively cheap call. I don't think there's enough memory to do anything sophisticated like double-buffer the code though and given the alu to bandwidth mismatch as it is it probably wouldn't be much help anyway.

I do already have an 'EPort' primitive I included in the Java api. It's basically a non-locking cyclic counter which can be used to implement single writer / single reader queues very efficiently on the epiphany just using local memory reads and remote memory writes (i.e. non-blocking if not full and no mesh impact if it is). It's a bit limited though as for example you can only reserve or consume one slot at a time. Still useful mind you and it works with host-core as well as core-core.

I need to brush up again on some of the hardware workgroup support to see what other efficient primitives can be implemented (weird, the 4.13.x revision of the arch reference has vanished from the parallella site). Should be able to get a barrier at least although it's a bit more work having it work off-chip. Personally I think a mutex has no place in massive parallel hardware, although without a hardware atomic counter or mailboxes options are limited.

But maybe another day. I thought i'd had enough beer on Thursday (pretty much the last day of summer, 32 degrees and a warm balmy evening - absolutely awesome) but after finding out what the new contract is focussed on I'm ready for a Sunday Session even if it's just in my own back yard.

Saturday arvo elf-loader hack-a-thon

So I hacked a bit more of the elf loader today. Initially it was just documenting where I got to so I could work out where to go next. I wrote up a bit more background and detail on how it works. Then I cleared out the out of date experimental stuff and focussed on the ez_ interfaces.

But then I got a bit side-tracked ...

Compact startup

First I thought i'd see if i could replace crt0 with my own. Apart from an initialisation issue with bss and data there isn't really anything wrong with the bundled one ... apart from it dragging in a bunch of C support stuff which isn't necessary if writing small kernel code that doesn't need the full libc.

So I took the crt0.s and stripped out a bunch of stuff. The trampoline to a potentially off-core start routine (it's going to be on-core). The atexit init (not necessary). The constructors init (hopefully not necessary). The .bss clearing (it's problematic when you have possibly more than one block of bss such as shared and given that .data isn't re-initialised anyway the C language behaviour is already broken). And the argument setting (three zero's isn't useful for anything and isn't even correct).

I toyed with the idea of passing arguments to kernel but decided to just have a void main instead.

I just pass -nostartfiles e-crt0minimal.o to the link line to replace the standard start-up code.

 e-gcc -Wl,-r -o e-test-reloc.elf -nostartfiles e-crt0minimal.o e-test-reloc.o -le-lib

Worth it? Probably ...

Minimal crt0:

$ e-size e-test-workgroup-a.elf
   text    data     bss     dec     hex filename
    548      64     120     732     2dc e-test-workgroup-a.elf

Standard crt0:

$ e-size e-test-workgroup-a.elf
   text    data     bss     dec     hex filename
   1418    1196     128    2742     ab6 e-test-workgroup-a.elf

When you've only got 32K to play with saving 2K isn't to be sniffed at.

Matching addresses (aka fake hsa)

Next I looked at just using mmap to map the shared memory so pointers can be shared between the epiphany and the host arm directly. I tried to use e_get_platform_info() to get the list of memory blocks but for some odd reason that zeros out the memory array pointer? Odd. So ... I just access the struct directly via an extern instead.

This is just an implementation of stuff from a previous post but using the platform_info to find the addresses.

I have no idea whether this will work on a multi-epiphany setup but since I don't have one it's not something i'll lose sleep over :)

COM symbols

About this point I noticed that some symbols weren't being allocated any location in the output file and thus could not be resolved during the loader-linker execution. These were symbols marked with the section id of COMMON. I had hit this before but I had forgotten all about it. Last time I solved it using a linker script but I found I can just pass '-d' to the linker to achieve the same result which puts any such values into bss.

Automatic remote-core on-chip symbol resolution

Then I had a look into implementing fully automatic resolution of remote but on-chip symbols. The options are limited but the desired target core can be indicated by the symbol name.

For example:

program a:
   extern int buffer[12];

program b:
   // current cell relative +0, +1
   extern int buffer_0_1[12];

   // group relative 0,1
   extern int buffer$0$1[12];

Or some variation thereof. This is easy enough to parse and implement, and not too ugly to use.

But it would mean that the binary for every core would need to be linked individually and thus it wouldn't be possible to just copy the same code across to multiple cores when they share the implementation. For this reason I've dropped the idea for now. Having to use e_get_global_address() on a weak symbol isn't too difficult.

About Me

Tags

ez-lib-tiny

add/or, or, add?

ez-lib-tiny

Bullshit quote of the day: "compilers can create better code than you can".

unix or gnu: sh n awk. finding the next lun.

inlining register reads

inline

epiphany stack frame

compiler strangeness

Alternate runtime

Saturday evening elf-loader hackery

Saturday arvo elf-loader hack-a-thon

Compact startup

Matching addresses (aka fake hsa)

COM symbols

Automatic remote-core on-chip symbol resolution

About Me

Tags

ez-lib-tiny

add/or, or, add?

ez-lib-tiny

Bullshit quote of the day: "compilers can create better code than you can".

unix or gnu: sh n awk. finding the next lun.

inlining register reads

inline

epiphany stack frame

compiler strangeness

Alternate runtime

Saturday evening elf-loader hackery

Saturday arvo elf-loader hack-a-thon

Compact startup

Matching addresses (aka fake hsa)

*COM* symbols

Automatic remote-core on-chip symbol resolution

COM symbols