inlining register reads
42.
I've been looking at the way the epiphany on-core library implements a couple of functions in order to improve them. One of the simplest are the special register read/write functions used for dma and other purposes.
unsigned e_reg_read(e_core_reg_id_t reg_id); unsigned e_reg_read(e_core_reg_id_t reg_id) { volatile register unsigned reg_val; unsigned *addr; // TODO: function affects integer flags. Add special API for STATUS switch (reg_id) { case E_REG_CONFIG: __asm__ __volatile__ ("MOVFS %0, CONFIG" : "=r" (reg_val) : ); return reg_val; case E_REG_STATUS: __asm__ __volatile__ ("MOVFS %0, STATUS" : "=r" (reg_val) : ); return reg_val; default: addr = (unsigned *) e_get_global_address(e_group_config.core_row, e_group_config.core_col, (void *) reg_id); return *addr; } }
As alluded to in the comments this actually breaks reading the status register anyway ... and it is incomplete. There are 42 special registers but because they are not contiguous and the actual memory address is passed to the function the compiler generates either a giant jump table or a long sequence of nested branches searching for the switch target.
And apart from this any calling code needs to go via the a function invocation which may be as simple as a branch but is more likely to be a 32-bit load followed by a jsr, and then the function itself needs to implement a switch.
Update: I added some markup to the output. Bold is the code that is required to do the actual job, for the first examples it includes the e_reg_read function implementation but in the ideal case it is a single instruction. Italic is bad code either due to incorrect implementation or the compiler going whacko the didlio for whatever reason.
00000000 _e_reg_read: 0: 40e2 mov r2,r0 2: 000b 0042 mov r0,0x400 6: 01eb 1002 movt r0,0xf a: d65c 2700 str lr,[sp],-0x4 e: 283a sub r1,r2,r0 10: 2800 beq 60 <_e_reg_read+0x60> 12: 008b 0042 mov r0,0x404 16: 01eb 1002 movt r0,0xf 1a: 283a sub r1,r2,r0 1c: 1700 beq 4a <_e_reg_read+0x4a> 1e: 000b 0002 mov r0,0x0 22: 200b 0002 mov r1,0x0 26: 000b 1002 movt r0,0x0 2a: 200b 1002 movt r1,0x0 2e: 600b 0002 mov r3,0x0 32: 0044 ldr r0,[r0] 34: 2444 ldr r1,[r1] 36: 600b 1002 movt r3,0x0 3a: 0d52 jalr r3 3c: d64c 2400 ldr lr,[sp,+0x4] 40: 0044 ldr r0,[r0] 42: b41b 2402 add sp,sp,16 46: 194f 0402 rts 4a: 0512 movfs r0,status 4c: 15dc 0400 str r0,[sp,+0x3] 50: 15cc 0400 ldr r0,[sp,+0x3] 54: d64c 2400 ldr lr,[sp,+0x4] 58: b41b 2402 add sp,sp,16 5c: 194f 0402 rts 60: 0112 movfs r0,config 62: 15dc 0400 str r0,[sp,+0x3] 66: 15cc 0400 ldr r0,[sp,+0x3] 6a: d64c 2400 ldr lr,[sp,+0x4] 6e: b41b 2402 add sp,sp,16 72: 194f 0402 rts 76: 01a2 nop
And an example call:
unsigned a; void foo(void) { a = e_reg_read(E_REG_STATUS); } --> e-gcc -std=gnu99 -O2 -c -o e-foo.o e-foo.c 00000000 _foo: 0: 008b 0042 mov r0,0x404 4: 200b 0002 mov r1,0x0 8: d55c 2700 str lr,[sp],-0x2 c: 200b 1002 movt r1,0x0 10: 01eb 1002 movt r0,0xf 14: 0552 jalr r1 16: 400b 0002 mov r2,0x0 1a: 400b 1002 movt r2,0x0 1e: 0854 str r0,[r2] 20: d54c 2400 ldr lr,[sp,+0x2] 24: 04e2 mov r0,r1 26: b41b 2401 add sp,sp,8 2a: 194f 0402 rts 2e: 01a2 nop
Can we do better?
inline
So the solution is to inline it. Simply moving e_reg_read to an inline function in a header helps. Well sort of helps.
static inline unsigned ex_reg_read(e_core_reg_id_t reg_id) .. exactly the same int foo_inline(void) { a = ex_reg_read(E_REG_STATUS); } --> e-gcc -std=gnu99 -O2 -c -o e-foo.o e-foo.c 00000000 _foo_inline: 0: b41b 24ff add sp,sp,-8 4: 0512 movfs r0,status 6: 15dc 0400 str r0,[sp,+0x3] a: 35cc 0400 ldr r1,[sp,+0x3] e: 000b 0002 mov r0,0x0 12: 000b 1002 movt r0,0x0 16: 2054 str r1,[r0] 18: 810b 2002 mov r12,0x8 1c: b61f 248a add sp,sp,r12 20: 194f 0402 rts
Yeah not sure what's going on there to make it go through the stack. Maybe the type or something. What the hell? (oh I later worked it out: the unnecessary volatile on the reg_val value is forcing a store to and read from the stack which is not desirable at all, i filed a bug in prickhub).
Actually I'm going backwards here, I actually already wrote this and wanted to compare to what currently happens, so lets just forget all that and see what I came up with.
static inline uint32_t ez_reg_read(e_core_reg_id_t id) { register uint32_t v; switch (id) { case E_REG_CONFIG: asm volatile ("movfs %0,config" : "=r" (v)); break; case E_REG_STATUS: asm volatile ("movfs %0,status" : "=r" (v)); break; default: v = *((volatile uint32_t *) e_get_global_address(e_group_config.core_row, e_group_config.core_col, (void *) id)); break; } return v; }
And an example.
void fooz_inline(void) { a = ez_reg_read(E_REG_STATUS); } --> e-gcc -std=gnu99 -O2 -c -o e-foo.o e-foo.c 00000024 _fooz_inline: 0: 2512 movfs r1,status 2: 000b 0002 mov r0,0x0 6: 000b 1002 movt r0,0x0 a: 2054 str r1,[r0] c: 194f 0402 rts
Ok that's what I wanted. This helps the compiler generate much better code too since the result can go in any register, no scratch registers need to be saved, etc.
So I took this and proceeded to add all 42 of the special registers ... and then compiled an example that had more than one register read and ... damn. Suddenly it decides that it doesn't want to inline it anymore and turns every get into a function call to a 912 byte function. Oops. __attribute__((always_inline))
fixed that at least. Although it's different depending on the compiler - on the device I don't need to add the always_inline thing (or maybe some tiny detail was different).
However, things get a bit nasty when a non-constant expression is passed. Suddenly it inlines as much of that gigantic switch statement as it needs - potentially the whole lot if it doesn't know the range. Obviously not much use.
gcc has one last trick ... __builtin_constant_p(x)
. This returns true if the parameter is probably a constant. Since for this particular case on the epiphany any special register can just be read by a global memory access (as it uses already for a fallback) this can be used to decide the path to use.
#define ez_reg_read(x) (__builtin_constant_p(x) \ ? _ez_reg_read(x) \ : (*((volatile uint32_t *) e_get_global_address( \ e_group_config.core_row, e_group_config.core_col, (void *) x))))
The macro decides whether to call _ez_reg_read() which will compile into a single movfs instruction, or fall-back to the memory load path (this may have some nasty unrolled-loop cases, hopefully unlikely). Although ... It's probably not terribly important to support non-constant parameters because any tools that need it can do it themselves and the api could just be for known registers (the enum implies that already).
Given how much code the current implementation compiles into it doesn't seem worth special casing any registers at all and it could just fall back to a global memory load every time. I'm lead to believe that movfs/movts goes through the same physical path (unfortunately) and since the required address is the key to the function there's little more to do.
I think elib can be shrunk significantly by changing some of the parameterisation and judicious use of inlining.
So ... whilst playing with a version of e_dma_wait() using this ... I think I found another gcc bug when __builtin_constant_p() is used, so yeah, non-constant args are in the bin.
Update: So it seems I missed the brackets on the macro which someone kindly pointed out on the forums ... oops. Actually before that I had written up the implementation and realised that the macro wasn't doing anything magical and __builtin_constant_p() can just go into the inline function itself.
Below is the code so far which results in identical output.
I thought it wouldn't work with no optimisation turned on but it seems to do the right thing. It still in-lines the ez_reg_read() but drops to the memory access path. This wont work for the config and status registers because they need special handling according to Andreas but I just realised I had separate entry points for those anyway so it should be ok. I'm not sure why C would ever be reading status anyway.
ez_regval_t EZ_ALWAYS_INLINE ez_reg_read(ez_regid_t id) { register uint32_t v; if (__builtin_constant_p(id)) { switch (id) { // bank 0 case E_REG_CONFIG: asm volatile ("movfs %0,config" : "=r" (v)); break; case E_REG_STATUS: asm volatile ("movfs %0,status" : "=r" (v)); break; case E_REG_PC: asm volatile ("movfs %0,pc" : "=r" (v)); break; case E_REG_DEBUGSTATUS: asm volatile ("movfs %0,debug" : "=r" (v)); break; ... every other named register ... default: // unknown register, who cares v = 0; break; } } else { v = *(volatile uint32_t *)ez_global_core_self((void *)id); } return v; }