About Me
Michael Zucchi
B.E. (Comp. Sys. Eng.)
also known as Zed
to his mates & enemies!
< notzed at gmail >
< fosstodon.org/@notzed >
inlining register reads
42.
I've been looking at the way the epiphany on-core library implements a couple of functions in order to improve them. One of the simplest are the special register read/write functions used for dma and other purposes.
unsigned e_reg_read(e_core_reg_id_t reg_id);
unsigned e_reg_read(e_core_reg_id_t reg_id)
{
volatile register unsigned reg_val;
unsigned *addr;
// TODO: function affects integer flags. Add special API for STATUS
switch (reg_id)
{
case E_REG_CONFIG:
__asm__ __volatile__ ("MOVFS %0, CONFIG" : "=r" (reg_val) : );
return reg_val;
case E_REG_STATUS:
__asm__ __volatile__ ("MOVFS %0, STATUS" : "=r" (reg_val) : );
return reg_val;
default:
addr = (unsigned *) e_get_global_address(e_group_config.core_row,
e_group_config.core_col, (void *) reg_id);
return *addr;
}
}
As alluded to in the comments this actually breaks reading the status register anyway ... and it is incomplete. There are 42 special registers but because they are not contiguous and the actual memory address is passed to the function the compiler generates either a giant jump table or a long sequence of nested branches searching for the switch target.
And apart from this any calling code needs to go via the a function invocation which may be as simple as a branch but is more likely to be a 32-bit load followed by a jsr, and then the function itself needs to implement a switch.
Update: I added some markup to the output. Bold is the code that is required to do the actual job, for the first examples it includes the e_reg_read function implementation but in the ideal case it is a single instruction. Italic is bad code either due to incorrect implementation or the compiler going whacko the didlio for whatever reason.
00000000 _e_reg_read:
0: 40e2 mov r2,r0
2: 000b 0042 mov r0,0x400
6: 01eb 1002 movt r0,0xf
a: d65c 2700 str lr,[sp],-0x4
e: 283a sub r1,r2,r0
10: 2800 beq 60 <_e_reg_read+0x60>
12: 008b 0042 mov r0,0x404
16: 01eb 1002 movt r0,0xf
1a: 283a sub r1,r2,r0
1c: 1700 beq 4a <_e_reg_read+0x4a>
1e: 000b 0002 mov r0,0x0
22: 200b 0002 mov r1,0x0
26: 000b 1002 movt r0,0x0
2a: 200b 1002 movt r1,0x0
2e: 600b 0002 mov r3,0x0
32: 0044 ldr r0,[r0]
34: 2444 ldr r1,[r1]
36: 600b 1002 movt r3,0x0
3a: 0d52 jalr r3
3c: d64c 2400 ldr lr,[sp,+0x4]
40: 0044 ldr r0,[r0]
42: b41b 2402 add sp,sp,16
46: 194f 0402 rts
4a: 0512 movfs r0,status
4c: 15dc 0400 str r0,[sp,+0x3]
50: 15cc 0400 ldr r0,[sp,+0x3]
54: d64c 2400 ldr lr,[sp,+0x4]
58: b41b 2402 add sp,sp,16
5c: 194f 0402 rts
60: 0112 movfs r0,config
62: 15dc 0400 str r0,[sp,+0x3]
66: 15cc 0400 ldr r0,[sp,+0x3]
6a: d64c 2400 ldr lr,[sp,+0x4]
6e: b41b 2402 add sp,sp,16
72: 194f 0402 rts
76: 01a2 nop
And an example call:
unsigned a;
void foo(void) {
a = e_reg_read(E_REG_STATUS);
}
--> e-gcc -std=gnu99 -O2 -c -o e-foo.o e-foo.c
00000000 _foo:
0: 008b 0042 mov r0,0x404
4: 200b 0002 mov r1,0x0
8: d55c 2700 str lr,[sp],-0x2
c: 200b 1002 movt r1,0x0
10: 01eb 1002 movt r0,0xf
14: 0552 jalr r1
16: 400b 0002 mov r2,0x0
1a: 400b 1002 movt r2,0x0
1e: 0854 str r0,[r2]
20: d54c 2400 ldr lr,[sp,+0x2]
24: 04e2 mov r0,r1
26: b41b 2401 add sp,sp,8
2a: 194f 0402 rts
2e: 01a2 nop
Can we do better?
inline
So the solution is to inline it. Simply moving e_reg_read to an inline function in a header helps. Well sort of helps.
static inline unsigned ex_reg_read(e_core_reg_id_t reg_id)
.. exactly the same
int foo_inline(void) {
a = ex_reg_read(E_REG_STATUS);
}
--> e-gcc -std=gnu99 -O2 -c -o e-foo.o e-foo.c
00000000 _foo_inline:
0: b41b 24ff add sp,sp,-8
4: 0512 movfs r0,status
6: 15dc 0400 str r0,[sp,+0x3]
a: 35cc 0400 ldr r1,[sp,+0x3]
e: 000b 0002 mov r0,0x0
12: 000b 1002 movt r0,0x0
16: 2054 str r1,[r0]
18: 810b 2002 mov r12,0x8
1c: b61f 248a add sp,sp,r12
20: 194f 0402 rts
Yeah not sure what's going on there to make it go through the stack. Maybe the type or something. What the hell? (oh I later worked it out: the unnecessary volatile on the reg_val value is forcing a store to and read from the stack which is not desirable at all, i filed a bug in prickhub).
Actually I'm going backwards here, I actually already wrote this and wanted to compare to what currently happens, so lets just forget all that and see what I came up with.
static inline uint32_t
ez_reg_read(e_core_reg_id_t id) {
register uint32_t v;
switch (id) {
case E_REG_CONFIG:
asm volatile ("movfs %0,config" : "=r" (v));
break;
case E_REG_STATUS:
asm volatile ("movfs %0,status" : "=r" (v));
break;
default:
v = *((volatile uint32_t *) e_get_global_address(e_group_config.core_row,
e_group_config.core_col, (void *) id));
break;
}
return v;
}
And an example.
void fooz_inline(void) {
a = ez_reg_read(E_REG_STATUS);
}
--> e-gcc -std=gnu99 -O2 -c -o e-foo.o e-foo.c
00000024 _fooz_inline:
0: 2512 movfs r1,status
2: 000b 0002 mov r0,0x0
6: 000b 1002 movt r0,0x0
a: 2054 str r1,[r0]
c: 194f 0402 rts
Ok that's what I wanted. This helps the compiler generate much better code too since the result can go in any register, no scratch registers need to be saved, etc.
So I took this and proceeded to add all 42 of the special registers ... and then compiled an example that had more than one register read and ... damn. Suddenly it decides that it doesn't want to inline it anymore and turns every get into a function call to a 912 byte function. Oops. __attribute__((always_inline))
fixed that at least. Although it's different depending on the compiler - on the device I don't need to add the always_inline thing (or maybe some tiny detail was different).
However, things get a bit nasty when a non-constant expression is passed. Suddenly it inlines as much of that gigantic switch statement as it needs - potentially the whole lot if it doesn't know the range. Obviously not much use.
gcc has one last trick ... __builtin_constant_p(x)
. This returns true if the parameter is probably a constant. Since for this particular case on the epiphany any special register can just be read by a global memory access (as it uses already for a fallback) this can be used to decide the path to use.
#define ez_reg_read(x) (__builtin_constant_p(x) \
? _ez_reg_read(x) \
: (*((volatile uint32_t *) e_get_global_address( \
e_group_config.core_row, e_group_config.core_col, (void *) x))))
The macro decides whether to call _ez_reg_read() which will compile into a single movfs instruction, or fall-back to the memory load path (this may have some nasty unrolled-loop cases, hopefully unlikely). Although ... It's probably not terribly important to support non-constant parameters because any tools that need it can do it themselves and the api could just be for known registers (the enum implies that already).
Given how much code the current implementation compiles into it doesn't seem worth special casing any registers at all and it could just fall back to a global memory load every time. I'm lead to believe that movfs/movts goes through the same physical path (unfortunately) and since the required address is the key to the function there's little more to do.
I think elib can be shrunk significantly by changing some of the parameterisation and judicious use of inlining.
So ... whilst playing with a version of e_dma_wait() using this ... I think I found another gcc bug when __builtin_constant_p() is used, so yeah, non-constant args are in the bin.
Update: So it seems I missed the brackets on the macro which someone kindly pointed out on the forums ... oops. Actually before that I had written up the implementation and realised that the macro wasn't doing anything magical and __builtin_constant_p() can just go into the inline function itself.
Below is the code so far which results in identical output.
I thought it wouldn't work with no optimisation turned on but it seems to do the right thing. It still in-lines the ez_reg_read() but drops to the memory access path. This wont work for the config and status registers because they need special handling according to Andreas but I just realised I had separate entry points for those anyway so it should be ok. I'm not sure why C would ever be reading status anyway.
ez_regval_t
EZ_ALWAYS_INLINE
ez_reg_read(ez_regid_t id) {
register uint32_t v;
if (__builtin_constant_p(id)) {
switch (id) {
// bank 0
case E_REG_CONFIG:
asm volatile ("movfs %0,config" : "=r" (v));
break;
case E_REG_STATUS:
asm volatile ("movfs %0,status" : "=r" (v));
break;
case E_REG_PC:
asm volatile ("movfs %0,pc" : "=r" (v));
break;
case E_REG_DEBUGSTATUS:
asm volatile ("movfs %0,debug" : "=r" (v));
break;
... every other named register ...
default:
// unknown register, who cares
v = 0;
break;
}
} else {
v = *(volatile uint32_t *)ez_global_core_self((void *)id);
}
return v;
}
epiphany stack frame
I came across this months ago but forgot some of the finer details.
I thought that 8-byte empty area looked odd in generated code but I think it's there to simplify stacking saved registers due to the lack of pre-decrement addressing modes (otherwise it doesn't make much sense - leaf function use wouldn't matter if it had such addressing modes).
It's in the gcc source.
/* EPIPHANY stack frames look like:
Before call After call
+-----------------------+ +-----------------------+
| | | |
high | local variables, | | local variables, |
mem | reg save area, etc. | | reg save area, etc. |
| | | |
+-----------------------+ +-----------------------+
| | | |
| arguments on stack. | | arguments on stack. |
| | | |
SP+8->+-----------------------+FP+8m->+-----------------------+
| 2 word save area for | | reg parm save area, |
| leaf funcs / flags | | only created for |
SP+0->+-----------------------+ | variable argument |
| functions |
FP+8n->+-----------------------+
| |
| register save area |
| |
+-----------------------+
| |
| local variables |
| |
FP+0->+-----------------------+
| |
| alloca allocations |
| |
+-----------------------+
| |
| arguments on stack |
| |
SP+8->+-----------------------+
low | 2 word save area for |
memory | leaf funcs / flags |
SP+0->+-----------------------+
compiler strangeness
So I hit a strange issue with gcc. Well i don't know ... not 'strange', just unexpected. It probably doesn't matter much on x86 because it has so few registers and such a shitty breadth of addressing modes but on arm and epiphany it generates some pretty shit load/store code outside of an unexpected optimisation flag (and -O3, and even then only sometimes?).
Easiest to demonstrate in the epiphany instruction set.
A simple example:
extern const e_group_config_t e_group_config;
int id[8];
void foo(void) {
id[0] = e_group_config.core_row;
id[1] = e_group_config.core_col;
}
-->
0: 000b 0002 mov r0,0x0
0: R_EPIPHANY_LOW _e_group_config+0x1c
4: 000b 1002 movt r0,0x0
4: R_EPIPHANY_HIGH _e_group_config+0x1c
8: 2044 ldr r1,[r0]
a: 000b 0002 mov r0,0x0
a: R_EPIPHANY_LOW .bss
e: 000b 1002 movt r0,0x0
e: R_EPIPHANY_HIGH .bss
12: 2054 str r1,[r0]
14: 000b 0002 mov r0,0x0
14: R_EPIPHANY_LOW _e_group_config+0x20
18: 000b 1002 movt r0,0x0
18: R_EPIPHANY_HIGH _e_group_config+0x20
1c: 2044 ldr r1,[r0]
1e: 000b 0002 mov r0,0x0
1e: R_EPIPHANY_LOW .bss+0x4
22: 000b 1002 movt r0,0x0
22: R_EPIPHANY_HIGH .bss+0x4
26: 2054 str r1,[r0]
Err, what?
It's basically going to the linker to resolve every memory reference (all those R_* reloc records), even for the array array. At first I thought this was just an epiphany-gcc thing but i cross checked on amd64 and arm with the same result. Curious.
Curious also ...
extern const e_group_config_t e_group_config;
int id[8];
void foo(void) {
int *idp = id;
const e_group_config_t *ep = &e_group_config;
idp[0] = ep->core_row;
idp[1] = ep->core_col;
}
-->
0: 200b 0002 mov r1,0x0
0: R_EPIPHANY_LOW _e_group_config+0x1c
4: 200b 1002 movt r1,0x0
4: R_EPIPHANY_HIGH _e_group_config+0x1c
8: 2444 ldr r1,[r1]
a: 000b 0002 mov r0,0x0
a: R_EPIPHANY_LOW .bss
e: 000b 1002 movt r0,0x0
e: R_EPIPHANY_HIGH .bss
12: 2054 str r1,[r0]
14: 200b 0002 mov r1,0x0
14: R_EPIPHANY_LOW _e_group_config+0x20
18: 200b 1002 movt r1,0x0
18: R_EPIPHANY_HIGH _e_group_config+0x20
1c: 2444 ldr r1,[r1]
1e: 20d4 str r1,[r0,0x1]
This fixes the array references, but not the struct references.
If one hard-codes the pointer address (which is probably a better idea anyway - yes it really is) and uses the pointer-to-array trick, then things finally reach the most-straightforward-compilation I get by just looking at the code and thinking in assembly (which is how i always look at memory-accessing code).
#define e_group_config ((const e_group_config_t *)0x28)
int id[8];
void foo(void) {
int *idp = id;
idp[0] = e_group_config->core_row;
idp[1] = e_group_config->core_col;[/code]
}
-->
0: 2503 mov r1,0x28
2: 47c4 ldr r2,[r1,0x7]
4: 000b 0002 mov r0,0x0
4: R_EPIPHANY_LOW .bss
8: 000b 1002 movt r0,0x0
8: R_EPIPHANY_HIGH .bss
c: 4054 str r2,[r0]
e: 244c 0001 ldr r1,[r1,+0x8]
12: 20d4 str r1,[r0,0x1]
Bit of a throwing-hands-in-the-air moment.
Using -O3 on the original example gives something reasonable:
0: 200b 0002 mov r1,0x0
0: R_EPIPHANY_LOW _e_group_config+0x1c
4: 200b 1002 movt r1,0x0
4: R_EPIPHANY_HIGH _e_group_config+0x1c
8: 4444 ldr r2,[r1]
a: 000b 0002 mov r0,0x0
a: R_EPIPHANY_LOW .bss
e: 24c4 ldr r1,[r1,0x1]
10: 000b 1002 movt r0,0x0
10: R_EPIPHANY_HIGH .bss
14: 4054 str r2,[r0]
16: 20d4 str r1,[r0,0x1]
Which is what it should've been doing to start with. After testing every optimisation flag different between -O3 and -O2 I found that it was -ftree-vectorize that activates this 'optimisation'.
I can only presume the cost model of offset address calculations is borrowing too much from x86 where the lack of registers and addressing modes favours pre-calculation every time. -O[s23] compile this the same on amd64 as one would expect.
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # 6
2: R_X86_64_PC32 e_group_config+0x18
6: 89 05 00 00 00 00 mov %eax,0x0(%rip) # c
8: R_X86_64_PC32 .bss-0x4
c: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # 12
e: R_X86_64_PC32 e_group_config+0x1c
12: 89 05 00 00 00 00 mov %eax,0x0(%rip) # 18
14: R_X86_64_PC32 .bss+0x7c
It might seem insignificant but the initial code size is 40 bytes vs 24 for the optimised (or 20 using hard address) - these minor things can add up pretty fast.
Looks like epiphany will need a pretty specific set of optimisation flags to get decent code (just using -O3 on it's own usually bloats the code too much).
Alternate runtime
I'm actually working toward an alternate runtime for epiphany cores. Just the e-lib stuff and loader anyway.
I was looking at creating a more epiphany optimised version of e_group_config and e_mem_config, both to save a few bytes and make access more efficient. I was just making sure every access could fit into a 16-bit instruction when a test build surprised me.
I've come up with this group-info structure which leads to more compact code for a variety of reasons:
struct ez_config_t {
uint16_t reserved0;
uint16_t reserved1;
uint16_t group_size;
uint16_t group_rows;
uint16_t group_cols;
uint16_t core_index;
uint16_t core_row;
uint16_t core_col;
uint32_t group_id;
void *extmem;
uint32_t reserved2;
uint32_t reserved3;
};
The layout isn't random - shorts are all within a 3-bit offset so a single 16-bit instruction can load them. The whole structure supports some expansion slots all which fit in with the 3-bit offset constraint for the data-type, and there is room for some bytes if necessary.
To test it I access every value once:
#define ez_configp ((ez_config_t *)(0x28))
int *idp = id;
idp[0] = ez_configp->group_size;
idp[1] = ez_configp->group_rows;
idp[2] = ez_configp->group_cols;
idp[3] = ez_configp->core_index;
idp[4] = ez_configp->core_row;
idp[5] = ez_configp->core_col;
idp[6] = ez_configp->group_id;
idp[7] = (int32_t)ez_configp->extmem;
-->
0: 2503 mov r1,0x28
2: 000b 0002 mov r0,0x0
6: 4524 ldrh r2,[r1,0x2]
8: 000b 1002 movt r0,0x0
c: 4054 str r2,[r0]
e: 45a4 ldrh r2,[r1,0x3]
10: 40d4 str r2,[r0,0x1]
12: 4624 ldrh r2,[r1,0x4]
14: 4154 str r2,[r0,0x2]
16: 46a4 ldrh r2,[r1,0x5]
18: 41d4 str r2,[r0,0x3]
1a: 4724 ldrh r2,[r1,0x6]
1c: 4254 str r2,[r0,0x4]
1e: 47a4 ldrh r2,[r1,0x7]
20: 42d4 str r2,[r0,0x5]
22: 4744 ldr r2,[r1,0x6]
24: 27c4 ldr r1,[r1,0x7]
26: 4354 str r2,[r0,0x6]
28: 23d4 str r1,[r0,0x7]
And the compiler's done exactly what you would expect here. Load the object base address and then simply access everything via an indexed access taking advantage of the hand-tuned layout to use a 16-bit instruction for all of them too.
I've included a couple of pre-calculated flat index values because these things are often needed in practical code and certainly to implement any group-wide primitives. This is somewhat better than the existing api which must calculate them on the fly.
int *idp = id;
idp[0] = e_group_config.group_rows * e_group_config.group_cols;
idp[1] = e_group_config.group_rows;
idp[2] = e_group_config.group_cols;
idp[3] = e_group_config.group_row * e_group_config.group_cols + e_group_config.group_col;
idp[4] = e_group_config.group_row;
idp[5] = e_group_config.group_col;
idp[6] = e_group_config.group_id;
idp[7] = (int32_t)e_emem_config.base;
--> -Os with default fpu mode
0: 000b 0002 mov r0,0x0
4: 000b 1002 movt r0,0x0
8: 804c 2000 ldr r12,[r0,+0x0]
c: 000b 0002 mov r0,0x0
10: 000b 1002 movt r0,0x0
14: 4044 ldr r2,[r0]
16: 000b 4002 mov r16,0x0
1a: 000b 0002 mov r0,0x0
1e: 000b 1002 movt r0,0x0
22: 010b 5002 movt r16,0x8
26: 2112 movfs r1,config
28: 0392 gid
2a: 411f 4002 movfs r18,config
2e: 487f 490a orr r18,r18,r16
32: 410f 4002 movts config,r18
36: 0192 gie
38: 0392 gid
3a: 611f 4002 movfs r19,config
3e: 6c7f 490a orr r19,r19,r16
42: 610f 4002 movts config,r19
46: 0192 gie
48: 0a2f 4087 fmul r16,r2,r12
4c: 80dc 2000 str r12,[r0,+0x1]
50: 800b 2002 mov r12,0x0
54: 800b 3002 movt r12,0x0
58: 4154 str r2,[r0,0x2]
5a: 005c 4000 str r16,[r0]
5e: 104c 4400 ldr r16,[r12,+0x0]
62: 800b 2002 mov r12,0x0
66: 800b 3002 movt r12,0x0
6a: 412f 0807 fmul r2,r16,r2
6e: 904c 2400 ldr r12,[r12,+0x0]
72: 025c 4000 str r16,[r0,+0x4]
76: 82dc 2000 str r12,[r0,+0x5]
7a: 4a1f 008a add r2,r2,r12
7e: 41d4 str r2,[r0,0x3]
80: 400b 0002 mov r2,0x0
84: 400b 1002 movt r2,0x0
88: 4844 ldr r2,[r2]
8a: 4354 str r2,[r0,0x6]
8c: 400b 0002 mov r2,0x0
90: 400b 1002 movt r2,0x0
94: 48c4 ldr r2,[r2,0x1]
96: 43d4 str r2,[r0,0x7]
98: 0392 gid
9a: 611f 4002 movfs r19,config
9e: 6c8f 480a eor r19,r19,r1
a2: 6ddf 480a and r19,r19,r3
a6: 6c8f 480a eor r19,r19,r1
aa: 610f 4002 movts config,r19
ae: 0192 gie
b0: 0392 gid
b2: 011f 4002 movfs r16,config
b6: 008f 480a eor r16,r16,r1
ba: 01df 480a and r16,r16,r3
be: 008f 480a eor r16,r16,r1
c2: 010f 4002 movts config,r16
c6: 0192 gie
--> -O3 with -mfp-mode=int
0: 000b 0002 mov r0,0x0
4: 000b 1002 movt r0,0x0
8: 2044 ldr r1,[r0]
a: 000b 0002 mov r0,0x0
e: 000b 1002 movt r0,0x0
12: 6044 ldr r3,[r0]
14: 000b 0002 mov r0,0x0
18: 000b 1002 movt r0,0x0
1c: 804c 2000 ldr r12,[r0,+0x0]
20: 4caf 4007 fmul r18,r3,r1
24: 000b 4002 mov r16,0x0
28: 000b 5002 movt r16,0x0
2c: 000b 0002 mov r0,0x0
30: 662f 4087 fmul r19,r1,r12
34: 000b 1002 movt r0,0x0
38: 204c 4800 ldr r17,[r16,+0x0]
3c: 000b 4002 mov r16,0x0
40: 4044 ldr r2,[r0]
42: 000b 5002 movt r16,0x0
46: 000b 0002 mov r0,0x0
4a: 00cc 4800 ldr r16,[r16,+0x1]
4e: 000b 1002 movt r0,0x0
52: 491f 480a add r18,r18,r2
56: 605c 4000 str r19,[r0]
5a: 80dc 2000 str r12,[r0,+0x1]
5e: 2154 str r1,[r0,0x2]
60: 41dc 4000 str r18,[r0,+0x3]
64: 6254 str r3,[r0,0x4]
66: 42d4 str r2,[r0,0x5]
68: 235c 4000 str r17,[r0,+0x6]
6c: 03dc 4000 str r16,[r0,+0x7]
Unless the code has no flops the fpumode=int is probably not very useful but this probably represents the best it could possibly do. And there's some real funky config register shit going on there in the -Os version but that just has to be a bug.
Oh blast, and the absolute loads are back anyway!
My hands might not stay attached if i keep throwing them up in the air at this point.
For the hexadecimal challenged (i.e. me) each fragment is 42, 200, and 112 bytes long respectively. And each uses 3, 9, or 9 registers.
Saturday evening elf-loader hackery
Wasn't much on TV so I kept poking fairly late last night. I had a look at a Java binding to the code.
It's kind of looking a bit like OpenCL but without any of the queuing stuff.
I tried creating an empty demo to try out the api and I'm going to need a bit more runtime support to make it practical. So at present this is how the api might work for a mandelbrot painter.
First the communication structures that live in the epiphany code.
#define WIDTH 512
#define HEIGHT 512
struct shared {
float left, top, right, bottom;
jbyte status[16];
};
// Shared comm block
struct shared shared EZ_SHARED;
// RGBA pixels
byte pixels[WIDTH * HEIGHT * 4] EZ_SHARED;
And then an example main.
EZPlatform plat = EZPlatform.init("system.hdf", EZ_SHARED_POINTERS);
EZWorkgroup wg = plat.createWorkgroup(0, 0, 4, 4);
EZProgram eprog = EZProgram.load("emandelbrot.elf");
// Halt the cores
wg.reset();
// Bind program to all cores
wg.bind(eprog, 0, 0, 4, 4);
// Link/load the code
wg.load();
// Access comms structures
ByteBuffer shared = wg.mapSymbol("_shared");
ByteBuffer pixels = wg.mapSymbol("_pixels");
// Job parameters
shared.putFloat(0).putFloat(0).putFloat(1).putFloat(1).put(new byte[16]).rewind();
// Start calculation
wg.start();
// Wait for all jobs to finish
for (int i = 0; i < 16; i++) {
while (shared.get(i + 16) == 0)
try {
Thread.sleep(1);
} catch (InterruptedException ex) {
Logger.getLogger(Test.class.getName()).log(Level.SEVERE, null, ex);
}
}
// Use pixels.
// ...
It's all pretty straightforward and decent until the job completion stuff. I probably want some way of abstracting that to something re-usable. Perhaps one day there will be some hardware support as well negating the need to poll the result. But just being able to look up structures by name is a big plus over the way you have to do it with the existing tools.
This (non-existent) example is just a one-shot execution but it already supports a persistent server mode. Perhaps it would also be useful to be able to support multi-kernel one-shot operation, e.g. choose the kernel and then a SYNC will launch a different main. If I do that then supporting kernel arguments would become useful although it's only worth it if the latency is ok versus the code size of a dispatch loop approach.
At the moment the .load() function is probably the interesting one. Internally this first relocates and links all the code to an arm-local buffer. Then it just memcpy's this to each core they are bound to. This state is remembered so it is possible to switch the functionality of a whole workgroup with a relatively cheap call. I don't think there's enough memory to do anything sophisticated like double-buffer the code though and given the alu to bandwidth mismatch as it is it probably wouldn't be much help anyway.
I do already have an 'EPort' primitive I included in the Java api. It's basically a non-locking cyclic counter which can be used to implement single writer / single reader queues very efficiently on the epiphany just using local memory reads and remote memory writes (i.e. non-blocking if not full and no mesh impact if it is). It's a bit limited though as for example you can only reserve or consume one slot at a time. Still useful mind you and it works with host-core as well as core-core.
I need to brush up again on some of the hardware workgroup support to see what other efficient primitives can be implemented (weird, the 4.13.x revision of the arch reference has vanished from the parallella site). Should be able to get a barrier at least although it's a bit more work having it work off-chip. Personally I think a mutex has no place in massive parallel hardware, although without a hardware atomic counter or mailboxes options are limited.
But maybe another day. I thought i'd had enough beer on Thursday (pretty much the last day of summer, 32 degrees and a warm balmy evening - absolutely awesome) but after finding out what the new contract is focussed on I'm ready for a Sunday Session even if it's just in my own back yard.
Saturday arvo elf-loader hack-a-thon
So I hacked a bit more of the elf loader today. Initially it was just documenting where I got to so I could work out where to go next. I wrote up a bit more background and detail on how it works. Then I cleared out the out of date experimental stuff and focussed on the ez_ interfaces.
But then I got a bit side-tracked ...
Compact startup
First I thought i'd see if i could replace crt0 with my own. Apart from an initialisation issue with bss and data there isn't really anything wrong with the bundled one ... apart from it dragging in a bunch of C support stuff which isn't necessary if writing small kernel code that doesn't need the full libc.
So I took the crt0.s and stripped out a bunch of stuff. The trampoline to a potentially off-core start routine (it's going to be on-core). The atexit init (not necessary). The constructors init (hopefully not necessary). The .bss clearing (it's problematic when you have possibly more than one block of bss such as shared and given that .data isn't re-initialised anyway the C language behaviour is already broken). And the argument setting (three zero's isn't useful for anything and isn't even correct).
I toyed with the idea of passing arguments to kernel but decided to just have a void main instead.
I just pass -nostartfiles e-crt0minimal.o
to the link line to replace the standard start-up code.
e-gcc -Wl,-r -o e-test-reloc.elf -nostartfiles e-crt0minimal.o e-test-reloc.o -le-lib
Worth it? Probably ...
Minimal crt0:
$ e-size e-test-workgroup-a.elf
text data bss dec hex filename
548 64 120 732 2dc e-test-workgroup-a.elf
Standard crt0:
$ e-size e-test-workgroup-a.elf
text data bss dec hex filename
1418 1196 128 2742 ab6 e-test-workgroup-a.elf
When you've only got 32K to play with saving 2K isn't to be sniffed at.
Matching addresses (aka fake hsa)
Next I looked at just using mmap to map the shared memory so pointers can be shared between the epiphany and the host arm directly. I tried to use e_get_platform_info() to get the list of memory blocks but for some odd reason that zeros out the memory array pointer? Odd. So ... I just access the struct directly via an extern instead.
This is just an implementation of stuff from a previous post but using the platform_info to find the addresses.
I have no idea whether this will work on a multi-epiphany setup but since I don't have one it's not something i'll lose sleep over :)
*COM* symbols
About this point I noticed that some symbols weren't being allocated any location in the output file and thus could not be resolved during the loader-linker execution. These were symbols marked with the section id of COMMON. I had hit this before but I had forgotten all about it. Last time I solved it using a linker script but I found I can just pass '-d' to the linker to achieve the same result which puts any such values into bss.
Automatic remote-core on-chip symbol resolution
Then I had a look into implementing fully automatic resolution of remote but on-chip symbols. The options are limited but the desired target core can be indicated by the symbol name.
For example:
program a:
extern int buffer[12];
program b:
// current cell relative +0, +1
extern int buffer_0_1[12];
// group relative 0,1
extern int buffer$0$1[12];
Or some variation thereof. This is easy enough to parse and implement, and not too ugly to use.
But it would mean that the binary for every core would need to be linked individually and thus it wouldn't be possible to just copy the same code across to multiple cores when they share the implementation. For this reason I've dropped the idea for now. Having to use e_get_global_address() on a weak symbol isn't too difficult.
'easy' elf loader for parallella
I prettied up some of the stuff I did a few months ago on the parallella code loader and uploaded it to the home page.
It is still very much work in progress (just a bunch of experiments) but it is currently able to take several distinct epiphany-core programs and relocate and cross-link them to any on-core topology - at runtime. Remote addresses in other cores can be partially resolved automatically (to the local core address offset - suitable for e_get_global_address()) by using weak symbols and the host can resolve symbols by name. By default sections go on-core but .section directives can redirect individual records to specific banks or to global memory. bss/text/data are all supported for any such section using standard names (no 'code' sections!).
Linker scripts are not needed for any of this and the only 'special sauce' is that the epiphany binaries be linked with -r. I mention this because this was the primary driving factor for me to write any of this. I would probably like to replace crt0 as well but that is something for the future (basically remove the bss init stuff).
When I next poke at it I want to work toward wrapping it in an accessible Java API. There is a bit to be done before that though (I think - it's been a while since I looked at it).
thoughts on opencl + array methods
I was going to have a quick look at removing the erroneous asynchronous Get/SetPrimitiveArrayCritical() stuff from zcl this morning but I've hit a complication too far for my tired brain.
I changed to just allocating a staging buffer and using Get/Set*ArrayRegion() for the read/writeBuffer commands. I only allocate enough memory for the transfer and copy the transfer size around and so on. It's a bit bulky but it's fairly straightforward.
Then I started looking at the image interfaces and realised doing the same thing is somewhat more complex - either I have to copy the whole array to/from the staging buffer to/from each time (if the get/set updates on a portion of the image for example) or I have to flatten the transfer myself. The former pretty much makes the function pointless and the latter bulks out the binding and may require lots of jvm calls.
So now i'm deciding whether I just force synchronous transfers for all array interfaces because they probably have some use despite synchronisation being the mind-killer, or just deleting them altogether to drop a ton of code. Since i've already got all the code the former will probably be the approach I take. The event callback stuff i'm using to finish up the transaction seems pretty expensive anyway so there may not be much net difference (against a net use count of zero for the library, at that - it's just something to pass the time).
On another note I decided to use cvs as my local repository backend to store this stuff. I think the all-day-never-finished checkout of gcc finally tipped me over but I never liked subversion because it's too slow and is just shit at merging. I was surprised netbeans detected it and offered to install the cvs plugin automatically (and i was a little surprised it was already installed too). I don't need or want to use tools that weren't designed for my use-case.
Ho hum, back to work Tuesday. My boss actually apologised for taking so long to get the contract sorted but yeah, i'm not complaining! Looks like it'll mostly be a continuation of one of the projects i'm not terribly keen on too. Bummer I guess. All I really care about right now is sleep though.
Update: (I kept poking) I just removed the async handling code and force a blocking call. Get/SetPrimitiveArrayCritical() is used to access the arrays directly. I'll do a release another day though.
sumatra, graal, etc.
I didn't really wanna get stuck building stuff all day but that seems to have happened.
Sumatra uses auto* so it built easily no problem. Bummer there's no javafx but hopefully that isn't far off.
graal is a bit of a pain because it uses a completely custom build/update/everything mega-tool written in a single 4KLOC piece of python. Ok when it works but a meaningless backtrace if it doesn't. Well it is early alpha software I guess. Still ... why?
Anyway ... I tried building against 'make install' from sumatra but that doesn't work, you need to point your JAVA_HOME at the Sumatra tree as the docs tell you to. My mistake there.
So it turns out getting the hsail tools to build had some point after-all. The hsailasm downloaded by the "build" tool (in lib/okra-1.8-with-sim.jar) wont work against the libelf 0.x included in slackware (might be easier just building libelf 1.x). So I added the path to the hsailasm I built myself ... and ...
export PATH=/home/notzed/hsa/HSAIL-Instruction-Set-Simulator/build/HSAIL-Tools:$PATH
./mx.sh --vm server unittest -XX:+TraceGPUInteraction \
-XX:+GPUOffload -G:Log=CodeGen hsail.test.IntAddTest
[HSAIL] library is libokra_x86_64.so
[HSAIL] using _OKRA_SIM_LIB_PATH_=/tmp/okraresource.dir_7081062365578722856/libokra_x86_64.so
[GPU] registered initialization of Okra (total initialized: 1)
JUnit version 4.8
.[thread:1] scope:
[thread:1] scope: GraalCompiler
[thread:1] scope: GraalCompiler.CodeGen
Nothing to do here
Nothing to do here
Nothing to do here
version 0:95: $full : $large;
// static method HotSpotMethod
kernel &run (
align 8 kernarg_u64 %_arg0,
align 8 kernarg_u64 %_arg1,
align 8 kernarg_u64 %_arg2
) {
ld_kernarg_u64 $d0, [%_arg0];
ld_kernarg_u64 $d1, [%_arg1];
ld_kernarg_u64 $d2, [%_arg2];
workitemabsid_u32 $s0, 0;
@L0:
cmp_eq_b1_u64 $c0, $d0, 0; // null test
cbr $c0, @L1;
@L2:
ld_global_s32 $s1, [$d0 + 12];
cmp_ge_b1_u32 $c0, $s0, $s1;
cbr $c0, @L12;
@L3:
cmp_eq_b1_u64 $c0, $d2, 0; // null test
cbr $c0, @L4;
@L5:
ld_global_s32 $s1, [$d2 + 12];
cmp_ge_b1_u32 $c0, $s0, $s1;
cbr $c0, @L11;
@L6:
cmp_eq_b1_u64 $c0, $d1, 0; // null test
cbr $c0, @L7;
@L8:
ld_global_s32 $s1, [$d1 + 12];
cmp_ge_b1_u32 $c0, $s0, $s1;
cbr $c0, @L10;
@L9:
cvt_s64_s32 $d3, $s0;
mul_s64 $d3, $d3, 4;
add_u64 $d1, $d1, $d3;
ld_global_s32 $s1, [$d1 + 16];
cvt_s64_s32 $d1, $s0;
mul_s64 $d1, $d1, 4;
add_u64 $d2, $d2, $d1;
ld_global_s32 $s2, [$d2 + 16];
add_s32 $s2, $s2, $s1;
cvt_s64_s32 $d1, $s0;
mul_s64 $d1, $d1, 4;
add_u64 $d0, $d0, $d1;
st_global_s32 $s2, [$d0 + 16];
ret;
@L1:
mov_b32 $s0, -7691;
@L13:
ret;
@L4:
mov_b32 $s0, -6411;
brn @L13;
@L10:
mov_b32 $s0, -5403;
brn @L13;
@L7:
mov_b32 $s0, -4875;
brn @L13;
@L12:
mov_b32 $s0, -8219;
brn @L13;
@L11:
mov_b32 $s0, -6939;
brn @L13;
};
[HSAIL] heap=0x00007f47a8017a40
[HSAIL] base=0x95400000, capacity=108527616
External method:com.oracle.graal.compiler.hsail.test.IntAddTest.run([I[I[II)V
installCode0: ExternalCompilationResult
[HSAIL] sig:([I[I[II)V args length=3, _parameter_count=4
[HSAIL] static method
[HSAIL] HSAILKernelArguments::do_array, _index=0, 0xdd563828, is a [I
[HSAIL] HSAILKernelArguments::do_array, _index=1, 0xdd581718, is a [I
[HSAIL] HSAILKernelArguments::do_array, _index=2, 0xdd581778, is a [I
[HSAIL] HSAILKernelArguments::not pushing trailing int
Time: 0.213
OK (1 test)
Yay? I think?
Maybe not ... it seems that it's only using the simulator. I tried using LD_LIBRARY_PATH and -Djava.library.path to redirect to the libokra from the Okra-Interface-to-HSA-Device library but that just hangs after the "base=0x95..." line after dumping the hsail. strace isn't showing anything obvious so i'm not sure what's going on. Might've hit some ubuntu compatibility issue at last or just a mismatch in versions of libokra.
On the other hand ... it was noticeable that something was happening with the gpu as the mouse started to judder, yet a simple ctrl-c killed it cleanly. Just that alone once it makes it into OpenCL will be worth it's weight in cocky shit rather than just hard locking X as is does with catalyst.
Having just typed that ... one test too many and it decides to crash into an unkillable process and do weird stuff (and not long after I had to reboot the system). But at least that is to be expected for alpha software and i've been pretty surprised by the overall system stability all things considered.
I think next time I'll just have a closer look at aparapi because at least I have that working with the APU and i'm a bit sick of compiling other peoples code and their strange build systems. Sumatra and graal are very large and complex projects and a bit more involved than I'm really interested in right now. I haven't used aparapi before anyway so I should have a look.
If the slackware vs ubuntu thing becomes too much of a hassle I might just go and buy another hdd and dual-boot; I already have to multi-boot to switch between opencl+accelerated javafx vs apu.
Update: Actually there may be something more to it. I just tried creating my own aparapi thing and it crashed in the same way so maybe i was missing some env variable or it was due to a suspend/resume cycle.
So I just had another go at getting the graal test running on hsail and I think it worked:
export JAVA_HOME=/home/notzed/hsa/sumatra-dev/build/linux-x86_64-normal-server-release/images/j2sdk-image
export PATH=${JAVA_HOME}/bin:${PATH}
export LD_LIBRARY_PATH=/home/notzed/hsa/Okra-Interface-to-HSA-Device/okra/dist/bin
./mx.sh --vm server unittest -XX:+TraceGPUInteraction -XX:+GPUOffload -G:Log=CodeGen hsail.test.IntAddTest
[HSAIL] library is libokra_x86_64.so
[GPU] registered initialization of Okra (total initialized: 1)
JUnit version 4.8
...
Time: 0.198
OK (1 test)
Still, i'm not sure what to do with it yet ...
I was going to have a play today but I just got the word on work starting again (I can probably push it out to Monday) so I might just go to the pub or just for a ride - way too nice to be inside getting monitor burn. I foolishly decided to walk into the city yesterday for lunch with a mate I haven't seen for years (and did a few pubs on the way home - i was in no rush!) but just ended up with a nice big blister and sore feet for my troubles. It's about a 45 minute walk into the city but I don't do much walking.
I want see if anything interesting comes out of the Sony's GDC talk first though (the vr one? - yes the VR one, it's just started).
And ... done. Interesting, but still early days. Even if they release a model for the public at a mass-market price it's still going to have to be a long term project. Likely the first 5-10 years will just be experimentation and getting the technology the point where it is good and cheap enough.
Update: Yep, i'm pretty sure it's just a problem with suspend/resume. I just tried running it after a resume and it panicked the kernel.
Copyright (C) 2019 Michael Zucchi, All Rights Reserved.
Powered by gcc & me!