epiphany stack frame

I came across this months ago but forgot some of the finer details.

I thought that 8-byte empty area looked odd in generated code but I think it's there to simplify stacking saved registers due to the lack of pre-decrement addressing modes (otherwise it doesn't make much sense - leaf function use wouldn't matter if it had such addressing modes).

It's in the gcc source.

/* EPIPHANY stack frames look like:

             Before call                       After call
        +-----------------------+       +-----------------------+
        |                       |       |                       |
   high |  local variables,     |       |  local variables,     |
   mem  |  reg save area, etc.  |       |  reg save area, etc.  |
        |                       |       |                       |
        +-----------------------+       +-----------------------+
        |                       |       |                       |
        |  arguments on stack.  |       |  arguments on stack.  |
        |                       |       |                       |
  SP+8->+-----------------------+FP+8m->+-----------------------+
        | 2 word save area for  |       |  reg parm save area,  |
        | leaf funcs / flags    |       |  only created for     |
  SP+0->+-----------------------+       |  variable argument    |
                                        |  functions            |
                                 FP+8n->+-----------------------+
                                        |                       |
                                        |  register save area   |
                                        |                       |
                                        +-----------------------+
                                        |                       |
                                        |  local variables      |
                                        |                       |
                                  FP+0->+-----------------------+
                                        |                       |
                                        |  alloca allocations   |
                                        |                       |
                                        +-----------------------+
                                        |                       |
                                        |  arguments on stack   |
                                        |                       |
                                  SP+8->+-----------------------+
   low                                  | 2 word save area for  |
   memory                               | leaf funcs / flags    |
                                  SP+0->+-----------------------+

compiler strangeness

So I hit a strange issue with gcc. Well i don't know ... not 'strange', just unexpected. It probably doesn't matter much on x86 because it has so few registers and such a shitty breadth of addressing modes but on arm and epiphany it generates some pretty shit load/store code outside of an unexpected optimisation flag (and -O3, and even then only sometimes?).

Easiest to demonstrate in the epiphany instruction set.

A simple example:

extern const e_group_config_t e_group_config;
int id[8];
void foo(void) {
 id[0] = e_group_config.core_row;
 id[1] = e_group_config.core_col;
}

 -->

   0:   000b 0002       mov r0,0x0
                        0: R_EPIPHANY_LOW       _e_group_config+0x1c
   4:   000b 1002       movt r0,0x0
                        4: R_EPIPHANY_HIGH      _e_group_config+0x1c
   8:   2044            ldr r1,[r0]
   a:   000b 0002       mov r0,0x0
                        a: R_EPIPHANY_LOW       .bss
   e:   000b 1002       movt r0,0x0
                        e: R_EPIPHANY_HIGH      .bss
  12:   2054            str r1,[r0]
  14:   000b 0002       mov r0,0x0
                        14: R_EPIPHANY_LOW      _e_group_config+0x20
  18:   000b 1002       movt r0,0x0
                        18: R_EPIPHANY_HIGH     _e_group_config+0x20
  1c:   2044            ldr r1,[r0]
  1e:   000b 0002       mov r0,0x0
                        1e: R_EPIPHANY_LOW      .bss+0x4
  22:   000b 1002       movt r0,0x0
                        22: R_EPIPHANY_HIGH     .bss+0x4
  26:   2054            str r1,[r0]

Err, what?

It's basically going to the linker to resolve every memory reference (all those R_* reloc records), even for the array array. At first I thought this was just an epiphany-gcc thing but i cross checked on amd64 and arm with the same result. Curious.

Curious also ...

extern const e_group_config_t e_group_config;
int id[8];
void foo(void) {
 int *idp = id;
 const e_group_config_t *ep = &e_group_config;

 idp[0] = ep->core_row;
 idp[1] = ep->core_col;
}

 -->
   0:   200b 0002       mov r1,0x0
                        0: R_EPIPHANY_LOW       _e_group_config+0x1c
   4:   200b 1002       movt r1,0x0
                        4: R_EPIPHANY_HIGH      _e_group_config+0x1c
   8:   2444            ldr r1,[r1]
   a:   000b 0002       mov r0,0x0
                        a: R_EPIPHANY_LOW       .bss
   e:   000b 1002       movt r0,0x0
                        e: R_EPIPHANY_HIGH      .bss
  12:   2054            str r1,[r0]
  14:   200b 0002       mov r1,0x0
                        14: R_EPIPHANY_LOW      _e_group_config+0x20
  18:   200b 1002       movt r1,0x0
                        18: R_EPIPHANY_HIGH     _e_group_config+0x20
  1c:   2444            ldr r1,[r1]
  1e:   20d4            str r1,[r0,0x1]

This fixes the array references, but not the struct references.

If one hard-codes the pointer address (which is probably a better idea anyway - yes it really is) and uses the pointer-to-array trick, then things finally reach the most-straightforward-compilation I get by just looking at the code and thinking in assembly (which is how i always look at memory-accessing code).

#define e_group_config ((const e_group_config_t *)0x28)
int id[8];
void foo(void) {
  int *idp = id;
  idp[0] = e_group_config->core_row;
  idp[1] = e_group_config->core_col;[/code]
}

 -->

   0:   2503            mov r1,0x28
   2:   47c4            ldr r2,[r1,0x7]
   4:   000b 0002       mov r0,0x0
                        4: R_EPIPHANY_LOW       .bss
   8:   000b 1002       movt r0,0x0
                        8: R_EPIPHANY_HIGH      .bss
   c:   4054            str r2,[r0]
   e:   244c 0001       ldr r1,[r1,+0x8]
  12:   20d4            str r1,[r0,0x1]

Bit of a throwing-hands-in-the-air moment.

Using -O3 on the original example gives something reasonable:

   0:   200b 0002       mov r1,0x0
                        0: R_EPIPHANY_LOW       _e_group_config+0x1c
   4:   200b 1002       movt r1,0x0
                        4: R_EPIPHANY_HIGH      _e_group_config+0x1c
   8:   4444            ldr r2,[r1]
   a:   000b 0002       mov r0,0x0
                        a: R_EPIPHANY_LOW       .bss
   e:   24c4            ldr r1,[r1,0x1]
  10:   000b 1002       movt r0,0x0
                        10: R_EPIPHANY_HIGH     .bss
  14:   4054            str r2,[r0]
  16:   20d4            str r1,[r0,0x1]

Which is what it should've been doing to start with. After testing every optimisation flag different between -O3 and -O2 I found that it was -ftree-vectorize that activates this 'optimisation'.

I can only presume the cost model of offset address calculations is borrowing too much from x86 where the lack of registers and addressing modes favours pre-calculation every time. -O[s23] compile this the same on amd64 as one would expect.

   0:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 6 
                        2: R_X86_64_PC32        e_group_config+0x18
   6:   89 05 00 00 00 00       mov    %eax,0x0(%rip)        # c 
                        8: R_X86_64_PC32        .bss-0x4
   c:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 12 
                        e: R_X86_64_PC32        e_group_config+0x1c
  12:   89 05 00 00 00 00       mov    %eax,0x0(%rip)        # 18 
                        14: R_X86_64_PC32       .bss+0x7c

It might seem insignificant but the initial code size is 40 bytes vs 24 for the optimised (or 20 using hard address) - these minor things can add up pretty fast.

Looks like epiphany will need a pretty specific set of optimisation flags to get decent code (just using -O3 on it's own usually bloats the code too much).

Alternate runtime

I'm actually working toward an alternate runtime for epiphany cores. Just the e-lib stuff and loader anyway.

I was looking at creating a more epiphany optimised version of e_group_config and e_mem_config, both to save a few bytes and make access more efficient. I was just making sure every access could fit into a 16-bit instruction when a test build surprised me.

I've come up with this group-info structure which leads to more compact code for a variety of reasons:

struct ez_config_t {
    uint16_t reserved0;
    uint16_t reserved1;

    uint16_t group_size;
    uint16_t group_rows;
    uint16_t group_cols;

    uint16_t core_index;
    uint16_t core_row;
    uint16_t core_col;

    uint32_t group_id;
    void *extmem;

    uint32_t reserved2;
    uint32_t reserved3;
};

The layout isn't random - shorts are all within a 3-bit offset so a single 16-bit instruction can load them. The whole structure supports some expansion slots all which fit in with the 3-bit offset constraint for the data-type, and there is room for some bytes if necessary.

To test it I access every value once:

        #define ez_configp ((ez_config_t *)(0x28))
        int *idp = id;
        idp[0] = ez_configp->group_size;
        idp[1] = ez_configp->group_rows;
        idp[2] = ez_configp->group_cols;
        idp[3] = ez_configp->core_index;
        idp[4] = ez_configp->core_row;
        idp[5] = ez_configp->core_col;

        idp[6] = ez_configp->group_id;
        idp[7] = (int32_t)ez_configp->extmem;

 -->
   0:   2503            mov r1,0x28
   2:   000b 0002       mov r0,0x0
   6:   4524            ldrh r2,[r1,0x2]
   8:   000b 1002       movt r0,0x0
   c:   4054            str r2,[r0]
   e:   45a4            ldrh r2,[r1,0x3]
  10:   40d4            str r2,[r0,0x1]
  12:   4624            ldrh r2,[r1,0x4]
  14:   4154            str r2,[r0,0x2]
  16:   46a4            ldrh r2,[r1,0x5]
  18:   41d4            str r2,[r0,0x3]
  1a:   4724            ldrh r2,[r1,0x6]
  1c:   4254            str r2,[r0,0x4]
  1e:   47a4            ldrh r2,[r1,0x7]
  20:   42d4            str r2,[r0,0x5]
  22:   4744            ldr r2,[r1,0x6]
  24:   27c4            ldr r1,[r1,0x7]
  26:   4354            str r2,[r0,0x6]
  28:   23d4            str r1,[r0,0x7]

And the compiler's done exactly what you would expect here. Load the object base address and then simply access everything via an indexed access taking advantage of the hand-tuned layout to use a 16-bit instruction for all of them too.

I've included a couple of pre-calculated flat index values because these things are often needed in practical code and certainly to implement any group-wide primitives. This is somewhat better than the existing api which must calculate them on the fly.

    int *idp = id;
    idp[0] = e_group_config.group_rows * e_group_config.group_cols;
    idp[1] = e_group_config.group_rows;
    idp[2] = e_group_config.group_cols;
    idp[3] = e_group_config.group_row * e_group_config.group_cols + e_group_config.group_col;
    idp[4] = e_group_config.group_row;
    idp[5] = e_group_config.group_col;

    idp[6] = e_group_config.group_id;
    idp[7] = (int32_t)e_emem_config.base;

 --> -Os with default fpu mode
   0:   000b 0002       mov r0,0x0
   4:   000b 1002       movt r0,0x0
   8:   804c 2000       ldr r12,[r0,+0x0]
   c:   000b 0002       mov r0,0x0
  10:   000b 1002       movt r0,0x0
  14:   4044            ldr r2,[r0]
  16:   000b 4002       mov r16,0x0
  1a:   000b 0002       mov r0,0x0
  1e:   000b 1002       movt r0,0x0
  22:   010b 5002       movt r16,0x8
  26:   2112            movfs r1,config
  28:   0392            gid
  2a:   411f 4002       movfs r18,config
  2e:   487f 490a       orr r18,r18,r16
  32:   410f 4002       movts config,r18
  36:   0192            gie
  38:   0392            gid
  3a:   611f 4002       movfs r19,config
  3e:   6c7f 490a       orr r19,r19,r16
  42:   610f 4002       movts config,r19
  46:   0192            gie
  48:   0a2f 4087       fmul r16,r2,r12
  4c:   80dc 2000       str r12,[r0,+0x1]
  50:   800b 2002       mov r12,0x0
  54:   800b 3002       movt r12,0x0
  58:   4154            str r2,[r0,0x2]
  5a:   005c 4000       str r16,[r0]
  5e:   104c 4400       ldr r16,[r12,+0x0]
  62:   800b 2002       mov r12,0x0
  66:   800b 3002       movt r12,0x0
  6a:   412f 0807       fmul r2,r16,r2
  6e:   904c 2400       ldr r12,[r12,+0x0]
  72:   025c 4000       str r16,[r0,+0x4]
  76:   82dc 2000       str r12,[r0,+0x5]
  7a:   4a1f 008a       add r2,r2,r12
  7e:   41d4            str r2,[r0,0x3]
  80:   400b 0002       mov r2,0x0
  84:   400b 1002       movt r2,0x0
  88:   4844            ldr r2,[r2]
  8a:   4354            str r2,[r0,0x6]
  8c:   400b 0002       mov r2,0x0
  90:   400b 1002       movt r2,0x0
  94:   48c4            ldr r2,[r2,0x1]
  96:   43d4            str r2,[r0,0x7]
  98:   0392            gid
  9a:   611f 4002       movfs r19,config
  9e:   6c8f 480a       eor r19,r19,r1
  a2:   6ddf 480a       and r19,r19,r3
  a6:   6c8f 480a       eor r19,r19,r1
  aa:   610f 4002       movts config,r19
  ae:   0192            gie
  b0:   0392            gid
  b2:   011f 4002       movfs r16,config
  b6:   008f 480a       eor r16,r16,r1
  ba:   01df 480a       and r16,r16,r3
  be:   008f 480a       eor r16,r16,r1
  c2:   010f 4002       movts config,r16
  c6:   0192            gie

 --> -O3 with -mfp-mode=int
   0:   000b 0002       mov r0,0x0
   4:   000b 1002       movt r0,0x0
   8:   2044            ldr r1,[r0]
   a:   000b 0002       mov r0,0x0
   e:   000b 1002       movt r0,0x0
  12:   6044            ldr r3,[r0]
  14:   000b 0002       mov r0,0x0
  18:   000b 1002       movt r0,0x0
  1c:   804c 2000       ldr r12,[r0,+0x0]
  20:   4caf 4007       fmul r18,r3,r1
  24:   000b 4002       mov r16,0x0
  28:   000b 5002       movt r16,0x0
  2c:   000b 0002       mov r0,0x0
  30:   662f 4087       fmul r19,r1,r12
  34:   000b 1002       movt r0,0x0
  38:   204c 4800       ldr r17,[r16,+0x0]
  3c:   000b 4002       mov r16,0x0
  40:   4044            ldr r2,[r0]
  42:   000b 5002       movt r16,0x0
  46:   000b 0002       mov r0,0x0
  4a:   00cc 4800       ldr r16,[r16,+0x1]
  4e:   000b 1002       movt r0,0x0
  52:   491f 480a       add r18,r18,r2
  56:   605c 4000       str r19,[r0]
  5a:   80dc 2000       str r12,[r0,+0x1]
  5e:   2154            str r1,[r0,0x2]
  60:   41dc 4000       str r18,[r0,+0x3]
  64:   6254            str r3,[r0,0x4]
  66:   42d4            str r2,[r0,0x5]
  68:   235c 4000       str r17,[r0,+0x6]
  6c:   03dc 4000       str r16,[r0,+0x7]

Unless the code has no flops the fpumode=int is probably not very useful but this probably represents the best it could possibly do. And there's some real funky config register shit going on there in the -Os version but that just has to be a bug.

Oh blast, and the absolute loads are back anyway!

My hands might not stay attached if i keep throwing them up in the air at this point.

For the hexadecimal challenged (i.e. me) each fragment is 42, 200, and 112 bytes long respectively. And each uses 3, 9, or 9 registers.

Saturday evening elf-loader hackery

Wasn't much on TV so I kept poking fairly late last night. I had a look at a Java binding to the code.

It's kind of looking a bit like OpenCL but without any of the queuing stuff.

I tried creating an empty demo to try out the api and I'm going to need a bit more runtime support to make it practical. So at present this is how the api might work for a mandelbrot painter.

First the communication structures that live in the epiphany code.

#define WIDTH 512
#define HEIGHT 512

struct shared {
 float left, top, right, bottom;
 jbyte status[16]; 
};

// Shared comm block
struct shared shared EZ_SHARED;
// RGBA pixels
byte pixels[WIDTH * HEIGHT * 4] EZ_SHARED;

And then an example main.

        EZPlatform plat = EZPlatform.init("system.hdf", EZ_SHARED_POINTERS);
        EZWorkgroup wg = plat.createWorkgroup(0, 0, 4, 4);

        EZProgram eprog = EZProgram.load("emandelbrot.elf");

        // Halt the cores
        wg.reset();

        // Bind program to all cores
        wg.bind(eprog, 0, 0, 4, 4);

        // Link/load the code
        wg.load();

        // Access comms structures
        ByteBuffer shared = wg.mapSymbol("_shared");
        ByteBuffer pixels = wg.mapSymbol("_pixels");

        // Job parameters
        shared.putFloat(0).putFloat(0).putFloat(1).putFloat(1).put(new byte[16]).rewind();

        // Start calculation
        wg.start();

        // Wait for all jobs to finish
        for (int i = 0; i < 16; i++) {
            while (shared.get(i + 16) == 0)
                try {
                    Thread.sleep(1);
                } catch (InterruptedException ex) {
                    Logger.getLogger(Test.class.getName()).log(Level.SEVERE, null, ex);
                }
        }
        
        // Use pixels.
        // ...

It's all pretty straightforward and decent until the job completion stuff. I probably want some way of abstracting that to something re-usable. Perhaps one day there will be some hardware support as well negating the need to poll the result. But just being able to look up structures by name is a big plus over the way you have to do it with the existing tools.

This (non-existent) example is just a one-shot execution but it already supports a persistent server mode. Perhaps it would also be useful to be able to support multi-kernel one-shot operation, e.g. choose the kernel and then a SYNC will launch a different main. If I do that then supporting kernel arguments would become useful although it's only worth it if the latency is ok versus the code size of a dispatch loop approach.

At the moment the .load() function is probably the interesting one. Internally this first relocates and links all the code to an arm-local buffer. Then it just memcpy's this to each core they are bound to. This state is remembered so it is possible to switch the functionality of a whole workgroup with a relatively cheap call. I don't think there's enough memory to do anything sophisticated like double-buffer the code though and given the alu to bandwidth mismatch as it is it probably wouldn't be much help anyway.

I do already have an 'EPort' primitive I included in the Java api. It's basically a non-locking cyclic counter which can be used to implement single writer / single reader queues very efficiently on the epiphany just using local memory reads and remote memory writes (i.e. non-blocking if not full and no mesh impact if it is). It's a bit limited though as for example you can only reserve or consume one slot at a time. Still useful mind you and it works with host-core as well as core-core.

I need to brush up again on some of the hardware workgroup support to see what other efficient primitives can be implemented (weird, the 4.13.x revision of the arch reference has vanished from the parallella site). Should be able to get a barrier at least although it's a bit more work having it work off-chip. Personally I think a mutex has no place in massive parallel hardware, although without a hardware atomic counter or mailboxes options are limited.

But maybe another day. I thought i'd had enough beer on Thursday (pretty much the last day of summer, 32 degrees and a warm balmy evening - absolutely awesome) but after finding out what the new contract is focussed on I'm ready for a Sunday Session even if it's just in my own back yard.

Saturday arvo elf-loader hack-a-thon

So I hacked a bit more of the elf loader today. Initially it was just documenting where I got to so I could work out where to go next. I wrote up a bit more background and detail on how it works. Then I cleared out the out of date experimental stuff and focussed on the ez_ interfaces.

But then I got a bit side-tracked ...

Compact startup

First I thought i'd see if i could replace crt0 with my own. Apart from an initialisation issue with bss and data there isn't really anything wrong with the bundled one ... apart from it dragging in a bunch of C support stuff which isn't necessary if writing small kernel code that doesn't need the full libc.

So I took the crt0.s and stripped out a bunch of stuff. The trampoline to a potentially off-core start routine (it's going to be on-core). The atexit init (not necessary). The constructors init (hopefully not necessary). The .bss clearing (it's problematic when you have possibly more than one block of bss such as shared and given that .data isn't re-initialised anyway the C language behaviour is already broken). And the argument setting (three zero's isn't useful for anything and isn't even correct).

I toyed with the idea of passing arguments to kernel but decided to just have a void main instead.

I just pass -nostartfiles e-crt0minimal.o to the link line to replace the standard start-up code.

 e-gcc -Wl,-r -o e-test-reloc.elf -nostartfiles e-crt0minimal.o e-test-reloc.o -le-lib

Worth it? Probably ...

Minimal crt0:

$ e-size e-test-workgroup-a.elf
   text    data     bss     dec     hex filename
    548      64     120     732     2dc e-test-workgroup-a.elf

Standard crt0:

$ e-size e-test-workgroup-a.elf
   text    data     bss     dec     hex filename
   1418    1196     128    2742     ab6 e-test-workgroup-a.elf

When you've only got 32K to play with saving 2K isn't to be sniffed at.

Matching addresses (aka fake hsa)

Next I looked at just using mmap to map the shared memory so pointers can be shared between the epiphany and the host arm directly. I tried to use e_get_platform_info() to get the list of memory blocks but for some odd reason that zeros out the memory array pointer? Odd. So ... I just access the struct directly via an extern instead.

This is just an implementation of stuff from a previous post but using the platform_info to find the addresses.

I have no idea whether this will work on a multi-epiphany setup but since I don't have one it's not something i'll lose sleep over :)

COM symbols

About this point I noticed that some symbols weren't being allocated any location in the output file and thus could not be resolved during the loader-linker execution. These were symbols marked with the section id of COMMON. I had hit this before but I had forgotten all about it. Last time I solved it using a linker script but I found I can just pass '-d' to the linker to achieve the same result which puts any such values into bss.

Automatic remote-core on-chip symbol resolution

Then I had a look into implementing fully automatic resolution of remote but on-chip symbols. The options are limited but the desired target core can be indicated by the symbol name.

For example:

program a:
   extern int buffer[12];

program b:
   // current cell relative +0, +1
   extern int buffer_0_1[12];

   // group relative 0,1
   extern int buffer$0$1[12];

Or some variation thereof. This is easy enough to parse and implement, and not too ugly to use.

But it would mean that the binary for every core would need to be linked individually and thus it wouldn't be possible to just copy the same code across to multiple cores when they share the implementation. For this reason I've dropped the idea for now. Having to use e_get_global_address() on a weak symbol isn't too difficult.

'easy' elf loader for parallella

I prettied up some of the stuff I did a few months ago on the parallella code loader and uploaded it to the home page.

It is still very much work in progress (just a bunch of experiments) but it is currently able to take several distinct epiphany-core programs and relocate and cross-link them to any on-core topology - at runtime. Remote addresses in other cores can be partially resolved automatically (to the local core address offset - suitable for e_get_global_address()) by using weak symbols and the host can resolve symbols by name. By default sections go on-core but .section directives can redirect individual records to specific banks or to global memory. bss/text/data are all supported for any such section using standard names (no 'code' sections!).

Linker scripts are not needed for any of this and the only 'special sauce' is that the epiphany binaries be linked with -r. I mention this because this was the primary driving factor for me to write any of this. I would probably like to replace crt0 as well but that is something for the future (basically remove the bss init stuff).

When I next poke at it I want to work toward wrapping it in an accessible Java API. There is a bit to be done before that though (I think - it's been a while since I looked at it).

thoughts on opencl + array methods

I was going to have a quick look at removing the erroneous asynchronous Get/SetPrimitiveArrayCritical() stuff from zcl this morning but I've hit a complication too far for my tired brain.

I changed to just allocating a staging buffer and using Get/Set*ArrayRegion() for the read/writeBuffer commands. I only allocate enough memory for the transfer and copy the transfer size around and so on. It's a bit bulky but it's fairly straightforward.

Then I started looking at the image interfaces and realised doing the same thing is somewhat more complex - either I have to copy the whole array to/from the staging buffer to/from each time (if the get/set updates on a portion of the image for example) or I have to flatten the transfer myself. The former pretty much makes the function pointless and the latter bulks out the binding and may require lots of jvm calls.

So now i'm deciding whether I just force synchronous transfers for all array interfaces because they probably have some use despite synchronisation being the mind-killer, or just deleting them altogether to drop a ton of code. Since i've already got all the code the former will probably be the approach I take. The event callback stuff i'm using to finish up the transaction seems pretty expensive anyway so there may not be much net difference (against a net use count of zero for the library, at that - it's just something to pass the time).

On another note I decided to use cvs as my local repository backend to store this stuff. I think the all-day-never-finished checkout of gcc finally tipped me over but I never liked subversion because it's too slow and is just shit at merging. I was surprised netbeans detected it and offered to install the cvs plugin automatically (and i was a little surprised it was already installed too). I don't need or want to use tools that weren't designed for my use-case.

Ho hum, back to work Tuesday. My boss actually apologised for taking so long to get the contract sorted but yeah, i'm not complaining! Looks like it'll mostly be a continuation of one of the projects i'm not terribly keen on too. Bummer I guess. All I really care about right now is sleep though.

Update: (I kept poking) I just removed the async handling code and force a blocking call. Get/SetPrimitiveArrayCritical() is used to access the arrays directly. I'll do a release another day though.

sumatra, graal, etc.

I didn't really wanna get stuck building stuff all day but that seems to have happened.

Sumatra uses auto* so it built easily no problem. Bummer there's no javafx but hopefully that isn't far off.

graal is a bit of a pain because it uses a completely custom build/update/everything mega-tool written in a single 4KLOC piece of python. Ok when it works but a meaningless backtrace if it doesn't. Well it is early alpha software I guess. Still ... why?

Anyway ... I tried building against 'make install' from sumatra but that doesn't work, you need to point your JAVA_HOME at the Sumatra tree as the docs tell you to. My mistake there.

So it turns out getting the hsail tools to build had some point after-all. The hsailasm downloaded by the "build" tool (in lib/okra-1.8-with-sim.jar) wont work against the libelf 0.x included in slackware (might be easier just building libelf 1.x). So I added the path to the hsailasm I built myself ... and ...

 export PATH=/home/notzed/hsa/HSAIL-Instruction-Set-Simulator/build/HSAIL-Tools:$PATH
 ./mx.sh  --vm server unittest -XX:+TraceGPUInteraction \
   -XX:+GPUOffload -G:Log=CodeGen hsail.test.IntAddTest
[HSAIL] library is libokra_x86_64.so
[HSAIL] using _OKRA_SIM_LIB_PATH_=/tmp/okraresource.dir_7081062365578722856/libokra_x86_64.so
[GPU] registered initialization of Okra (total initialized: 1)
JUnit version 4.8
.[thread:1] scope: 
  [thread:1] scope: GraalCompiler
    [thread:1] scope: GraalCompiler.CodeGen
    Nothing to do here
    Nothing to do here
    Nothing to do here
    version 0:95: $full : $large;
// static method HotSpotMethod
kernel &run (
        align 8 kernarg_u64 %_arg0,
        align 8 kernarg_u64 %_arg1,
        align 8 kernarg_u64 %_arg2
        ) {
        ld_kernarg_u64  $d0, [%_arg0];
        ld_kernarg_u64  $d1, [%_arg1];
        ld_kernarg_u64  $d2, [%_arg2];
        workitemabsid_u32 $s0, 0;
                                           
@L0:
        cmp_eq_b1_u64 $c0, $d0, 0; // null test 
        cbr $c0, @L1;
@L2:
        ld_global_s32 $s1, [$d0 + 12];
        cmp_ge_b1_u32 $c0, $s0, $s1;
        cbr $c0, @L12;
@L3:
        cmp_eq_b1_u64 $c0, $d2, 0; // null test 
        cbr $c0, @L4;
@L5:
        ld_global_s32 $s1, [$d2 + 12];
        cmp_ge_b1_u32 $c0, $s0, $s1;
        cbr $c0, @L11;
@L6:
        cmp_eq_b1_u64 $c0, $d1, 0; // null test 
        cbr $c0, @L7;
@L8:
        ld_global_s32 $s1, [$d1 + 12];
        cmp_ge_b1_u32 $c0, $s0, $s1;
        cbr $c0, @L10;
@L9:
        cvt_s64_s32 $d3, $s0;
        mul_s64 $d3, $d3, 4;
        add_u64 $d1, $d1, $d3;
        ld_global_s32 $s1, [$d1 + 16];
        cvt_s64_s32 $d1, $s0;
        mul_s64 $d1, $d1, 4;
        add_u64 $d2, $d2, $d1;
        ld_global_s32 $s2, [$d2 + 16];
        add_s32 $s2, $s2, $s1;
        cvt_s64_s32 $d1, $s0;
        mul_s64 $d1, $d1, 4;
        add_u64 $d0, $d0, $d1;
        st_global_s32 $s2, [$d0 + 16];
        ret;
@L1:
        mov_b32 $s0, -7691;
@L13:
        ret;
@L4:
        mov_b32 $s0, -6411;
        brn @L13;
@L10:
        mov_b32 $s0, -5403;
        brn @L13;
@L7:
        mov_b32 $s0, -4875;
        brn @L13;
@L12:
        mov_b32 $s0, -8219;
        brn @L13;
@L11:
        mov_b32 $s0, -6939;
        brn @L13;
};

[HSAIL] heap=0x00007f47a8017a40
[HSAIL] base=0x95400000, capacity=108527616
External method:com.oracle.graal.compiler.hsail.test.IntAddTest.run([I[I[II)V
installCode0: ExternalCompilationResult
[HSAIL] sig:([I[I[II)V  args length=3, _parameter_count=4
[HSAIL] static method
[HSAIL] HSAILKernelArguments::do_array, _index=0, 0xdd563828, is a [I
[HSAIL] HSAILKernelArguments::do_array, _index=1, 0xdd581718, is a [I
[HSAIL] HSAILKernelArguments::do_array, _index=2, 0xdd581778, is a [I
[HSAIL] HSAILKernelArguments::not pushing trailing int

Time: 0.213

OK (1 test)

Yay? I think?

Maybe not ... it seems that it's only using the simulator. I tried using LD_LIBRARY_PATH and -Djava.library.path to redirect to the libokra from the Okra-Interface-to-HSA-Device library but that just hangs after the "base=0x95..." line after dumping the hsail. strace isn't showing anything obvious so i'm not sure what's going on. Might've hit some ubuntu compatibility issue at last or just a mismatch in versions of libokra.

On the other hand ... it was noticeable that something was happening with the gpu as the mouse started to judder, yet a simple ctrl-c killed it cleanly. Just that alone once it makes it into OpenCL will be worth it's weight in cocky shit rather than just hard locking X as is does with catalyst.

Having just typed that ... one test too many and it decides to crash into an unkillable process and do weird stuff (and not long after I had to reboot the system). But at least that is to be expected for alpha software and i've been pretty surprised by the overall system stability all things considered.

I think next time I'll just have a closer look at aparapi because at least I have that working with the APU and i'm a bit sick of compiling other peoples code and their strange build systems. Sumatra and graal are very large and complex projects and a bit more involved than I'm really interested in right now. I haven't used aparapi before anyway so I should have a look.

If the slackware vs ubuntu thing becomes too much of a hassle I might just go and buy another hdd and dual-boot; I already have to multi-boot to switch between opencl+accelerated javafx vs apu.

Update: Actually there may be something more to it. I just tried creating my own aparapi thing and it crashed in the same way so maybe i was missing some env variable or it was due to a suspend/resume cycle.

So I just had another go at getting the graal test running on hsail and I think it worked:

 export JAVA_HOME=/home/notzed/hsa/sumatra-dev/build/linux-x86_64-normal-server-release/images/j2sdk-image
 export PATH=${JAVA_HOME}/bin:${PATH}
 export LD_LIBRARY_PATH=/home/notzed/hsa/Okra-Interface-to-HSA-Device/okra/dist/bin
 ./mx.sh  --vm server unittest  -XX:+TraceGPUInteraction -XX:+GPUOffload -G:Log=CodeGen hsail.test.IntAddTest

[HSAIL] library is libokra_x86_64.so
[GPU] registered initialization of Okra (total initialized: 1)
JUnit version 4.8

...

Time: 0.198

OK (1 test)

Still, i'm not sure what to do with it yet ...

I was going to have a play today but I just got the word on work starting again (I can probably push it out to Monday) so I might just go to the pub or just for a ride - way too nice to be inside getting monitor burn. I foolishly decided to walk into the city yesterday for lunch with a mate I haven't seen for years (and did a few pubs on the way home - i was in no rush!) but just ended up with a nice big blister and sore feet for my troubles. It's about a 45 minute walk into the city but I don't do much walking.

I want see if anything interesting comes out of the Sony's GDC talk first though (the vr one? - yes the VR one, it's just started).

And ... done. Interesting, but still early days. Even if they release a model for the public at a mass-market price it's still going to have to be a long term project. Likely the first 5-10 years will just be experimentation and getting the technology the point where it is good and cheap enough.

Update: Yep, i'm pretty sure it's just a problem with suspend/resume. I just tried running it after a resume and it panicked the kernel.

Building the hsail tools.

I had a lot more trouble than this post suggests getting this stuff to compile which is probably why i sound pissed off (hint: i am, actually pro tip: i often am), but this summarises the results.

I started with the instructions in README.hsa from the hsa branch of gcc. See the patches below though.

I installed libelf from slackware 14.1 but had to build libdwarf manually. I got libdwarf-20140208 from the libdwarf home page. I actually created a SlackBuild for it but i'm uncertain about providing it to slackbuilds.org right now (mostly because it seems a bit too niche).

Then this should probably work:

  mkdir hsa
  cd hsa
  git clone --depth 1 https://github.com/HSAFoundation/HSAIL-Instruction-Set-Simulator.git
  git clone --depth 1 https://github.com/HSAFoundation/HSAIL-Tools
  cd HSAIL-Instruction-Set-Simulator/src/
  ln -s ../../HSAIL-Tools
  cd ..
  mkdir build
  cd build
  cmake -DCMAKE_BUILD_TYPE=Debug ..
  make -j3 _DBG=1 VERBOSE=1

Actually I originally built it from the Okra-Interface-to-HSAIL-Simulator thing, but that checks out both of these tools as part of it's android build process together with the llvm compiler this does. Ugh. And then proceeds to compile whilst hiding the details of what it's actually doing and simultaneously ignoring-all-fatal-errors-along-the-way through the wonders of an ant script. Actually I could add more but I might offend for no purpose - afterall I did get it compiled in the end. I'm just a bit more impatient than I once was :)

Patches

It doesn't look for libelf.h anywhere other than /usr/include so I had to add this to HSAIL-Instruction-Set-Simulator. Of course if you have lib* in /usr/local you have to change it to that instead. Sigh. I thought this sort of auto-discovery was a solved problem.

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 083e9c6..c86192c 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -25,8 +25,9 @@ set(CMAKE_CXX_FLAGS "-DTEST_PATH=${PROJECT_SOURCE_DIR}/test ${CMAKE_CXX_FLAGS}")
 set(CMAKE_CXX_FLAGS "-DBIN_PATH=${PROJECT_BINARY_DIR}/HSAIL-Tools ${CMAKE_CXX_FLAGS}")
 set(CMAKE_CXX_FLAGS "-DOBJ_PATH=${PROJECT_BINARY_DIR}/test ${CMAKE_CXX_FLAGS}")
 set(CMAKE_CXX_FLAGS "-I/usr/include/libdwarf ${CMAKE_CXX_FLAGS}")
+set(CMAKE_CXX_FLAGS "-I/usr/include/libelf ${CMAKE_CXX_FLAGS}")
 
-set(CMAKE_CXX_FLAGS_RELEASE "-O3 -UNDEBUG")
+set(CMAKE_CXX_FLAGS_RELEASE "-O2 -UNDEBUG")
 
 # add a target to generate API documentation with Doxygen
 find_package(Doxygen)

And because I used a newer version of libdwarf than ubuntu uses it has a world-shattering fatal error causing difference in one of the apis.

Apply this to HSAIL-Tools.

diff --git a/libHSAIL/libBRIGdwarf/BrigDwarfGenerator.cpp b/libHSAIL/libBRIGdwarf/BrigDwarfGenerator.cpp
index 7e2d28d..fe2bc51 100644
--- a/libHSAIL/libBRIGdwarf/BrigDwarfGenerator.cpp
+++ b/libHSAIL/libBRIGdwarf/BrigDwarfGenerator.cpp
@@ -261,7 +261,7 @@ public:
     //
     bool storeInBrig( HSAIL_ASM::BrigContainer & c ) const;
 
-    int DwarfProducerCallback2( char * name,
+    int DwarfProducerCallback2( const char * name,
                                 int    size,
                                 Dwarf_Unsigned type,
                                 Dwarf_Unsigned flags,
@@ -459,7 +459,7 @@ bool BrigDwarfGenerator_impl::generate( HSAIL_ASM::BrigContainer & c )
 // *sec_name_index is an OUT parameter, a pointer to the shared string table
 // (.shrstrtab) offset for thte name of the section
 //
-static int DwarfProducerCallbackFunc( char * name,
+static int DwarfProducerCallbackFunc( const char * name,
                                       int    size,
                                       Dwarf_Unsigned type,
                                       Dwarf_Unsigned flags,
@@ -476,7 +476,7 @@ static int DwarfProducerCallbackFunc( char * name,
 }
 
 int
-BrigDwarfGenerator_impl::DwarfProducerCallback2( char * name,
+BrigDwarfGenerator_impl::DwarfProducerCallback2( const char * name,
                                                  int    size,
                                                  Dwarf_Unsigned type,
                                                  Dwarf_Unsigned flags,

I had to use the 'git registry' to disable coloured diffs so I could just see what I was even doing here.

  git config --global color.ui false

Whomever thought that sort of bloat was a good idea in such a tool ... well I would only offend if i called him or her a fuckwit wouldn't i?

About Me

Tags

epiphany stack frame

compiler strangeness

Alternate runtime

Saturday evening elf-loader hackery

Saturday arvo elf-loader hack-a-thon

Compact startup

Matching addresses (aka fake hsa)

COM symbols

Automatic remote-core on-chip symbol resolution

'easy' elf loader for parallella

thoughts on opencl + array methods

sumatra, graal, etc.

Building the hsail tools.

Patches

About Me

Tags

epiphany stack frame

compiler strangeness

Alternate runtime

Saturday evening elf-loader hackery

Saturday arvo elf-loader hack-a-thon

Compact startup

Matching addresses (aka fake hsa)

*COM* symbols

Automatic remote-core on-chip symbol resolution

'easy' elf loader for parallella

thoughts on opencl + array methods

sumatra, graal, etc.

Building the hsail tools.

Patches

COM symbols