... 75.

First time the scales hit 75kg this morning. Probably be up by the end of the day but i'll take it as written. I thought the weight loss was slowing down there as i seemed to be stuck at about 78 for a while but my 80kg post was only 5 weeks ago so it's still shedding at a fair clip. Can't remember exactly when I was last around this weight - probably about 2000 before I went to Boston and eating american food and their foul soft-drinks. It's further than I expected and i'm not sure how much farther I will or should take it, I have a light frame so a few more off wouldn't be excessive even if i might be approaching 'skinny' in the light of modern averages.

I'm barely eating anything yet still functioning about as well (or not) as I was before. Some days i get away with only a couple of handfuls of nuts, some watery/herby/spicy soup I make in a mug, and a single piece of toast; often bread and butter has been the basis of "the main meal of the day". Haven't had the need for afternoon naps for months despite not sleeping better or worse than before and I certainly feel less hungry all of the time than I did when I was eating too much - it felt like something wasn't quite right at the time but it didn't seem possible to correct it. But now even if i'm feeling "really hungry" it takes little to satiate that hunger and I can't finish full sized meals.

The gout seems under control for now but then i strained my other foot from favouring it for so long (@#$@$#) but that will clear up eventually. It's been too cold to do much outside so i'm not missing much.

Not much hacking the last few weeks. I still can't get SVM to work with OpenCL despite it working in some sdk samples. I'm doing some javafx tools for work and getting fairly proficient at that although I still get stuck with frustrating layout issues from time to time. Now I understand it's usage i'm quite liking Tasks for a lot of things although running separate threads and processing loops is still the best solution for anything stateful; threads are quite the joy in Java and they're a great strength over most other languages.

So with not much else to do I've been putting regular time into DRIVECLUB (its a really stunning game and I would rate it over GT for me personally), did a few afternoons of Final Fantasy 13 (i forgot how polished a proper big budget game can be, solid frame-rate, no tearing, and compared to the ps2 games it loads quickly and rarely too; pity the story and characters are awful and it's a bit grindy). I usually try some of the PS+ games but nothing has really grabbed me so far and although a few seem quite solid they are just not to my taste. But I haven't even run most of them. Barely watching tv. I watch some footy but for fucks sake I would kill for a separate audio stream without the fuckwit commentators messing it up (or ads). Abernathy and now mcquire is back too too - i had hoped he'd gone when channel 9 was. So I often end up muting it for them or the next fucking supermarket or hardware "store: advert and then forgetting it's on till it's over. I actually looked into voice recognition on GNU so i could make a device that muted on command! I'm still thinking about it; maybe use whistling or clapping instead. Still reading regularly but mostly in bed, and i've moved on to some more enjoyable, better-written, and less abysmal stories for the moment.

hotspot code generation, optimisation and deoptimisation

As promised here is an article about the hotspot code generation using the disassembler plugin mention in the last post. I was nearly going to not do it but i'd already done some playing with it.

Unfortunately I had to use AMD64 instructions here; i think the ISA is pretty shithouse so I haven't bothered to learn it very well so i'm doing some guessing below. I even downloaded the APMs from AMD (i find the intel docs quite poor) to look some stuff up.

For the C code i'm using gcc 4.8.2 with -mtune=native -std=gnu99 and -Ox as indicated in the text.

The actual test calculates 1000x dot products of 2^20 elements each. For java i'm using System.nanoTime() and printing the best result across all runs. For C i couldn't be bothered with the gettimeofday() stuff so i'm just using the time command - over 1000 iterations the difference should be negligable and there are some interesting results regardless.

Simple loop

This is the starting function; obvious what it does.

public float dot(float[] a, float[] b, int len) {
        float v = 0;
        for (int i=0;i<len;i++)
                v += a[i] * b[i];
        return v;
}

A C version is identical apart from using pointers rather than arrays and some extra fluffly conventions.

float dot(const float *a, const float  *b, int len) {
        float v = 0;
        for (int i=0;i<len;i++)
                v += a[i] * b[i];
        return v;
}

First pass

After some iterations hotspot will recognise this function could benefit from optimisation and this is what jdk8 spits out at the first compilation pass.

This is using gcc syntax so instruction operands are srca,[srcb,],dst rather than the more conventional dst,srca[,srcb].

.1: movslq %esi,%rdi
    jae    .exception0
    vmovss 0x10(%rdx,%rdi,4),%xmm1

    movslq %esi,%rdi
    jae    .exception1
    vmovss 0x10(%rcx,%rdi,4),%xmm2

    vmulss %xmm2,%xmm1,%xmm1
    vaddss %xmm0,%xmm1,%xmm1
    inc    %esi

    mov    $0x7ffdffc00ce8,%rdi
    mov    0xe0(%rdi),%ebx
    add    $0x8,%ebx
    mov    %ebx,0xe0(%rdi)

    mov    $0x7ffdffc00488,%rdi
    and    $0xfff8,%ebx
    cmp    $0x0,%ebx
    je     .2

.3: test   %eax,0x15e4076a(%rip)
    mov    $0x7ffdffc00ce8,%rdi
    incl   0x128(%rdi)
    vmovaps %xmm1,%xmm0

    cmp    %r8d,%esi
    mov    $0x7ffdffc00ce8,%rdi
    mov    $0x108,%rbx
    jge    .4
    mov    $0x118,%rbx
.4: mov    (%rdi,%rbx,1),%rax
    lea    0x1(%rax),%rax
    mov    %rax,(%rdi,%rbx,1)
    jl     .1

;; clean up and exit

.2: mov    %rdi,0x8(%rsp)
    movq   $0x1d,(%rsp)
    callq  some_function
    jmpq   .3

Of these 11 are for the loop itself, the rest seem to be for profiling the loop.

As far as it goes it looks fairly decent - pretty much gcc -O2 level of optimisation with array bounds checking performed at each array read.

Of course the profiling adds a lot of overhead here.

The following is the output for the inner loop of gcc -O2.

  10:   f3 0f 10 0c 87          movss  (%rdi,%rax,4),%xmm1
  15:   f3 0f 59 0c 86          mulss  (%rsi,%rax,4),%xmm1
  1a:   48 ff c0                inc    %rax
  1d:   39 c2                   cmp    %eax,%edx
  1f:   f3 0f 58 c1             addss  %xmm1,%xmm0
  23:   7f eb                   jg     10

The only real difference apart from having no bounds checking is that it multiplies directly from memory rather than through a register. The latter is how every other mainstream cpu does it so that may have some bearing on it.

I can't easy do comparison timing of the loops (and it isn't very meaningful) but obviously the java will be slower here, and probably on-par with -O0 output from gcc.

Final pass

After it has gained some profiling information the result will be optimised - in this case it recompiles it twice more. The inner loop of the final pass is below:

.1: vmovss 0x10(%rbx,%r14,4),%xmm0
    vmulss 0x10(%rcx,%r14,4),%xmm0,%xmm1
    vaddss %xmm3,%xmm1,%xmm0
    movslq %r14d,%r10
    vmovss 0x2c(%rbx,%r10,4),%xmm2
    vmulss 0x2c(%rcx,%r10,4),%xmm2,%xmm8
    vmovss 0x14(%rbx,%r10,4),%xmm1
    vmulss 0x14(%rcx,%r10,4),%xmm1,%xmm2
    vmovss 0x18(%rcx,%r10,4),%xmm1
    vmulss 0x18(%rbx,%r10,4),%xmm1,%xmm3
    vmovss 0x28(%rbx,%r10,4),%xmm1
    vmulss 0x28(%rcx,%r10,4),%xmm1,%xmm4
    vmovss 0x1c(%rcx,%r10,4),%xmm1
    vmulss 0x1c(%rbx,%r10,4),%xmm1,%xmm5
    vmovss 0x20(%rbx,%r10,4),%xmm1
    vmulss 0x20(%rcx,%r10,4),%xmm1,%xmm6
    vmovss 0x24(%rbx,%r10,4),%xmm1
    vmulss 0x24(%rcx,%r10,4),%xmm1,%xmm7
    vaddss %xmm2,%xmm0,%xmm0
    vaddss %xmm0,%xmm3,%xmm1
    vaddss %xmm1,%xmm5,%xmm1
    vaddss %xmm1,%xmm6,%xmm0
    vaddss %xmm0,%xmm7,%xmm1
    vaddss %xmm1,%xmm4,%xmm0
    vaddss %xmm0,%xmm8,%xmm3

    add    $0x8,%r14d

    cmp    %r8d,%r14d
    jl     .1

    cmp    %ebp,%r14d
    jge    .done
    xchg   %ax,%ax
.2: cmp    %edi,%r14d
    jae    .stuff0
    vmovss 0x10(%rcx,%r14,4),%xmm1

    cmp    %r9d,%r14d
    jae    .stuff1

    vmulss 0x10(%rbx,%r14,4),%xmm1,%xmm1
    vaddss %xmm1,%xmm3,%xmm3

    inc    %r14d

    cmp    %ebp,%r14d
    jl     .2
.done:

So this has removed all the array bounds checking from inside the loop (it's elsewhere - too bulky/not important here). It's also unrolled the loop 8x and is using modern 3-operand instructions to stagger most of the operations for better throughput on typical RISC cpus (I have no knowledge of the AMD scheduling rules). And finally it tacks on a simple 1-element loop to finish off anything left over.

Comparing this to the output of gcc -O3 ...

  30:   f3 0f 10 09             movss  (%rcx),%xmm1
  34:   41 83 c0 10             add    $0x10,%r8d
  38:   0f 18 49 50             prefetcht0 0x50(%rcx)
  3c:   0f 18 48 50             prefetcht0 0x50(%rax)
  40:   48 83 c1 40             add    $0x40,%rcx
  44:   48 83 c0 40             add    $0x40,%rax
  48:   f3 0f 59 48 c0          mulss  -0x40(%rax),%xmm1
  4d:   f3 0f 58 c1             addss  %xmm1,%xmm0
  51:   f3 0f 10 49 c4          movss  -0x3c(%rcx),%xmm1
  56:   f3 0f 59 48 c4          mulss  -0x3c(%rax),%xmm1
  5b:   f3 0f 58 c1             addss  %xmm1,%xmm0
  5f:   f3 0f 10 49 c8          movss  -0x38(%rcx),%xmm1
  64:   f3 0f 59 48 c8          mulss  -0x38(%rax),%xmm1
  69:   f3 0f 58 c1             addss  %xmm1,%xmm0
  6d:   f3 0f 10 49 cc          movss  -0x34(%rcx),%xmm1
  72:   f3 0f 59 48 cc          mulss  -0x34(%rax),%xmm1
  77:   f3 0f 58 c1             addss  %xmm1,%xmm0
  7b:   f3 0f 10 49 d0          movss  -0x30(%rcx),%xmm1
  80:   f3 0f 59 48 d0          mulss  -0x30(%rax),%xmm1
  85:   f3 0f 58 c1             addss  %xmm1,%xmm0
  89:   f3 0f 10 49 d4          movss  -0x2c(%rcx),%xmm1
  8e:   f3 0f 59 48 d4          mulss  -0x2c(%rax),%xmm1
  93:   f3 0f 58 c1             addss  %xmm1,%xmm0
  97:   f3 0f 10 49 d8          movss  -0x28(%rcx),%xmm1
  9c:   f3 0f 59 48 d8          mulss  -0x28(%rax),%xmm1
  a1:   f3 0f 58 c1             addss  %xmm1,%xmm0
  a5:   f3 0f 10 49 dc          movss  -0x24(%rcx),%xmm1
  aa:   f3 0f 59 48 dc          mulss  -0x24(%rax),%xmm1
  af:   f3 0f 58 c1             addss  %xmm1,%xmm0
  b3:   f3 0f 10 49 e0          movss  -0x20(%rcx),%xmm1
  b8:   f3 0f 59 48 e0          mulss  -0x20(%rax),%xmm1
  bd:   f3 0f 58 c1             addss  %xmm1,%xmm0
  c1:   f3 0f 10 49 e4          movss  -0x1c(%rcx),%xmm1
  c6:   f3 0f 59 48 e4          mulss  -0x1c(%rax),%xmm1
  cb:   f3 0f 58 c1             addss  %xmm1,%xmm0
  cf:   f3 0f 10 49 e8          movss  -0x18(%rcx),%xmm1
  d4:   f3 0f 59 48 e8          mulss  -0x18(%rax),%xmm1
  d9:   f3 0f 58 c1             addss  %xmm1,%xmm0
  dd:   f3 0f 10 49 ec          movss  -0x14(%rcx),%xmm1
  e2:   f3 0f 59 48 ec          mulss  -0x14(%rax),%xmm1
  e7:   f3 0f 58 c1             addss  %xmm1,%xmm0
  eb:   f3 0f 10 49 f0          movss  -0x10(%rcx),%xmm1
  f0:   f3 0f 59 48 f0          mulss  -0x10(%rax),%xmm1
  f5:   f3 0f 58 c1             addss  %xmm1,%xmm0
  f9:   f3 0f 10 49 f4          movss  -0xc(%rcx),%xmm1
  fe:   f3 0f 59 48 f4          mulss  -0xc(%rax),%xmm1
 103:   f3 0f 58 c1             addss  %xmm1,%xmm0
 107:   f3 0f 10 49 f8          movss  -0x8(%rcx),%xmm1
 10c:   f3 0f 59 48 f8          mulss  -0x8(%rax),%xmm1
 111:   f3 0f 58 c1             addss  %xmm1,%xmm0
 115:   f3 0f 10 49 fc          movss  -0x4(%rcx),%xmm1
 11a:   f3 0f 59 48 fc          mulss  -0x4(%rax),%xmm1
 11f:   45 39 c8                cmp    %r9d,%r8d
 122:   f3 0f 58 c1             addss  %xmm1,%xmm0
 126:   0f 85 04 ff ff ff       jne    30
  
 12c:   49 63 c0                movslq %r8d,%rax
 12f:   48 c1 e0 02             shl    $0x2,%rax
 133:   48 01 c7                add    %rax,%rdi
 136:   48 01 c6                add    %rax,%rsi
 139:   31 c0                   xor    %eax,%eax
 13b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

 140:   f3 0f 10 0c 87          movss  (%rdi,%rax,4),%xmm1
 145:   f3 0f 59 0c 86          mulss  (%rsi,%rax,4),%xmm1
 14a:   48 ff c0                inc    %rax
 14d:   41 8d 0c 00             lea    (%r8,%rax,1),%ecx
 151:   39 ca                   cmp    %ecx,%edx
 153:   f3 0f 58 c1             addss  %xmm1,%xmm0
 157:   7f e7                   jg     140

The main differences here are that it unrolls the loop 16x here. It only uses the 2-operand instructions - it uses fewer registers. It has also transformed the array indexing into pre-increment pointer arithmetic (in batches).

Well this definitely isn't a RISC cpu as that scheduling looks pants as everything keeps writing to the same registers. But x86 being so dominant has allowed a lot of money to be spent optimising the chip to run shitty code faster to make up for the compiler.

Benchmarks

Here are some timing results. All values are in ms for equivalent of 1 loop (or seconds for 1000 loops).

what       time
 gcc -O0    4.86
     -O2    1.44
     -O3    1.44

 java       1.60
 time java  1.7

The last is using the 'time' command on the whole java loop. i.e. this includes the jvm startup, profiling, and compilation. This isn't too shabby.

Either way these times are pretty good vs effort - maybe one or the other is more tuned to the cpu I have vs intel stuff but it's really neither here nor there.

Unrolled loop

Actually what prompted the idea for this article was some results I had from unrolling loops 4x in Java. I subsequently found that unrolling 2x did just as good a job in this case so i'll do that here just for simplicity. The assembly is almost identical anyway as it just gets unrolled an additional 2x rather than 4x by the compiler.

public float dot(float[] a, float[] b, int len) {
 float v0 = 0, v1=0;
 int i = 0;
 for (int e = len & ~1;i<e;i+=2) {
  v0 += a[i] * b[i];
  v1 += a[i+1] * b[i+1];
 }
 for (;i<len;i++)
  v0 += a[i] * b[i];
 return v0+v1;
}

Final pass

And here's just the inner loop of the final pass:

.1: vmovss 0x10(%rcx,%r8,4),%xmm0
    vmulss 0x10(%rdx,%r8,4),%xmm0,%xmm1
    vaddss %xmm3,%xmm1,%xmm0
    movslq %r8d,%r11
    vmovss 0x2c(%rcx,%r11,4),%xmm2
    vmulss 0x2c(%rdx,%r11,4),%xmm2,%xmm9
    vmovss 0x24(%rcx,%r11,4),%xmm1
    vmulss 0x24(%rdx,%r11,4),%xmm1,%xmm8
    vmovss 0x1c(%rcx,%r11,4),%xmm2
    vmulss 0x1c(%rdx,%r11,4),%xmm2,%xmm1
    vmovss 0x18(%rcx,%r11,4),%xmm3
    vmulss 0x18(%rdx,%r11,4),%xmm3,%xmm2
    vmovss 0x14(%rcx,%r11,4),%xmm4
    vmulss 0x14(%rdx,%r11,4),%xmm4,%xmm3
    vmovss 0x20(%rcx,%r11,4),%xmm5
    vmulss 0x20(%rdx,%r11,4),%xmm5,%xmm4
    vmovss 0x28(%rcx,%r11,4),%xmm6

    vmulss 0x28(%rdx,%r11,4),%xmm6,%xmm5
    vaddss %xmm3,%xmm0,%xmm3
    vaddss %xmm2,%xmm3,%xmm0
    vaddss %xmm1,%xmm0,%xmm1
    vaddss %xmm1,%xmm4,%xmm0
    vaddss %xmm8,%xmm0,%xmm1
    vaddss %xmm1,%xmm5,%xmm0
    vaddss %xmm0,%xmm9,%xmm3

    add    $0x8,%r8d

    cmp    %r10d,%r8d
    jl     .1

So now it's unrolled the loop an addition 4x times so that it looks the same at first glance. But now the moves have been spread across many registers rather than mostly going through xmm1. It runs quite a bit faster.

This is getting too long so i wont include it but the same simple modification applied to the C version also makes a difference - quite a big one. The generated code is almost identical apart from every second xmm0 being replaced with xmm1 - i.e. interleaved as written.

Benchmarks

And here's some benchmarks of this 'version'.

what       time
 gcc -O0    2.76
     -O2    0.833
     -O3    0.735

 java       1.00
 time java  1.2

Conclusions

Well hotspot is pretty good, but could be a little bit better. And it seems mostly just to fall down on some seemingly simple areas like instruction scheduling (simple compared to the rest of the work it's done).

Although I don't have enough knowledge of the architecture here to state that the original scheduling isn't very optimal the benchmark results probably speak loud enough in that absence. It is clearly not optimal as the same machine code which interleaves the output register runs 2x faster in the C case. I don't really feel like translating this to assembly so i can see if some simple re-arrangement would make a difference.

But what is odd that neither compiler is doing this on it's own, one could argue (quite convincingly) that due to floating point peculiarities (addition is only weakly associative) both loops are not actually calculating the same result. In the case of hotspot however this argument is weak because the optimised version is already spreading the addition accross multiple registers.

Lambdas & de-optimisation

This is getting long and the next part could probably go into another article but i've spent enough of my weekend on this so i'll get it out of the way now with a quick summary.

For simplcity I created the following simple 3-parameter map/reduce operation.

public interface FloatTrinaryFunction {
 public float applyAsFloat(float a, float b, float c);
}

public float reduce(float[] a, float[] b, int len, FloatTrinaryFunction func) {
 float v = 0;
 for (int i=0;i<len;i++)
  v = func.applyAsFloat(v, a[i], b[i]);
 return v;  
}

And invoke it thus:

  reduce(a, b, a.length, (float v, float x, float y)->v + x*y);

Opt and de-opt

In short, if you use up to two lambdas it results in equivalent code to the direct dot product equation - nice. But once you go to three or more it de-optimises the loop and reverts to a function call. It also spends more time in the compiler.

The following is what the deoptimised loop looks like:

.1: mov    (%rsp),%rdx
    mov    %rdx,(%rsp)
    cmp    %r10d,%ebp
    jae    .exception0
    mov    0x8(%rsp),%r10
    vmovss 0x10(%rdx,%rbp,4),%xmm1

    cmp    %r10d,%ebp
    jae    .exception1
    mov    0x8(%rsp),%r10
    vmovss 0x10(%r10,%rbp,4),%xmm2

    mov    0x18(%rsp),%rsi
    xchg   %ax,%ax
    mov    $0xffffffffffffffff,%rax
    callq  applyAsFloat

    inc    %ebp

    cmp    0x10(%rsp),%ebp
    jl     .1

So it retains the array bounds checks inside the loop (bummer) and invokes the interface as a function call (expected), but it removes any profiling instrumentation that was present in the first pass (expected also) and generates the smallest code.

This hits around the 4.5ms mark.

Conclusions 2

It's important to note that this is just a run-time decision made by the current version of hotspot - this could be changed or could be tweaked in the future. And as I showed in some previous posts it can be worked-around even with the current hotspot using some bytecode foo.

Given the prevalence of lambdas in java8 i suspect it is something that will gain some tuning attention in future revisions. It's not something one would change lightly so it will probably be based on profiling data and usage.

hotspot code generation

So I was curious as to whether you can get the code out of hotspot and I found you can - a hotspot plugin is included in the jdk source but not distributed probably due to license restrictions (GPL2 + GPL3).

After a short search I came across this nice post which pointed me in the right direction. My system is a bit out of date so his approach didn't work but it wasn't much effort to drop in binutils from a GNU mirror. I had to remove -Werror in a couple of makefiles for it to compile (why this is ever used in released software i don't know, too many things change in system libs for it to be portable).

I've only had a quick look but it's quite interesting. Pity about the horrid x86 ISA though.

It will do several iterations of a compilation before settling down - gradually changing (improving?) the code at each step.

Eventually it does all the things you'd expect: registerising locals, unrolling loops, using registers as pointers with fixed array offsets where possible. It will also move array bounds checks to outside of inner loops so that the result looks pretty much like compiled C and sometimes better as it can potentially inline anything and not just macros or stuff in includes.

In one test case it appeared to unroll a simple loop almost identically to the optimisation of a manually unrolled loop; but it ran quite a bit slower. Not sure on that one, might be register dependency stalls or perhaps I was looking at the wrong code-path as it was a fairly large function. I will have to try on simpler loops and mathematically they weren't strictly identical either.

Unfortunately it wont employ SIMD even when it's pretty close; I guess that's alignment related as much as anything due to intel's strict alignment requirements. I did notice recently that bytebuffers seem to be 16-byte aligned now though.

dot product

To start with something simple this is the loop i'll look closer at, it's the one I was looking at above.

  double v=0;
  for (int i=0; i<end;i++)
    v += a[i] * b[i];

And the manually unrolled version. This is not identical due to the peculiarities of floating point despite being mathematically the same.

  double v, v0=0, v1=0, v2=0, v3=0;
  int i=0;
  for (int bend=end & ~3; i<bend;i+=4) {
    v0 += a[i] * b[i];
    v1 += a[i+1] * b[i+1];
    v2 += a[i+2] * b[i+2];
    v3 += a[i+3] * b[i+3];
  }
  for (; i<end;i++)
    v0 += a[i] * b[i];
  v = (v0+v1+v2+v3);

I will look at the compiler output in another post. If I get keen i might see if i can build an ARM version for the nicer ISA.

sgemm, OpenCL

Yesterday I couldn't do much else so I played with some OpenCL code again. Just as my left foot was nearing recovery from gout ... I think I strained my right foot from too much walking or other activity and i'm immobile again. Argh.

With OpenCL still haven't managed to work out why I can't use SVM - I have a C test, a C test based on extracting all the relevant code from the BufferBandwidth sample (from amd sdk), and a C++ test based on the BufferBandwidth sample; they all crash as soon as I try to invoke a kernel against an SVM buffer, although BufferBandwidth runs fine. I even tried compiling linux 3.19.8 - I had to modify the catalyst driver a little bit to get it to build but I had it working for a bit, but suspend was broken and then I made one too many changes to the linux config and i couldn't get it to boot again and eventually just gave up. The linux config system is pretty shit and any changes force a full rebuild so i was getting sick of that. When I did have it running it made no difference to the SVM stuff anyway.

OTOH I did find that using CL_MEM_USE_HOST_PTR works in much the same way anyway (in terms of java usefulness) - even without mapping or unmapping the values are being updated on the via the GPU device, so with any luck map/unmap are just noops. I didn't really look too much further though.

What I looked at instead was implementing a basic matrix matrix multiply, i.e. lapack's sgemm. Not really for any need but just curiosity; how much effort is required vs the payoff.

My test case was a C=AxB where each is (1024,1024) with row-major order storage. CPU is a AMD A10-7850K (kaveri).

  java naive             20.5
  java copy col B         1.5
  java copy col B mt      0.50
  opencl cpu naive        6.5
  opencl cpu float4x4     0.49
  opencl gpu simple       0.26
  opencl gpu float4x4     0.045

  java ojalgo (double)    0.48
  java la4j (double)      1.7

  java copy stream        0.40
  java copy stream x4     0.30

java naive: This just implements the classic algorithm literally - i.e. for all rows of A, dot product of row by all columns of B, etc. The problem here is that each dot product scans a column of B in a potentially worst-case way in terms of cacheable memory access - this size hits that.
java copy col B: This just inverts the two outer loops so that it runs for all columns of B and then dot products that with all rows of A. It copies the current column of B in the outermost loop and so it only has to run once for every 2^20 accesses (in this case). Which is obviously worth it.
java copy col B mt: This replaces the outer loop with a IntStream.range(0, n).parallel().forEach(). It's not optimal memory-use wise but that makes little difference in this canned example (see the last couple of results). This is a trivial change and also easily worth it.
opencl * naive: This is a trivial opencl implementation that runs transposed with each work item calculating a single output value. The work size is 64,1,1 in each case. This shows that it isn't worth using OpenCL without a bit more effort on the algorithm.
opencl * float4x4: This is the most complex implementation whereby each work-item calculates a 4x4 output cell (calculates 4 rows at once). The number of columns and rows must be multiple of 4. It's basically just an unrolled loop using vector types; but the code is still quite straightforward. At least in this case - since the problem is embarrassingly parallel - the effort required is modest for the gains possible.
java copy stream: This replaces the outer loop of the copy col B loop with a custom non-gc-polluting spliterator over the columns of B. i.e. it allocates one row for each thread which is re-used for each call. This is moderately more work to set-up but the spliterator is re-usable. It's also possibly slightly misleading due to the way hotspot optimises callbacks.
java copy stream x4: Well what the hell - this unrolls the inner loop by 4x so only works on matrices with A_n a multiple of 4.

The opencl code is sub-optimal for the CPU case - something closer to the java implementation would make sense - I will try again at another time perhaps. I'm not that familiar with the compiler behavior or best processing model to use for the CPU driver but using vectors obviously helps. But if it's barely faster than Java there wont be much point. OTOH a 10x speedup using the kaveri GPU is a bit more interesting.

Sorry no code today - if i keep poking i might drop it into a zcl-samples thing later on. I'm sure there's plenty of (better) code out there in accelerated lapack libraries anyway.

Update: So before i posted this i came across the java matrix benchmark and the pretty simple 'java copy col b' is pretty close to ojAlgo running on this box. I only ran it on this same test and not the benchmark so i don't know if it's 'fast' on this machine although i imagine most of it's perf advantages in multiply in the benchmarks is from multi-threading. I also had some time to blow so I tried the row stream and row stream unrolled versions just out of curiosity.

ojalgo only seems to do double arrays, which doesn't make much difference with this problem size apart from double the storage space. I'm using a double accumulator for the dot product fwiw.

Update: Bored. Looked at la4j. It's maven only but a quick makefile fixed that abhorrence. Anyway it's just a tiny bit slower in this case and surprisingly only single-threaded (for something which is `current', this really is surprising). It's using the same algorithm as "java copy col B" but it uses 2D java arrays for it's dense matrix and creates a garbage copy of every column during operation rather than re-using the array. It looks like a pretty nice little library apart from a few odd looking decisions like these, especially the custom serialisation mechanism, and lack of threading. But there's a lot more to a matrix library than a multiply.

FWIW I didn't bother to include it in the above but on the weekend I also tried a direct ByteBuffer as storage. It takes a bit longer than the array backend for hotspot to optimise but it's quite close to the array version once it has. Or actually a bit faster in the mt case, for some strange reason.

faster faster loops

Given i haven't touched opencl for a while I thought i'd stop faffing about with threads and streams and see what this APU can do.

But I found a silly bug in zcl which rendered it broken so just mucked around playing with that and got nowhere with my original aim ...

I call a bunch of JNIEnv *A() functions, these take an array of jvalue's which should presumably be more efficient than walking a varargs list (if insignificantly so). But in an effort to clean up the way I was using it I broke it and hadn't gotten around to actually running anything until now. I will drop another zcl at some point but considering nobody's downloaded it i'm in no rush.

I also worked out some issue with the GPU driver, and possibly slackware. The mandelbrot demo works fine with javafx but other non-gui code just wasn't finding the GPU device. I couldn't work out what was going on but a strange error indicated it was probably some xfce session setup thing. I found an acceptable workaround in just setting export COMPUTE=localhost:0.0.

And then i spent the rest of the evening trying to work out why SVM wouldn't work on the GPU. It "works fine" on the CPU driver but although other operations are fine any kernel invocation leads to an insta-crash. After re-checking every code path i came to the conclusion that it's not the way i'm calling it, it just doesn't want to work.

And just now I tried a stripped down C implementation and it just crashes when I invoke a (do-nothing) kernel with an SVM argument :-/ Blast. I checked the BufferBandwidth sample and once I figured out netbeans was sucking too much for it to run and closed it; it worked fine. After a pretty long look i can't see why it isn't working so i must've done something really silly and small.

One part of svm - the common address pointers - aren't as useful from Java as from C anyway but the ability to share buffers without explicit map/unmap calls in the fine grained case should be, particularly on this APU.

netbeans

Netbeans is really starting to struggle for some reason. I was doing a big cleanup of a prototype which the boss gave to the customer (sigh) and moving lots of code around and suddenly it decided I had no main class and wouldn't even compile. After cleaning caches and other junk it was 'just' a non-obvious parser error. But I still had to resort to emacs+makefile to go through the compile errors one by one until I could get it to run. And the line that got it working in netbeans again was just a reference to a deleted import - i'd moved it to a common library.

But then it just started scanning dozens of files (dozens of times each) every time i switch windows; pausing for 250-1000ms each time. Cleaning the cache made no difference and it's already on an SSD. And it's running out of memory constantly - which messes up the incremental compilation something fierce. The last thing I did was I tried the same thing I did at home and disabled a dozen or so plugins i'll never use; but it didn't make much long-term difference at home so i'm not confident it will at work either.

But opening the zcl projects back up at home has pretty much busted it here - it's constantly running out of memory and taking a second or so to save any file and often not compiling it. I mean, it's actually become unfit for purpose.

Maybe lambda parsing is throwing it for a loop; but why? Any parameter/type matching should be somewhat limited in scope unlike C++.

Faster loops

I've been playing with streams and iterators and stuff again a bit. Although i've found that having a custom calculation loop is pretty good for performance trying to call a (lambda) function for each item can have some fairly large overheads once the jvm decides it wants to de-optimise the calling loop. But the latter is just so convenient to use it's worth a little more effort if it will make a performance difference.

de-optimisation

This stuff is just based on observation and not some internal knowledge of hotspot

So I had another look at trying to optimise the case of processing per-item in a loop with minimal overheads in a way that is practical and efficient. So far i've found the jvm will inline a call inside a loop (depending on size?) if the loop only calls up to two different class types. Unfortunately each new dynamicinvoke counts as a new class type, so for example the first of the following will optimise fine, but the second wont.

  // these will remain optimised
  IntUnaryOperator op = Integer::reverse;
  A.forEach(op);
  A.forEach(op);
  A.forEach(op);

  // the third invocation will cause a de-optimisation,
  //  subsequently all will run as de-optimised
  A.forEach(Integer::reverse);
  A.forEach(Integer::reverse);
  A.forEach(Integer::reverse);

And this applies globally so using a singleton wont get you very far.

So how to address this?

forEacherator

I found that if i used bytecode manipulation and created a new copy-class the optimisation stayed around - because the loop only ever calls one function. So the goal then was to create the simplest class so the overhead of doing this (and the subsequent code) remained small.

After a couple of iterations I settled on the following pair of interface+implementation.

public interface ByteRowFunction {
        public void forEach(byte[] data, int offset, int length);
}

public class ByteRow implements ByteRowFunction {

        private final IntUnaryOperator op;

        public ByteRow(IntUnaryOperator op) {
                this.op = op;
        }

        public void forEach(byte[] data, int offset, int length) {
                for (; length > 0; length--, offset++)
                        data[offset] = (byte) op.applyAsInt(data[offset] & 0xff);
        }
}

This form of for loop is also the fastest I could find, at least with this hotspot on this platform. (i suppose also it's really a map() or apply() call, just read it as the one you prefer).

It still has the same issue as the examples above in that using it with 3 or more different 'op' values will de-optimise it, even if it can be used itself as a singleton itself (since the forEach call is pure).

Class specialisation

So this is where the bytecode manipulation comes in. I use ASM to simply create a new class for each class of op. And the jvm will worry about in-lining the call if it makes sense otherwise.

public static ByteRowFunction of(IntUnaryOperator op) {
    try {
        Class cp = ASMTools.specialisedClass(ByteRow.class.getName(), op);
        
        return (ByteRowFunction) cp.getConstructor(IntUnaryOperator.class).newInstance(op);
    } catch (Exception ex) {
        throw new RuntimeException(ex);
    }
}

The specialisedClass() function simply takes the original class and creates an exact copy but renames it to a unique value tied to op.getClass(). There is an out-of-date example in the ASM FAQ on how to do this but it's pretty easy using ASM. And that's more or less all it takes.

Actually further ... in this case the forEach() call is pure (re-entrant and side-effect free) so the of() function could return a singleton instance as well. But that adds some other mess to gc and so on so probably isn't worth it or even detrimental in the long run; if necessary a caller could keep track of their own.

Results

I did two tests on a (2^25) byte array. The first tries to isolate just the overheads and invokes Integer.hashCode() on each item. The second calls Integer.reverse() which is somewhat slow on x86 without a bitrev instruction (ignoring that this will always result in zero when byte-integer-byte is used).

In each case i'm calling the same thing in 3 different places with 3 different lambdas (`Integer::xx' will create a new dynamicinvoke handle each time) which should trigger a de-optimisation if it's going to.

                                 hashCode        reverse
  new ByteRow().forEach          0.1800000       0.29
  new of().forEach               0.0000700       0.146

  singleton hash of().forEach    0.0000020       0.146
  singleton saved of().forEach   0.0000013       0.146
  for loop                       0.0000010       0.158

This is the best of 5 runs after a couple of warmup laps. Because it's so short the first column results are a bit noisy but the general trend is clear enough and i took some representative values from a few runs.

The first two include (one) object instantiation. The third uses a (non-synchronised) hash table lookup, the fourth just creates an instance and re-uses it. The last is a simple for-loop over the byte array.

It would be handy to see the generated object code but one can guess that the first vs the rest is the difference between an in-line `foo[x] = foo[x]' and a function call `foo[x] = identity(foo[x])'.

Of course a `memcpy' operation isn't much of a test so with something a little more substantial like Integer.reverse() the overheads are only about 100% - which is still pretty much the "doesn't matter" point in most cases but it's there anyway. Oddly enough the for loop loses out here a little bit but that probably just comes down to different micro-optimisations.

The point of this is to save having to type out yet! another! for! loop! and use lambdas but still retain the performance of using a specialised for-loop. The grand prize would be to re-compile the code for SIMD instructions or HSA or OpenCL - i.e. aparapi. But that requires more than just a bit more effort ;-).

I was hoping that the same technique would be applicable to creating optimised spliterators for use with streams, but with the first approach I tried unfortunately by the time the spliterator gets to see the operator it's been wrapped in so many anonymous inner classes and/or dynamicinvoke handles that the compiler doesn't try or can't inline everything. Bummer I guess.

I guess if expose the spliterator boundaries to the pipeline it could work. Instead of creating a stream of integers it could create a stream of batches or rows, and then some helper 'of()' functions could wrap simple per-item calculations into optimised loop running instances whilst still retaining most of the simplicity of use.

  thing.rows().map(ByteRow.ofMap((v) -> itemcalc)). ...;
  thing.rows().flatMap(ByteRow.ofFlatMap((v) -> itemcalc)). ...;
  etc

But i've had enough of this for today. I dunno why i was even on this thing - i had an overly long work week and spent too much time infront of screens as it is. But with a crappy cold day and that sore foot options are limited. Might see if the footy is on, but that channel 7 commentary makes it unbearable.

C?

But just for a ground-check i thought i'd see how C compares. Unfortunately the compiler detects that a bit reverse of a byte will end in zero and optimises it away to a byte-store of 0. Oops. Well i mean it's great that it does that but not really important here. I'm using the same code as in Integer.reverse().

So I changed it to the byte-correct reverse(i)>>24.

                         reverse(i)>>24    ~i (x5 loops)
 new ByteRow().forEach   0.151                   0.037
 for loop                0.147                   0.035

 C call or forEach       0.118                   0.233
 C inline                0.08                    0.096

So yeah it's slower but only about 2x worst case but in the more realistic case were you're not going to include inline implementations of everything it's only ~30% slower.

I also tried a 'not' function and here java pounces on gcc, even the in-line case is 3x slower and via a function call is 6x slower. This is just with -O2 and it is not doing any loop unrolling or simdisation. But -O3 doesn't make much difference. Using -O3 and -mtune=native results in no real difference either although it generates a bunch of SIMD code and unrolls the loop a few times.

The gcc generated code looks ok at a glance - not that i care enough about x86 mc to be able to determine the finer details. Maybe it's an alignment thing or something?

It is still a bit surprising if not very important but is enough to demonstrate C doesn't automatically mean best-of-speed.

post google code post

Well nobody bothered to comment about the stuff i removed from google code apart from the one lad or lass who lamented the loss of some javafx demos.

I had comments open+moderated for a few weeks but got hit by spammers a couple of days ago so had to go back to id+moderated. Maybe something got lost in those 500 bits of snot but i don't think so. The spam was quite strange; most mentioned web sites but didn't provide links or weren't very readable so i'm not sure what the point was. Perhaps they're just fishing for open sites or naive moderators they can then exploit. Like the "windows computer department" that keeps calling and calling hoping i'll not tell them to fuck off every time (sigh, no i don't normally say that although i would tonight).

I've still got the subversion clones but i'm not inclined to do much with any of it for the forseeable future and i'm not even sure if i'm going to continue publishing other bits of code i play with going forward.

Desktop Java, OpenCL, ARM assembly language; these things are just not very common in the Free Software world. Server Java is pretty common but that's just, well, `open sauce' companies sharing costs and not hobbyists. So i think all i'm really doing is providing hints or solutions for some student's homework or help for graduate programmers to keep their jobs. And even then it's so niche it wouldn't be many, if any.

As an example of niche, I was looking up some way to communicate with adobe photoshop that doesn't involve psd format and one thing i came across was someone linking to one of my projects for some unfinished experiments with openraster format - on the first page of results. This happens rarely but still too often. Of course it could just be the search engine trying to be smart and tuning results to the user, which is a somewhat terrifying possibility (implications beyond these types searches of course). FWIW I came to the conclusion photoshop is just one of those proprietary relics from the past which intentionally refuses to support other formats so it's idiot users can continue to be arse-reamed by its inflated price.

It's just a hobby

As a hobby i have no desire to work on larger projects of my own or other established projects in my spare time. Occasionally i'll send in a patch to a project but if they want a bunch of fucking around then yeah, ... naah. In hindsight i somewhat regret how we did it on evolution but i think i've mentioned that before. Neither do i need to solicit work or build a portfolio or just gain experience.

I'm not sure how many hobbyists are around; anyone with remotely close to enough skill seems to be jumping into the wild casinos of app-stores or services and expecting to make billion$ and not just doing it for the fun of it. Some of those left over just seem to be arrogant egotistical fuckwits (and some would probably think the same of me). Same as it ever was I guess.

I suppose I will continue to code-drop even if it's just out of habit.

For another hobby I made kumquat marmalade on the weekend. Spent a couple of hours in the sun slicing the tiny fruit and extracting seeds (2-3 cups worth of seeds) and cooked it the next day. Unfortunately after all that effort it looks like it wasn't cooked quite enough and it probably wont set - it's a bit runny but at least it tastes good. Not sure what i'll do with 2-odd litres of the stuff though.

oh driveclub :(

Ho hum. The last patch gutted some of the driveclub features i most use or enjoy.

The "community challenges" list now only contains 20 entries which don't update very often at all (when one expires?) - and today of those there were 15 fucking drift events and a couple using locked cars (either due to club levels or paid DLC) and that left just a couple of events that I might even remotely be interested in ... but they weren't really my chop either. The early June update seems to have made some changes here. The list no longer seems hard-fixed at 20, and it seems to change a little more often. It is still fairly limited and has an over-representation of drifting (no doubt reflecting the customer base, unfortunately).

I detest drift events - they're not racing they're just wank. I worked out a couple of days ago how to get a half-ok score (turn-handbrake-release-pause-engage, the pause is the trick) to get through the "tour" but it didn't give me any more enjoyment out of it and so it isn't something I would choose to do. The locked events are kinda silly too, why even show em? (f2p hooks i guess).

The most popular challenge races are just single-lap, so they're getting boring. I don't really enjoy most of the faster cars so that pretty much rules out the rest. They seem to be focused on 'farming fame' but i got to level 50 without really trying so why bother?

Before this update I used to scroll along the list - sometimes far along - until i found something that suited my current mood and just took it, regardless of how many other people played it (it was sorted by descending popularity i think). Usually the slower cars and the shorter tracks that I've managed to memorise so far.

So this is a real bummer for me it sucks a lot of the fun out of it. At least an option to filter by event type would be a big plus, let alone by car class as well.

There have been some changes to the way challenges and time trials work too. Maybe it's just a server hiccup (hard to tell with dc) but a) you always have to set one time/score on your own before it even loads another timer up, ~~b) with time trial events it only shows YOUR best time's ghost now - not any other ghost,~~ c) with community time trial events it only ever seems to show your best ghost and the best overall ghost (or maybe it's the initiators lap) - no longer do you see the ghost matching the next-nearest-target time.

a) kinda makes sense and is otherwise neither here nor there and showing the slowest time ever isn't very useful although it can be funny.

~~b) takes most of the fun out of this game mode. Trying to beat your own time can be fun but the other ghosts just make it much much more fun.~~ Looks like it was just the server on the night, seems it's back now although it can take a few laps to show up on the longer tracks. Hoorah!

c) Mostly the same as b) except not showing nearby times is also a pain since i'm just not that good in most cases i usually don't see the other ghost much so you can't see the driving lines taken, etc. It's not a target you can beat so it's not a target to beat, merely an indication of how shit you're doing. TBH I can't remember if it always worked like this ....

So it seems they keep gutting the features to try to get the server code working. Maybe it'll come back but since it's not been fixed after so long it's more likely that it's simply gone for good - and this may not be the end of the cuts.

I'm not that interested in online and i've given up anyway with the shitzbox penetrode sold me; one day i'll try routing through a gnu box but it's going to have to be a long wet weekend for me to get keen enough.

Of course i've got plenty of other games but i just haven't felt like playing any: driveclub is good for a couple of laps as a 'go' which sometimes ends there if i'm not feeling it or sometimes turns into an evening of engaged occupation. And if i'm useless at it it doesn't really matter i can just practice or try a different car and not get "stuck forever". Speaking of cars i've been doing some ferrari km mostly just to get the free one and because i haven't driven then much; but i really don't like the way they drive "looks like a fish, moves like a fish, steers like a cow" is apt.

Too knackered to code and the idea of tv sounds dreadful and so if it wasn't so early (19:30) i'd go to bed and read.

About Me

Tags

... 75.

hotspot code generation, optimisation and deoptimisation

Simple loop

First pass

Final pass

Benchmarks

Unrolled loop

Final pass

Benchmarks

Conclusions

Lambdas & de-optimisation

Opt and de-opt

Conclusions 2

hotspot code generation

dot product

sgemm, OpenCL

faster faster loops

netbeans

Faster loops

de-optimisation

forEacherator

Class specialisation

Results

C?

post google code post

It's just a hobby

oh driveclub :(