hotspot code generation, optimisation and deoptimisation
As promised here is an article about the hotspot code generation using the disassembler plugin mention in the last post. I was nearly going to not do it but i'd already done some playing with it.
Unfortunately I had to use AMD64 instructions here; i think the ISA is pretty shithouse so I haven't bothered to learn it very well so i'm doing some guessing below. I even downloaded the APMs from AMD (i find the intel docs quite poor) to look some stuff up.
For the C code i'm using gcc 4.8.2 with -mtune=native -std=gnu99 and -Ox as indicated in the text.
The actual test calculates 1000x dot products of 2^20 elements each. For java i'm using System.nanoTime() and printing the best result across all runs. For C i couldn't be bothered with the gettimeofday() stuff so i'm just using the time command - over 1000 iterations the difference should be negligable and there are some interesting results regardless.
Simple loop
This is the starting function; obvious what it does.
public float dot(float[] a, float[] b, int len) {
float v = 0;
for (int i=0;i<len;i++)
v += a[i] * b[i];
return v;
}
A C version is identical apart from using pointers rather than arrays and some extra fluffly conventions.
float dot(const float *a, const float *b, int len) {
float v = 0;
for (int i=0;i<len;i++)
v += a[i] * b[i];
return v;
}
First pass
After some iterations hotspot will recognise this function could benefit from optimisation and this is what jdk8 spits out at the first compilation pass.
This is using gcc syntax so instruction operands are srca,[srcb,],dst rather than the more conventional dst,srca[,srcb].
.1: movslq %esi,%rdi
jae .exception0
vmovss 0x10(%rdx,%rdi,4),%xmm1
movslq %esi,%rdi
jae .exception1
vmovss 0x10(%rcx,%rdi,4),%xmm2
vmulss %xmm2,%xmm1,%xmm1
vaddss %xmm0,%xmm1,%xmm1
inc %esi
mov $0x7ffdffc00ce8,%rdi
mov 0xe0(%rdi),%ebx
add $0x8,%ebx
mov %ebx,0xe0(%rdi)
mov $0x7ffdffc00488,%rdi
and $0xfff8,%ebx
cmp $0x0,%ebx
je .2
.3: test %eax,0x15e4076a(%rip)
mov $0x7ffdffc00ce8,%rdi
incl 0x128(%rdi)
vmovaps %xmm1,%xmm0
cmp %r8d,%esi
mov $0x7ffdffc00ce8,%rdi
mov $0x108,%rbx
jge .4
mov $0x118,%rbx
.4: mov (%rdi,%rbx,1),%rax
lea 0x1(%rax),%rax
mov %rax,(%rdi,%rbx,1)
jl .1
;; clean up and exit
.2: mov %rdi,0x8(%rsp)
movq $0x1d,(%rsp)
callq some_function
jmpq .3
Of these 11 are for the loop itself, the rest seem to be for profiling the loop.
As far as it goes it looks fairly decent - pretty much gcc -O2 level of optimisation with array bounds checking performed at each array read.
Of course the profiling adds a lot of overhead here.
The following is the output for the inner loop of gcc -O2.
10: f3 0f 10 0c 87 movss (%rdi,%rax,4),%xmm1
15: f3 0f 59 0c 86 mulss (%rsi,%rax,4),%xmm1
1a: 48 ff c0 inc %rax
1d: 39 c2 cmp %eax,%edx
1f: f3 0f 58 c1 addss %xmm1,%xmm0
23: 7f eb jg 10
The only real difference apart from having no bounds checking is that it multiplies directly from memory rather than through a register. The latter is how every other mainstream cpu does it so that may have some bearing on it.
I can't easy do comparison timing of the loops (and it isn't very meaningful) but obviously the java will be slower here, and probably on-par with -O0 output from gcc.
Final pass
After it has gained some profiling information the result will be optimised - in this case it recompiles it twice more. The inner loop of the final pass is below:
.1: vmovss 0x10(%rbx,%r14,4),%xmm0
vmulss 0x10(%rcx,%r14,4),%xmm0,%xmm1
vaddss %xmm3,%xmm1,%xmm0
movslq %r14d,%r10
vmovss 0x2c(%rbx,%r10,4),%xmm2
vmulss 0x2c(%rcx,%r10,4),%xmm2,%xmm8
vmovss 0x14(%rbx,%r10,4),%xmm1
vmulss 0x14(%rcx,%r10,4),%xmm1,%xmm2
vmovss 0x18(%rcx,%r10,4),%xmm1
vmulss 0x18(%rbx,%r10,4),%xmm1,%xmm3
vmovss 0x28(%rbx,%r10,4),%xmm1
vmulss 0x28(%rcx,%r10,4),%xmm1,%xmm4
vmovss 0x1c(%rcx,%r10,4),%xmm1
vmulss 0x1c(%rbx,%r10,4),%xmm1,%xmm5
vmovss 0x20(%rbx,%r10,4),%xmm1
vmulss 0x20(%rcx,%r10,4),%xmm1,%xmm6
vmovss 0x24(%rbx,%r10,4),%xmm1
vmulss 0x24(%rcx,%r10,4),%xmm1,%xmm7
vaddss %xmm2,%xmm0,%xmm0
vaddss %xmm0,%xmm3,%xmm1
vaddss %xmm1,%xmm5,%xmm1
vaddss %xmm1,%xmm6,%xmm0
vaddss %xmm0,%xmm7,%xmm1
vaddss %xmm1,%xmm4,%xmm0
vaddss %xmm0,%xmm8,%xmm3
add $0x8,%r14d
cmp %r8d,%r14d
jl .1
cmp %ebp,%r14d
jge .done
xchg %ax,%ax
.2: cmp %edi,%r14d
jae .stuff0
vmovss 0x10(%rcx,%r14,4),%xmm1
cmp %r9d,%r14d
jae .stuff1
vmulss 0x10(%rbx,%r14,4),%xmm1,%xmm1
vaddss %xmm1,%xmm3,%xmm3
inc %r14d
cmp %ebp,%r14d
jl .2
.done:
So this has removed all the array bounds checking from inside the loop (it's elsewhere - too bulky/not important here). It's also unrolled the loop 8x and is using modern 3-operand instructions to stagger most of the operations for better throughput on typical RISC cpus (I have no knowledge of the AMD scheduling rules). And finally it tacks on a simple 1-element loop to finish off anything left over.
Comparing this to the output of gcc -O3 ...
30: f3 0f 10 09 movss (%rcx),%xmm1
34: 41 83 c0 10 add $0x10,%r8d
38: 0f 18 49 50 prefetcht0 0x50(%rcx)
3c: 0f 18 48 50 prefetcht0 0x50(%rax)
40: 48 83 c1 40 add $0x40,%rcx
44: 48 83 c0 40 add $0x40,%rax
48: f3 0f 59 48 c0 mulss -0x40(%rax),%xmm1
4d: f3 0f 58 c1 addss %xmm1,%xmm0
51: f3 0f 10 49 c4 movss -0x3c(%rcx),%xmm1
56: f3 0f 59 48 c4 mulss -0x3c(%rax),%xmm1
5b: f3 0f 58 c1 addss %xmm1,%xmm0
5f: f3 0f 10 49 c8 movss -0x38(%rcx),%xmm1
64: f3 0f 59 48 c8 mulss -0x38(%rax),%xmm1
69: f3 0f 58 c1 addss %xmm1,%xmm0
6d: f3 0f 10 49 cc movss -0x34(%rcx),%xmm1
72: f3 0f 59 48 cc mulss -0x34(%rax),%xmm1
77: f3 0f 58 c1 addss %xmm1,%xmm0
7b: f3 0f 10 49 d0 movss -0x30(%rcx),%xmm1
80: f3 0f 59 48 d0 mulss -0x30(%rax),%xmm1
85: f3 0f 58 c1 addss %xmm1,%xmm0
89: f3 0f 10 49 d4 movss -0x2c(%rcx),%xmm1
8e: f3 0f 59 48 d4 mulss -0x2c(%rax),%xmm1
93: f3 0f 58 c1 addss %xmm1,%xmm0
97: f3 0f 10 49 d8 movss -0x28(%rcx),%xmm1
9c: f3 0f 59 48 d8 mulss -0x28(%rax),%xmm1
a1: f3 0f 58 c1 addss %xmm1,%xmm0
a5: f3 0f 10 49 dc movss -0x24(%rcx),%xmm1
aa: f3 0f 59 48 dc mulss -0x24(%rax),%xmm1
af: f3 0f 58 c1 addss %xmm1,%xmm0
b3: f3 0f 10 49 e0 movss -0x20(%rcx),%xmm1
b8: f3 0f 59 48 e0 mulss -0x20(%rax),%xmm1
bd: f3 0f 58 c1 addss %xmm1,%xmm0
c1: f3 0f 10 49 e4 movss -0x1c(%rcx),%xmm1
c6: f3 0f 59 48 e4 mulss -0x1c(%rax),%xmm1
cb: f3 0f 58 c1 addss %xmm1,%xmm0
cf: f3 0f 10 49 e8 movss -0x18(%rcx),%xmm1
d4: f3 0f 59 48 e8 mulss -0x18(%rax),%xmm1
d9: f3 0f 58 c1 addss %xmm1,%xmm0
dd: f3 0f 10 49 ec movss -0x14(%rcx),%xmm1
e2: f3 0f 59 48 ec mulss -0x14(%rax),%xmm1
e7: f3 0f 58 c1 addss %xmm1,%xmm0
eb: f3 0f 10 49 f0 movss -0x10(%rcx),%xmm1
f0: f3 0f 59 48 f0 mulss -0x10(%rax),%xmm1
f5: f3 0f 58 c1 addss %xmm1,%xmm0
f9: f3 0f 10 49 f4 movss -0xc(%rcx),%xmm1
fe: f3 0f 59 48 f4 mulss -0xc(%rax),%xmm1
103: f3 0f 58 c1 addss %xmm1,%xmm0
107: f3 0f 10 49 f8 movss -0x8(%rcx),%xmm1
10c: f3 0f 59 48 f8 mulss -0x8(%rax),%xmm1
111: f3 0f 58 c1 addss %xmm1,%xmm0
115: f3 0f 10 49 fc movss -0x4(%rcx),%xmm1
11a: f3 0f 59 48 fc mulss -0x4(%rax),%xmm1
11f: 45 39 c8 cmp %r9d,%r8d
122: f3 0f 58 c1 addss %xmm1,%xmm0
126: 0f 85 04 ff ff ff jne 30
12c: 49 63 c0 movslq %r8d,%rax
12f: 48 c1 e0 02 shl $0x2,%rax
133: 48 01 c7 add %rax,%rdi
136: 48 01 c6 add %rax,%rsi
139: 31 c0 xor %eax,%eax
13b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
140: f3 0f 10 0c 87 movss (%rdi,%rax,4),%xmm1
145: f3 0f 59 0c 86 mulss (%rsi,%rax,4),%xmm1
14a: 48 ff c0 inc %rax
14d: 41 8d 0c 00 lea (%r8,%rax,1),%ecx
151: 39 ca cmp %ecx,%edx
153: f3 0f 58 c1 addss %xmm1,%xmm0
157: 7f e7 jg 140
The main differences here are that it unrolls the loop 16x here. It only uses the 2-operand instructions - it uses fewer registers. It has also transformed the array indexing into pre-increment pointer arithmetic (in batches).
Well this definitely isn't a RISC cpu as that scheduling looks pants as everything keeps writing to the same registers. But x86 being so dominant has allowed a lot of money to be spent optimising the chip to run shitty code faster to make up for the compiler.
Benchmarks
Here are some timing results. All values are in ms for equivalent of 1 loop (or seconds for 1000 loops).
what time
gcc -O0 4.86
-O2 1.44
-O3 1.44
java 1.60
time java 1.7
The last is using the 'time' command on the whole java loop. i.e. this includes the jvm startup, profiling, and compilation. This isn't too shabby.
Either way these times are pretty good vs effort - maybe one or the other is more tuned to the cpu I have vs intel stuff but it's really neither here nor there.
Unrolled loop
Actually what prompted the idea for this article was some results I had from unrolling loops 4x in Java. I subsequently found that unrolling 2x did just as good a job in this case so i'll do that here just for simplicity. The assembly is almost identical anyway as it just gets unrolled an additional 2x rather than 4x by the compiler.
public float dot(float[] a, float[] b, int len) {
float v0 = 0, v1=0;
int i = 0;
for (int e = len & ~1;i<e;i+=2) {
v0 += a[i] * b[i];
v1 += a[i+1] * b[i+1];
}
for (;i<len;i++)
v0 += a[i] * b[i];
return v0+v1;
}
Final pass
And here's just the inner loop of the final pass:
.1: vmovss 0x10(%rcx,%r8,4),%xmm0
vmulss 0x10(%rdx,%r8,4),%xmm0,%xmm1
vaddss %xmm3,%xmm1,%xmm0
movslq %r8d,%r11
vmovss 0x2c(%rcx,%r11,4),%xmm2
vmulss 0x2c(%rdx,%r11,4),%xmm2,%xmm9
vmovss 0x24(%rcx,%r11,4),%xmm1
vmulss 0x24(%rdx,%r11,4),%xmm1,%xmm8
vmovss 0x1c(%rcx,%r11,4),%xmm2
vmulss 0x1c(%rdx,%r11,4),%xmm2,%xmm1
vmovss 0x18(%rcx,%r11,4),%xmm3
vmulss 0x18(%rdx,%r11,4),%xmm3,%xmm2
vmovss 0x14(%rcx,%r11,4),%xmm4
vmulss 0x14(%rdx,%r11,4),%xmm4,%xmm3
vmovss 0x20(%rcx,%r11,4),%xmm5
vmulss 0x20(%rdx,%r11,4),%xmm5,%xmm4
vmovss 0x28(%rcx,%r11,4),%xmm6
vmulss 0x28(%rdx,%r11,4),%xmm6,%xmm5
vaddss %xmm3,%xmm0,%xmm3
vaddss %xmm2,%xmm3,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vaddss %xmm1,%xmm4,%xmm0
vaddss %xmm8,%xmm0,%xmm1
vaddss %xmm1,%xmm5,%xmm0
vaddss %xmm0,%xmm9,%xmm3
add $0x8,%r8d
cmp %r10d,%r8d
jl .1
So now it's unrolled the loop an addition 4x times so that it looks the same at first glance. But now the moves have been spread across many registers rather than mostly going through xmm1. It runs quite a bit faster.
This is getting too long so i wont include it but the same simple modification applied to the C version also makes a difference - quite a big one. The generated code is almost identical apart from every second xmm0 being replaced with xmm1 - i.e. interleaved as written.
Benchmarks
And here's some benchmarks of this 'version'.
what time
gcc -O0 2.76
-O2 0.833
-O3 0.735
java 1.00
time java 1.2
Conclusions
Well hotspot is pretty good, but could be a little bit better. And it seems mostly just to fall down on some seemingly simple areas like instruction scheduling (simple compared to the rest of the work it's done).
Although I don't have enough knowledge of the architecture here to state that the original scheduling isn't very optimal the benchmark results probably speak loud enough in that absence. It is clearly not optimal as the same machine code which interleaves the output register runs 2x faster in the C case. I don't really feel like translating this to assembly so i can see if some simple re-arrangement would make a difference.
But what is odd that neither compiler is doing this on it's own, one could argue (quite convincingly) that due to floating point peculiarities (addition is only weakly associative) both loops are not actually calculating the same result. In the case of hotspot however this argument is weak because the optimised version is already spreading the addition accross multiple registers.
Lambdas & de-optimisation
This is getting long and the next part could probably go into another article but i've spent enough of my weekend on this so i'll get it out of the way now with a quick summary.
For simplcity I created the following simple 3-parameter map/reduce operation.
public interface FloatTrinaryFunction {
public float applyAsFloat(float a, float b, float c);
}
public float reduce(float[] a, float[] b, int len, FloatTrinaryFunction func) {
float v = 0;
for (int i=0;i<len;i++)
v = func.applyAsFloat(v, a[i], b[i]);
return v;
}
And invoke it thus:
reduce(a, b, a.length, (float v, float x, float y)->v + x*y);
Opt and de-opt
In short, if you use up to two lambdas it results in equivalent code to the direct dot product equation - nice. But once you go to three or more it de-optimises the loop and reverts to a function call. It also spends more time in the compiler.
The following is what the deoptimised loop looks like:
.1: mov (%rsp),%rdx
mov %rdx,(%rsp)
cmp %r10d,%ebp
jae .exception0
mov 0x8(%rsp),%r10
vmovss 0x10(%rdx,%rbp,4),%xmm1
cmp %r10d,%ebp
jae .exception1
mov 0x8(%rsp),%r10
vmovss 0x10(%r10,%rbp,4),%xmm2
mov 0x18(%rsp),%rsi
xchg %ax,%ax
mov $0xffffffffffffffff,%rax
callq applyAsFloat
inc %ebp
cmp 0x10(%rsp),%ebp
jl .1
So it retains the array bounds checks inside the loop (bummer) and invokes the interface as a function call (expected), but it removes any profiling instrumentation that was present in the first pass (expected also) and generates the smallest code.
This hits around the 4.5ms mark.
Conclusions 2
It's important to note that this is just a run-time decision made by the current version of hotspot - this could be changed or could be tweaked in the future. And as I showed in some previous posts it can be worked-around even with the current hotspot using some bytecode foo.
Given the prevalence of lambdas in java8 i suspect it is something that will gain some tuning attention in future revisions. It's not something one would change lightly so it will probably be based on profiling data and usage.
Faster loops
I've been playing with streams and iterators and stuff again a bit. Although i've found that having a custom calculation loop is pretty good for performance trying to call a (lambda) function for each item can have some fairly large overheads once the jvm decides it wants to de-optimise the calling loop. But the latter is just so convenient to use it's worth a little more effort if it will make a performance difference.
de-optimisation
This stuff is just based on observation and not some internal knowledge of hotspot
So I had another look at trying to optimise the case of processing per-item in a loop with minimal overheads in a way that is practical and efficient. So far i've found the jvm will inline a call inside a loop (depending on size?) if the loop only calls up to two different class types. Unfortunately each new dynamicinvoke counts as a new class type, so for example the first of the following will optimise fine, but the second wont.
// these will remain optimised
IntUnaryOperator op = Integer::reverse;
A.forEach(op);
A.forEach(op);
A.forEach(op);
// the third invocation will cause a de-optimisation,
// subsequently all will run as de-optimised
A.forEach(Integer::reverse);
A.forEach(Integer::reverse);
A.forEach(Integer::reverse);
And this applies globally so using a singleton wont get you very far.
So how to address this?
forEacherator
I found that if i used bytecode manipulation and created a new copy-class the optimisation stayed around - because the loop only ever calls one function. So the goal then was to create the simplest class so the overhead of doing this (and the subsequent code) remained small.
After a couple of iterations I settled on the following pair of interface+implementation.
public interface ByteRowFunction {
public void forEach(byte[] data, int offset, int length);
}
public class ByteRow implements ByteRowFunction {
private final IntUnaryOperator op;
public ByteRow(IntUnaryOperator op) {
this.op = op;
}
public void forEach(byte[] data, int offset, int length) {
for (; length > 0; length--, offset++)
data[offset] = (byte) op.applyAsInt(data[offset] & 0xff);
}
}
This form of for loop is also the fastest I could find, at least with this hotspot on this platform. (i suppose also it's really a map() or apply() call, just read it as the one you prefer).
It still has the same issue as the examples above in that using it with 3 or more different 'op' values will de-optimise it, even if it can be used itself as a singleton itself (since the forEach call is pure).
Class specialisation
So this is where the bytecode manipulation comes in. I use ASM to simply create a new class for each class of op. And the jvm will worry about in-lining the call if it makes sense otherwise.
public static ByteRowFunction of(IntUnaryOperator op) {
try {
Class cp = ASMTools.specialisedClass(ByteRow.class.getName(), op);
return (ByteRowFunction) cp.getConstructor(IntUnaryOperator.class).newInstance(op);
} catch (Exception ex) {
throw new RuntimeException(ex);
}
}
The specialisedClass() function simply takes the original class and creates an exact copy but renames it to a unique value tied to op.getClass(). There is an out-of-date example in the ASM FAQ on how to do this but it's pretty easy using ASM. And that's more or less all it takes.
Actually further ... in this case the forEach() call is pure (re-entrant and side-effect free) so the of() function could return a singleton instance as well. But that adds some other mess to gc and so on so probably isn't worth it or even detrimental in the long run; if necessary a caller could keep track of their own.
Results
I did two tests on a (2^25) byte array. The first tries to isolate just the overheads and invokes Integer.hashCode() on each item. The second calls Integer.reverse() which is somewhat slow on x86 without a bitrev instruction (ignoring that this will always result in zero when byte-integer-byte is used).
In each case i'm calling the same thing in 3 different places with 3 different lambdas (`Integer::xx' will create a new dynamicinvoke handle each time) which should trigger a de-optimisation if it's going to.
hashCode reverse
new ByteRow().forEach 0.1800000 0.29
new of().forEach 0.0000700 0.146
singleton hash of().forEach 0.0000020 0.146
singleton saved of().forEach 0.0000013 0.146
for loop 0.0000010 0.158
This is the best of 5 runs after a couple of warmup laps. Because it's so short the first column results are a bit noisy but the general trend is clear enough and i took some representative values from a few runs.
The first two include (one) object instantiation. The third uses a (non-synchronised) hash table lookup, the fourth just creates an instance and re-uses it. The last is a simple for-loop over the byte array.
It would be handy to see the generated object code but one can guess that the first vs the rest is the difference between an in-line `foo[x] = foo[x]' and a function call `foo[x] = identity(foo[x])'.
Of course a `memcpy' operation isn't much of a test so with something a little more substantial like Integer.reverse() the overheads are only about 100% - which is still pretty much the "doesn't matter" point in most cases but it's there anyway. Oddly enough the for loop loses out here a little bit but that probably just comes down to different micro-optimisations.
The point of this is to save having to type out yet! another! for! loop! and use lambdas but still retain the performance of using a specialised for-loop. The grand prize would be to re-compile the code for SIMD instructions or HSA or OpenCL - i.e. aparapi. But that requires more than just a bit more effort ;-).
I was hoping that the same technique would be applicable to creating optimised spliterators for use with streams, but with the first approach I tried unfortunately by the time the spliterator gets to see the operator it's been wrapped in so many anonymous inner classes and/or dynamicinvoke handles that the compiler doesn't try or can't inline everything. Bummer I guess.
I guess if expose the spliterator boundaries to the pipeline it could work. Instead of creating a stream of integers it could create a stream of batches or rows, and then some helper 'of()' functions could wrap simple per-item calculations into optimised loop running instances whilst still retaining most of the simplicity of use.
thing.rows().map(ByteRow.ofMap((v) -> itemcalc)). ...;
thing.rows().flatMap(ByteRow.ofFlatMap((v) -> itemcalc)). ...;
etc
But i've had enough of this for today. I dunno why i was even on this thing - i had an overly long work week and spent too much time infront of screens as it is. But with a crappy cold day and that sore foot options are limited. Might see if the footy is on, but that channel 7 commentary makes it unbearable.
C?
But just for a ground-check i thought i'd see how C compares. Unfortunately the compiler detects that a bit reverse of a byte will end in zero and optimises it away to a byte-store of 0. Oops. Well i mean it's great that it does that but not really important here. I'm using the same code as in Integer.reverse().
So I changed it to the byte-correct reverse(i)>>24.
reverse(i)>>24 ~i (x5 loops)
new ByteRow().forEach 0.151 0.037
for loop 0.147 0.035
C call or forEach 0.118 0.233
C inline 0.08 0.096
So yeah it's slower but only about 2x worst case but in the more realistic case were you're not going to include inline implementations of everything it's only ~30% slower.
I also tried a 'not' function and here java pounces on gcc, even the in-line case is 3x slower and via a function call is 6x slower. This is just with -O2 and it is not doing any loop unrolling or simdisation. But -O3 doesn't make much difference. Using -O3 and -mtune=native results in no real difference either although it generates a bunch of SIMD code and unrolls the loop a few times.
The gcc generated code looks ok at a glance - not that i care enough about x86 mc to be able to determine the finer details. Maybe it's an alignment thing or something?
It is still a bit surprising if not very important but is enough to demonstrate C doesn't automatically mean best-of-speed.