About Me

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as Zed
  to his mates & enemies!

notzed at gmail >
fosstodon.org/@notzed >

Tags

android (44)
beagle (63)
biographical (104)
blogz (9)
business (1)
code (77)
compilerz (1)
cooking (31)
dez (7)
dusk (31)
esp32 (4)
extensionz (1)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (459)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (231)
java ee (3)
javafx (49)
jjmpeg (81)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (10)
opencl (120)
os (17)
panamaz (5)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
players (1)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
vulkan (3)
wanki (3)
workshop (3)
zcl (4)
zedzone (26)
Wednesday, 14 September 2011, 23:28

Control the Roll

I think I hit a bug in the NVidia OpenCL compiler today, and lost a bit of faith in compilers in general ..

What was a fairly small routine with a simple loop - wanted to use 63 registers/work item. No matter how i tried to unroll the loop using #pragma unroll, re-arrange the work to use vectors or not, and so on.

// local group size = 16, 16, 1
kernel void
somefunc(..., constant float *f0b, ...) {
    local float *localdata[];

    ... load local data ...

    for (int i=0;i<9;i++) {
        float a0 = localdata[i*2];
        float a1 = localdata[i*2+1];
        ...
        v0 += f0a[i*2] * a0 + f1a[i*2] * a1;
        v1 += f0b[i*2] * b0 + f1b[i*2] * b1;
        v2 += f0a[i*2+1] * a0 + f1a[i*2] * a1;
        v3 += f0b[i*2+1] * a0 + f1b[i*2] * b1;
    }
}

Looking at the intermediate code it had about a thousand(!) redundant register moves to/from other registers. For the small problem I had it was taking about 100uS which probably wouldn't have bothered me apart from the weird compiler output.

So I removed the loop entirely by hand, using C macros to implement each step.

Result: 73uS & 21 registers.

And the intermediate code was much smaller and more compact.

NVidia's compiler seems to do a pretty crappy job with vectors in any event, the vector version was even worse - half the speed of a scalar version - around 200uS. It's nor normally this extreme but it seems it's almost always faster not to use vector code. It would also (only sometimes!) hang for 20 seconds or more whilst compiling this file, and these changes fixed that too.

Tagged opencl, rants.
LU Decomposition 2 | Aparapi freed
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!