fpu mode, compiler options
Poked a bit more at the 2d scaler on the parallella yesterday. I started just working out all the edge cases for the X scaler, but then I ended up delving into compiler options and assembler optimisations.
Because the floating point unit has some behaviour defined by the CONFIG register the compiler needs to twiddle bits quite a bit - and by default it seems to do it more often than you'd expect. And because it supports writing interrupt handlers in C it also requires any of these bit twiddles to occur within an interrupt disable block. Fun.
To cut a long story short I found that fiddling with the compiler flags makes a pretty big difference to performance.
The flags which seemed to produce the best result here were:
-std=gnu99 -O2 -ffast-math -mfp-mode=truncate -funroll-loops
Actually the option that has the biggest effect was -mfp-mode=truncate as that removes many of the (redundant) mode switches.
What I didn't expect though is that the CONFIG register bits also seem to have a big effect on the assembly code. By adding this to the preamble of the linear interpolator function I got a significant performance boost. Without it it's taking about 5.5Mcycles per core, but with it it's about 4.8Mcycles!?
mov r17,#0xfff0 movt r17,#0xfff1 mov r16,#1 movfs r12,CONFIG and r12,r12,r17 ; set fpumode = float, turn off exceptions orr r12,r12,r16 ; truncate rounding movts CONFIG,r12
It didn't make any difference to the results whether I did this or not.
Not sure what's going on here.
I have a very simple routine that resamples a single line of float data using linear interpolation. I was trying to determine if such a simple routine would compile ok or would need me to resort to assembler language for decent performance. At first it looked like it was needed until I used the compiler flags above (although later I noticed I'd left an option to disable inlining of functions that I was using to investigate compiler output - which may have contributed).
The sampler i'm using is just (see: here for a nice overview):
static inline float sample_linear(float * __restrict__ src, float sxf) { int sx = (int)sxf; float r = sxf - sx; float y1 = src[sx]; float y2 = src[sx+1]; return (y1*(1-r)+y2*r); }
Called from:
static void scale_linex(float * __restrict__ src, float sxf, float * __restrict__ dst, int dlen, float factor) { int x; for (x=0;x<dlen;x++) { dst[x] = sample(src, sxf); sxf += factor; } }
A straight asm implementation is reasonably simple but there are a lot of dependency-stalls.
mov r19,#0 ; 1.0f movt r19,#0x3f80 ;; linear interpolation fix r16,r1 ; sx = (int)sxf lsl r18,r16,#2 float r17,r16 ; (float)sx add r18,r18,r0 fsub r17,r1,r17 ; r = sxf - sx ldr r21,[r18,#1] ; y2 = src[sx+1] ldr r20,[r18,#0] ; y1 = src[sx] fsub r22,r19,r17 ; 1-r fmul r21,r21,r17 ; y2 = y2 * r fmadd r21,r20,r22 ; res = y2 * r + y1 * (1-r)
(I actually implement the whole resample-row routine, not just the sampler).
This simple loop is much faster than the default -O2 optimisation, but slower than the C version with better optimisation flags. I can beat the C compiler with an implementation which processes 4 output pixels per loop - thus allowing for better scheduling with a reduction in stalls, and dword writes to the next core in the pipeline. But the gain is only modest for the amount of effort required.
Overview of
Routine Mcycles per core C -O2 10.3 C flags as above 4.2 asm 1x 5.3 asm 1x force CONFIG 4.7 asm 4x 3.9 asm 4x force CONFIG 3.8
I'm timing the total instruction cycles on the core which includes the synchronisation work. Image is 512x512 scaled by 1.7x,1.0.
On a semi-relted note I was playing with the VJ detector code and noticed the performance scalability isn't quite so good on PAL-res images because in practice image being searched is very small. i.e. parallelism isn't so hot. I hit this problem with the OpenCL version too and probably the solution is the same as I used there: go wider. Basically generate all probe scales at once and then process them all at once.