Converting YUV to RGB (approximately) is a fairly simple process but requires a lot of arithmetic, and is something that converts to SIMD code incredibly well.
As described in the YUV conversion article on fourcc.org, an approximate calculation for each pixel is given by:
R = Y + 1.402 (Cr-128) G = Y - 0.34414 (Cb-128) - 0.71414 (Cr-128) B = Y + 1.772 (Cb-128)
Note: (Y, Cr, Cb) == (Y, U, V) in our notation
Note: There is much discussion on the linked page as to which numbers are right - these numbers seem to work well enough for me for now.
Scalar example
This is a straightforward implementation using fixed point arithmetic in c.
// Using 2.14 fixed point format #define SHIFT (14) // Let the compiler do the constant formation #define YSCALE ((int) (( 1.0 ) * (1<<SHIFT))) #define BUSCALE ((int) (( 1.772 ) * (1<<SHIFT))) #define GVSCALE ((int) ((-0.71414) * (1<<SHIFT))) #define GUSCALE ((int) ((-0.34414) * (1<<SHIFT))) #define RVSCALE ((int) (( 1.402 ) * (1<<SHIFT))) void yuv2rgb(unsigned char *yp, unsigned char *up, unsigned char *vp, unsigned int *argbp, int count) { int x; for (x = 0;x<count;x++) { int y = yp[x]; int u = up[x/2] int v = vp[x/2]; int r, g, b; // pre-calculate common subexpressions y = y * YSCALE; u = u - 128; v = v - 128; // perform basic calculation r = y + RVSCALE * v; g = y + GUSCALE * u + GVSCALE * v; b = y + BUSCALE * u; // fix point r >> = SHIFT; g >> = SHIFT; b >> = SHIFT; // clamp to maximum r = r > 255 ? 255 : r; g = g > 255 ? 255 : g; b = b > 255 ? 255 : b; // clamp to minimum r = r < 0 ? 0 : r; g = g < 0 ? 0 : g; b = b < 0 ? 0 : b; // pack and write argb value argbp[x] = ( r << 16) | ( g << 8 ) | b; } }
There are probably a few rounding error/loss of precision issues above - it is an ok approximation at this point though. Also, this is uncompiled or tested.
In most (?) video formats, the U and V data is specified at 1/2 the horizontal resolution of Y, so in the above code we take that into account. Now this is one of those embarassingly parallelisable functions, so converting it to spu SIMD code is almost trivial.
So, lets have a look at how one goes about converting it to SIMD code.
The only really tricky bit is a detail hidden in the example above - the casting of the input values to an int increases their size. When loading packed byte data this casting and size expansion must be handled manually. But apart from that, the conversion is straightforward.
First, initialise the constants - we can use the same values, but they need to be in vectors since we cannot multiply an immediate value. Similarly, we need a clamping value as a vector type so we can clamp 4 values at once.
void yuv2rgb_vec(unsigned vector char *yp, unsigned vector char *up, unsigned vector char *vp, unsigned vector int *argbp, int count) { int x; const vec_int4 yscale = spu_splats(YSCALE); const vec_int4 buscale = spu_splats(BUSCALE); const vec_int4 gvscale = spu_splats(GVSCALE); const vec_int4 guscale = spu_splats(GUSCALE); const vec_int4 rvscale = spu_splats(RVSCALE); const vec_int4 clamp = spu_splats( 255 );
Next comes some shuffle control words. We use two different ones, and they are used to convert the packed byte data into integers. One is used for the Y data which is used as is, and the other for the U and V data, which also requires 2x expansion before we can use it.
We want to minimise our memory loads, so we load 1 full quad-word (i.e. 16 bytes) each time (not that we have a choice!), and process as much as we can, and if we are using integers (32 bit), that means one memory load is expanded into 4 quad-word registers for the Y data, and 8 for the U and V data. Instead of requiring a new shuffle pattern for every set of registers we can just use the same one and increment it - adding is fast and runs on a different pipeline to the shuffle and loads, so is faster than loading a pre-calculated shuffle pattern (given that we don't need to do much other arithmetic in the mean time). Because I was lazy and didn't want to cut and paste the whole lot of code twice, lets just use some logic to make sure that the UV shuffle mask updates at the required rate.
We can also use a shuffle pattern to perform the packing back to RGB format from 3 registers of RGB values in fewer instructions than using shifts and ors. The shuffle performs the equivalent of r<<16 | g<<8 in one step.
const vec_uint4 shuffle_y_0 = (vec_uint4) { 0x80808000, 0x80808001, 0x80808002, 0x80808003 }; const vec_uint4 shuffle_uv_0 = (vec_uint4) { 0x80808000, 0x80808000, 0x80808001, 0x80808001 }; const vec_uint4 shuffle_rg = (vec_uint4) { 0x80031380, 0x80071780, 0x800b1b80, 0x800f1f80 };
The first one just converts the first (next) 4 bytes into unsigned integers, and the second expands the first 2 bytes into 4 unsigned integers. To get to the next 4 bytes in the loaded quadword, we just add 0x04 to the first one (and 0x02 to the second one), and repeat 4 (or 8) times to cover everything we actually loaded.
The loop thus becomes:
vec_uint4 shuffle_uv = shuffle_uv_0; for (x=0;x<count/16;x+=1) { vec_int4 y = (vec_int4)yp[x]; vec_int4 u = (vec_int4)up[x/2]; vec_int4 v = (vec_int4)vp[x/2];Note the use of casts - this just means I have to do fewer casts later. A case where C code just gets in the way of the instructions - the spu doesn't give a shit about types.
vec_int4 y0, y1, y2, y3; vec_int4 u0, u1, u2, u3; vec_int4 v0, v1, v2, v3; vec_int4 r0, r1, r2, r3; vec_int4 g0, g1, g2, g3; vec_int4 b0, b1, b2, b3; vec_uint4 shuffle_y = shuffle_y_0; shuffle_uv = (x & 1) == 0 ? shuffle_uv_0 : shuffle_uv; // convert 16 YUV values into 4x4 value vectors y0 = spu_shuffle(y, y, (vec_uchar16)shuffle_y); shuffle_y = spu_add(shuffle_y, 4); u0 = spu_shuffle(u, u, (vec_uchar16)shuffle_uv); v0 = spu_shuffle(v, v, (vec_uchar16)shuffle_uv); shuffle_uv = spu_add(shuffle_uv, 2); ... repeat this 3 more times for yn, un, and vn // 5*4 = 20
y0, u0, and v0 (and 1,2,3 of each) now contain 4 YUV values, stored as a vector. The arithmetic can just be applied as it was before, but using intrinsics (although the compiler can do some of it automatically it can be easier just doing it manually). And all we have to do is repeat it 4 times.
// pre-calculate common subexpression y0 = spu_mulo((vec_short8)y0, (vec_short8)yscale); v0 = spu_add(v0, -128); u0 = spu_add(u0, -128); .. repeat for 1-3 // 3*4 = 12 // perform basic calculation (note use of multiply and add) r0 = spu_madd((vec_short8)v0, (vec_short8)rvscale, y0); g0 = spu_madd((vec_short8)v0, (vec_short8)gvscale, y0); g0 = spu_madd((vec_short8)u0, (vec_short8)guscale, g0); b0 = spu_madd((vec_short8)u0, (vec_short8)buscale, y0); .. repeat for 1-3 // 4*4 = 16The spu only has a 16x16 bit to 32 bit multiply - but it can also add to an existing result. This removes the need for extra adds. The parameters must be cast to shorts, although it only uses each odd short - i.e. the lower 16 bits of each integer, so its all good.
// fix point (rotate left with mask algebraic is the same as an arithmetic shift right) r0 = spu_rlmaska(r0, -SHIFT); g0 = spu_rlmaska(g0, -SHIFT); b0 = spu_rlmaska(b0, -SHIFT); .. repeat for 1-3 // 3*4 = 12 // clamp to maximum (compare and select makes this very simple) r0 = spu_sel(r0, clamp, spu_cmpgt(r0, clamp)); g0 = spu_sel(g0, clamp, spu_cmpgt(g0, clamp)); b0 = spu_sel(b0, clamp, spu_cmpgt(b0, clamp)); .. repeat for 1-3 // clamp to minimum (we use and here since cmp already returns the right value. // This way we don't need a register of zero's either.) r0 = spu_and(r0, spu_cmpgt(r0, 0)); g0 = spu_and(g0, spu_cmpgt(g0, 0)); b0 = spu_and(b0, spu_cmpgt(b0, 0)); .. repeat for 1-3 // 12*4 = 48 // pack and write argb value argbp[x*4+0] = spu_or(b0, spu_shuffle(r0, g0, shuffle_rg)); .. repeat for 1-3 // 2*4 = 8 } }
The compiler has a nice lot of independent calculations to work with here - it can reschedule things nicely with few stalls. The above only uses 1/2 of the u and v values - and repeats for the other half, duplicating the loads again. This reduces code size at the expense of performance - 2 redundant loads and some unecessary logic could be removed.
The output here is a packed ARGB value - but if we're going to be doing other processing it might make sense to store the data as unpacked planes, of various bit sizes. e.g. from 8 bits, 16/32 bits integer/fixed point, or floats. The core algorithm and unpacking would be the same.
So far we've looked at an obvious direct translation of a scalar algorithm to a SIMD/vectorised one. It will be a huge improvement over the scalar version, particularly on an spu, but we should be able to do better.
An obvious idea would be to convert this to perform the calculations using shorts. This way much of the calculations could be performed on 8 values rather than 4. We can only multiply 4 values at once, so the multiple-accumulation isn't (much) different, but it halves much of the rest.
This is the sort of thing which can't very easily be represented in C, so although most of the intrinsic use in the previous example was optional, here we have no choice.
The constants can just be converted to a short type. Although the multiply only uses half of them, we need both since we can use different multiply instructions to get different intermediate results.
void yuv2rgb_vec(unsigned vector char *yp, unsigned vector char *up, unsigned vector char *vp, unsigned vector int *argbp, int count) { int x; const vec_short8 yscale = spu_splats(YSCALE); const vec_short8 buscale = spu_splats(BUSCALE); const vec_short8 gvscale = spu_splats(GVSCALE); const vec_short8 guscale = spu_splats(GUSCALE); const vec_short8 rvscale = spu_splats(RVSCALE); const vec_short8 clamp = spu_splats( 255 );
Then, we need different and more shuffle patterns. The y shuffle pattern gets 8 rather than 4 at a time, and the uv shuffle pattern 4. We can also use another shuffle pattern to re-interleave the shorts after the multiplies, and we need 2 ARGB packing shuffles, since we now generate 8 pixels rather than 4. Also we cannot simply or in the blue anymore and need an additional set of shuffles for that.
const vec_ushort8 shuffle_y_0 = (vec_ushort8) { 0x8000, 0x8001, 0x8002, 0x8003, 0x8004, 0x8005, 0x8006, 0x8007 }; const vec_ushort8 shuffle_uv_0 = (vec_uhsort8) { 0x8000, 0x8000, 0x8001, 0x8001, 0x8002, 0x8002, 0x8003, 0x8003 }; const vec_ushort8 shuffle_toshort = (vec_ushort8) { 0x0203, 0x1213, 0x0607, 0x1617, 0x0a0b, 0x1a1b, 0x0e0f, 0x1e1f }; const vec_uint4 shuffle_rg0 = (vec_uint4) { 0x80011180, 0x80031380, 0x80051580, 0x80071780 }; const vec_uint4 shuffle_rg1 = (vec_uint4) { 0x80091980, 0x800b1b80, 0x800d1d80, 0x800f1f80 }; const vec_uint4 shuffle_rgb0 = (vec_uint4) { 0x80010211, 0x80050613, 0x80090a15, 0x800d0e17 }; const vec_uint4 shuffle_rgb1 = (vec_uint4) { 0x80010219, 0x8005061b, 0x80090a1d, 0x800d0e1f };And our new loop:
vec_uint4 shuffle_uv = shuffle_uv_0; for (x=0;x<count/16;x+=1) { vec_short8 y = (vec_short4)yp[x]; vec_short8 u = (vec_short4)up[x/2]; vec_short8 v = (vec_short4)vp[x/2]; vec_short8 y0, y1; vec_short8 u0, u1; vec_short8 v0, v1; vec_int4 y0odd, y0even, y10dd, y1even; vec_int4 u0odd, u0even, u10dd, u1even; vec_int4 v0odd, v0even, v10dd, v1even; vec_short8 r0, r1; vec_short8 g0, g1; vec_short8 b0, b1; vec_int4 r0odd, r1odd; vec_int4 g0odd, g1odd; vec_int4 b0odd, b1odd; vec_uint4 shuffle_y = shuffle_y_0; shuffle_uv = (x & 1) == 0 ? shuffle_uv_0 : shuffle_uv; // convert 16 YUV values into 2x8 value vectors y0 = spu_shuffle(y, y, (vec_uchar16)shuffle_y); shuffle_y = spu_add(shuffle_y, 8); u0 = spu_shuffle(u, u, (vec_uchar16)shuffle_uv); v0 = spu_shuffle(v, v, (vec_uchar16)shuffle_uv); shuffle_uv = spu_add(shuffle_uv, 4); // ... repeat again for y1, u1, v1
y0, u0, and v0 (and y1, etc) now contain 8 YUV values stored as a vector. Now we have to adjust the arithmetic and use intermedite 32 bit integers - because we need the bits and because we can only do 4x16 bit multiplies.
// pre-calculate common subexpressions y0odd = spu_mulo(y0, yscale); y0even = spu_mule(y0, yscale); v0 = spu_add(v0, -128); u0 = spu_add(u0, -128); .. repeat again // perform basic calculation r0odd = spu_madd(v0, rvscale, y0odd); g0odd = spu_madd(v0, gvscale, y0odd); g0odd = spu_madd(u0, guscale, g0odd); b0odd = spu_madd(u0, buscale, y0odd); r0even = spu_mhhadd(v0, rvscale, y0even); g0even = spu_mhhadd(v0, gvscale, y0even); g0even = spu_mhhadd(u0, guscale, g0even); b0even = spu_mhhadd(u0, buscale, y0even); .. repeat againBut we don't need to move the data around to get it in the right format, as there are 2 different multiply add instructions we can use. The first we used before, but there is also a version which multiplies the high 16 bits of each word instead. Unfortunately, we still need to fix the point on all 12 values before we continue.
// fix point r0odd = spu_rlmaska(r0odd, -SHIFT); r0even = spu_rlmaska(r0even, -SHIFT); g0odd = spu_rlmaska(g0odd, -SHIFT); g0even = spu_rlmaska(g0even, -SHIFT); b0odd = spu_rlmaska(b0odd, -SHIFT); b0even = spu_rlmaska(b0even, -SHIFT); .. repeatAnd we need an extra shuffle step - but this halves the clamping code required (this could also be done in the previous code as well).
// convert back to shorts r0 = spu_shuffle(r0even, r0odd, shuffle_toshort); g0 = spu_shuffle(g0even, g0odd, shuffle_toshort); b0 = spu_shuffle(b0even, b0odd, shuffle_toshort); .. repeat // clamp to maximum r0 = spu_sel(r0, clamp, spu_cmpgt(r0, clamp)); g0 = spu_sel(g0, clamp, spu_cmpgt(g0, clamp)); b0 = spu_sel(b0, clamp, spu_cmpgt(b0, clamp)); .. repeat // clamp to minimum r0 = spu_and(r0, spu_cmpgt(r0, 0)); g0 = spu_and(g0, spu_cmpgt(g0, 0)); b0 = spu_and(b0, spu_cmpgt(b0, 0)); .. repeat // pack and write argb values argbp[x*4+0] = spu_shuffle(spu_shuffle(r0, g0, shuffle_rg0), b0, shuffle_rgb0); argbp[x*4+1] = spu_shuffle(spu_shuffle(r0, g0, shuffle_rg1), b0, shuffle_rgb1); .. repeat } }Ignoring the loads and stores which are the same, this takes significantly fewer instructions - 84 vs 116. And there should still be enough instructions for the compiler scheduler to do a decent job. Given the better mix of pipeline 0 and 1 instructions, it may even reduce the cycle count further.
So what about floats? The SPU is bloody fast at doing floating point calculations, will it actually be faster than using fixed point? We could potentially save some fixed-point rotates, but unless we change the api, we're just wasting lots of time doing numerical conversions.
#define YSCALEF ( 1.0f ) #define BUSCALEF ( 1.772f ) #define GVSCALEF (-0.71414f) #define GUSCALEF (-0.34414f) #define RVSCALEF ( 1.402f ) void yuv2rgb_vec(unsigned vector char *yp, unsigned vector char *up, unsigned vector char *vp, vector unsigned int *argbp, int count) { int x; const vec_float4 yscale = spu_splats(YSCALEF); const vec_float4 buscale = spu_splats(BUSCALEF); const vec_float4 gvscale = spu_splats(GVSCALEF); const vec_float4 guscale = spu_splats(GUSCALEF); const vec_float4 rvscale = spu_splats(RVSCALEF); const vec_int4 clamp = spu_splats( 1.0f ); const vec_float4 half = spu_splats( 0.5f );
We just need to convert the scale factors to floats. We cannot do an immediate add for floats so we also need 0.5 as a vector. We could change the clamp factor to a float too, but see below for details. All of the YUV unpacking is the same as for the integer version, as is the RGB packing.
const vec_uint4 shuffle_y_0 = (vec_uint4) { 0x80808000, 0x80808001, 0x80808002, 0x80808003 }; const vec_uint4 shuffle_uv_0 = (vec_uint4) { 0x80808000, 0x80808000, 0x80808001, 0x80808001 }; const vec_uint4 shuffle_rg = (vec_uint4) { 0x80031380, 0x80071780, 0x800b1b80, 0x800f1f80 };Now the loop is
for (x=0;x<count/16;x+=1) { vec_int4 y = (vec_int4)yp[x]; vec_int4 u = (vec_int4)up[x/2]; vec_int4 v = (vec_int4)vp[x/2]; vec_float4 y0, y1, y2, y3; vec_float4 u0, u1, u2, u3; vec_float4 v0, v1, v2, v3; vec_float4 r0, r1, r2, r3; vec_float4 g0, g1, g2, g3; vec_float4 b0, b1, b2, b3; vec_uint4 shuffle_y = shuffle_y_0; vec_uint4 shuffle_uv = shuffle_uv_0; // convert 16 YUV values into 4x4 value vectors y0 = spu_convtf(spu_shuffle(y, y, (vec_uchar16)shuffle_y), 8); shuffle_y = spu_add(shuffle_y, 4); u0 = spu_convtf(spu_shuffle(u, u, (vec_uchar16)shuffle_uv), 8); v0 = spu_convtf(spu_shuffle(v, v, (vec_uchar16)shuffle_uv), 8); shuffle_uv = spu_add(shuffle_uv, 2); ... repeat this 3 more times for yn, un, and vnWe need to convert the integers to floats before we can use them. Relatively expensive extra step, even if it saves us any shifting later on.
// pre-calculate common subexpression y0 = spu_mul(y0, yscale); v0 = spu_sub(v0, half); u0 = spu_sub(u0, half); .. repeat for 1-3 // perform basic calculation r0 = spu_madd(v0, rvscale, y0); g0 = spu_madd(v0, gvscale, y0); g0 = spu_madd(u0, guscale, g0); b0 = spu_madd(u0, buscale, y0); .. repeat for 1-3Now comes a choice - do we clamp the floats, or clamp integers afterwards? If we're storing the RGB values as floats then there's no question. Clamp and we're done. - maybe we could pack them into an ARGB quadword, but storing them as planes would be more efficient.
If we're going to pack them back into integers, then we could go either way. Clamping the integers would save us an instruction - we need to subtract 1 otherwise since when we scale 1.0f back to an 8 bit integer we will get 256, not 255. Although I can't really see any point to it, i'll re-pack the data anyway, just for comparisons sake. So we'll clamp at the ints.
vec_int4 ir0, ir1, ir2, ir3; vec_int4 ig0, ig1, ig2, ig3; vec_int4 ib0, ib1, ib2, ib3; // back to integers ir0 = spu_convtu(r0, 8); ig0 = spu_convtu(g0, 8); ib0 = spu_convtu(b0, 8); .. repeat for 1-3I haven't actually bothered to compile this version, so it's probably buggy.And now it's the same as the integer version.// clamp to maximum ir0 = spu_sel(ir0, clamp, spu_cmpgt(ir0, clamp)); ig0 = spu_sel(ig0, clamp, spu_cmpgt(ig0, clamp)); ib0 = spu_sel(ib0, clamp, spu_cmpgt(ib0, clamp)); .. repeat for 1-3 // clamp to minimum ir0 = spu_and(r0, spu_cmpgt(ir0, 0)); ig0 = spu_and(g0, spu_cmpgt(ig0, 0)); ib0 = spu_and(b0, spu_cmpgt(ib0, 0)); .. repeat for 1-3 // pack and write argb value argbp[x*4+0] = spu_or(ib0, spu_shuffle(ir0, ig0, shuffle_rg)); .. repeat for 1-3 } }
Clearly, with all of the extra conversions, this is just going to be slower - we not only use more intstructions, some of them are slower (e.g. float converts rather than nothing or shifts).
However in reality we might want to do multiple passes on the data while it is in the spu - local accesses are very cheap, so lets make hay whilst the sun is shining. Since working with floats may simplify some of the other calculations we might want to do, we could store the values as floats here (or even fixed point integers), and only perform the byte packing once we've finished. Maybe the YUV unpacking could be done separately too, although that might just overload the load/store pipeline with all of the load/stores and shuffles going on.
I used the -S flag to gcc to dump the intermediate assembly language generated by the compiler.
spu-gcc -S -O2 -fno-inline test-yuv.cThis generates a file
test-yuv.s
, and then executing
<path-to-ibm-sdk>/spu_timing test-yuv.c (TO BE CHECKED)
Generates a file test-yuv.s.timing
including static
timing and pipeline results from the generated assmebly language.
From this file you can easily see how instructions are scheduled and
executed, any data dependency stalls, and so forth.
I use -fno-inline to gcc just helps analyze the basic code - if the code is inlined things could change significantly - but we'd hope only for the better, so it should show worst-case.
yuv-int
If we just consider the main loop, we see that the scheduler in gcc
has done a reasonably job with ordering, so we only get 3 data
dependency stalls. I didn't bother analyzing the instructions to see
if it could've done better - often it can be.
So lets look at the timing and try to break it down roughly where it corresponds to the C source.
; loop book keeping/address calculation/start some char->int convertion .L4: 0D 0123 rotmi $21,$27,-31 1D 012345 lqr $19,.LC0 0D 12 andi $39,$27,1 1D 1 hbrp # 1 0D 2 nop 127 1D 234567 lqd $18,0($26) 0D 34 ai $25,$25,-1 1D 345678901234567 hbrr .L11,.L4 0 45 ceqi $20,$39,0 0 56 a $9,$21,$27 0D 67 ai $27,$27,1 1D 6789 fsm $4,$20 0d 7890 rotmai $12,$9,-1 1d -8901 shufb $11,$18,$18,$36 0D 9 nop 127 1D 9012 shufb $13,$18,$18,$37 0D 01 selb $75,$7,$19,$4 1D 0123 shufb $10,$18,$18,$38 0D 1 nop 127 1D 1 hbrp # 2 0D 2345 shli $22,$12,4 ; get stuck into the char-> int conversion, start pipelining the initial calculations 1D 2345 shufb $6,$18,$18,$35 0 34 ai $73,$75,2 0 4567890 mpy $57,$13,$32 0D 56 ai $71,$73,2 1D 5 lnop 0D 6789012 mpy $55,$10,$32 1D 678901 lqx $79,$22,$33 0D 7890123 mpy $53,$6,$32 1D 789012 lqx $70,$22,$34 0 89 ai $22,$71,2 0 9012345 mpy $51,$11,$32 1 --2345 shufb $78,$79,$79,$75 1 3456 shufb $74,$79,$79,$73 1 4567 shufb $72,$79,$79,$71 1 5678 shufb $76,$79,$79,$22 0D 67 ai $61,$78,-128 1D 6789 shufb $69,$70,$70,$75 0D 78 ai $60,$74,-128 1D 0 789 shufb $68,$70,$70,$73 0D 89 ai $59,$72,-128 1D 01 89 shufb $67,$70,$70,$71 0D 0 9 ai $58,$76,-128 1D 012 9 shufb $66,$70,$70,$22 ; finishes the conversions, now its the arithmetic of the algorithm 0 0123456 mpya $65,$61,$30,$57 0 1234567 mpya $64,$60,$30,$55 0 2345678 mpya $63,$59,$30,$53 0 3456789 mpya $62,$58,$30,$51 0 45 ai $56,$69,-128 0 56 ai $54,$68,-128 0 67 ai $52,$67,-128 0 78 ai $50,$66,-128 0 8901234 mpya $49,$56,$29,$65 0 9012345 mpya $48,$54,$29,$64 0 0123456 mpya $47,$52,$29,$63 0 1234567 mpya $17,$50,$29,$62 0 2345678 mpya $46,$61,$28,$57 0 3456789 mpya $8,$60,$28,$55 0 4567890 mpya $45,$59,$28,$53 0 5678901 mpya $44,$58,$28,$51 0 6789012 mpya $43,$56,$31,$57 0 7890123 mpya $5,$54,$31,$55 0 8901234 mpya $2,$52,$31,$53 0 9012345 mpya $3,$50,$31,$51 ; fixed-point reset (most of it) 0 0123 rotmai $14,$49,-14 0 1234 rotmai $15,$48,-14 0 2345 rotmai $16,$47,-14 0 3456 rotmai $21,$17,-14 0 4567 rotmai $18,$8,-14 0 5678 rotmai $10,$44,-14 0 6789 rotmai $9,$46,-14 0 7890 rotmai $4,$45,-14 0 8901 rotmai $7,$43,-14 0 9012 rotmai $79,$5,-14 ; start clamping ... 0 01 cgti $42,$14,255 0 12 cgti $41,$15,255 0 23 cgti $40,$16,255 0 34 cgti $39,$21,255 0 4567 rotmai $77,$2,-14 0 5678 rotmai $75,$3,-14 0 67 selb $73,$14,$23,$42 0 78 selb $71,$15,$23,$41 0 89 selb $69,$16,$23,$40 0 90 selb $67,$21,$23,$39 0 01 cgti $19,$18,255 0 12 cgti $13,$10,255 0 23 cgti $20,$9,255 0 34 cgti $12,$4,255 0 45 selb $63,$18,$23,$19 0 56 selb $59,$10,$23,$13 0 67 selb $65,$9,$23,$20 0 78 selb $61,$4,$23,$12 0 89 cgti $6,$7,255 0 0 9 cgti $11,$79,255 0 01 cgti $74,$73,0 0 12 cgti $72,$71,0 0 23 cgti $70,$69,0 0 34 cgti $68,$67,0 0 45 cgti $78,$77,255 0 56 cgti $76,$75,255 0 67 selb $53,$7,$23,$6 0 78 selb $51,$79,$23,$11 0 89 selb $49,$77,$23,$78 0 90 selb $47,$75,$23,$76 0 01 and $58,$73,$74 0 12 and $57,$71,$72 0 23 and $56,$69,$70 0 34 and $55,$67,$68 0 45 cgti $66,$65,0 0 56 cgti $64,$63,0 0 67 cgti $62,$61,0 0 78 cgti $60,$59,0 0 89 and $46,$65,$66 0 90 and $45,$63,$64 0 01 and $44,$61,$62 0 12 and $43,$59,$60 0 2345 shli $42,$58,8 0 3456 shli $15,$57,8 0 4567 shli $16,$56,8 0 5678 shli $17,$55,8 0 67 cgti $54,$53,0 0 78 cgti $52,$51,0 0 89 cgti $50,$49,0 0 90 cgti $48,$47,0 0 01 and $41,$53,$54 0 12 and $40,$51,$52 0 23 and $39,$49,$50 0 34 and $21,$47,$48 ; form result 0 4567 shli $20,$46,16 0 5678 shli $8,$45,16 0 6789 shli $13,$44,16 0 7890 shli $10,$43,16 0 89 or $19,$41,$42 0 90 or $18,$40,$15 0 01 or $12,$39,$16 0 12 or $11,$21,$17 0 23 or $7,$19,$20 0 34 or $14,$18,$8 0D 45 or $6,$12,$13 ; write results 1D 456789 stqd $7,0($24) 0D 56 or $5,$11,$10 1D 0 56789 stqd $14,16($24) 0D 67 ai $7,$22,2 1D 01 6789 stqd $6,32($24) 1 012 789 stqd $5,48($24) 1 01 89 biz $25,$lr 0 0 9 ai $26,$26,16 0 01 ai $24,$24,64 .L11: 1 1234 br .L4 .size yuv2rgb, .-yuv2rgb(131 cycles to the loop branch)
Just for interests sake, lets break it out. The first (and last) part is the loop book keeping and address calculation. yp[x] addressing calculated using a simple add, but u[x/2] requires a bunch of arithmetic - it could definitely be simplified. Strangely, the compiler is including the branch hint inside the loop which isn't required, too.
0D 0123 rotmi $21,$27,-31 0D 34 ai $25,$25,-1 1D 345678901234567 hbrr .L11,.L4 0 56 a $9,$21,$27 0D 67 ai $27,$27,1 0d 7890 rotmai $12,$9,-1 1D 1 hbrp # 2 0D 2345 shli $22,$12,4 ; x/2 ready for indexing up and vp -- rest of code -- 0 0 9 ai $26,$26,16 ; yp 0 01 ai $24,$24,64 ; output pointer .L11: 1 1234 br .L4The trinary operator also frobs around a bit - it isn't bad, and is branchless, but this was a known size/speed trade-off anyway.
0D 12 andi $39,$27,1 0 45 ceqi $20,$39,0 1D 6789 fsm $4,$20 0D 01 selb $75,$7,$19,$4
The data unpacking is straightforward - it can dual issue with some of the initial arithmetic as well. But storing only 4 results per register it just has to do quite a few operations.
Now i'm going to have a look at the short version. This should be noticably better because it halves the number of unpacking and clamping instructions required.
And not surprisingly, it is.
; loop book keeping/address calculation/start some char->int convertion .L4: 0D 3456 rotmi $56,$22,-31 1D 345678 lqd $53,0($20) 0D 45 andi $57,$22,1 1D 4 hbrp # 1 0D 5 nop 127 1D 567890 lqr $52,.LC0 0D 67 ai $18,$18,-1 1D 678901234567890 hbrr .L11,.L4 0 78 ceqi $55,$57,0 0 89 a $54,$56,$22 0D 90 ai $22,$22,1 1D 9012 fsm $51,$55 0D 0123 rotmai $50,$54,-1 1D 0123 shufb $47,$53,$53,$34 1 1234 shufb $48,$53,$53,$35 0 -34 selb $21,$21,$52,$51 0D 4567 shli $49,$50,4 1D 4 hbrp # 2 ; already getting stuck into some of the maths while still loading/format conversion 0D 5678901 mpyhh $74,$48,$27 1D 5678 shlqbyi $43,$21,0 0 6789012 mpyhh $70,$47,$27 0 78 ahi $46,$21,4 0D 8901234 mpy $11,$48,$27 1D 890123 lqx $45,$49,$32 0D 9012345 mpy $15,$47,$27 1D 901234 lqx $42,$49,$33 0D 01 ahi $21,$46,4 1D 0 lnop 0D -23 ori $10,$74,0 1D 2345 shlqbyi $5,$74,0 0D 34 ori $78,$70,0 1D 3456 shlqbyi $3,$70,0 1 4567 shufb $41,$45,$45,$46 1 5678 shufb $40,$45,$45,$43 1 6789 shufb $38,$42,$42,$46 1 0 789 shufb $39,$42,$42,$43 ; finish the maths proper 0 89 ahi $36,$41,-128 0 0 9 ahi $37,$40,-128 0 0123456 mpya $14,$36,$25,$15 0 1234567 mpya $13,$37,$25,$11 0 2345678 mpyhha $78,$36,$25 0 3456789 mpyhha $10,$37,$25 0 45 ahi $9,$38,-128 0 56 ahi $4,$39,-128 0 6789012 mpya $7,$37,$23,$11 0 7890123 mpya $6,$36,$23,$15 0 8901234 mpyhha $5,$37,$23 0 9012345 mpya $12,$4,$24,$13 0 0123456 mpya $79,$9,$24,$14 0 1234567 mpyhha $78,$9,$24 0 2345678 mpyhha $10,$4,$24 0 3456789 mpyhha $3,$36,$23 0 4567890 mpya $71,$9,$26,$15 0 5678901 mpyhha $70,$9,$26 0 6789012 mpya $75,$4,$26,$11 0 7890123 mpyhha $74,$4,$26 ; fix the point and start converting back to shorts 0 8901 rotmai $8,$5,-14 0 9012 rotmai $2,$7,-14 0 0123 rotmai $77,$12,-14 0 1234 rotmai $76,$10,-14 0D 2345 rotmai $73,$6,-14 1D 2 lnop 0D 3456 rotmai $72,$3,-14 1D 3456 shufb $62,$8,$2,$17 0D 4567 rotmai $69,$79,-14 1D 4 lnop 0D 5678 rotmai $68,$78,-14 1D 5678 shufb $60,$76,$77,$17 0D 6789 rotmai $67,$75,-14 1D 6 lnop 0D 7890 rotmai $66,$74,-14 1D 7890 shufb $58,$72,$73,$17 0D 8901 rotmai $65,$71,-14 1D 8 lnop 0D 9012 rotmai $64,$70,-14 1D 9012 shufb $56,$68,$69,$17 ; ... and start on the clamping 0D 01 cgthi $63,$62,255 1D 0 lnop 0D 12 cgthi $61,$60,255 1D 1234 shufb $54,$66,$67,$17 0D 23 cgthi $59,$58,255 1D 2 lnop 0D 34 cgthi $57,$56,255 1D 3456 shufb $52,$64,$65,$17 0 45 selb $50,$62,$19,$63 0 56 selb $48,$60,$19,$61 0 67 selb $46,$58,$19,$59 0 78 selb $44,$56,$19,$57 0 89 cgthi $55,$54,255 0 90 cgthi $53,$52,255 0 01 cgthi $51,$50,0 0 12 cgthi $49,$48,0 0 23 cgthi $47,$46,0 0 34 cgthi $45,$44,0 0 45 selb $40,$54,$19,$55 0 56 selb $36,$52,$19,$53 0 67 and $42,$50,$51 0 78 and $43,$48,$49 0D 89 and $38,$46,$47 1D 8 lnop 0D 0 9 and $39,$44,$45 ; start packing result 1D 012 9 shufb $15,$42,$43,$29 0D 01 cgthi $41,$40,0 1D 0123 shufb $13,$42,$43,$31 0D 12 cgthi $37,$36,0 1D 1234 shufb $14,$38,$39,$29 0D 23 and $7,$40,$41 1D 2345 shufb $11,$38,$39,$31 0d 34 and $12,$36,$37 1d -4567 shufb $10,$15,$7,$28 1 5678 shufb $9,$14,$12,$28 1 6789 shufb $5,$13,$7,$30 1 7890 shufb $6,$11,$12,$30 1 890123 stqd $10,16($16) 1 901234 stqd $9,48($16) 1 012345 stqd $5,0($16) 1 123456 stqd $6,32($16) 1 2345 biz $18,$lr 0 34 ai $20,$20,16 0 45 ai $16,$16,64 .L11: 1 5678 br .L4This is just 92 cycles to the loop branch, and it does the same thing.
Because we're only unpacking to half as many results, this is only half the size. There still isn't a great deal of overlapping with arithmetic though - mainly because it's waiting for the results of the address calculation/uv load before it can get stuck into it.
So there are a couple of opportunities for improvements in the code. One is the data loads - it currently loads the u and v data twice, and althought the actual load isn't bad, it also has to do the address calculation twice which increases the latency a lot. The other is the trinary operator. So lets see what happens if I just get rid of those by copying the loop contents once and fixing things up.
It's an improvement - 12 cycles per 16 pixels. 238 cycles for twice the work vs 131.
If I just move the next load of y to just after the last usage of it, I get another improvement - 226 cycles total now (or 18 per 16 pixels).
I then remembered that this is the sort of thing the restrict keyword was for. So I tried just adding restrict to the definitions of all the pointers in the original version. But it made no difference to the compiled output. Either i'm using it wrong, gcc isn't taking advantage of it, or i'm misunderstanding it's use.
Actually I just had another thought - in my simple example i'm
passing the same value to all pointers, maybe gcc is taking that into
account - although afaik it shouldn't make any difference - I will
have to verify.
yuv-short-2, yuv-short-2-a
And I did the same treatment for the short versions. Again, with about the same fixed-amount of improvement (for a larger relative amount).
yuv-short-2 takes 168 cycles, and with the re-ordering of the second y load, it's down to 150.
The last thing to look at is the nasty address calculations. There's no reason the compiler should be performing (x/2<<4) every loop - it knows the full state of x throughout the loop. It has already done it for y, so why not u and v too? Well, lets see what can be done about it.
We have 4 basic housekeeping calculations required per loop (of the -2 version).
But we have these being calculated:
We can fix the y issue easily, argbp is already fine, and we can totally remove the uv address calculation as well.
The y-doing-silly-things issue should be fixed easily. The following should change it to a single add, and the array addressing changed to use immediate offsets.
y = yp[0]; ... y = yp[1]; yp += 2;
We could just do the same for u and v, but since the calculation is the same, we can use the lqx instruction to help us. That will load a value relative to the sum of two registers. We could try using array indexing and incrementing a normal int by 1 every time, but the compiler might still do some silly things, so rather than wasting time, lets just tell it what we really want.
vec_uint4 uvoffset = spu_promote((unsigned int)0, 0); loop { u = si_lqx((qword)up, (qword)uvoffset); v = si_lqx((qword)yp, (qword)uvoffset); ... uvoffset = spu_add(uvoffset, 16); }
And finally we could use any of the incrementing values for a loop condition. uvoffset is handy and requires less calculation, so use that. To make it explictly clear to the compiler what we want to do - do it all manually directly on vector types again.
vec_uint4 loopend = spu_rlmask(spu_promote(count, 0), -1); while (spu_extract(spu_cmpgt(loopend, uvoffset), 0) { ... }
Of course none of this should be necessary, but the compiler is being stubborn. So what did the compiler do with it?
Fucked it up, at least partially. It's still using 2 pointers and thus 2 sets of pointer arithmetic for the y accesses. Sigh.
0D 45 ai $38,$38,32 1D 456789 lqx $61,$53,$44 1 567890 lqd $63,0($39) 1 6 hbrp # 1 0D 78 ai $39,$39,32 1D 789012 lqd $73,-32($38) 1 890123 lqx $70,$52,$44 0 90 ai $44,$44,16
It's also done something silly with the loop - moved the loop test to the end of the loop - ok, not great but ok - but then that forces the compare and branch to run consecutively and adds another stall. Ok, it's only 1 cycle, but ...
At least the pre-loop setup is nice and compact, it's dual-issued almost everything.
0D 0123 rotmi $45,$7,-1 1D 012345678901234 hbrr .L8,.L2 0D 12 ori $39,$3,0 1D 123456 lqr $51,.LC0 0D 23 ori $52,$4,0 1D 234567 lqr $50,.LC1 0D 34 ori $53,$5,0 1D 345678 lqr $48,.LC2 0D 45 ori $30,$6,0 1D 456789 lqr $49,.LC3 0D 56 ilh $37,16384 1D 567890 lqr $31,.LC4 0D 67 ilh $36,29032 1D 678901 lqr $43,.LC5 0D 78 ilh $35,-11700 1D 789012 lqr $42,.LC6 0D 89 ilh $34,-5638 1D 890123 lqr $41,.LC7 0D 90 ilh $33,22970 1D 9 hbrp # 3 0D 01 ilh $32,255 1D 012345 lqr $40,.LC8 0D 12 il $44,0 1D 123456 lqr $47,.LC9 0D 23 ai $38,$3,16 1D 234567 lqr $46,.LC10 0D 3 nop 127 .L8: 1D 3456 br .L2
So this point I tried a few other things and eventually had enough, so I had a guess at what would work the best. Use direct loads for all the loads, and let the compiler do some of the loop stuff just using ints - using a vector for the loop counter is just too unreadable anyway.
The promote/extract stuff really does nothing but keep the compiler happy, which means it just gets in the way of the user ... but its easier than working on vectors directly the whole time.
for (x=0;x<count>>;x+=16) { ... y = (vec_short8)si_lqd((qword)spu_promote((unsigned int)yp, 0), 0); u = (vec_short8)si_lqx((qword)spu_promote((unsigned int)up, 0), (qword)spu_promote(x, 0)); v = (vec_short8)si_lqx((qword)spu_promote((unsigned int)vp, 0), (qword)spu_promote(x, 0)); ... y = (vec_short8)si_lqd((qword)spu_promote((unsigned int)yp, 0), 16); yp += 2; ... }
And horrah - it just about did the right thing. Now it's got a nice tight load section, so that's all good - although it didn't make any difference to the execution time. And it got rid of the test-branch, at least it tried - it moved the loop test to well before it's needed. But it moved it too early - so it still stalls 1 cycle. Oh well. By this stage i've had enough, so it'll have to do (actually I did try some manual tweaking of the assembly, but that is difficult as re-ordering or inserting instructions messes up all of the dual-issue stuff).
1 123456 lqd $62,0($39) 1 234567 lqd $59,16($39) 0D 34 ai $39,$39,32 1D 3 hbrp # 1 1 456789 lqx $76,$48,$40 1 567890 lqx $72,$50,$40 0D 67 ai $40,$40,16 1D 6 lnop 0D -89 cgt $30,$45,$40
It also made a mess of the pre-loop setup. Before it interleaved large constant loads and small ones - so they executed in half the time. Now it's got a short-circuit 'quit quickly' test (i.e. a useless one), that messes all of that up. Maybe I need to use __builtin_expect() or something - but really, the compiler should be doing this sort of shit for me. I could always make the api demand a count of > 0 and just use a do {} while loop, although who knows what the compiler will do with that.
yuv2rgb_short: 0D 0123 rotmai $45,$7,-1 1D 0123 shlqbyi $39,$3,0 0D 12 ori $31,$6,0 1D 1 hbrp # 1 0 23 ilh $38,16384 0 34 ilh $37,29032 0 45 cgti $2,$45,0 0 56 ilh $36,-11700 0 67 ilh $35,-5638 0 78 ilh $34,22970 0D 89 ilh $33,255 1D 8901 biz $2,$lr 0D 90 ori $50,$4,0 1D 901234 lqr $52,.LC0 0D 01 ori $48,$5,0 1D 012345 lqr $51,.LC1 0D 1 nop 127 1D 1 hbrp # 2 0D 23 il $40,0 1D 234567 lqr $53,.LC2 1 345678 lqr $49,.LC3 1 456789 lqr $32,.LC4 1 567890 lqr $44,.LC5 1 678901 lqr $43,.LC6 1 789012 lqr $42,.LC7 1 890123 lqr $41,.LC8 1 901234 lqr $46,.LC9 1 012345 lqr $47,.LC10
It's not in the loop at least - so it's not the end of the world. A bit silly though. If the biz just happened near the end of all the loads it could half this time and still avoid the unconditional branch to the loop test.
static vec_uchar16 buffer[2048]; int main() { int i; for (i = 0;i<100000;i++) yuv2rgb(buffer, buffer, buffer, buffer, 2048); }And timed completed versions of the integer code above using the following command:
time elfspe ./yuv2rgbAnd here are the results.
version description time (s) yuv2rgb * 4xint based throughout 0.535 yuv2rgb_2 4xint based with 1 loop unroll 0.489 yuv2rgb_2-a yuv2rgb_2 with manual reorder 0.467 yuv2rgb_short 8xshort based where possible 0.382 yuv2rgb_short_2 8xshort based with 1 loop unroll 0.347 yuv2rgb_short_2-a yuv2rgb_short_2 with manual reorder 0.313 yuv2rgb_short_2-b yuv2rgb_short_2 with restrict 0.347 yuv2rgb_short-3 different loop/index logic 0.314 yuv2rgb_short-4 all manual loads 0.304 * - this is actually a slightly older version that does shift/or instead of shuffle for forming the ARGB result - i'm too lazy to rerun the tests/compiles.Conclusions:
All of the C source and annotated assembly listings. These are all in the public domain.
C source Annotated Assembly test-yuv.c test-yuv.s.timing test-yuv-2.c test-yuv-2.s.timing test-yuv-2-a.c test-yuv-2-a.s.timing test-yuv-short.c test-yuv-short.s.timing test-yuv-short-2.c test-yuv-short-2.s.timing test-yuv-short-2-a.c test-yuv-short-2-a.s.timing test-yuv-short-2-b.c test-yuv-short-2-b.s.timing test-yuv-short-3.c test-yuv-short-3.s.timing test-yuv-short-4.c test-yuv-short-4.s.timing
Cell Broadband Engine master index page at IBM.
Note that these documents update fairly frequently, but the latest should be available at the above index if these links are stale.
C/C++ Language Extensions for Cell Broadband Architecture Copyright 2008 Michael Zucchi, All Rights Reserved.