asm vs c II
I dunno, i'm almost lost for words on this one.
typedef float float4 __attribute__((vector_size(16))) __attribute__((aligned(16))); void mult4(float *mat, float4 * src, float4 * dst) { dst[0] = src[0] + mat[0]; }
notzed@minized:src$ make simd.o arm-linux-gnueabihf-gcc -c -o simd.o simd.c -O3 -mcpu=cortex-a9 -marm -mfpu=neon notzed@minized:$ arm-linux-gnueabihf-objdump -dr simd.o simd.o: file format elf32-littlearm Disassembly of section .text: 00000000: 0: f4610aef vld1.64 {d16-d17}, [r1 :128] 4: ee103b90 vmov.32 r3, d16[0] 8: edd07a00 vldr s15, [r0] c: e24dd010 sub sp, sp, #16 10: ee063a10 vmov s12, r3 14: ee303b90 vmov.32 r3, d16[1] 18: ee063a90 vmov s13, r3 1c: ee113b90 vmov.32 r3, d17[0] 20: ee366a27 vadd.f32 s12, s12, s15 24: ee073a10 vmov s14, r3 28: ee313b90 vmov.32 r3, d17[1] 2c: ee766aa7 vadd.f32 s13, s13, s15 30: ee053a90 vmov s11, r3 34: ee377a27 vadd.f32 s14, s14, s15 38: ee757aa7 vadd.f32 s15, s11, s15 3c: ed8d6a00 vstr s12, [sp] 40: edcd6a01 vstr s13, [sp, #4] 44: ed8d7a02 vstr s14, [sp, #8] 48: edcd7a03 vstr s15, [sp, #12] 4c: f46d0adf vld1.64 {d16-d17}, [sp :64] 50: f4420aef vst1.64 {d16-d17}, [r2 :128] 54: e28dd010 add sp, sp, #16 58: e12fff1e bx lr notzed@minized:/export/notzed/src/raster/gl/src$
I thought that the store/load/store via the stack was a particularly cute bit of work, especially given the results were already in the right order and in adequately aligned registers. r3 also seems a little too popular.
I guess the vector extensions to gcc just aren't finished - or just don't work. Maybe I used the wrong flags or my build is broken. It produces similar junk code for the epiphany mind you. I've never really tried using them but after a bunch of OpenCL in the past I thought it might be worth a shot to access SIMD without machine code.
My NEON is very rusty but I think it could be something like this:
notzed@minized:src$ arm-linux-gnueabihf-objdump -dr neon-mat4.o neon-mat4.o: file format elf32-littlearm Disassembly of section .text: 00000000: 0: f4a02caf vld1.32 {d2[]-d3[]}, [r0] 4: f4210a8f vld1.32 {d0-d1}, [r1] 8: f2000d42 vadd.f32 q0, q0, q1 c: f4020a8f vst1.32 {d0-d1}, [r2] 10: e12fff1e bx lr
As can be seen from the names I started with a "simple" matrix multiply but whittled it down to something I thought the compiler could manage after seeing what it did to it - this is just a meaningless snippet.
After a pretty long day at work I was just half-heartedly poking at filling out the frontend to the epiphany gpu but just got distracted by whining at the compiler again. I should've just started with NEON, after a little poking I remembered how nice it was.