not extending bitonic sort to non-powers of 2
Update: Turns out I was a bit over-optimistic here - this stuff below doesn't work, it just happened to work for a couple of cases with the data I had. Bummer. Maybe it can be made to work but maybe not, but I didn't have much luck so I gave up for now.
Original post ...
This morning I started having a look at combining the OpenCL code I have into a GA solver. Which mostly involved lots of bug fixing in the classifier code and extending it to process all classifiers and images in a single run.
All went fine till I hit the sort routine for the population fitness when I found out that the bitonic sort I wrote doesn't handle totals which are non-powers of two.
e.g. with 11 elements:
input : b a 9 8 7 6 5 4 3 2 1 output: 1 2 3 7 8 9 a b 4 5 6 --------------- +++++
Each sub-list is sorted, there is just a problem with the largest-stride merge. I noticed that this merge processes indices:
0 -> 8 1 -> 9 2 -> 10 data : 1 2 3 7 8 9 a b 4 5 6 a a a b b b
Which misses out on comparing 3-7, which are already going to be greater than 0-2.
So a simple solution is just to offset the first comparison index so that it compares indices 5-7 to indices 8-10 instead. The calculation is simply:
off = max(0, (1 << (b + 1)) - N); // (here, b=3, N=11, poff = 5)
Which results in the correct merge indices for the largest offset:
data : 1 2 3 7 8 9 a b 4 5 6 a a a b b b
So an updated version of the sort becomes:
lid = local work item id; array = base address of array; for (bit=0; (1<<bit) < N; bit++) { for (b=bit; b >= 0; b--) { // insertion masks for bits int upper = ~0 << (b+1); int lower = (1 << b)-1; // extract 'flip' indicator - 1 means use opposite sense int flip = (lid >> bit) & 1; // insert a 0 bit at bit position 'b' into the index int p0 = ((lid << 1) & upper) | (lid & lower); // insert a 1 bit at the same place int p1 = p0 | (1 << b); p0 += max(0, (1 << (b+1)) - N); swap_if(array, p0, p1, compare(p0, p1) ^ flip); } }
I haven't extensively tested it but I think it should work. It doesn't!
I'm probably not going to investigate it but I also realised that a serial-cpu version of a bitonic sort wouldn't need to frob around with bit insertions in the inner loop; it could just nest a couple of loops instead. Actually I may well end up investigating it because a similar approach would work on epiphany.