More Maths

It wasn't a problem with the fixed divisor of 10 but I missed that the additions in the mulh() routine used in the previous post can overflow which I discovered while trying to implement a 32 bit version of division using Newton's Method to find the reciprocal.

Here's 3 different versions which address it.

Using a 64-bit addition, which isn't very fast:

uint32_t mulh(uint32_t a, uint32_t b) {
        uint32_t ah = a >> 16;
        uint32_t al = a & 0xffff;
        uint32_t bh = b >> 16;
        uint32_t bl = b & 0xffff;

        uint32_t c = ah * bh;
        uint32_t d = ah * bl;
        uint32_t e = al * bh;
        uint32_t f = al * bl;
        uint32_t g = f >> 16;

        // calculate ((d + e + g) >> 16) without overflow
        uint32_t h = ((uint64_t)d + e + g) >> 16;

        return c + h;
}

Breaking the addition into 16 bits and using 32-bit addition seems to be the best idea:

uint32_t mulh(uint32_t a, uint32_t b) {
        uint32_t ah = a >> 16;
        uint32_t al = a & 0xffff;
        uint32_t bh = b >> 16;
        uint32_t bl = b & 0xffff;

        uint32_t c = ah * bh;
        uint32_t d = ah * bl;
        uint32_t e = al * bh;
        uint32_t f = al * bl;
        uint16_t g = f >> 16;

        // calculate ((d + e + g) >> 16) without overflow
        uint32_t h = (d >> 16) + (e >> 16);
        uint32_t l = (d & 0xffff) + (e & 0xffff) + g;

        return c + h + (l >> 16);
}

Detecting overflow and manually calculating the carry bits afterwards:

uint32_t mulh(uint32_t a, uint32_t b) {
        uint32_t ah = a >> 16;
        uint32_t al = a & 0xffff;
        uint32_t bh = b >> 16;
        uint32_t bl = b & 0xffff;

        uint32_t c = ah * bh;
        uint32_t d = ah * bl;
        uint32_t e = al * bh;
        uint32_t f = al * bl;
        uint16_t g = f >> 16;

        // calculate d + e + g (may overflow)
        uint32_t x = d + e;
        uint32_t y = x + g;

        // calculate carrys
        uint32_t p = ((d | e) & ~x) | (d & e);
        // g can be ignored because the high bits are 0
        uint32_t q = x ^ y;
        uint32_t o = ((p >> 31) + (q >> 31));

        return c + (y >> 16) + (o << 16);
}

I think it should work anyway. It requires more bit manipulation than I'd expected.

C sucks

The previous version also included a subtle bug - one which only shows up in specific circumstances. It has to do with c promotion rules and what I think is pretty inconsistent results but it probably falls into the 'undefined behviour' of c.

        uint16_t ah = ...;
        uint16_t al = ...;
        uint16_t bh = ...;
        uint16_t bl = ...;

        uint32_t c = ah * bh;

To calculate c, ah and bh are promoted to int per the c specification, not unsigned int as you'd expect. Thus if the msb of c is set it's actually an overflow because the multiply is considered signed. This can lead to some pretty weird results if the code is optimised by the compiler:

        uint16_t ah = 0xffff;
	uint16_t bh = 0xffff;
        uint32_t c = ah * bh;

        printf(" c       = %08x\n", c);
        printf(" c >> 30 = %08x\n", c >> 30);
        printf(" c >> 31 = %08x\n", c >> 31);
        printf(" c < 0   = %d\n", (int32_t)c < 0);
 -->
 c       = fffe0001
 c >> 30 = 00000003
 c >> 31 = 00000000
 c < 0   = 0

I think that result is a bit inconsistent - if it's it's going to enforce an unsigned overflow limit (i.e. 2 unsigned multiply's can't be negative) it should do it everywhere (i.e. the result should be truncated). But that's c I suppose. I actually filed a bug with gcc because it's a pretty naff result but it was pointed out that it's just a part of the promotion rules and so now I feel a bit embarassed about doing so; particularly since only came about because I was using uint16_t to save some typing.

I also have to remember it's never a bug in the compiler. The last time I filed a bug for a compiler was using the Solaris cc around 26 years ago and I still feel shame when I think about it. Also just don't bother with small integers aside from storage, you never know what you're gonna get.

Division by Multiplication

Anyway this is the result of the conversion of the routine from the other blog mentioned above to 32 bits.

// Derived from 16-bit version explained properly here:
// <https://blog.segger.com/algorithms-for-division-part-4-using-newtons-method/>

// re16[0] should be 0xffffffff but it can lead to overflow
static const uint32_t re16[16] = {
        0xfffffffe, 0xf0f0f0f0, 0xe38e38e3, 0xd79435e5,
        0xcccccccc, 0xc30c30c3, 0xba2e8ba2, 0xb21642c8,
        0xaaaaaaaa, 0xa3d70a3d, 0x9d89d89d, 0x97b425ed,
        0x92492492, 0x8d3dcb08, 0x88888888, 0x84210842,
};

// esp32c3   96 cycles using mulh instruction
// esp8266  216 cycles using second mulh routine above
udiv_t udiv(uint32_t u, uint32_t v) {
        int n = __builtin_clz(v);

        // normalised inverse estimate
        uint32_t i = v << n >> 27;
        uint32_t r = re16[i  - 16];

        v <<= n;

        // refine via Newton's Method
        r = mulh(r, -mulh(r, v)) << 1;
        r = mulh(r, -mulh(r, v)) << 1;

        // quotient, denormalise, remainder
        uint32_t q = mulh(u, r);

        q >>= 31 - n;
        v >>= n;
        r = u - q * v;

        // check remainder overflow, 2 steps might be needed when v=1
        if (r >= v) {
                q += 1;
                r -= v;
                if (r >= v) {
                        q += 1;
                        r -= v;
                }
        }

        return (udiv_t){ q, r };
}

Depending on the mulh() routine used this is nearly the same speed as div() on an esp32c3 (which is interesting given it has a divu instruction), and 2.25x fater than the div() library call on esp8266.

About Me

Tags

More Maths

C sucks

Division by Multiplication