More Maths
It wasn't a problem with the fixed divisor of 10 but I missed that
the additions in the mulh()
routine used in the
previous post can overflow which I discovered while trying to
implement a 32 bit version
of division
using Newton's Method to find the reciprocal.
Here's 3 different versions which address it.
Using a 64-bit addition, which isn't very fast:
uint32_t mulh(uint32_t a, uint32_t b) { uint32_t ah = a >> 16; uint32_t al = a & 0xffff; uint32_t bh = b >> 16; uint32_t bl = b & 0xffff; uint32_t c = ah * bh; uint32_t d = ah * bl; uint32_t e = al * bh; uint32_t f = al * bl; uint32_t g = f >> 16; // calculate ((d + e + g) >> 16) without overflow uint32_t h = ((uint64_t)d + e + g) >> 16; return c + h; }
Breaking the addition into 16 bits and using 32-bit addition seems to be the best idea:
uint32_t mulh(uint32_t a, uint32_t b) { uint32_t ah = a >> 16; uint32_t al = a & 0xffff; uint32_t bh = b >> 16; uint32_t bl = b & 0xffff; uint32_t c = ah * bh; uint32_t d = ah * bl; uint32_t e = al * bh; uint32_t f = al * bl; uint16_t g = f >> 16; // calculate ((d + e + g) >> 16) without overflow uint32_t h = (d >> 16) + (e >> 16); uint32_t l = (d & 0xffff) + (e & 0xffff) + g; return c + h + (l >> 16); }
Detecting overflow and manually calculating the carry bits afterwards:
uint32_t mulh(uint32_t a, uint32_t b) { uint32_t ah = a >> 16; uint32_t al = a & 0xffff; uint32_t bh = b >> 16; uint32_t bl = b & 0xffff; uint32_t c = ah * bh; uint32_t d = ah * bl; uint32_t e = al * bh; uint32_t f = al * bl; uint16_t g = f >> 16; // calculate d + e + g (may overflow) uint32_t x = d + e; uint32_t y = x + g; // calculate carrys uint32_t p = ((d | e) & ~x) | (d & e); // g can be ignored because the high bits are 0 uint32_t q = x ^ y; uint32_t o = ((p >> 31) + (q >> 31)); return c + (y >> 16) + (o << 16); }
I think it should work anyway. It requires more bit manipulation than I'd expected.
C sucks
The previous version also included a subtle bug - one which only shows up in specific circumstances. It has to do with c promotion rules and what I think is pretty inconsistent results but it probably falls into the 'undefined behviour' of c.
uint16_t ah = ...; uint16_t al = ...; uint16_t bh = ...; uint16_t bl = ...; uint32_t c = ah * bh;
To calculate c
, ah
and bh
are
promoted to int
per the c specification,
not unsigned int
as you'd expect. Thus if the msb of c
is
set it's actually an overflow because the multiply is considered
signed. This can lead to some pretty weird results if the code is
optimised by the compiler:
uint16_t ah = 0xffff; uint16_t bh = 0xffff; uint32_t c = ah * bh; printf(" c = %08x\n", c); printf(" c >> 30 = %08x\n", c >> 30); printf(" c >> 31 = %08x\n", c >> 31); printf(" c < 0 = %d\n", (int32_t)c < 0); --> c = fffe0001 c >> 30 = 00000003 c >> 31 = 00000000 c < 0 = 0
I think that result is a bit inconsistent - if it's it's going to
enforce an unsigned overflow limit (i.e. 2 unsigned multiply's can't
be negative) it should do it everywhere (i.e. the result should be
truncated). But that's c I suppose. I actually filed a bug with
gcc because it's a pretty naff result but it was pointed out that
it's just a part of the promotion rules and so now I feel a bit
embarassed about doing so; particularly since only came about
because I was using uint16_t
to save some typing.
I also have to remember it's never a bug in the compiler. The last time I filed a bug for a compiler was using the Solaris cc around 26 years ago and I still feel shame when I think about it. Also just don't bother with small integers aside from storage, you never know what you're gonna get.
Division by Multiplication
Anyway this is the result of the conversion of the routine from the other blog mentioned above to 32 bits.
// Derived from 16-bit version explained properly here: // <https://blog.segger.com/algorithms-for-division-part-4-using-newtons-method/> // re16[0] should be 0xffffffff but it can lead to overflow static const uint32_t re16[16] = { 0xfffffffe, 0xf0f0f0f0, 0xe38e38e3, 0xd79435e5, 0xcccccccc, 0xc30c30c3, 0xba2e8ba2, 0xb21642c8, 0xaaaaaaaa, 0xa3d70a3d, 0x9d89d89d, 0x97b425ed, 0x92492492, 0x8d3dcb08, 0x88888888, 0x84210842, }; // esp32c3 96 cycles using mulh instruction // esp8266 216 cycles using second mulh routine above udiv_t udiv(uint32_t u, uint32_t v) { int n = __builtin_clz(v); // normalised inverse estimate uint32_t i = v << n >> 27; uint32_t r = re16[i - 16]; v <<= n; // refine via Newton's Method r = mulh(r, -mulh(r, v)) << 1; r = mulh(r, -mulh(r, v)) << 1; // quotient, denormalise, remainder uint32_t q = mulh(u, r); q >>= 31 - n; v >>= n; r = u - q * v; // check remainder overflow, 2 steps might be needed when v=1 if (r >= v) { q += 1; r -= v; if (r >= v) { q += 1; r -= v; } } return (udiv_t){ q, r }; }
Depending on the mulh()
routine used this is nearly the
same speed as div()
on an esp32c3 (which is interesting
given it has a divu
instruction), and 2.25x fater
than the div()
library call on esp8266.