Quicker divmod on esp32/esp8266
The reason I looked into this is a bit silly but in short I've come up with some code that does a divide/modulo by 10 somewhat faster than the C compiler comes up with. A fairly straightforward bit-shifting implementation improves over the compiler supplied one on esp8266 but that's only the start. I eventually came across some xtensa assembly language code in Linux for div/mod but I haven't investigated, this is just looking at c.
Slow Divide
The esp8266 doesn't have a divide function so it is implemented
using a standard bit-based long division algorithm. The RISCV cpu
in the esp32c3 does have one but it's pretty slow. Below is an
example of a straightforward divmod
implementation with
a few basic improvements.
// udiv32 libc.div() c / and % // esp32c3 290 87 80 cycles // esp8266 320 490 470 udiv32_t udiv32(uint32_t num, uint32_t den) { uint32_t r = 0, q = 0; uint32_t bit = 1 << 31; // shortcut low numbers (and avoid 0) if (num < den) return (udiv32_t){ 0, num }; // shortcut high numbers int f = __builtin_clz(num); // we know this bit is set so take it bit = bit >> (f + 1); num = num << (f + 1); r = 1; while (bit) { r = (r << 1) | (num >> 31); num <<= 1; if (r >= den) { r -= den; q |= bit; } bit >>= 1; } return (udiv32_t){ q, r }; }
It's running time depends on the inputs but on an esp8266 it's
around 50% faster than div
or simply using { num
/ div, num % div }
. On the other hand it's 1/4 of the speed
on an esp32c3 which has divu
and remu
instructions (of unknown implementation and timings).
There are a few (micro) optimisations which help on most cpus:
- Using
clz()
to skip to the first set bit in the numerator, even if there's noclz
instruction it's usually something better than testing each bit one by one; - Moving the first iteration outside of the loop since we're already at a set bit;
- Avoiding a separate loop counter by using the bit itself;
- Avoiding masks in the bit copy by taking the top bit from the numerator.
One odd thing I noticed when verifying it is that intel cpu's can't
shift a 32-bit value by 32 bits, so without the shortcut exit the
result will be incorrect if num
is 1. But only on
intel.
Fast Multiply
I suppose that was a bit of a win for fairly simple C but given multiply isn't slow on the cpu's it's possible to improve the results for specific values by pre-calculating a multiplication constant.
A fixed point division can be implemented by a multiply of the
inverse and a shift. Having a 32-bit implied shift from
a mulh
instruction gives the obvious choice for the
shift and provides the maximum precison possible. The inverse
constant is easy to calculate (using 33 bit arithmetic):
SCALE = (1 << SHIFT) / den = (1 << 32) / 10 = 429496729 = 0x19999999
Then to calculate the division requires the high result of a 32x32 multiply.
uint32_t div10(uint32_t num) { return (uint64_t)num * 0x19999999 >> 32; }
Unfortunately if you try using this directly you find it is
incorrect quite often - all(?) multiples of 10 are out by 1 for
instance and it gets worse after num
gets very large -
larger than 0x40000000
. This might not be a problem
for some applications but I needed a correct result. I tried adding
some magic bits to the multiplicand - which helped - but it can't
cover every possible input and and requires feedback to correct.
Fortunately over the full range of input the error is no more than 1/2 (or a remainder overflow of +5), and since this code is also interested in the remainder it can then be used to correct the error. Thus:
// esp32c3 33 cycles // esp8266 124 cycles udiv32_t udiv10(uint32_t num) { uint32_t q = (uint64_t)num * 0x19999999 >> 32; uint32_t r = num - q * 10; if (r >= 10) { r -= 10; q += 1; } return (udiv32_t){ q, r }; }
On an esp32c3 this is nearly 3x faster than the compiler generates
using div
/rem
instructions, but obviously
it is hardcoded to the single denominator of 10.
An esp8266 doesn't have a mulh
instruction and the
compiler generates a full 64 bit intermediate result from a library
call - however this is still 5x faster than div()
. So
the next obvious step is to try to just calculate the high bits of
the multiply without retaining the rest. It turns out you can't
save many operations but the operations you can save are fairly
expensive on these cpus - 64 bit addition is a bit messy without a
carry flag.
Using long multiplication provides the solution:
/* * ah al * bh bl * ------------------- * ah.bl al.bl * + ah.bh al.bh * =================== * <<32 <<16 */ // esp32c3 40 cycles // esp8266 58 cycles udiv32_t udiv10(uint32_t num) { uint16_t ah = num >> 16; uint16_t al = num; uint16_t bh = 0x1999; uint16_t bl = 0x9999; uint32_t c = ah * bh; uint32_t d = ah * bl; uint32_t e = al * bh; uint32_t f = al * bl; uint32_t t = d + e + (f >> 16); uint32_t q = c + (t >> 16); uint32_t r = num - q * 10; if (r >= 10) { r -= 10; q += 1; } return (udiv32_t){ q, r }; }
Each multiplication result 'column' is 16 bits wide with some overlap so a few shifts are required to line everything up and not lose any bits.
On an esp8266 this uses the mul16u
instruction directly
and ends up about 2x as faster again, or around 10x faster in total
than using div
.
This isn't much slower on an esp32c3 either, possibly
the mulhu
instruction is slower than mulu
;
it's cycle timing doesn't appear to be publically documented and I
haven't tried timing it as yet. I'm also timing call overheads
which are significant at this range of cycle times (17-22 cycles).
A bit more juice
This can be improved a tiny bit more. It turns out for this case
that a slight adjustment to the multiplier improves the accuracy of
the initial result significantly. By adding 1 to the multiplcand
will provide an exact result for every num
from 0 to a
bit over 0x40000000
. This is quite a bit better but it
still needs correction and the error will be in the other direction
- the remainder can go negative.
The mulhu
friendly version:
// esp32c3 30 cycles // esp8266 123 cycles udiv32_t udiv10(uint32_t num) { uint32_t q = (uint64_t)num * 0x1999999a >> 32; uint32_t r = num - q * 10; if ((int32_t)r < 0) { r += 10; q -= 1; } return (udiv32_t){ q, r }; }
Apart from being correct more often the negative test is cheaper on many cpu's since it doesn't require a comparison first. Two's compliment arithmetic means the sign still works as a test on an ``unsigned'' number.
The mul16u
friendly version:
// esp32c3 37 cycles // esp8266 56 cycles udiv32_t udiv10(uint32_t num) { uint16_t ah = num >> 16; uint16_t al = num; uint16_t bh = 0x1999; uint16_t bl = 0x999a; uint32_t c = ah * bh; uint32_t d = ah * bl; uint32_t e = al * bh; uint32_t f = al * bl; uint32_t t = d + e + (f >> 16); uint32_t q = c + (t >> 16); uint32_t r = num - q * 10; if ((int32_t)r < 0) { r += 10; q -= 1; } return (udiv32_t){ q, r }; }
The timings were taken from the cpu cycle counters but are approximate as they depend either a lot (for the shift implementation) or a tiny bit on the inputs. They also include the overheads of function calls, for these systems size is important so it's probably necessary anyway but for comparison this is the timing of non-functional calls tested the same way.
// esp32c3 17 cycles // esp8266 22 cycles udiv32_t udiv_null(uint32_t num) { return (udiv32_t){ num, num }; }
Not sure it was the best use of a whole day but there you have it.
Well I did mow the lawn.