Quicker divmod on esp32/esp8266

The reason I looked into this is a bit silly but in short I've come up with some code that does a divide/modulo by 10 somewhat faster than the C compiler comes up with. A fairly straightforward bit-shifting implementation improves over the compiler supplied one on esp8266 but that's only the start. I eventually came across some xtensa assembly language code in Linux for div/mod but I haven't investigated, this is just looking at c.

Slow Divide

The esp8266 doesn't have a divide function so it is implemented using a standard bit-based long division algorithm. The RISCV cpu in the esp32c3 does have one but it's pretty slow. Below is an example of a straightforward divmod implementation with a few basic improvements.

//          udiv32   libc.div()   c / and %
// esp32c3    290          87           80  cycles
// esp8266    320         490          470
udiv32_t udiv32(uint32_t num, uint32_t den) {
        uint32_t r = 0, q = 0;
        uint32_t bit = 1 << 31;

        // shortcut low numbers (and avoid 0)
        if (num < den)
                return (udiv32_t){ 0, num };

        // shortcut high numbers
        int f = __builtin_clz(num);

        // we know this bit is set so take it
        bit = bit >> (f + 1);
        num = num << (f + 1);
        r = 1;

        while (bit) {
                r = (r << 1) | (num >> 31);

                num <<= 1;
                if (r >= den) {
                        r -= den;
                        q |= bit;
                }
                bit >>= 1;
        }

        return (udiv32_t){ q, r };
}

It's running time depends on the inputs but on an esp8266 it's around 50% faster than div or simply using { num / div, num % div }. On the other hand it's 1/4 of the speed on an esp32c3 which has divu and remu instructions (of unknown implementation and timings).

There are a few (micro) optimisations which help on most cpus:

Using clz() to skip to the first set bit in the numerator, even if there's no clz instruction it's usually something better than testing each bit one by one;
Moving the first iteration outside of the loop since we're already at a set bit;
Avoiding a separate loop counter by using the bit itself;
Avoiding masks in the bit copy by taking the top bit from the numerator.

One odd thing I noticed when verifying it is that intel cpu's can't shift a 32-bit value by 32 bits, so without the shortcut exit the result will be incorrect if num is 1. But only on intel.

Fast Multiply

I suppose that was a bit of a win for fairly simple C but given multiply isn't slow on the cpu's it's possible to improve the results for specific values by pre-calculating a multiplication constant.

A fixed point division can be implemented by a multiply of the inverse and a shift. Having a 32-bit implied shift from a mulh instruction gives the obvious choice for the shift and provides the maximum precison possible. The inverse constant is easy to calculate (using 33 bit arithmetic):

        SCALE = (1 << SHIFT) / den
              = (1 << 32) / 10
              = 429496729
              = 0x19999999

Then to calculate the division requires the high result of a 32x32 multiply.

uint32_t div10(uint32_t num) {
        return (uint64_t)num * 0x19999999 >> 32;
}

Unfortunately if you try using this directly you find it is incorrect quite often - all(?) multiples of 10 are out by 1 for instance and it gets worse after num gets very large - larger than 0x40000000. This might not be a problem for some applications but I needed a correct result. I tried adding some magic bits to the multiplicand - which helped - but it can't cover every possible input and and requires feedback to correct.

Fortunately over the full range of input the error is no more than 1/2 (or a remainder overflow of +5), and since this code is also interested in the remainder it can then be used to correct the error. Thus:

// esp32c3   33 cycles
// esp8266  124 cycles
udiv32_t udiv10(uint32_t num) {
        uint32_t q = (uint64_t)num * 0x19999999 >> 32;
        uint32_t r = num - q * 10;

        if (r >= 10) {
                r -= 10;
                q += 1;
        }

        return (udiv32_t){ q, r };
}

On an esp32c3 this is nearly 3x faster than the compiler generates using div/rem instructions, but obviously it is hardcoded to the single denominator of 10.

An esp8266 doesn't have a mulh instruction and the compiler generates a full 64 bit intermediate result from a library call - however this is still 5x faster than div(). So the next obvious step is to try to just calculate the high bits of the multiply without retaining the rest. It turns out you can't save many operations but the operations you can save are fairly expensive on these cpus - 64 bit addition is a bit messy without a carry flag.

Using long multiplication provides the solution:

/*
 *             ah     al
 *             bh     bl
 *    -------------------
 *           ah.bl  al.bl
 * +  ah.bh  al.bh
 *    ===================
 *     <<32   <<16
 */
// esp32c3   40 cycles
// esp8266   58 cycles
udiv32_t udiv10(uint32_t num) {
        uint16_t ah = num >> 16;
        uint16_t al = num;
        uint16_t bh = 0x1999;
        uint16_t bl = 0x9999;

        uint32_t c = ah * bh;
        uint32_t d = ah * bl;
        uint32_t e = al * bh;
        uint32_t f = al * bl;

        uint32_t t = d + e + (f >> 16);

        uint32_t q = c + (t >> 16);
        uint32_t r = num - q * 10;

        if (r >= 10) {
                r -= 10;
                q += 1;
        }

        return (udiv32_t){ q, r };
}

Each multiplication result 'column' is 16 bits wide with some overlap so a few shifts are required to line everything up and not lose any bits.

On an esp8266 this uses the mul16u instruction directly and ends up about 2x as faster again, or around 10x faster in total than using div.

This isn't much slower on an esp32c3 either, possibly the mulhu instruction is slower than mulu; it's cycle timing doesn't appear to be publically documented and I haven't tried timing it as yet. I'm also timing call overheads which are significant at this range of cycle times (17-22 cycles).

A bit more juice

This can be improved a tiny bit more. It turns out for this case that a slight adjustment to the multiplier improves the accuracy of the initial result significantly. By adding 1 to the multiplcand will provide an exact result for every num from 0 to a bit over 0x40000000. This is quite a bit better but it still needs correction and the error will be in the other direction - the remainder can go negative.

The mulhu friendly version:

// esp32c3   30 cycles
// esp8266  123 cycles
udiv32_t udiv10(uint32_t num) {
        uint32_t q = (uint64_t)num * 0x1999999a >> 32;
        uint32_t r = num - q * 10;

        if ((int32_t)r < 0) {
                r += 10;
                q -= 1;
        }

        return (udiv32_t){ q, r };
}

Apart from being correct more often the negative test is cheaper on many cpu's since it doesn't require a comparison first. Two's compliment arithmetic means the sign still works as a test on an ``unsigned'' number.

The mul16u friendly version:

// esp32c3   37 cycles
// esp8266   56 cycles
udiv32_t udiv10(uint32_t num) {
        uint16_t ah = num >> 16;
        uint16_t al = num;
        uint16_t bh = 0x1999;
        uint16_t bl = 0x999a;

        uint32_t c = ah * bh;
        uint32_t d = ah * bl;
        uint32_t e = al * bh;
        uint32_t f = al * bl;

        uint32_t t = d + e + (f >> 16);

        uint32_t q = c + (t >> 16);
        uint32_t r = num - q * 10;

        if ((int32_t)r < 0) {
                r += 10;
                q -= 1;
        }

        return (udiv32_t){ q, r };
}

The timings were taken from the cpu cycle counters but are approximate as they depend either a lot (for the shift implementation) or a tiny bit on the inputs. They also include the overheads of function calls, for these systems size is important so it's probably necessary anyway but for comparison this is the timing of non-functional calls tested the same way.

// esp32c3  17 cycles
// esp8266  22 cycles
udiv32_t udiv_null(uint32_t num) {
        return (udiv32_t){ num, num };
}

Not sure it was the best use of a whole day but there you have it.

Well I did mow the lawn.

About Me

Tags

Quicker divmod on esp32/esp8266

Slow Divide

Fast Multiply

A bit more juice