Floating-point face off, part 3: How we do it

This posting continues to explore the performance of floating point and how microcontrollers can efficiently execute basic floating-point operations.

So, how much do you know?

When you are using a high-level language, such as C, it’s hard to know, for sure, what the overhead of using some particular language feature will be in terms of code size or execution speed. The machine instructions that implement language features are mostly hidden from you as a programmer, buried in a library or constructed by a compiler. And for some, being blind to such detail is simply not an issue they need to be aware of.

Microcontroller applications have a space and speed budget that engineers must manage, constrained by the devices available on the market and the price you’re willing to pay. When you want things to go faster, you usually have to pay with additional code (in flash) or additional working storage (in RAM). When you want things to be smaller, you usually pay in additional processor cycles with less-efficient, but more compact, algorithms.

As an expert embedded engineer, can you estimate how many instructions it would take, on average, to add two floating-point values? What about the maximum or minimum number of cycles? Did you take into account exceptional conditions such as infinite and not-a-number inputs, division by zero, overflow, and underflow? What about the particular instruction set of the target, whether it has a barrel shifter, a multiplier, or even a divider?

When implementing floating point support functions, each of these considerations must be taken into account. And the SEGGER Runtime Library does because it is customizable, so you can balance execution speed and code size.

Add, Subtract, Multiply, Divide

From early schooldays, we are taught how to add and subtract first, then multiply, and then how to divide. To be sure, you would never be taught how to divide before being taught how to add, subtract, and multiply, right? For long division you need to understand how to subtract and multiply, and to multiply you need to know how to add.

It’s clear that division must be the most complex operation for a computer, correct? Well, you would be dead wrong. Floating-point addition and subtraction are far trickier to code than floating-point division and, should you ever try to implement these fundamental operations, you would understand why.

Floating-point division is actually very easy to implement. Here’s the kernel of a 64-bit floating divide:

uint64_t a, b, c;
//
a = 0;
for (i = 0; i < 55; ++i) {
  a <<= 1; if (b >= c) {
    b -= c;
    a |= 1;
  }
  b <<= 1;
}

With some boilerplate code before and code after to extract and pack, that’s all there is to a floating divide. Remember those trade offs? This algorithm is small, straightforward, but also slow: it needs 55 iterations to provide a quotient and rounding bit. But a variant of this is in the SEGGER Runtime Library so that a highly compact runtime is possible.

Dividing faster

Let’s consider a processor that has a division capability. To implement a single-precision division, it’s possible to use a division instruction when calculating the significand part of the floating-point quotient, something like this:

  //
  // Divide in extended precision.
  //
  q = (((uint64_t)b) << 32) / c;

This uses a 64-by-32 division, which is usually very fast, to calculate the floating quotient. Again, the SEGGER Runtime Library uses a variation of this when beneficial on the architecture.

But there are even faster ways to divide b by c. If you can quickly compute the reciprocal of c and multiply by b, then you have an almost-correct quotient. Why almost-correct? Well, consider dividing 6 by 3 using this method. The fraction 1/3 is a recurring decimal (0.33333…) and it’s also a recurring binary fraction (0.010101…) which cannot be represented exactly. When this inexact approximation is multiplied by 6, the calculated quotient is not exactly two, it’s slightly less. But for a problem there is usually a solution, and we can use some simple mathematics to provide a correction factor that is applied to the quotient and always returns an IEEE-correct answer.

So the problem now is how to quickly calculate a reciprocal? This is a problem with a wealth of supporting literature. But, in essence, it reduces to looking up an estimate of the reciprocal of c in a table using the leading bits of c‘s significand, and refining it using the remaining bits. Clearly, more entries in the table improve the accuracy of the estimate, with each additional bit taken from c doubling the size of the table but potentially reducing the time needed to correct the quotient.

So it’s a trade off again, between table size and computation. And SEGGER has more than one implementation of floating division using such an algorithm.

The SEGGER Library provides a number of ways of calculating quotients, coded in C, Arm, and RISC-V assembly language, to suit code size and execution speed.

Some closing remarks

It’s interesting to compare the size of the source code for the floating-point library. For Arm, it’s 11,500 source lines with multiple variations for each of the architectures and architectural features. The same for RISC-V RV32 is just 5,000 lines as there are far fewer architectural variations.