Floating Point Rounding

When a floating point computation is performed, the floating point result will often not be equal to the 'true' result. For example, the result of multiplying the two binary numbers .1001 and .1101 together is .01110101 but if we are using floating point arithmetic with only 4 bit precision then the result would be .01110 or .01111. The choice of which of these two results will actually be produced is called "rounding".

At first sight, this might look pretty easy, and it is - except for a few special cases. The obvious thing to do is to chose the result which give the least error, which we can call round to nearest. In our example, we would choose .01111 because the size of the error between this and the 'true' result is .00000011 whereas the error for the other result is .00000101.

What do we do if the error is the same for both choices? There are lots of possibilities here, including:

Most modern FPU's (including those in Intel 80x86 processors) will choose the even result because it is recommended by the IEEE standards.

Round to nearest or even

Round to nearest along with the choice of even result when neither choice is nearer is called round to nearest or even. Features of this method of rounding include:

A feature of round to nearest or even (which is not shared by some other rounding methods) is that rounding performed in two or more stages may result in an error. Consider our example again; imagine that we have our FPU running in a mode where it produces 5 bit precision results, in this case the correctly rounded result is .011101. Now consider what happens if we store this result as a 4 bit precision number. This will require another rounding and the correct result (round to even) is .01110, which is different from the result we obtained (.01111) by performing the rounding to 4 bit precision in one step. In general, the result of rounding in several stages can be different from the result of rounding in one step unless either:

The proof is left as an exercise for the reader ;-). This property has implications on architectures such as the Intel 80x86 processors. In the 'C' language on such machines, a 'double' has 53 bit precision and a 'long double' has 64 bit precision. Unlike some other architectures, the precision of FPU results is not encoded into the actual arithmetic instructions, but is set by separate instructions which load the FPU control word. For efficiency reasons it is therefore normal to run the FPU at the highest required precision, which defaults to that of a 'long double'. Therefore (for example) the result of multiplying two 'double' operands (53 bit precision) is always rounded to the precision of a 'long double' (64 bit precision). This result will then be rounded to 'double' precision when it is subsequently stored in RAM. This two stage rounding means that the results of computation on Intel machines can be different from the results produced on other architectures. Except for rare cases, the differences will only be significant for poorly designed programs.

Other rounding methods

In addition to rounding to the nearest result, there is a need for other rounding modes such as:

These three rounding modes are provided on the Intel 80x86 architecture in addition to round to nearest or even.