A Tour of Floating Point

Basic Introduction

Skip this section if you are familiar with the basic concepts. The following concepts are covered in this section:

To understand this section will will need to be familiar with binary numbers.

The binary integers used in computers are an example of fixed point numbers. For the purpose of our examples in this section, we will use positive (un-signed) four bit binary integers. Thus we can write the decimal number 5 as a binary integer:

or, with the binary point, as:
This is a fixed point binary number because the binary point is always at the same position (at the right).

Note that there are implied zeroes to the right of the binary point, e.g.:


To make a floating point number, more information is needed in order to specify the position of the binary point. One way to do this is to place this information in another number, for example, the decimal number 10 could be represented by the two numbers:

           0001    0101
where the first number is called the exponent and the second is called the significand. The exponent tells us where to place the binary point in the significand. In the above example we have chosen to have it mean to shift the binary point in the significand one bit to the right of its nominal position. Thus we have our significand:
which is decimal 5, and the exponent says to shift the binary point one bit to the right, i.e. our number is:
which is of course decimal 10.

We can write the meaning of our floating point number as:

      x = m * 2^E
where x is the value of the floating point number, m is the value of the significand, and E is the value of the exponent. The notation 2^E means "two to the power E".

To make our floating point numbers really useful, we need two modifications: we add a sign bit, so we can represent negative as well as positive numbers; and we modify the exponent so that we can shift the binary point to the left as well as to the right. Thus we arrive at

      x = (-1)^S * m * 2^(E-B)
where the new parts are the sign bit S which can have a value 0 or 1, and the exponent bias B. Thus, for example, we might choose to reserve the first bit of our old exponent for use as the sign bit, and chose a bias equal to 3.

Let's look at what the decimal number -1.625 would look like as a floating point number with our scheme. First, S has a value of 1. Then note that our decimal number written in binary notation is 1.101, therefore we have:

        S = 1
        m = 1101.
        E = 0
or, as two 4 bit words (remember that we have chosen to put the sign in the first bit of the exponent):
           1000    1101
which represents the value (using decimal numbers):
x = (-1)^S * m * 2^(E-B)
= (-1)^1 * 13 * 2^(0-3)
= -1.625

It is usual with floating point numbers to ensure that the most significant bit of the significand contains a '1'. A floating point number where this is true is called a normalised floating point number. Note that if floating point numbers are always normalised then it is not necessary to actually store the most significant bit; the msb is implied and we always know its value without looking at it.

The number of bits taken by the mantissa (including any implied bit) is called the precision of the floating point number. Thus our simple example has a precision of 4 bits. If we modified it to be always normalised and use an implied most significant bit then it would have a precision of 5 bits.

Floating Point Numbers

A good example of floating point numbers is provided by those available on the Intel 80x86 architecture.

Floating Point Arithmetic