Introduction

Category:  Floating Point Arithmetic → IEEE-754 → Basics

In this article basic definition and agreements are given to further description of binary numbers in IEEE-754 floating point format.

Many applied scientific calculations can be done with some measure of inaccuracy; that is why it is not usually needed to have accurate values. For example, a house builder hardly ever checks if the diagonal of a square stone foundation of side $a$ equals $a\sqrt2$, as it is enough to make sure that the value is close to $1.4a$. While some areas require higher accuracy, others do not, but the general idea is that we have to know only some first significant digits of the number and its order. Say, 5 km – it is 5.0×103 m. The order 3 means some kind of the nature of the number (thousands of meters). E.g., speaking about distances in molecular physics we handle the orders of −9, −10 and even smaller ones, but in interplanetary calculations the orders can achieve +9, +10 and even bigger values.

The accuracy of our calculations depends on the amount of first digits of a number. Some calculations require high precision (up to ten and more decimal digits), whereas in some others rough rounding to two-three digits is enough.

The floating-point format is widely used to store this kind of approximate values. In this format each number is given by

$$z=m\times \beta^e,$$

where $m$mantissa (significand), $\beta$base, а $e$exponent (order).

The mantissa is responsible for accuracy of a number, while the order shows how big or how small a number is. As a rule $\beta \geq 2$ is integral, and mantissa and exponent are written in base $\beta$, however it is not necessarily always like this. For more information about this kind of number designation see “Scientific Notation”. To read the text further, please, make sure that you fully understand the abovementioned article.

As many of modern computers are binary, it is convenient to consider the base $\beta$ to be equal to 2. However, it is not the only reason for this choice. It is proved that the base $\beta=2$ minimizes average rounding error over the all other bases. Nevertheless in some cases $\beta=10$ is used, but these cases are not included in our series of articles, dedicated to binary numbers. At this moment we consider $\beta=2$, unless anything else is specified.

Binary floating point numbers

So, any number z can be represented in binary floating point notation:

$$z=m\times 2^e.$$

To avoid non-uniqueness, let us assume that our numbers are written in scientific notation, that is, $1\leq|m|<2$. As it was mentioned, the article “Scientific Notation” contains specifications and important remarks for this kind of representation.

Hence, all numbers, except zero, will have the form of 1.xxx…×2e. However, this “rule” will be broken when we speak about “denormal” numbers, but for now you may forget this word.

Number 0 cannot be represented this way, that is why we include it into a certain category and consider 0 as a special number. Later we will learn about other special numbers.

For now we will not discuss the algorithm of converting a decimal number, for example 3.14, into a binary scientific notation 1.100100011110…×21, but let’s suppose that we are able to do that somehow. Detailed description of these algorithms will be given in further articles. Actually, this example shows why these numbers are called floating point numbers: the point really “floats” between digits until it takes the place between the first significant digit and the others.

In the suggested example, we need to have infinite number of significant digits, to get the full binary representation of 3.14, moreover in many applied calculations the number of significant digits can be very big, though not infinite. However, the memory of any computer is limited, that is why we cannot store as many significant digits as we would like to. As well, we cannot choose an exponent of any desired range. That is why we have to put some constraints. Let us assume that the mantissa has up to p binary digits, and the exponent is in a range $e_{min}\leq e \leq e_{max}$.

Let us construct an illustrative example, which will be used in this article and some further ones. Let p=3, $e_{min}=-1$, $e_{max}=2$. This way, all the numbers that we can represent considering these constraints are located on this number axis (only nonnegative semi-axle is presented): The first positive number, which can be represented in a scientific notation with the constraints pointed out, is the number 1.00×2−1=0.5. It is represented as a long mark “0.5” on the left of picture. It is followed by three short marks which correspond to the numbers 1.01×2−1=0.625, 1.10×2−1=0.75 and 1.11×2−1=0.875. These numbers divide the range between 0.5 and 1 by 4 in equal parts of the step 0.125. Further, when e=0, the minimal number with this exponent is equal to 1.00×20=1. The following three numbers apart from each other by 0.25 and divide the range between 1 и 2 in 4 equal parts. Then, in the same manner, numbers 2 (e=1) and 4 (e=2) follow, and after each of them there are intermediate numbers which go with the coequal step. Number 0, as it was said, is a special number. If we increase the range $[e_{min}, e_{max}]$, it will not change the accuracy of big intervals, but will increase the number of the intervals. For example, if we leave p=3, but put $e_{max}=3$, we extend our semi-axles by new numbers from 8 to 14 with step 2: 