Computer Representation of Floating Point Numbers

Category:  Floating Point Arithmetic → IEEE-754 → Basics

We illustrate on the simple example the logic of the IEEE-754 Standard, which is used for the binary representation of floating point numbers.

In a computer all the data are stored as a sequence of bits, the interpretation of which gives the right information, according to pre-specified rules. In order to represent a floating point number as a sequence of bits, it is necessary to determine the rules of such a representation. The most well-known and common method was proposed in the IEEE-754 Standard. In this article, we will again turn to our example, where $p=3$, $e_{min}=-1$ and $e_{max}=2$, and consider way to store floating point numbers according to IEEE-754 logic. Of course, in the Standard these restrictions are not provided, but now we need to demonstrate the logic of coding, so it would be easier to move directly to the data types, described in the Standard.

Firstly, a number may be positive or negative, and therefore we need a single bit to store the sign. Let us call this bit s. According to the Standard, if s=0, then the number is positive, and if s=1, then it is negative. This is convenient, since the sign of the number is determined by the formula (−1)s.

Secondly, we need to reserve a few bits to store a mantissa. In our example, the mantissa consists of three bits, but we know that the normalized numbers begin with 1, so that this implicit leading bit does not need to be stored, and only 2 bits of the fractional part of the mantissa are stored explicitly. Later in this article we will learn how to store denormal numbers.

Thirdly, it is necessary to store the exponent of a number. Here the situation is quite complex, so be careful, some things, said now, will become clear later. Let us reserve 3 bits for the exponent in our example. The exponent can be positive and negative, so a natural question arises: how to encode negative values? It would be possible to write them in two’s complement, in the same way as it is done for integers, but there is a better solution, the cause of which will be clarified later.

The Standard proposes to store the so-called biased exponent $e_b$, which is defined by the formula $e_b=e+\mathrm{bias}$. The value of $\mathrm{bias}$ is chosen so that a biased exponent $e_b$ would always be strictly positive. Further, as it was mentioned before, the Standard defines data types in a way that meets the condition of symmetry of orders $e_{min}=1-e_{max}$. In this case, if the exponent field size has $E$ bits, the maximum possible orders (according to the Standard), fitting in this field, will be the following: $e_{min}=-(2^{E-1}-2)$, and $e_{max}=2^{E-1}-1$, so, $\mathrm{bias}=e_{max}$.

Let us use our example to illustrate the case. We have $E=3$, that is $e_{min}=-(4-2)=-2$, $e_{max}=4-1=3$ and $\mathrm{bias}=3$. It turns out that we can even expand our example by two exponents, without going beyond the given 3 bits. It means that now our numerical axis looks like this: So, we have allocated 1 bit to store the sign, 3 bits for the exponent and 2 bits for the fractional part of the mantissa. Altogether we have 6 bits, divided into three fields. The Standard logic requires to place these fields one after the other as it is shown here: Bit number 5 (blue) is responsible for the sign, green bits (2—4) for the biased exponent, and the red field (bits 0—1) for the fractional part of the mantissa. Consider a few examples of representable numbers. Let me remind you that we decided to take 3 bits for storing the exponent and expanded our little example. So, now the exponent range spreads from −2 to 3.

The minimum normal number has a value of 1,00(2)×2−2=0.25. It is positive, so the sign is 0. The biased exponent is $e_b=-2+3=1$, and the fractional part of the mantissa is equal to 00. Thus, the number of 0.25 is encoded as

0 001 00 = 0.25

The maximum number is equal to 1.11(2)×23=14. The biased exponent will be equal to $e_b=3+3=6$, the fractional part of the mantissa is 11. Thus, we get a set of bits:

0 110 11 = 14

The number −1.25=−1,01(2)×20 is written as

1 011 01 = −1.25

An attentive reader has noticed that we fit only 6 different exponents in 3 bits, although they could store 8. A natural question arises: where are two other values, namely $e_b=000_{(2)}$ and $e_b=111_{(2)}$?

Two extreme value $e_b=0$ and $e_b=7$ are not involved in the encoding the number order. According to the Standard these extreme values have a different meaning, they are necessary for encoding special numbers.

When the value $e_b=0$, we have to interpret floating-point numbers as denormal ones, therefore we consider an implicit leading bit (which we do not store obviously) and the value of the exponent to equal 0 and $e_{min}$ respectively. For example, the number of −0,10(2)×2−2=−0,125 is denormal, so it will be coded as

1 000 10 = −0.125

When $e_b=7$, the numbers are interpreted as infinity or “not a number” (NaN). These special numbers, as well as zero (which is a special number as well), will be discussed further.

Zeroes, infinities and NaNs

It is easy to notice that in the proposed logic of coding floating point numbers we inevitably get two zeroes: positive and negative.

• 0 000 00 = +0
• 1 000 00 = −0

At the same time the Standard requires the equality +0=−0.

Positive and negative zeroes are quite important. Think of mathematical analysis. If we interpret 0 as the limit of a sequence, we can use its sign to show from which direction we come to this limit. Thus, a negative zero may be interpreted as a very small (rounded to zero) negative number that we have got in the calculations. Another meaning of the zero sign we'll see, when we discuss the infinity.

Contrary to popular belief, floating point arithmetic is rather a complex and not obvious tool to use, therefore, it requires the programmer to have quite a lot of experience, so that the result, we get, using the program, would really get close to the correct one. It would be wrong to force the programmer to care more about the overflow, especially if we take into consideration the fact that it is not always that easy to do (and remember to do) when working with integers. That is why the Standard provides a special number, called “infinity”.

Infinity (positive or negative) is encoded as follows. The exponent field consists entirely of ones ($e_b=7$), and the field of the mantissa — of zeroes. The sign bit is still responsible for the sign.

• 0 111 00 = +∞
• 1 111 00 = −∞

Infinity has a number of intuitive properties. For example +∞±a=+∞ или −∞<a, if a — a finite number. Contact between infinity and zero obeys simple rules: a/+0=+∞, a/−0=−∞, a/+∞=+0, a/−∞=−0.

Now we see another reason to use positive and negative zeroes: for example, when in some complex calculations denominator is rounded to zero, you will get the infinity with the correct sign, if the zero denominator retains its sign. It depends on the correct sign in the answer if certain important conditions in the program are going to be implemented. Thus, the presence of signed infinities and zeroes simplifies the logic of some complex scientific calculations.

What happens if you try to calculate, say +∞/+∞, or −0/−0? In such cases, the Standard provides a number of more specific values, called NaNs (“Not a Number”). NaN indicates that an invalid operation has been done (say, $\sqrt{-1}$ или +∞/+∞), or that one of the operands was not a number (say NaN+1.0). The NaN value is encoded as follows: the exponent field consists of ones ($e_b=7$), and the mantissa field contains at least one non-zero bit. The sign field does not matter.

• 0 111 10 = NaN
• 1 111 01 = NaN
• 1 111 11 = NaN

An important condition, which lets you “catch” NaN, is that if x=NaN. More specifically, the operator x≠x should return true, when it receives NaN as either of its operands. The remaining logic operations should return false. So, x≤x = false, which is a bit strange. This usually produces an unexpected effect, connected with negation. We can not replace a comparison operation with the opposite one, using negation, that is, a≤b — is not the same as the NOT(a>b), because in the first case, NaN comparison will be false, and in the second — true. While comparing floating-point numbers, keep in mind this peculiarity regarding NaN.

The Standard does not specify what meaning should be attributed to the bits of the mantissa for NaN, therefore, manufacturers of computing devices have a freedom of choice here. Though there is an unwritten rule under which the highest bit of the mantissa field is considered to be equal to 1 for the so-called quiet NaN (qNaN) and to 0 for the so-called signaling NaN (sNaN). The difference between them is that qNaN is processed in normal mode without making exceptions, and sNaN generates some emergency (for example, as an exception).

Division NaN into sNaN and qNaN is convenient, because in some cases it is interpreted as a critical error and should be signaled, and in some other ones the appearance of NaN does not contradict the idea of the program developer.

For example, if in the compiling stage the memory, attributed to the floating-point values, is initialized with sNaN, then any action with such numbers will create an exception, so a programmer can quickly guess that he forgot to initialize a variable before using it.

One can give other examples of NaN usage for different purposes, but there isn’t any well-established traditions in this area. Moreover, if you are a common professional programmer, in almost 100% of cases, spotting NaN will mean for you that somewhere in the program an inappropriate operation (like given above) has been carried out, and some additional knowledge of NaN will never be necessary.

Comparison and order of numbers

The logic, according to which the numbers exponent is stored biased, contrary to a usual two’s complement, will be clear when we learn more about comparison of floating point numbers and the way they are ordered.

To begin with, assume that the sign bit of s is zero, it means, we will work with non-negative numbers. Let us interpret the bits of a floating point number as an integer. For example, the number 0 001 00 = 0.25 can be considered as an integer 4, 0 010 11 = 0.875 — is 11. There is an important rule, floating-point numbers obey — their order on the real axis matches the order of corresponding integers.

In other words, if we take two non-negative floating point numbers a and b, and their corresponding integers A and B, it turns out that a<b only if A<B. This rule is fair even when b=+∞. Moreover, if the number of floating point a corresponds to an integer A, the representable number, following a (or infinity) corresponds to A+1. For example, after 0 010 11 = 0.875 we have 0 011 00 = 1.

Negative numbers are obviously subject to reverse order.

In the case where one of the numbers is NaN, the rule will not work, because these “numbers” do not obey any order, and we should work with them separately. As it’s been already mentioned, the only comparison operator that returns true, involving NaN, — is .