Unexpected effects that occur when using floating point numbers

Category:  Floating Point ArithmeticIEEE-754Basics

We show three main problems that occur when working with floating point numbers.

The described method of numbers representation gives many advantages over fixed-point numbers and it is convenient when working with a very wide range of values, but it also gives rise to some problems in the form of unexpected effects, unusual for a programmer-mathematician.

The first problem is that not all the real numbers of a feasible range find their position on our number axis. Let us turn to our example, where $p=3$, $e_ {min}=-1$ and $e_{max}=2$. These restrictions allow to cover the range of numbers [0.5, 7] and zero, which we have agreed to consider a special value. Where is the number 3.75 on our axis? It is located exactly between two valid numbers: 3.5 and 4:

In such cases, the number is rounded according to one of four rules, which we will discuss later, in another article. Now, please, take for granted the fact that it will be rounded up to the value 4 by default. That is 3.75≈4. So the entire continuum of numbers somehow fit in only seventeen values. Consequently, it creates a certain error in the calculations and makes many numbers (that are close to some of the values represented exactly) indistinguishable. For example, all the real numbers in the range [3.75, 4.5] are rounded to 4 by the default. Moreover, our entire range can be divided into such zones, where numbers are rounded to the same value (it depends on the mark that falls within its zone):

Once again, I remind you that there are 4 rounding modes, which we will discuss later. The above figure is given for one of these modes that is used by default. Please pay attention to the fact that the big marks are not in the center of their zones. This is because increasing the exponent by one doubles the interval between the numbers.

The second problem appears when some arithmetic operations produce a result, which cannot be represented by feasible numbers. On the one hand, this is obvious, because the sum or product of two very large numbers is still a large number that does not fit in the selected range. However, we have effect, which is not that obvious: try to calculate 4+0.5. You will get 4 again, since the number of 4.5 will be rounded down to 4. Thus, the attempt to add and subtract numbers, which differ from each other too much does not make sense, the result will be equal to the largest number (absolute value is considered), involved in the calculation.

This problem produces a curious effect connected with the calculation procedure. Let's try to find an expression of 4+0.5−4. We get 0, because, first, we get 4+0.5≈4, and then subtract 4−4=0. However, if we change the order like 4−4+0.5, we will get 0.5. Thus, the floating point arithmetic may not be commutative for those operations in mathematics that are commutative. Similarly, associativity is broken too: (4+0.5)+0.5=4 ≠ 5=4+(0.5+0.5). For these reasons, you need to plan the order of calculations very carefully to minimize the error associated with similar effects.

The third problem is associated with subtracting different numbers that are close to each other. Take two numbers 1 and 0.875 (this is the number that precedes 1.0 on our number line). Their difference is equal to 0.125. However, the problem is that the number of 0.125 lays in the so-called “underflow gap around zero” — between zero and the minimum positive number 0.5. Thus, it will be rounded down to zero:

The same thing happens if you try to calculate the difference 1.75−1.5=0.25≈0, we obtain 0 again. What do we have? — The difference of two different numbers can give zero! This is extremely inconvenient and can lead to serious errors in the calculation, for example, to an unexpected division by zero or the incredible loss of precision in intermediate calculations in complex expressions.

Escaping the first two problems while using floating point format is not possible, we can only put up with them, but the third problem was solved due to the so-called “denormal” numbers.

Denormal Numbers

Normalized numbers are numbers that can be represented in the scientific notation of the form 1.xxx…×2e, but the constraints, put on the number of digits of the mantissa and the range of the exponent value, will always produce an underflow gap around zero. Thus, such a representation eliminates the possibility of working with very small numbers, close to zero. To solve this problem, we must abandon the conditions of normalization of 1≤|m|<2 for those numbers, the exponent of which is the minimum value of $e=e_ {min}$. So we come to the definition of denormal numbers.

Denormal numbers are the numbers with minimum exponent for which 0≤|m|<1, so, they have the form $0.xxx\ldots\times2^{e_{min}}$.

Let us add the following numbers to our example:

  • 0.00×2−1=0,
  • 0.01×2−1=0.125,
  • 0.10×2−1=0.25,
  • 0.11×2−1=0.375.

    • Actually, it is wrong to call zero a denormal number, because it is a special number, but zero fits well into the concept of denormal numbers, falling under the general formula.

      Denormal numbers uniformly fill the underflow gap between zero and the minimum normalized number:

      Note that the step, which separates denormal numbers from each other, is equal to the step for normal numbers with the same exponent −1. In our example, this step is equal to 0.125. This allows us to avoid the third problem. Now, it is guaranteed that the difference of any exact represented numbers is not equal to zero, unless the numbers are not equal. Let us return to our examples of the difference calculation: 1−0.875=0.125=0.01×2−1 and 1.75−1.5=0.25=0.10×2−1. Denormal numbers can also be subtracted from each other, and the above rule will not be violated.

      If we look at the last picture, it is easy to see that there are as many numbers with a negative exponent (8 numbers) as the ones with a positive exponent (also 8 numbers), and numbers with zero exponent (4 numbers) lay in the middle between them. This symmetry is achieved in all cases, when we choose $e_{min} = 1-e_{max}$. There is an assumption that the choice of the range of exponents for the IEEE-754 format was based on this argument. Later we will see that in this format exponential range is subject to the above formula.