Floating Point Data Types in the IEEE-754 Standard

Category:  Floating Point ArithmeticIEEE-754Basics

Data types that are supported by Standard are shown. The type of binary32 is discussed in detail.

IEEE-754 format provides the ability to fit a floating point number in fields, ranging in size from 16 to 128 bits, with certain precision. Lets’s describe these data types:

  • binary16 (half precision),
  • binary32 (single precision),
  • binary64 (double precision) и
  • binary128 (quadruple precision).

There is also a mention of a format like binary256 (octuple precision) and recommendations on formats with extended accuracy (extended precision formats). For example, a format with extended precision of 80 bits is used in FPU (x87), but quite rarely nowadays. Some compilers (for example, Visual C ++ the latest version) even do not support such a data type, and it is commonly referred to as “long double”. It is not recommended to apply it. The table below shows the values of $p$, $E$, $e_{min}$ and $e_{max}$ for these formats.

The number in the name of each format indicates the number of bits, used to encode the binary numbers, represented in it.

Binary32

Let's start learning IEEE-754 from single precision numbers binary32. This data type is supported by many of languages, including C and C ++ (“float” data type), so we can test our knowledge immediately. In binary32 format 8 bits are provided for the exponent (E=8) and 23 bits — for the fractional part of the mantissa 23 (p=24), one more bit remains for the sign.

So, the bits are numbered from 0 to 31, and the entire space of 32 bits is divided into three sections, as it is shown in the picture:

The blue field, that takes one bit with the number 31, is responsible for the sign. The second, green, field takes bits from 23th to 30th and refers to the exponent. In the logic of the IEEE-754 here, in this field, the biased exponent $e_b=e+\mathrm{bias}$ is stored, and $\mathrm{bias}$ is equal to 127. The last field, which is red, occupies 23 bits from the 0th to the 22th. It is responsible for the mantissa, or rather, its fractional part, because we do not store the most significant bit (explicit leading bit or, bit of integer part) explicitly.

Consider the following example. The number 3.14 is written as $1.100\,100\,011\,110\,101\,110\,000\,11_{(2)}×2^1$ (having rounded the number correctly, we get only 23 digits after the decimal point and do not need any more). Our 23 digits of the fractional part are — 10010001111010111000011. The value of the biased exponent will be equal to $e_b=1+127=128=10000000_{(2)}$. Thus, the binary floating-point number 3.14 becomes:

0 10000000 10010001111010111000011

Here something strange starts and anyone, who works with floating-point arithmetic, will face it. Watch carefully! Let's try to turn our number back to decimal format:

$$\displaylines{ \biggl(1 + \frac1{2^1} + \frac1{2^4} + \frac1{2^8}+ \frac1{2^9}+ \frac1{2^{10}}+ \frac1{2^{11}}+ \frac1{2^{13}}+ \frac1{2^{15}}+ {}\cr {} + \frac1{2^{16}}+ \frac1{2^{17}}+ \frac1{2^{22}}+ \frac1{2^{23}}\biggr)\cdot2^1 = 3.140\,000\,104\,904\,174\,804\,687\,5.}$$

If you do not understand why the denominators are to the power of two, look at those positions of the mantissa, where there are ones (from left to right, counting from one). The ones are in the first position, then in the fourth, then the eighth and so on till the positions number 22 and 23. The implicit leading bit is like 1/20.

So, we cannot represent our number 3.14 using 23 bits after the decimal point, so we should round it to the nearest number that can be represented in this format. Mark this feature of floating point arithmetic and keep it in mind. In general it is a rare case when applied calculations involve numbers, which are represented in binary floating-point arithmetic, there is always some error. In our example, the accuracy is at least 10—6 (the difference between the original and the approximate numbers is less than one millionth).

Now we can put our knowledge into practice. Take the number 25. In binary it looks like 11001. The normalized scientific notation is 1.1001×24. Thus, the biased exponent is $e_b=4+127=131=10000011_{(2)}$, the mantissa (when we remove the most significant bit and add zeroes to extend the sequence to 23 bits) is 10010000000000000000000. Thus, the number 25 in binary32 format has the following form:

0 10000011 10010000000000000000000

The next example is the number of −0,3515625. The normalized scientific notation looks like −1,01101×2−2. The biased exponent $e_b=-2+127=125=01111101_{(2)}$. The mantissa is 01101000000000000000000, and as the number is negative, the sign bit has the value of 1. As a result, we get this:

1 01111101 01101000000000000000000

Another interesting example is converting the number 1. Its exponent is 0, so the biased exponent is represented in the following way $e_b=127=01111111_{(2)}$, and the mantissa consists of zeroes:

0 01111111 00000000000000000000000

Being able to read hexadecimal floating point numbers is a very useful skill. Here the reader can have some practice and translate three preceding examples into hexadecimal code:

  • 0 10000011 10010000000000000000000 = 0x41C80000 = 25,0
  • 1 01111101 01101000000000000000000 = 0xBEB40000 = −0,3515625
  • 0 01111111 00000000000000000000000 = 0x3F800000 = 1.0

In future, we will have to deal with this kind of record quite often, as it is the most economical and convenient for users.

Let's try to perform the reverse operation. Say, we have the number of 0x449A4000 in binary32 format. It’s necessary to transfer it to the usual decimal notation. Therefore, we divide it into three bit fields:

0 10001001 00110100100000000000000 = 0x449A4000

The biased exponent is $e_b=137$, so the order of our number is e=10. Thus, the number has a value of $$ \bigl( 1+2^{-3}+2^{-4}+2^{-6}+2^{-9} \bigr) \times 2^{10} = 1\,234. $$

All three above examples of the conversion from decimal to binary32 format have one feature, due to which we performed the operation so easily. All of these numbers can be represented in binary32 format exactly, that is why we have no problems with the rounding. When a number can not be represented in a given format, it must be rounded.

As it has been mentioned, IEEE-754 standard provides 4 ways of rounding: towards zero, towards minus infinity, towards plus infinity and “half to even”.

Take the number of 1+2−23. In the scientific notation, it is recorded as $1.000\,000\,000\,000\,000\,000\,000\,01 \times 2^0$. This number can be represented in the given format, so no rounding is necessary. The size of the mantissa is enough to fit the 23th single bit of fractional part:

0 01111111 00000000000000000000001

Now take the number 1+2−23+2−24. $$\require{color}1.000\,000\,000\,000\,000\,000\,000\,01\colorbox{gray}{1} \times 2^0.$$

We lack a bit in the mantissa field to fit the last, 24th, bit of the fractional part. Therefore, we need rounding. However, we need to find out what rounding direction to choose. In a case of ambiguity the rule of “half to even” requires to round towards the even mantissa. So, we get 1+2−23+2−24≈1+2−22:

0 01111111 00000000000000000000010

Now let’s work with the number 1+2−23+2−25: $$\require{color}1.000\,000\,000\,000\,000\,000\,000\,01\colorbox{gray}{0}1 \times 2^0.$$ Here we do not see a controversial case, therefore, it is rounded down 1+2−23+2−25≈1+2−23.

0 01111111 00000000000000000000001

The smallest number that can be represented in normalized scientific notation, is the number $1{,}000\,000\,000\,000\,000\,000\,000\,00 \times 2^{-126} = 2^{-126}\approx1{,}18\times 10^{-38}$. It has the smallest possible exponent −126 and in binary code it is written as

0 00000001 00000000000000000000000 = 0x00800000

The maximum number has the maximum possible exponent 127 and turns into $1.111\,111\,111\,111\,111\,111\,111\,11 \times 2^{127} = (2-2^{-23})\times 2^{127}\approx3{,}4\times 10^{38}$.

In binary it would look like

0 11111110 11111111111111111111111 = 0x7F7FFFFF

If the biased exponent is zero, then we have to deal with denormal numbers. The value of the exponent is assumed to be e=-126. So denormal numbers are written as 0.xxx…×2−126. It turns out that the smallest number, that can be represented in binary32 format, looks like $$0.000\,000\,000\,000\,000\,000\,000\,01 \times 2^{-126} = 2^{-126-23}=2^{-149}\approx1{,}4\times10^{-45}.$$

In binary:

0 00000000 00000000000000000000001 = 0x00000001

If $e_b = 255$, or, in other words, the exponent field is filled with ones only, we are dealing with infinities or NaNs. The sign field is responsible for the sign of infinity and has nothing to do with NaN.

  • 0 11111111 00000000000000000000000 = 0x7F800000 = +oo
  • 1 11111111 00000000000000000000000 = 0xFF800000 = -oo

“Not a number” NaN is encoded with all the ones in the exponent field and a non-zero mantissa:

  • 0 11111111 10000000000000000000000 = 0x7FС00000 = NaN
  • 1 11111111 00000000000000000000010 = 0xFF800002 = NaN

The reader can compile and run this program to see how some of the numbers are encoded in binary32 format. It displays some of the values and shows their hexadecimal representation.

#include <cstdio>

using namespace std;

typedef unsigned int u32;
typedef float fp32;

void output (fp32 &a) {
  printf ("%08X (%.10g)\n", *(u32*)&a, a);
}

int main() {
  fp32 a=1.0f, b=-2.0f, c=0.1f, d=0.333333333f, 
  p_inf, n_inf, nan1, nan2, max, min, dmin, p_zero, n_zero;
  
  max = 3.402823466e38f;    // Max value (approximately).
  min = 1.175494351e-38f;   // Min normal value (approximately).
  dmin = 1.401298464e-45f;  // Min denormal value (approximately).

  p_inf = max*2;         // Plus infinity (max float * 2).
  n_inf = -p_inf;        // Minus infinity.  
  nan1 = p_inf + n_inf;  // NaN as +oo + (-oo).
  nan2 = p_inf / p_inf;  // NaN as +oo / +oo.
  p_zero = 1.0f/p_inf;    // +0 = 1/+oo.
  n_zero = 1.0f/n_inf;    // -0 = 1/-oo.
  
  output (a);
  output (b);
  output (c);
  output (d);
  output (max);
  output (min);
  output (dmin);
  output (p_inf);
  output (n_inf);
  output (p_zero);
  output (n_zero);
  output (nan1);
  output (nan2);
  
  return 0;
}

For example, in VC ++ 2015 the output of this program will be:

3F800000 (1)
C0000000 (-2)
3DCCCCCD (0.1000000015)
3EAAAAAB (0.3333333433)
7F7FFFFF (3.402823466e+38)
00800000 (1.175494351e-38)
00000001 (1.401298464e-45)
7F800000 (inf)
FF800000 (-inf)
00000000 (0)
80000000 (-0)
FFC00000 (-nan(ind))
FFC00000 (-nan(ind))

Please pay attention to the fact that some values are not given accurately. For example, the number of 0,1(10) can not be represented in binary. In binary scientific notation, it looks like a 1.(1001)×2−4. Thus, the nearest number that can be represented in binary32 format, will be displayed as (rounding up applied)

So, in this format, it’s possible to encode the number 0.1 keeping the accuracy of, at least, 10−8. The same is true of the number of 1/3. In the binary scientific notation it is written as 1,(01)×2−2. It means that the nearest exact value will be represented (after rounding up) as $1{,}010\,101\,010\,101\,010\,101\,010\,11 \times 2^{-2} = 0{,}333\,333\,343\,267\,440\,795\,898\,437\,5$.

It is convenient to have the following table with some values in binary32 format. In this table in the rightmost column the numbers represent decimal approximation — the minimum number of decimal digits, which is enough to restore the number of binary32 format (in other words, these are numbers that possess minimum possible inaccuracy, allowed in the format).

Looking at the table, we can conclude that binary32 format “keeps” about 7 significant digits. However, this accuracy may be lower for denormal numbers. After each arithmetic operation accuracy may get even lower, and sometimes very significantly.