IEEE 754 Standard - Overview

Frozen Content

Modified by Admin on Sep 13, 2017

Before discussing the actual WB_FPU - Wishbone Floating Point Unit peripheral in detail, it is worth spending some time to look at the standard to which the floating-point numbers adhere, the IEEE Standard for Binary Floating-point Arithmetic (IEEE 754). This standard not only specifies how floating-point numbers are to be represented, but also how arithmetic calculations on these numbers should be performed.

Single Precision Format

The WB_FPU supports single precision (32-bit) binary floating-point numbers, formatted in accordance with the IEEE 754 Standard. Figure 1 shows the composition of a binary floating-point number under this standard.

Figure 1. IEEE 754 floating-point number format.

Sign

The Sign consists of a single bit. If this bit is '1', then the number is negative. If this bit is '0', then the number is positive.

Exponent

The Exponent consists of 8 bits. The base system used is binary (base 2). This base is implicit and is therefore not stored as part of the 32-bit format.

To facilitate both positive and negative exponent values, a bias value is added to the actual exponent to arrive at the 8-bit value that gets stored. For IEEE 754 single precision floats, the bias value used is 127.

Significand

The significand represents the precision bits of the number. The WB_FPU supports normalized floating-point numbers. Normalization leads to two important points:

The radix point is always placed after the first non-zero digit (e.g. 1101 becomes 1.101 x 2 ³)

The leading bit – to the left of the radix point – is required to be non-zero. As we are using the binary system, this bit can only be 1. As such, the leading bit is implicit and not stored as part of the significand.

The significand therefore effectively has 24-bit resolution, but only 23 bits are used to represent the fractional part of the number only – the digits to the right of the radix point. If the fractional part of the number (written in binary notation) is less than 23 bits, the remaining right-hand bits are padded with zeroes. For example 1.101 x 2 ³ would yield a significand of 10100000000000000000000.

Floating-Point Encoding Example

Understanding more clearly how an integer is encoded into this floating-point format can be demonstrated using a simple example. Consider the integer value 276. We need to obtain the Sign, Exponent and Significand in order to fully represent this integer in binary floating-point format.

This is a positive integer, so the sign bit of the binary floating-point representation will be 0.

Convert the integer into binary notation – giving us 100010100.

If we normalize this value, moving the radix point to the right of the leading 1, our binary value 100010100 becomes 1.00010100 x 2 ⁸.

The actual exponent for this value is 8. The exponent that gets stored is 8 + 127 (the bias value). Thus, 135 will be stored as the 8-bit exponent value 10000111.

The entry for the significand is the fractional part of the number (highlighted in red), padded with zeroes to obtain the required 23 bits. In this example, the significand is 00010100000000000000000.

This is all the information we need to fully represent the integer in floating-point form:

Valid Range for Floating-Point Numbers

The valid range for normalized (non-zero) numbers in single precision floating-point format can be defined as:

2 ^minexp ≤ FloatValue ≤ (2 - 2 ^{1-precisionbits}) x 2 ^maxexp ...Positive values

-2 ^minexp ≥ FloatValue ≥ -(2 - 2 ^{1-precisionbits}) x 2 ^maxexp ...Negative values

where,

minexp is the minimum legal exponent value for normalized numbers being encoded. For single precision floating-point format this value is -126 which, when the bias value (127) is added, will give a biased exponent of 1 (00000001b).

maxexp is the maximum legal exponent value for normalized numbers being encoded. For single precision floating-point format this value is 127 which, when the bias value (127) is added, will give a biased exponent of 254 (11111110b).

precisionbits is the number of bits of significand precision for the floating-point format you are using. For single precision, this value is 24 – remembering that the MSB is always 1 and so is hidden (or implicit).

Feeding the above values back into the expressions gives the following ranges:

2 ^-126 ≤ Value ≤ (2 - 2 ^-23) x 2 ¹²⁷ ...Positive values

-2 ^-126 ≥ Value ≥ -(2 - 2 ^-23) x 2 ¹²⁷ ...Negative values

These ranges can be reduced to a single range governing all values, positive and negative:

+/-(2 - 2 ^-23) x 2 ¹²⁷

The value 0 (or more specifically +0 and -0) is treated as a special case and is discussed in the next section.

Special Values

The IEEE 754 standard reserves effective exponent values of 0 and 255, which are used in conjunction with particular significand values to denote special binary floating-point values. Table 1 lists the special values supported by the WB_FPU.

Table 1. Special values.

Sign	Exponent	Significand	Describes
0	00h	000000h	Positive zero
1	00h	000000h	Negative zero
0	FFh	000000h	Positive infinity
1	FFh	000000h	Negative infinity
0	FFh	> 000000h	NaN - Not a Number

Note: -0 and +0 are distinct values, but are equal for comparison purposes.

A Word about NaNs

The value NaN (Not a Number) is used to represent a value that does not represent a real number. As illustrated in Table 1, NaNs are represented by a bit pattern with an exponent of all ones and a non-zero significand. The sign bit can be 0 or 1 – it has no bearing.

There are two categories of NaNs:

QNaN (Quiet NaN) – arising when the result of an arithmetic operation is mathematically undefined. The MSB of the significand is '1' for this type of NaN.

SNaN (Signaling NaN) – used to signal an exception when an invalid operation is performed. The MSB of the significand is '0' for this type of NaN.

The difference between QNaN and SNaN is not implemented in the WB_FPU.

Normalized Numbers

The majority of numbers in the IEEE 754 floating-point format are normalized. Such a value has an assumed '1' for the hidden 24th bit of the significand, after which directly follows the radix point. A normalized floating-point value can be summarized using the following expression:

NormalizedFloatValue = s x 2 ^e-b x 1.f

where,

s is defined by the sign bit and is +1 for sign=0 and -1 for sign=1.

e is the biased exponent

b is the bias value of 127.

f is the 23-bit 'fractional value' of the significand, to the right of the radix point.

The 24-bit significand (1.f) of a normalized binary floating-point number will therefore always be in the range:

1 ≤ significand < 2

Denormalized Numbers

When a calculation involving two floating-point values results in an exponent that is too small to be properly represented – at values less than 1.f x 2 ^-127 – an underflow event occurs. The IEEE 754 standard includes the provision for 'gradual underflow', through the use of denormalized (or subnormal) numbers.

A denormalized number has a biased exponent of zero. When a number becomes denormalized, the significand is shifted by one bit to the right, in order to include the hidden 24th bit. This bit is set to zero. To compensate, the un-biased exponent is incremented by 1.

A denormalized floating-point value can be summarized using the following expression:

DenormalizedFloatValue = s x 2 ^-126 x 0.f

The actual term for the un-biased exponent is 2 ^e-b+1, but as the biased exponent (e) is 0 for denormalized numbers and the bias (b) is 127, this reduces to 2 ^-126.

Denormalized numbers are not implemented in the WB_FPU. Instead, numbers with an exponent overflow will be rounded to infinity and numbers with an exponent underflow will be rounded to zero.

Floating-Point Algorithms

The following sections look at algorithms defined by the IEEE 754 standard for conversion and base arithmetic calculations, and which are implemented in the WB_FPU.

Conversions

The WB_FPU provides conversion from integer to the IEEE 754 single precision floating-point format and vice-versa.

Calculations

The WB_FPU supports the following calculations involving two floating-point values – Addition, Subtraction, Multiplication and Division.

Addition

Addition of two floating-point values requires that the exponents of both operands be first made equal. This involves shifting the significand of the operand with the smallest exponent, to the right, such that both exponents become equal. The significands are then added.

When adding significands, an overflow may occur. This overflow can be corrected by shifting the resulting significand one bit position to the right and adjusting the exponent accordingly.

The addition process takes a single clock cycle.

Table 2 summarizes the possible resulting values based on additions involving the various combinations of floating-point operands, including special values.

Table 2. Addition of operands.

Subtraction

Subtraction is similar to addition. It involves making the exponents of both operands equal and then subtracting the resulting significands. In order to make things less complicated, care is taken to ensure that the smaller operand is always subtracted from the larger one.

In order to keep the WB_FPU synthesizable for reasonable clock frequencies (e.g. 50MHz on a Spartan Virtex-II device) the subtraction process takes two clock cycles instead of one.

Table 3 summarizes the possible resulting values based on subtractions involving the various combinations of floating-point operands, including special values.

Table 3. Subtraction of operands.

Multiplication

Multiplication of two floating-point values involves addition of their exponents and multiplication of their significands. The resulting sign is an XOR function of the signs of both operands.

Table 4 summarizes the possible resulting values based on multiplications involving the various combinations of floating-point operands, including special values.

Table 4. Multiplication of operands.

Division

Division of two floating-point numbers is similar to multiplication. It involves subtraction of their exponents and division of their significands. The resulting sign is an XOR function of the signs of both operands.

Table 5 summarizes the possible resulting values based on divisions involving the various combinations of floating-point operands, including special values.

Table 5. Division of operands.

Rounding

The IEEE 754 standard describes the following four rounding methods:

Truncate

Round down (towards -infinity)

Round up (towards +infinity)

Round to nearest even.

The WB_FPU implements the last of these – round to nearest even – since this is the method used in standard C programming. This method implements rounding to the nearest possible value, except where the initial value is mid-way between the higher and lower values. In such cases, even values will be rounded down and odd values will be rounded up.

Rounding in the WB_FPU is performed as follows:

if (guard && (LSB or sticky)) then
round_up();
else
round_down();
end if;

LSB is the least significant bit of the significand resulting from a calculation. The Guard and Sticky bits are used as an aid in rounding and can be summarized as follows:

Guard bit – this is the first bit that does not fit into a significand (i.e. the bit to the right of the significand's LSB)

Sticky bit – this is an OR-reduced term for the complete set of bits that do not fit in the significand (i.e. bits to the right of the LSB and guard bits).