Floating Point

Number Representation Revisited

  • Given 32 bits (a word), what can we represent so far?
    • Signed and Unsigned Integers
    • 4 Characters (ASCII)
    • Instructions & Addresses
  • How do we encode the following: —— Floating Point
    • Real numbers (e.g. 3.14159)
    • Very large numbers (e.g. 6.02×1023)
    • Very small numbers (e.g. 6.626×10-34)–Special numbers (e.g. ∞, NaN)

Reasoning about Fractions

Big Idea: Why can’t we represent fractions? Because our bits all represent nonnegative powers of 2.
1

Representation of Fractions

Look at decimal(base 10) first: Decimal “point” signifies boundary between integer and fraction parts.
2

New Idea: Introduce a fixed “Binary Point” that signifies boundary between negative & nonnegative powers:
3

Questions:

How can we increase the range that we can represent while maintaining the precision using limited significant figures?

Now, we’re setting the binary point fixed but we know that the decimal point is never fixed.

Scientific Notation(Decimal)

4

  • Using scientific notation we can utilize the limited significant figures the most, because we only allow exactly one digit to the left of the decimal point.
  • To have a standardized binary point location is also important for consistency
    • e.g. 1 over 1 billion
    • there are a lot of ways to represent it if it’s not normalized
    • but if we use normalized notation, there is only one way to represent

Scientific Notation(Binary)

5

The notation is called floating-point because although the binary point seems like is at a standardized fixed position we can move the binary points around by changing exponent so the point is actually floating

summary

In order to come up a number representation they can represent very small and very large numbers, potentially some special numbers, we introduced an antique notation for binary which combines a fixed-point model and the exponent field for base2. This representation enables numbers to have a good range while maintaining a good precision under limited significant figures.

Translating To and From Scientific Notation

  • Consider the number $1.011_{two}×2^4$
  • To convert to ordinary number, shift the decimal to the right by 4
    • Result: $10110_{two} = 22_{ten}$
  • For negative exponents, shift decimal to the left
    • $1.011_{two}×2^{-2} => 0.01011_{two} = 0.34375_{ten}$
  • Go from ordinary number to scientific notation by shifting until in normalized form
    • $1101.001_{two} => 1.101001_{two}×2^3$

Goals for IEEE 754 Floating Point Standard

Let’s think about the elements that are needed for scientific notation. They are binary point, significand, base, exponent and the sign of the number. With these elements we can directly construct the scientific notation and get the decimal value so we need somehow find a way to store this important information into 32 bits floating point format. And that is where the standard comes to play and save the world. Because there are so many different ways to allocate bits for each element.

  • Standard arithmetic for reals for all computers
    • Important because computer representation of real numbers is approximate. Want same results on all computers
  • Keep as much precision as possible
  • Help programmer with errors in real arithmetic
    • +∞, -∞, Not-A-Number (NaN), exponent overflow, exponent underflow, +/- zero
  • Keep encoding that is somewhat compatible with two’s complement
    • E.g., +0 in Fl. Pt. is 0 in two’s complement
    • Make it possible to sort without needing to do floating-point comparisons

Floating Point Encoding: Single Precision

6

The Exponent Field

7

  • Use biased notation but with bias of -127
    • Read exponent field as unsigned, add the bias (+ (-127) = -127) to get the actual exponent
    • Exponent Field: 0 ($00000000_{two}$) to 255 ($11111111_{two}$)
    • Actual exponent: -127 ($00000000_{two}$) to 128 ($11111111_{two}$)
  • To encode in biased notation, subtract the bias (-(-127)=+127) then encode in unsigned:
    • If we had $2^1$, exp = 1 => 128 => $10000000_{two}$
    • $2^{127}$: exp = 127 => 254 => $11111110_{two}$

8

The Significand Field

9

Floating Point Encoding

10

Double Precision FP Encoding

11

Floaing Point Special Cases

Representing Zero

12

Representing ± ∞

13

Representing NaN

14

Dnorm Numbers

Representing Very Small Numbers

15

Denorm Numbers

16

Floating Point Numbers Summary

Exponent Significand Meaning
0 0 ±0
0 non-zero ±Denorm fl pt.
1-254 anything ±Norm fl.pt
255 0 ±∞
255 non-zero NaN

Floating Point Limitations

Limitation 1

17

Floating Point Gaps

18

Limitation 2

19

Questions

20

21

Summary

22

Bonus: FP Conversion Practice

Example: Convert FP to Decimal

23

Example: Scientific Notation to FP

24

Steps of Floating Point conversions

Converting From Hex and Decimal

  1. Convert to Binary
  2. Split Bits up
  3. Check If Norm/Denorm
  4. Evaluate

Converting From Decimal to Binary

  1. Convert Left Side of Decimal
  2. Convert Right Side of Decimal
  3. Combine Both Results and Normalize
  4. Convert to Binary