Floating Point

Number Representation Revisited

Given 32 bits (a word), what can we represent so far?
- Signed and Unsigned Integers
- 4 Characters (ASCII)
- Instructions & Addresses
How do we encode the following: —— Floating Point
- Real numbers (e.g. 3.14159)
- Very large numbers (e.g. 6.02×1023)
- Very small numbers (e.g. 6.626×10-34)–Special numbers (e.g. ∞, NaN）

Reasoning about Fractions

Big Idea: Why can’t we represent fractions? Because our bits all represent nonnegative powers of 2.

Representation of Fractions

Look at decimal(base 10) first: Decimal “point” signifies boundary between integer and fraction parts.

New Idea: Introduce a fixed “Binary Point” that signifies boundary between negative & nonnegative powers:

Questions:

How can we increase the range that we can represent while maintaining the precision using limited significant figures?

Now, we’re setting the binary point fixed but we know that the decimal point is never fixed.

Scientific Notation(Decimal)

Using scientific notation we can utilize the limited significant figures the most, because we only allow exactly one digit to the left of the decimal point.
To have a standardized binary point location is also important for consistency
- e.g. 1 over 1 billion
- there are a lot of ways to represent it if it’s not normalized
- but if we use normalized notation, there is only one way to represent

Scientific Notation(Binary)

The notation is called floating-point because although the binary point seems like is at a standardized fixed position we can move the binary points around by changing exponent so the point is actually floating

summary

In order to come up a number representation they can represent very small and very large numbers, potentially some special numbers, we introduced an antique notation for binary which combines a fixed-point model and the exponent field for base2. This representation enables numbers to have a good range while maintaining a good precision under limited significant figures.

Translating To and From Scientific Notation

Consider the number $1.011_{two}×2^4$
To convert to ordinary number, shift the decimal to the right by 4
- Result: $10110_{two} = 22_{ten}$
For negative exponents, shift decimal to the left
- $1.011_{two}×2^{-2} => 0.01011_{two} = 0.34375_{ten}$
Go from ordinary number to scientific notation by shifting until in normalized form
- $1101.001_{two} => 1.101001_{two}×2^3$

Goals for IEEE 754 Floating Point Standard

Let’s think about the elements that are needed for scientific notation. They are binary point, significand, base, exponent and the sign of the number. With these elements we can directly construct the scientific notation and get the decimal value so we need somehow find a way to store this important information into 32 bits floating point format. And that is where the standard comes to play and save the world. Because there are so many different ways to allocate bits for each element.

Standard arithmetic for reals for all computers
- Important because computer representation of real numbers is approximate. Want same results on all computers
Keep as much precision as possible
Help programmer with errors in real arithmetic
- +∞, -∞, Not-A-Number (NaN), exponent overflow, exponent underflow, +/- zero
Keep encoding that is somewhat compatible with two’s complement
- E.g., +0 in Fl. Pt. is 0 in two’s complement
- Make it possible to sort without needing to do floating-point comparisons

Floating Point Encoding: Single Precision

The Exponent Field

Use biased notation but with bias of -127
- Read exponent field as unsigned, add the bias (+ (-127) = -127) to get the actual exponent
- Exponent Field: 0 ($00000000_{two}$) to 255 ($11111111_{two}$)
- Actual exponent: -127 ($00000000_{two}$) to 128 ($11111111_{two}$)
To encode in biased notation, subtract the bias (-(-127)=+127) then encode in unsigned:
- If we had $2^1$, exp = 1 => 128 => $10000000_{two}$
- $2^{127}$: exp = 127 => 254 => $11111110_{two}$