Floating Point
Floating Point
Number Representation Revisited
- Given 32 bits (a word), what can we represent so far?
- Signed and Unsigned Integers
- 4 Characters (ASCII)
- Instructions & Addresses
- How do we encode the following: —— Floating Point
- Real numbers (e.g. 3.14159)
- Very large numbers (e.g. 6.02×1023)
- Very small numbers (e.g. 6.626×10-34)–Special numbers (e.g. ∞, NaN)
Reasoning about Fractions
Big Idea: Why can’t we represent fractions? Because our bits all represent nonnegative powers of 2.
Representation of Fractions
Look at decimal(base 10) first: Decimal “point” signifies boundary between integer and fraction parts.
New Idea: Introduce a fixed “Binary Point” that signifies boundary between negative & nonnegative powers:
Questions:
How can we increase the range that we can represent while maintaining the precision using limited significant figures?
Now, we’re setting the binary point fixed but we know that the decimal point is never fixed.
Scientific Notation(Decimal)
- Using scientific notation we can utilize the limited significant figures the most, because we only allow exactly one digit to the left of the decimal point.
- To have a standardized binary point location is also important for consistency
- e.g. 1 over 1 billion
- there are a lot of ways to represent it if it’s not normalized
- but if we use normalized notation, there is only one way to represent
Scientific Notation(Binary)
The notation is called floating-point because although the binary point seems like is at a standardized fixed position we can move the binary points around by changing exponent so the point is actually floating
summary
In order to come up a number representation they can represent very small and very large numbers, potentially some special numbers, we introduced an antique notation for binary which combines a fixed-point model and the exponent field for base2. This representation enables numbers to have a good range while maintaining a good precision under limited significant figures.
Translating To and From Scientific Notation
- Consider the number $1.011_{two}×2^4$
- To convert to ordinary number, shift the decimal to the right by 4
- Result: $10110_{two} = 22_{ten}$
- For negative exponents, shift decimal to the left
- $1.011_{two}×2^{-2} => 0.01011_{two} = 0.34375_{ten}$
- Go from ordinary number to scientific notation by shifting until in normalized form
- $1101.001_{two} => 1.101001_{two}×2^3$
Goals for IEEE 754 Floating Point Standard
Let’s think about the elements that are needed for scientific notation. They are binary point, significand, base, exponent and the sign of the number. With these elements we can directly construct the scientific notation and get the decimal value so we need somehow find a way to store this important information into 32 bits floating point format. And that is where the standard comes to play and save the world. Because there are so many different ways to allocate bits for each element.
- Standard arithmetic for reals for all computers
- Important because computer representation of real numbers is approximate. Want same results on all computers
- Keep as much precision as possible
- Help programmer with errors in real arithmetic
- +∞, -∞, Not-A-Number (NaN), exponent overflow, exponent underflow, +/- zero
- Keep encoding that is somewhat compatible with two’s complement
- E.g., +0 in Fl. Pt. is 0 in two’s complement
- Make it possible to sort without needing to do floating-point comparisons
Floating Point Encoding: Single Precision
The Exponent Field
- Use biased notation but with bias of -127
- Read exponent field as unsigned, add the bias (+ (-127) = -127) to get the actual exponent
- Exponent Field: 0 ($00000000_{two}$) to 255 ($11111111_{two}$)
- Actual exponent: -127 ($00000000_{two}$) to 128 ($11111111_{two}$)
- To encode in biased notation, subtract the bias (-(-127)=+127) then encode in unsigned:
- If we had $2^1$, exp = 1 => 128 => $10000000_{two}$
- $2^{127}$: exp = 127 => 254 => $11111110_{two}$
The Significand Field
Floating Point Encoding
Double Precision FP Encoding
Floaing Point Special Cases
Representing Zero
Representing ± ∞
Representing NaN
Dnorm Numbers
Representing Very Small Numbers
Denorm Numbers
Floating Point Numbers Summary
Exponent | Significand | Meaning |
---|---|---|
0 | 0 | ±0 |
0 | non-zero | ±Denorm fl pt. |
1-254 | anything | ±Norm fl.pt |
255 | 0 | ±∞ |
255 | non-zero | NaN |
Floating Point Limitations
Limitation 1
Floating Point Gaps
Limitation 2
Questions
Summary
Bonus: FP Conversion Practice
Example: Convert FP to Decimal
Example: Scientific Notation to FP
Steps of Floating Point conversions
Converting From Hex and Decimal
- Convert to Binary
- Split Bits up
- Check If Norm/Denorm
- Evaluate
Converting From Decimal to Binary
- Convert Left Side of Decimal
- Convert Right Side of Decimal
- Combine Both Results and Normalize
- Convert to Binary