Floating-point numbers are digital representations of rational numbers in binary within computer systems. The decimal point will vary as needed, hence the floating point. In software, the float
type is a 32-bit representation (with 6-7 decimal points of precision), while the double
type is a 64-bit representation.
The standard for floating-point numbers is specified by the IEEE. There are three components to the numbers, depending on the system:
- Sign of the number. 0 for a positive number and 1 for a negative number.
- Excess number, an unsigned number. The end values (0, 255) are used for special values, so the exponent values are between .
- For an exponent 0 and a mantissa 0, the value 0 is represented.
- For exponent 255 and mantissa 0, the value is represented.
- The mantissa is part of the 24-bit string, always with a leading 1. The mantissa specifies 24-digits, which means that we can only represent some values and not others.
- i.e., we can represent 0.5 with more precision than 1/3.
The below are the 32-bit and 64-bit standards: Some extended standards are used to provide more precision, including in the 64-bit case. This permits higher precision for functions in a series representation (sine, cosine).
Related pages
- Floating-point unit, a hardware unit for operating on floating-point numbers