Floating-point numbers are digital representations of rational numbers in binary within computer systems. The decimal point will vary as needed, hence the floating point. In software, the float type is a 32-bit representation (with 6-7 decimal points of precision), while the double type is a 64-bit representation.

The standard for floating-point numbers is specified by the IEEE. There are three components to the numbers, depending on the system:

  • Sign of the number. 0 for a positive number and 1 for a negative number.
  • Excess number, an unsigned number. The end values (0, 255) are used for special values, so the exponent values are between .
    • For an exponent 0 and a mantissa 0, the value 0 is represented.
    • For exponent 255 and mantissa 0, the value is represented.
  • The mantissa is part of the 24-bit string, always with a leading 1. The mantissa specifies 24-digits, which means that we can only represent some values and not others.
    • i.e., we can represent 0.5 with more precision than 1/3.

The below are the 32-bit and 64-bit standards: Some extended standards are used to provide more precision, including in the 64-bit case. This permits higher precision for functions in a series representation (sine, cosine).