Floating-point number

Floating-point numbers are digital representations of rational numbers in binary within computer systems. The decimal point will vary as needed, hence the floating point. In software, the float type is a 32-bit representation (with 6-7 decimal points of precision), while the double type is a 64-bit representation.

The standard for floating-point numbers is specified by the IEEE. There are three components to the numbers, depending on the system:

Sign of the number. 0 for a positive number and 1 for a negative number.
Excess number, an unsigned number. The end values (0, 255) are used for special values, so the exponent values are between $[- 126, 127]$ .
- For an exponent 0 and a mantissa 0, the value 0 is represented.
- For exponent 255 and mantissa 0, the value $\infty$ is represented.
The mantissa is part of the 24-bit string, always with a leading 1. The mantissa specifies 24-digits, which means that we can only represent some values and not others.
- i.e., we can represent 0.5 with more precision than 1/3.

The below are the 32-bit and 64-bit standards: Some extended standards are used to provide more precision, including in the 64-bit case. This permits higher precision for functions in a series representation (sine, cosine).

Floating-point unit, a hardware unit for operating on floating-point numbers

jszhn

Recent Notes

Accounting method

Adjugate matrix

Algorithm

Algorithmic analysis

Alma Linux

Floating-point number

Graph View

Backlinks

jszhn

Recent Notes

Accounting method

Adjugate matrix

Algorithm

Algorithmic analysis

Alma Linux

Floating-point number

Related pages

Graph View

Backlinks