## Floating-Point Numbers

“Floating point” refers to a set of data types that encode real numbers, including fractions and decimals. Floating-point data types allow for a varying number of digits after the decimal point, while fixed-point data types have a specific number of digits reserved before and after the decimal point. So, floating-point data types can represent a wider range of numbers than fixed-point data types.

Due to limited memory for number representation and storage, computers can represent a finite set of floating-point numbers that have finite precision. This finite precision can limit accuracy for floating-point computations that require exact values or high precision, as some numbers are not represented exactly. Despite their limitations, floating-point numbers are widely used due to their fast calculations and sufficient precision and range for solving real-world problems.

### Floating-Point Numbers in MATLAB

MATLAB^{®} has data types for double-precision (`double`

) and single-precision (`single`

) floating-point numbers following IEEE^{®} Standard 754. By default, MATLAB represents floating-point numbers in double precision. Double precision allows you to represent numbers to greater precision but requires more memory than single precision. To conserve memory, you can convert a number to single precision by using the `single`

function.

You can store numbers between approximately –3.4 × 10^{38} and 3.4 × 10^{38} using either double or single precision. If you have numbers outside of that range, store them using double precision.

#### Create Double-Precision Data

Because the default numeric type for MATLAB is type `double`

, you can create a double-precision floating-point number with a simple assignment statement.

x = 10; c = class(x)

c = 'double'

You can convert numeric data, characters or strings, and logical data to double precision by using the `double`

function. For example, convert a signed integer to a double-precision floating-point number.

x = int8(-113); y = double(x)

y = -113

#### Create Single-Precision Data

To create a single-precision number, use the `single`

function.

x = single(25.783);

You can also convert numeric data, characters or strings, and logical data to single precision by using the `single`

function. For example, convert a signed integer to a single-precision floating-point number.

x = int8(-113); y = single(x)

y = single -113

#### How MATLAB Stores Floating-Point Numbers

MATLAB constructs its `double`

and `single`

floating-point data types according to IEEE format and follows the *round to nearest, ties to even* rounding mode by default.

A floating-point number *x* has the form:

$$x=-{1}^{s}\cdot (1+f)\cdot {2}^{e}$$

where:

*s*determines the sign.*f*is the fraction, or*mantissa*, which satisfies 0 ≤*f*< 1.*e*is the exponent.

*s*, *f*, and *e* are each determined by a finite number of bits in memory, with *f* and *e* depending on the precision of the data type.

Storage of a `double`

number requires 64 bits, as shown in this table.

Bits | Width | Usage |
---|---|---|

`63` | `1` | Stores the sign, where `0` is positive and `1` is negative |

`62` to `52` | `11` | Stores the exponent, biased by `1023` |

`51` to `0` | `52` | Stores the mantissa |

Storage of a `single`

number requires 32 bits, as shown in this table.

Bits | Width | Usage |
---|---|---|

`31` | `1` | Stores the sign, where `0` is positive and `1` is negative |

`30` to `23` | `8` | Stores the exponent, biased by `127` |

`22` to `0` | `23` | Stores the mantissa |

### Largest and Smallest Values for Floating-Point Data Types

The double- and single-precision data types have a largest and smallest value that you can represent. Numbers outside of the representable range are assigned positive or negative infinity. However, some numbers within the representable range cannot be stored exactly due to the gaps between consecutive floating-point numbers, and these numbers can have round-off errors.

#### Largest and Smallest Double-Precision Values

Find the largest and smallest positive values that can be represented with the `double`

data type by using the `realmax`

and `realmin`

functions, respectively.

m = realmax

m = 1.7977e+308

n = realmin

n = 2.2251e-308

`realmax`

and `realmin`

return normalized IEEE values. You can find the largest and smallest negative values by multiplying `realmax`

and `realmin`

by `-1`

. Numbers greater than `realmax`

or less than `–realmax`

are assigned the values of positive or negative infinity, respectively.

#### Largest and Smallest Single-Precision Values

Find the largest and smallest positive values that can be represented with the `single`

data type by calling the `realmax`

and `realmin`

functions with the argument `"single"`

.

`m = realmax("single")`

m = single 3.4028e+38

`n = realmin("single")`

n = single 1.1755e-38

You can find the largest and smallest negative values by multiplying `realmax("single")`

and `realmin("single")`

by `–1`

. Numbers greater than `realmax("single")`

or less than `–realmax("single")`

are assigned the values of positive or negative infinity, respectively.

#### Largest Consecutive Floating-Point Integers

Not all integers are representable using floating-point data types. The *largest consecutive integer*, *x*, is the greatest integer for which all integers less than or equal to *x* can be exactly represented, but *x* + 1 cannot be represented in floating-point format. The `flintmax`

function returns the largest consecutive integer. For example, find the largest consecutive integer in double-precision floating-point format, which is 2^{53}, by using the `flintmax`

function.

x = flintmax

x = 9.0072e+15

Find the largest consecutive integer in single-precision floating-point format, which is 2^{24}.

`y = flintmax("single")`

y = single 16777216

When you convert an integer data type to a floating-point data type, integers that are not exactly representable in floating-point format lose accuracy. `flintmax`

, which is a floating-point number, is less than the greatest integer representable by integer data types using the same number of bits. For example, `flintmax`

for double precision is 2^{53}, while the maximum value for type `int64`

is 2^{64} – 1. Therefore, converting an integer greater than 2^{53} to double precision results in a loss of accuracy.

### Accuracy of Floating-Point Data

The accuracy of floating-point data can be affected by several factors:

Limitations of your computer hardware — For example, hardware with insufficient memory truncates the results of floating-point calculations.

Gaps between each floating-point number and the next larger floating-point number — These gaps are present on any computer and limit precision.

#### Gaps Between Floating-Point Numbers

You can determine the size of a gap between consecutive floating-point numbers by using the `eps`

function. For example, find the distance between `5`

and the next larger double-precision number.

e = eps(5)

e = 8.8818e-16

You cannot represent numbers between `5`

and `5 + eps(5)`

in double-precision format. If a double-precision computation returns the answer `5`

, the result is accurate within `eps(5)`

. This radius of accuracy is often called *machine epsilon*.

The gaps between floating-point numbers are not equal. For example, the gap between `1e10`

and the next larger double-precision number is larger than the gap between `5`

and the next larger double-precision number.

e = eps(1e10)

e = 1.9073e-06

Similarly, find the distance between `5`

and the next larger single-precision number.

x = single(5); e = eps(x)

e = single 4.7684e-07

Gaps between single-precision numbers are larger than the gaps between double-precision numbers because there are fewer single-precision numbers. So, results of single-precision calculations are less precise than results of double-precision calculations.

When you convert a double-precision number to a single-precision number, you can determine the upper bound for the amount the number is rounded by using the `eps`

function. For example, when you convert the double-precision number `3.14`

to single precision, the number is rounded by at most `eps(single(3.14))`

.

#### Gaps Between Consecutive Floating-Point Integers

The `flintmax`

function returns the largest consecutive integer in floating-point format. Above this value, consecutive floating-point integers have a gap greater than `1`

.

Find the gap between `flintmax`

and the next floating-point number by using `eps`

:

```
format long
x = flintmax
```

x = 9.007199254740992e+15

e = eps(x)

e = 2

Because `eps(x)`

is `2`

, the next larger floating-point number that can be represented exactly is `x + 2`

.

y = x + e

y = 9.007199254740994e+15

If you add `1`

to `x`

, the result is rounded to `x`

.

z = x + 1

z = 9.007199254740992e+15

### Arithmetic Operations on Floating-Point Numbers

You can use a range of data types in arithmetic operations with floating-point numbers, and the data type of the result depends on the input types. However, when you perform operations with different data types, some calculations may not be exact due to approximations or intermediate conversions.

#### Double-Precision Operands

You can perform basic arithmetic operations with `double`

and any of the following data types. If one or more operands are an integer scalar or array, the `double`

operand must be a scalar. The result is of type `double`

, except where noted otherwise.

`single`

— The result is of type`single`

.`double`

`int8`

,`int16`

,`int32`

,`int64`

— The result is of the same data type as the integer operand.`uint8`

,`uint16`

,`uint32`

,`uint64`

— The result is of the same data type as the integer operand.`char`

`logical`

#### Single-Precision Operands

You can perform basic arithmetic operations with `single`

and any of the following data types. The result is of type `single`

.

`single`

`double`

`char`

`logical`

### Unexpected Results with Floating-Point Arithmetic

Almost all operations in MATLAB are performed in double-precision arithmetic conforming to IEEE Standard 754. Because computers represent numbers to a finite precision, some computations can yield mathematically nonintuitive results. Some common issues that can arise while computing with floating-point numbers are round-off error, cancellation, swamping, and intermediate conversions. The unexpected results are not bugs in MATLAB and occur in any software that uses floating-point numbers. For exact rational representations of numbers, consider using the Symbolic Math Toolbox™.

#### Round-Off Error

Round-off error can occur due to the finite-precision representation of floating-point numbers. For example, the number `4/3`

cannot be represented exactly as a binary fraction. As such, this calculation returns the quantity `eps(1)`

, rather than `0`

.

e = 1 - 3*(4/3 - 1)

e = 2.2204e-16

Similarly, because `pi`

is not an exact representation of π, `sin(pi)`

is not exactly zero.

x = sin(pi)

x = 1.2246e-16

Round-off error is most noticeable when many operations are performed on floating-point numbers, allowing errors to accumulate and compound. A best practice is to minimize the number of operations whenever possible.

#### Cancellation

Cancellation can occur when you subtract a number from another number of roughly the same magnitude, as measured by `eps`

. For example, `eps(2^53)`

is `2`

, so the numbers `2^53 + 1`

and `2^53`

have the same floating-point representation.

x = (2^53 + 1) - 2^53

x = 0

When possible, try rewriting computations in an equivalent form that avoids cancellations.

#### Swamping

Swamping can occur when you perform operations on floating-point numbers that differ by many orders of magnitude. For example, this calculation shows a loss of precision that makes the addition insignificant.

x = 1 + 1e-16

x = 1

#### Intermediate Conversions

When you perform arithmetic with different data types, intermediate calculations and conversions can yield unexpected results. For example, although `x`

and `y`

are both `0.2`

, subtracting them yields a nonzero result. The reason is that `y`

is first converted to `double`

before the subtraction is performed. This subtraction result is then converted to `single`

, `z`

.

```
format long
x = 0.2
```

x = 0.200000000000000

y = single(0.2)

y = single 0.2000000

z = x - y

z = single -2.9802323e-09

#### Linear Algebra

Common issues in floating-point arithmetic, such as the ones described above, can compound when applied to linear algebra problems because the related calculations typically consist of multiple steps. For example, when solving the system of linear equations `Ax = b`

, MATLAB warns that the results may be inaccurate because operand matrix `A`

is ill conditioned.

A = diag([2 eps]); b = [2; eps]; x = A\b;

Warning: Matrix is close to singular or badly scaled. Results may be inaccurate. RCOND = 1.110223e-16.

## References

[1] Moler, Cleve. *Numerical Computing with
MATLAB*. Natick, MA: The MathWorks, Inc., 2004.