I believe I should add a hardware designer’s perspective to this since I design and build floating point hardware. Knowing the origin of the error may help in understanding what is happening in the software, and ultimately, I hope this helps explain the reasons for why floating point errors happen, and seem to accumulate over time.

Most processors follow the IEEE-754 standard but some use denormalized, or different standards . For example, there is a denormalized mode in IEEE-754 which allows representation of very small floating point numbers at the expense of precision. The following, however, will cover the normalized mode of IEEE-754 which is the typical mode of operation.

The main cause of the error in floating point division, are is the division algorithms used to calculate the quotient. Most computer systems calculate division using multiplication by an inverse, mainly in Z=X/Y, Z = X * (1/Y). Division A division is computed iteratively i.e. each cycle computes some bits of the quotient until the desired precision is reached, which for IEEE-754 is anything with an error of less than one unit in the last place. The table of reciprocals of Y (1/Y) is known as the quotient selection table (QST) in the slow division, and the size in bits of the quotient selection table is usually the width of the radix, or a number of bits of the quotient computed in each iteration, plus a few guard bits. For the IEEE-754 standard, double precision (64-bit), it would be the size of the radix of the divider, plus a few guard bits k, where k>=2. So for example, a typical Quotient Selection Table for a divider that computes 2 bits of the quotient at a time (radix 4) would be 2+2= 4 bits (plus a few optional bits).

What reciprocals are in the quotient selection table depend on the division method: slow division such as SRT division, or fast division such as Goldschmidt division; each entry is modified according to the division algorithm in an attempt to yield the lowest possible error. In any case, though, all reciprocals are approximations of the actual reciprocal, and introduce some element of error. Both slow division and fast division methods calculate the quotient iteratively, i.e. some number of bits of the quotient are calculated each step, then the result is subtracted from the dividend, and the divider repeats the steps until the error is less than one half of one unit in the last place. Slow division methods calculate a fixed number of digits of the quotient in each step and are usually less expensive to build, and fast division methods calculate a variable number of digits per step and are usually more expensive to build. The most important part of the division methods is that most of them rely upon repeated multiplication by an approximation of a reciprocal, so they are prone to error.

Another cause of the rounding errors in all operations are the different modes of truncation of the final answer that IEEE-754 allows. There's truncate, round-towards-zero, round-to-nearest (default), round-down, and round-up. All methods introduce an element of error of less than one unit in the last place for a single operation. Over time and repeated operations, truncation also adds cumulatively to the resultant error. This truncation error, is especially problematic in exponentiation, which involves some form of repeated multiplication.

Since the hardware that does the floating point calculations only needs to yield a result with an error of less than one half of one unit in the last place for a single operation, the error will grow over repeated operations if not watched. This is the reason that in computations that require a bounded error, mathematicians use methods such as using the round-to-nearest even digit in the last place of IEEE-754, because, over time, the errors are more likely to cancel each other out, and Interval Arithmetic combined with variations of the IEEE 754 rounding modes to predict rounding errors, and correct them. Because of its low relative error compared to other rounding modes, round to nearest even digit (in the last place), is the default rounding mode of IEEE-754.

I believe I should add a hardware designer’s perspective to this since I design and build floating point hardware. Knowing the origin of the error may help in understanding what is happening in the software, and ultimately, I hope this helps explain the reasons for why floating point errors happen, and seem to accumulate over time.

Most processors follow the IEEE-754 standard but some use denormalized, or different standards . For example, there is a denormalized mode in IEEE-754 which allows representation of very small floating point numbers at the expense of precision. The following however, will cover the normalized mode of IEEE-754 which is the typical mode of operation.

The main cause of the error in floating point division, are the division algorithms used to calculate the quotient. Most computer systems calculate division using multiplication by an inverse, mainly in Z=X/Y, Z = X * (1/Y). Division is computed iteratively i.e. each cycle computes some bits of the quotient until the desired precision is reached, which for IEEE-754 is anything with an error of less than one unit in the last place. The table of reciprocals of Y (1/Y) is known as the quotient selection table (QST) in slow division, and the size in bits of the quotient selection table is usually the width of the radix, or number of bits of the quotient computed in each iteration, plus a few guard bits. For the IEEE-754 standard, double precision (64-bit), it would be the size of the radix of the divider, plus a few guard bits k, where k>=2. So for example, a typical Quotient Selection Table for a divider that computes 2 bits of the quotient at a time (radix 4) would be 2+2= 4 bits (plus a few optional bits).

What reciprocals are in the quotient selection table depend on the division method: slow division such as SRT division, or fast division such as Goldschmidt division; each entry is modified according to the division algorithm in an attempt to yield the lowest possible error. In any case though, all reciprocals are approximations of the actual reciprocal, and introduce some element of error. Both slow division and fast division methods calculate the quotient iteratively, i.e. some number of bits of the quotient are calculated each step, then the result is subtracted from the dividend, and the divider repeats the steps until the error is less than one half of one unit in the last place. Slow division methods calculate a fixed number of digits of the quotient in each step and are usually less expensive to build, and fast division methods calculate a variable number of digits per step and are usually more expensive to build. The most important part of the division methods is that most of them rely upon repeated multiplication by an approximation of a reciprocal, so they are prone to error.

Another cause of the rounding errors in all operations are the different modes of truncation of the final answer that IEEE-754 allows. There's truncate, round-towards-zero, round-to-nearest (default), round-down, and round-up. All methods introduce an element of error of less than one unit in the last place for a single operation. Over time and repeated operations, truncation also adds cumulatively to the resultant error. This truncation error, is especially problematic in exponentiation, which involves some form of repeated multiplication.

Since the hardware that does the floating point calculations only needs to yield a result with an error of less than one half of one unit in the last place for a single operation, the error will grow over repeated operations if not watched. This is the reason that in computations that require a bounded error, mathematicians use methods such as using the round-to-nearest even digit in the last place of IEEE-754, because over time, the errors are more likely to cancel each other out, and Interval Arithmetic combined with variations of the IEEE 754 rounding modes to predict rounding errors, and correct them. Because of its low relative error compared to other rounding modes, round to nearest even digit (in the last place), is the default rounding mode of IEEE-754.

I believe I should add a hardware designer’s perspective to this since I design and build floating point hardware. Knowing the origin of the error may help in understanding what is happening in the software, and ultimately, I hope this helps explain the reasons for why floating point errors happen and seem to accumulate over time.

Most processors follow the IEEE-754 standard but some use denormalized, or different standards . For example, there is a denormalized mode in IEEE-754 which allows representation of very small floating point numbers at the expense of precision. The following, however, will cover the normalized mode of IEEE-754 which is the typical mode of operation.

The main cause of the error in floating point division is the division algorithms used to calculate the quotient. Most computer systems calculate division using multiplication by an inverse, mainly in Z=X/Y, Z = X * (1/Y). A division is computed iteratively i.e. each cycle computes some bits of the quotient until the desired precision is reached, which for IEEE-754 is anything with an error of less than one unit in the last place. The table of reciprocals of Y (1/Y) is known as the quotient selection table (QST) in the slow division, and the size in bits of the quotient selection table is usually the width of the radix, or a number of bits of the quotient computed in each iteration, plus a few guard bits. For the IEEE-754 standard, double precision (64-bit), it would be the size of the radix of the divider, plus a few guard bits k, where k>=2. So for example, a typical Quotient Selection Table for a divider that computes 2 bits of the quotient at a time (radix 4) would be 2+2= 4 bits (plus a few optional bits).

What reciprocals are in the quotient selection table depend on the division method: slow division such as SRT division, or fast division such as Goldschmidt division; each entry is modified according to the division algorithm in an attempt to yield the lowest possible error. In any case, though, all reciprocals are approximations of the actual reciprocal and introduce some element of error. Both slow division and fast division methods calculate the quotient iteratively, i.e. some number of bits of the quotient are calculated each step, then the result is subtracted from the dividend, and the divider repeats the steps until the error is less than one half of one unit in the last place. Slow division methods calculate a fixed number of digits of the quotient in each step and are usually less expensive to build, and fast division methods calculate a variable number of digits per step and are usually more expensive to build. The most important part of the division methods is that most of them rely upon repeated multiplication by an approximation of a reciprocal, so they are prone to error.

Another cause of the rounding errors in all operations are the different modes of truncation of the final answer that IEEE-754 allows. There's truncate, round-towards-zero, round-to-nearest (default), round-down, and round-up. All methods introduce an element of error of less than one unit in the last place for a single operation. Over time and repeated operations, truncation also adds cumulatively to the resultant error. This truncation error is especially problematic in exponentiation, which involves some form of repeated multiplication.

Since the hardware that does the floating point calculations only needs to yield a result with an error of less than one half of one unit in the last place for a single operation, the error will grow over repeated operations if not watched. This is the reason that in computations that require a bounded error, mathematicians use methods such as using the round-to-nearest even digit in the last place of IEEE-754, because, over time, the errors are more likely to cancel each other out, and Interval Arithmetic combined with variations of the IEEE 754 rounding modes to predict rounding errors, and correct them. Because of its low relative error compared to other rounding modes, round to nearest even digit (in the last place), is the default rounding mode of IEEE-754.

Bring section 4 in line with section 5
Source Link
Sneftel
  • 37.3k
  • 11
  • 64
  • 98

Another cause of the rounding errors in all operations are the different modes of truncation of the final answer that IEEE-754 allows. There's truncate, round-towards-zero, round-to-nearest (default), round-down, and round-up. All methods introduce an element of error of less than one half of one unit in the last place for a single operation. Over time and repeated operations, truncation also adds cumulatively to the resultant error. This truncation error, is especially problematic in exponentiation, which involves some form of repeated multiplication.

Another cause of the rounding errors in all operations are the different modes of truncation of the final answer that IEEE-754 allows. There's truncate, round-towards-zero, round-to-nearest (default), round-down, and round-up. All methods introduce an element of error of less than one half of one unit in the last place for a single operation. Over time and repeated operations, truncation also adds cumulatively to the resultant error. This truncation error, is especially problematic in exponentiation, which involves some form of repeated multiplication.

Another cause of the rounding errors in all operations are the different modes of truncation of the final answer that IEEE-754 allows. There's truncate, round-towards-zero, round-to-nearest (default), round-down, and round-up. All methods introduce an element of error of less than one unit in the last place for a single operation. Over time and repeated operations, truncation also adds cumulatively to the resultant error. This truncation error, is especially problematic in exponentiation, which involves some form of repeated multiplication.

The main cause of the error in floating point division, are the division algorithms used to calculate the quotient. Most computer systems calculate division using multiplication by an inverse, mainly in Z=X/YZ=X/Y, Z = X * (1/Y)Z = X * (1/Y). Division is computed iteratively i.e. each cycle computes some bits of the quotient until the desired precision is reached, which for IEEE-754 is anything with an error of less than one unit in the last place. The table of reciprocals of Y (1/Y) is known as the quotient selection table (QST) in slow division, and the size in bits of the quotient selection table is usually the width of the radix, or number of bits of the quotient computed in each iteration, plus a few guard bits. For the IEEE-754 standard, double precision (64-bit), it would be the size of the radix of the divider, plus a few guard bits k, where k>=2k>=2. So for example, a typical Quotient Selection Table for a divider that computes 2 bits of the quotient at a time (radix 4) would be 2+2= 42+2= 4 bits (plus a few optional bits).

The main cause of the error in floating point division, are the division algorithms used to calculate the quotient. Most computer systems calculate division using multiplication by an inverse, mainly in Z=X/Y, Z = X * (1/Y). Division is computed iteratively i.e. each cycle computes some bits of the quotient until the desired precision is reached, which for IEEE-754 is anything with an error of less than one unit in the last place. The table of reciprocals of Y (1/Y) is known as the quotient selection table (QST) in slow division, and the size in bits of the quotient selection table is usually the width of the radix, or number of bits of the quotient computed in each iteration, plus a few guard bits. For the IEEE-754 standard, double precision (64-bit), it would be the size of the radix of the divider, plus a few guard bits k, where k>=2. So for example, a typical Quotient Selection Table for a divider that computes 2 bits of the quotient at a time (radix 4) would be 2+2= 4 bits (plus a few optional bits).

The main cause of the error in floating point division, are the division algorithms used to calculate the quotient. Most computer systems calculate division using multiplication by an inverse, mainly in Z=X/Y, Z = X * (1/Y). Division is computed iteratively i.e. each cycle computes some bits of the quotient until the desired precision is reached, which for IEEE-754 is anything with an error of less than one unit in the last place. The table of reciprocals of Y (1/Y) is known as the quotient selection table (QST) in slow division, and the size in bits of the quotient selection table is usually the width of the radix, or number of bits of the quotient computed in each iteration, plus a few guard bits. For the IEEE-754 standard, double precision (64-bit), it would be the size of the radix of the divider, plus a few guard bits k, where k>=2. So for example, a typical Quotient Selection Table for a divider that computes 2 bits of the quotient at a time (radix 4) would be 2+2= 4 bits (plus a few optional bits).

Edited to clarify errors occur in most operations but not all
Source Link
KernelPanik
  • 7.8k
  • 1
  • 14
  • 14
Loading
deleted 38 characters in body
Source Link
KernelPanik
  • 7.8k
  • 1
  • 14
  • 14
Loading
Edited to correct for 1/2 of one ulp in most operations
Source Link
KernelPanik
  • 7.8k
  • 1
  • 14
  • 14
Loading
deleted 3 characters in body
Source Link
KernelPanik
  • 7.8k
  • 1
  • 14
  • 14
Loading
Clarified Radicies in Relation to Floating Point Division
Source Link
KernelPanik
  • 7.8k
  • 1
  • 14
  • 14
Loading
spelling
Source Link
KernelPanik
  • 7.8k
  • 1
  • 14
  • 14
Loading
spelling
Source Link
KernelPanik
  • 7.8k
  • 1
  • 14
  • 14
Loading
spelling
Source Link
KernelPanik
  • 7.8k
  • 1
  • 14
  • 14
Loading
deleted 16 characters in body
Source Link
KernelPanik
  • 7.8k
  • 1
  • 14
  • 14
Loading
added 768 characters in body
Source Link
KernelPanik
  • 7.8k
  • 1
  • 14
  • 14
Loading
Source Link
KernelPanik
  • 7.8k
  • 1
  • 14
  • 14
Loading