If you’ve taken a course in numerical computing you’ll be familiar with the various sources of numerical error. First you have representation error that arises when you have to represent a number with a finite number of digits. A common example: the decimal expansion for 1/3 = 0.3333… has a never-ending string of 3’s and at some point you just have to cut it. The same happens with 1/10 in binary. When you start adding (or subtracting) you may fall victim to loss of significance: when the two numbers have drastically different magnitude you lose least significant digits in the smaller one. You can also have cancellation: when two numbers have almost equal magnitude but opposite signs, you lose the most significant digits and produce a large relative error. You may also suffer from overflow and underflow when magnitude of intermediate or final results gets out of the representable range.
There’s also a surprising source of error that comes from performing intermediate calculations with higher precision than you asked for. This happens when programming in C or C++ on the Intel architecture, where the x87 floating point unit works with 80-bit floating point numbers instead of the 64-bit double precision values we’re used to.
You are probably asking, how can higher precision make things worse? Let me explain, by calculating 30% of 36500 in C:
double percentage = 0.3; int val = 36500 * percentage; // val should be 10950, but the computer says 10949!
The problem is that
percentage is slightly less than 3/10 to begin with because of representation error. When it is converted into an 80-bit precision value for use with the x87 floating point unit the result is far from what the machine would calculate as 3/10. The result of the multiplication will be just slightly less than 10950, and the program returns 10949 as the value for
val. And this of course leads to fun Stackoverflow questions.
How can this be fixed? Use the
round function to round the result to the nearest integer. Or if you want to truncate, don’t use floating point numbers at all:
int val = 36500 * 3/10;