Simple, Efficient, and Wrong.

I’m finally giving this thing a name of its own, paraphrasing a quote by H.L. Menken:

For every problem there is a solution that is simple, efficient, and wrong.

Reality has a nasty habit of getting in our way. Like icebergs, real problems have edge cases and complexities that mean most of the problem is underwater, and it’s hard to tell where their bounds really lie until you are on top of them.

Take for example the problem of adding up all the numbers in a list. Hardly anything could be simpler, perhaps even tedious. Just make a loop that goes over the list, adding the numbers into an accumulator, right?? Well, God is in the details.

Since computers store floating point numbers with a limited precision many rules of arithmetic no longer hold. For example, digits get lost when adding a small number to a larger one. To demonstrate this, let’s try computing the 1000,000th harmonic number, 1/1 + 1/2 + 1/3 + ... + 1/1000000 going from big to small and from small to big:

>>> sum([ 1.0/x for x in range(1, 1000001) ])
>>> sum([ 1.0/x for x in range(1000000, 0, -1) ])

Conclusion: an ideal summation algorithm should first put similarly sized numbers into buckets and then add numbers in the buckets, starting with the smallest numbers. Case closed.

Or is it? What if all of the numbers in the list are in a similar range, but the list is really long? Now the accumulator gets big and you start losing precision just adding to it. No worries, we can fix this by adding first each adjacent pair together, then add each adjacent pair, and so on, until you get the sum for the entire bucket.

It turns out that this approach isn’t ideal either: for a list of n numbers its error is bounded by O(log n). The Kahan summation algorithm in comparison has an error bound of O(1).

Fortunately most real problems permit solutions that are good enough. Most datasets are not so pathological, or requirements on the results so great, so that the naive algorithm couldn’t be used.