Myths About Floating-Point Numbers (2021)

https://news.ycombinator.com/rss Hits: 9
Summary

Wed 17 Mar 2021 Floating-point numbers are a great invention in computer science, but they can also be tricky and troublesome to use correctly. I’ve written about them already by publishing Floating-Point Formats Cheatsheet and presentation “Pitfalls of floating-point numbers” (“Pułapki liczb zmiennoprzecinkowych” – the slides are in Polish). Last year I was preparing for a more extensive talk about this topic, but it got cancelled, like pretty much everything in these hard times of the COVID-19 pandemic. So in this post, I would like to approach this topic from a different angle. A programmer can use floating-point numbers on different levels of understanding. A beginner would use them, trusting they are infinitely capable and precise, which can lead to problems. An intermediate programmer knows that they have some limitations, and so by using some good practices the problems can be avoided. An advanced programmer understands what is really going on inside these numbers and can use them with a full awareness of what to expect from them. This post may help you jump from step 2 to step 3. Commonly adopted good practices are called “myths” here, but they are actually just generalizations and simplifications. They can be useful for avoiding errors, unless you understand what is true and what is false about them on a deeper level. 1. They are not exact It is not true that 2.0 + 2.0 can give 3.99999. It will always be 4.0. They are exact to the extent of their limited range and precision. If you assign a floating-point number some constant value, you can safely compare it with the same value later, even using the discouraged operator ==, as long as it is not a result of some calculations. Imprecisions don't come out of nowhere. Instead of using integer loop iterator and converting it to float every time: for(size_t i = 0; i < count; ++i){ float f = (float)i; // Use f} You can do this, which will result in a much more efficient code: for(float f = 0.f; f < (float)count; f...

First seen: 2025-08-13 13:58

Last seen: 2025-08-14 02:09