Thursday, January 13, 2011

Freaking behavior of a small little C/C++ bug

Oh boy.
Read till the end the event and its root cause. Important morals follow below.





We run systems that on high capacity events handle thousands of transactions per second. One of the most heavy-traffic periods is New-Year's-Eve, the 31st of December, were most of our systems are under heavy stress around the world, stress that tends to difuse to our support teams. Structured and strict preparations usually make us pass this heavy-traffic day properly in most, if not all sites. Which happily was the case also this year.

Shockingly, on January 2nd we had a crash in two sites.

Analyzing the crash led to a timer that instead of re-scheduling itself for every 5 seconds, keeps snapping abruptly in periods of milliseconds.

While still analyzing the case, reproducing it in our labs, the problem vanished as suddenly as it appeared, on the end of the same day. January 3rd, 00:00, systems went back to behave nicely.

That's really odd. How does the bug relates to the date? Is it a coincidence? It doesn't look so, as a second after midnight problem disappears. Trying to reproduce it in the lab we got the same behavior: it is the bug of January 2nd 2011. (By the way, when running the system in our labs in debug mode, problem didn't reproduce! Bug appears only when running without debug! That's common for memory related bugs, smears etc.)

To some of us, it sounded like the iPhone alarm bug. Which was reported also not to work properly on 2011 start, being fixed on its own, by January 3rd.

http://www.tipb.com/2010/12/31/iphone-bugs-alarms-working-2011/


Maybe it's the same bug?

iPhone runs on iOS which is Linux based. We also run on Linux. Maybe there is something with Linux timers on beginning of 2011?
Looking for something in this direction led to nothing.

On the other hand, analytical investigation led to the following:

  1. The timer, when awakes, calls our callbak function. The callback function shall return an int value. Any value except 1 says "OK", 1 says - please call me again.
  2. Our callback function didn't return a value at all
Wait... - is it legal not to return a value from a non-void method?
Unfortunately, in C/C++ it is. And the bevior is undefined. The function do return a value, in some environmnets it will be the last value from the register. And, well, occasionaly it can be 1.
See:
http://stackoverflow.com/questions/1610030/why-can-you-return-from-a-non-void-function-without-returning-a-value-without-pro/1610454#1610454
http://stackoverflow.com/questions/2598084/function-with-missing-return-value-behavior-at-runtime
What shall be done?
Read:
http://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html
-Wreturn-type
Warn whenever a function is defined with a return-type that defaults to int. Also warn about any return statement with no return-value in a function whose return-type is not void (falling off the end of the function body is considered returning without a value), and about a return statement with an expression in a function whose return-type is void. For C++, a function without return type always produces a diagnostic message, even when -Wno-return-type is specified. The only exceptions are `main' and functions defined in system headers. This warning is enabled by -Wall.

Morals
  • Listen to compiler warnings!
    Solve all warnings, you should have a zero warnings policy.
    The problem above could be caught and solved as a warning (-Wreturn-type).
  • If you don't keep a policy of zero warnings, which you should, turn bad warnings as the one above into an error, with a compilation flag, e.g.: -Werror=return-type
  • You may want to test your software in future time, for example, have a test system that runs all the time 30 days ahead, if there is a time related bug it may help catching it on time. It won't probably catch everything, but it could have catch the problem we had above!

No comments: