Monday, October 05, 2009

Another note on hard-to-find-bugs...

Before I began to write this blog post, I realized how long I had not written anything! I promise to begin to write in a more timely manner, but the past weeks were merely a consolidation phase, ironing out bugs from the new releases.

I'd like to elaborate on one of these, one that really drove me crazy the past days. The problem was that omfile's directory creation mode was sometimes set to zero (well, almost always in some configurations). What began as a minor nit turned into a real nightmare ;)

The issue was that the variable fDirCreateMode was always set to zero, except if it was written to at the start of module initialization or when it was simply displayed at start of module initialization. That sounded strange, but even stranger seemed that by moving around the variable definition in the sources code (and thus assumingly changing its memory location), nothing changed. So I came to a point where I used this code as a patch:

omfile.c from rsyslog git

Look at line 769. With that seemingly unrelated write, the variable stayed as expected. However, if I moved the write to a function, nothing worked again. Strange... After committing the patch, testing showed that the directory permissions now worked well BUT the file create mode now behaved wrong in the same way.

I was stunned - and speechless. What followed, were intense debugging sessions. I ended up finding the commit that introduced the problem, but still could not see why that commit would affect anything. After hours of debugging, I ended up with a stripped-down and almost codeless omfile, which still had the same problem. And it appeared and disappeared almost at random as code lines were moved in and out.

I once again checked the git history and then I noticed that a few commits down the line, I had introduced another config variable for the io buffer size. Now I finally had the idea. The size-type config directives were introduced for file size restrictions. Thus, the regular 32 bit integer is not sufficiently large for them. Consequently, they needed 64 bit integers as pointers! But, of course, I had provided only a pointer to a 32 bit int, thus the config handler overwrote another 32 bits that happened to be close to the address I provided.

This was clearly an error. But could it explain the behavior I saw? Not really... But the problem went away once I had corrected the issue. So I began to suspect the that compiler hard re-ordered variable memory assignment in order to optimize access to them (maybe to get a better cache hit rate or whatever else). But hadn't I compiled with -O0 and as such no optimization should take place? I checked, and I realized that due to a glitch in lab setup, optimization actually was on, and not turned off! So now I think I can explain the behavior and theory as well as practice go hand in hand.

Really? What about the write and the debug print that made everything work? I guess these changes triggered some compiler optimization and thus the memory assignment was changed and so the "extra 32" bit pointed to some other variable. What also explains why the file creation mode was affected by my change. As well as why the bug reacted quite random to my testing code changes.

So it looks like I finally resolved this issue.

Lessens learned? Re-check your lab environment, even if it always worked well before. Be careful with assumption about memory layout, as the optimizer seems to heavily reorder variables, and even single statements and statement sequences seem to make a big difference. I knew the compiler reorders things, but I did not have this clear enough on my mind to become skeptic about my lab setup.

And, as always, some assumption limit your ability to really diagnose what goes on... May this be a reminder not only for me (I wonder how long it will last) but for others as well (thus I thought a blog post makes sense ;)).

No comments: