Wednesday, January 28, 2009

rsyslog data race analysis

The good news first: it looks like I have found at least one cause of the race condition that plagued rsyslog for a while. On some fast machines, we saw aborts in certain configurations. Usually, this happened to occur on machines with 4+ cores only, and the best reproduction could be achieved on a "two four-core" system (that's an 8-way system, but you'll later see why I emphasize the fact that it is not 8 cores on a single chip but two chips with 4 cores each). Special thanks go to Lorenzo Catucci of the University of Rome "Tor Vergata" - he contributed not only an excellent test environment, but offered good ideas in our discussions.

I can not say if what I found was the only problem, but for sure I can say it was one problem. Testing also seems to show that the bug has disappeared. The good thing is that what happened matches exactly what can theoretically happen with this kind of bug.

Of course, it all boils down to thread synchronization and the way memory is accessed in a modern processor. In order to explain, I need to set the stage first.

In rsyslog, syslog messages are kept inside so-called message "objects". As rsyslog is coded in C (not C++), we do not have real objects, but the hand-written runtime brings much of the benefits of real objects. So let's stick with the term "object" for the rest of this post. Rsyslog is designed to work well on tomorrow's massively-parallel machines. As such, it not only runs on multiple threads, but has abilities to shuffle messages to various queues for execution. Most importantly, actions (like writing to a file or forwarding to a central server) can be enqueued in asynchronously executed queues. When doing so, a single message object is potentially "stored" in multiple queues at the same time. In order to gain performance, rsyslog does not actually copy message objects (as long as they are not written to), but rather keeps references to the very same object. However, we also need to know when we can cleanup messages, so we need to know when the last owner has ceased control of it (no, no Java-like garbage collection...). So what we do is reference-count message objects. Whenever a new "owner" receives a reference to the object, a reference counter is incremented (done via MsgAddRef()). Whenever the owner no longer needs the object, it is decremented (done via MsgDestruct()). If MsgDestruct() detects that the reference counter dropped to zero, nobody is using the message object any longer. MsgDestruct then immediately destructs the object (hence the name). This is done as part of the MsgDestruct call and invisible to the caller.

So far, so good. No issue at all without threading. With multiple threads, we need to make sure that there is no race for the variable that holds the reference counter. The variable itself is a C "int", thus usually a 32 bit value.

To simplify things, let's for now assume we execute on Intel processors. This is a good assumption in many cases. Also, it is the environment in which we found the bug, so I have details for it. Note that in-depth documentation on Intel's processors is available via the Intel web site[1].

According to Intel's System Programming Guide[3], section 7.1.1, memory access to words aligned on 16 bit boundary will always be atomic. I had this on my mind when I wrote the initial reference counting code - and made a few (now educating) mistakes.

First of all, I had the somewhat old-time thinking of a single CPU without caches on my mind. That read: the increment and decrement operations work on properly aligned words, so they are atomic. Consequently, I do not need to guard them against races. I should have paid more attention to the advances in processor design. Modern CPUs have caches. Depending on how data is accessed, the cache may be used - or not. Even more important, with multiple cache levels and multiple execution units (multi-core CPUs), cache coherency becomes an issue. One must ensure that if one thread, executing on Unit A updates a value, and another one executing concurrently on Unit B updates the exact same value, nothing will be overwritten. This becomes even more problematic if the two execution units do not share the same cache subsystem (think of the two quad-core system). Consequently, Intel specifies in 7.1.1:

Accesses to cacheable memory that are split across bus widths, cache lines, and page boundaries are not guaranteed to be atomic...
So here we go, my basic assumption may be right on some classes of machines, but it is definitely wrong on others (newer ones). So aligned word-access does not guarantee (not even enhance the chance) of atomicity. [It is an interesting question whether I made invalid assumptions about the Intel microarchitecture or if Intel slightly changed the instruction set architecture at some time - I wouldn't outrule the later, but having a wrong assumption is probably the real issue; especially as I knew I should not rely on anything in regard to synchronized memory access if not explicitly specified).

Let me add that true atomicity (probably at a high performance penalty) can be achieved by specific instrucion prefixes, the "bus lock" instruction. This is detailed in section 7.1.2. of Intel's guide. Note that atomicity is only guaranteed if the design of the memory subsystem pays proper attention to the LOCK# signal, which we hopefully can assume on decent hardware. The GNU C compiler supports generating the necessary prefixes for those instructions that support tehm (few) as part of its atomic operations support.

So error#1 was to assume that access to properly-aligned word-data is atomic. It is not. I knew it, you know it; abide to it or you'll run into the same troubles.

As I said, I knew it. The bad thing is that I partly fixed the situation in a later version, after I had introduced support for GCC's atomic operations. However, another problem prevented that from working. But more on this later. I am not yet done with the initial problem.

Another, really dumb, error was that I assumed the following sequence to be a single memory operation:
currRefCount = --pThis->iRefCount;
This is really a beginner's error - I should be over that for a couple of years now... What I do is increment a word (OK, not looking at error #1 I thought this is "sufficiently atomic") and then store the result to another variable. Probably my fault was caused by an atomic instruction, supported by Intel processors, which actually does this in an atomic way. But that is far different from the code I wrote.

Let's have a look at which code GCC generates for this instruction (sequence):
3894: 8b 43 48              mov    0x48(%rbx),%eax
3897: 83 e8 01 sub $0x1,%eax
389a: 89 43 48 mov %eax,0x48(%rbx)
389d: 89 44 24 08 mov %eax,0x8(%rsp)
At address 3894 loads the value of pThis->iRefCount into register EAX. Then, it decrements the register (in 3897) and does two stores: one to save it back into the original variable (at 389a) and one to save it to currRefCount (in 389d). As this is a 4-instruction code sequence, there is no way this can be an atomic operation.

If, for example, a second CPU modifies pThis->iRefCount after the mov instruction at 3894 has been carried out, the results will be unpredictable at best. The following may now happen (and I think it has happened):

  1. thread A enters the above code in MsgDestruct() and executes mov at 3894
  2. thread B (on another CPU) increments pThis->iRefCount at the same time that thread A executes instruction 3897
  3. thread A continues execution with instruction 389a. Note that the increment done by thread B is simply overwritten - the reference counter will be one less than it should be
At the time this instruction sequence is executed, no problem appears. But it sets stage for a problem that may follow much later. At some time later, MsgDestruct() is called and the reference counter drops to zero, so the message object is destructed. However, as the counter was one off, the message object is still being used. Still, no harm is done unless either rsyslogd reduces its process space (very unlikely) or the memory is immediately overwritten (unlikely). If none of this happens (likely), the message will be processed, and not be freeed, by virtue of 0 -1 being not equal 0 - a double free will only happen if it equals zero! So in many cases, not even a double free happens! It looks like everything works perfect.

This is probably one reason why the bug was so hard to reproduce: there are a couple of things that need to come together to make it visible. Thankfully, I was smart enough to put a large number of runtime checks into the code and thankfully the bug appeared on some systems even with these runtime checks enabled. So they could detect at least some types of errors and that finally lead us onto the right route.

The other segfaults most probably happened by one of the unlikely or very unlikely cases. But if you process millions and millions of messages, the probability of such an incident increases. No wonder we did never see it on a low-volume system.

In any case, error #2 was to assume too much about code generation without checking (you do not need to check every code generated, but if you intend to do "dirty tricks", you better check at least the primary targets, knowing that this doesn't cover you in all cases).

And now to my code changes that should handle these situations. Please note that at this time I thought I was addressing a hypothetical issue which would never occur in practice (but I was at least brave enough to try to do "the right thing").

I used GCC's atomic operations to replace both the increment and the above decrement construct. Note that with these changes, the code is supposed to work. To understand this, we need to look a bit at sequence of code flow. Reference counting can only work if the reference count is never allowed to drop to zero as long as someone still needs an object. A typical programming error, if the reference count is at one, is code things like this:
msgDestruct(pMsg);

AddToQueue(msgAddRef(pMsg));
That code would first drop the reference counter to zero and thus destruct the object. However, the msgAddRef would then increment it, but work on the now-unallocated memory (which of course still holds all previous values in almost all cases. The proper sequence to avoid this kind of bug is
AddToQueue(msgAddRef(pMsg));

msgDestruct(pMsg);
which first increments the reference counter and then frees the object as far as the current activity is concerned. Thankfully, this coding error I did not commit.

Instead, a made another really dumb one: I crafted perfectly well code, and I even included conditional compilation to use locking in case atomic operations are not available. I used the variable DO_HAVE_ATOMICS to differentiate between the two cases. If only I had used the same variable inside my ./configure check... But there I used the name HAVE_ATOMIC_BUILTINS! Generally, that was a good idea, because it is following the naming convention all other such switches have. If just I had changed the .c files... And finally, I did not remove the interim code, which undid some of the atomics because they were not portable. In the end result, I had what looks like perfectly valid code, but it was never actually compiled. In other words: I totally screwed up.

The result was no improvement after all. I should have noted that testing did not reveal any of these issues, as I thought I address an issue which does not really exist. Most importantly, I had not test case to verify the code. I now know that I should at least have made sure that the conditional compilation worked properly.

So error #3 probably was not to put enough care into debugging conditional compilation.

Then, few isolated reports came in. Far too few to see any trend. At some point, I fixed the switch name error, but not consistently. For some reason (I do no longer remember), I thought that there may be a problem with the atomics and I also saw that there were probably an issue with msgDestruct(). I then commented out the atomic calls and used mutex calls instead. Around the same time, I fixed the handling of atomics in msgAddRef() - at least it looks so [I have not tried to extract the full history]. Had I just done one of these two things, everything would have started to work. But now these two actions were contradictory, and I didn't notice that.

What now happend was the msgAddRef was indeed using atomics. However, msgDestruct used the mutex. As a result, msgDestruct used the non-atomic instruction sequence shown at the beginning of this post to handle the reference count. If now msgAddRef incremented the counter after the variable was loaded into the register but before the register was written back, the reference count would be one too low - exactly as described in the failure scenario above. And msgAddRef could jump right into the middle of the sequence in msgDestruct, because msgAddRef did not try to lock the mutex. So there was nothing that prevented it from executing.

When I first began to realize this, I thought that there should be no issue because the mutex call, after all, is a full memory barrier and as such any update done by msgAddRef should be visible after the mutex lock. That's right, but it doesn't cover the failure scenario I described and I now know of.

What lead to this final problem, I think, was inconsistent analysis. The bug appeared every now and then and was not reproducable. So I was unable to do a good analysis of it. I tried this and that, all going into a good direction, but without any way to verify it. Somehow, also, I did not do a good-enough analysis of the algorithms used - I overlooked those bombs that were in the code for quite some while.

Once I was able to reproduce the issue at least a bit consistently, I could begin to see in more detail where things went wrong. I still didn't see the obvious. I even did an in-depth analysis of the code pathes in question - and have to admit that the obvious things still didn't occur to me.

What I saw, with the help of valgrind's DRD[2] tool, was an issue with a not correctly initialized message mutex. That was quite of a subtle error where a debug system statement caused a too-early aquire of the message mutex, which at that time was not initialized. That could lead to all sorts of interesting effects (including valgrind aborts) and I had high hopes this was the problem source. At first it looked promising, but then I noticed that the problem was introduced in recent builds and could not be the root cause - builds without that issue also crashed. So while it was definitely something we should fix, it was not the sole solution.

We were back to nowhere land... Finally, we begun to come closer to the bug after we were able to generate some error reports from valgrind's DRD tool on Lorenzo's machine (kudos to the valgrind team!).

I should note that I use valgrind on a regular basis and tried a couple of its tools (among others, helgrind and DRD) during my debugging effort. All of them never reported any issue other than mentioned above. I think none of the machines was fast enough to experience that kind of situation so that valgrind could generate an error report (as a side-note, this is good indication that valgrind's log being error-free doesn't mean the program is error-free).

After I got Lorenzo's DRD reports, I had hard evidence that the reference counter was incorrectly handled, something I suspected most of the time, but had outruled numerous times because I did not find any issue with it. Now I knew for sure and could begin to think very hard of why this could happen. I thought even about the most extreme cases - cases I had never thought about wouldn't have been there evidence that the issue was real. This brought me back to the atomic operations. Interestingly, I still didn't see the real issue. But I was smart enough to replace the atomics in msgAddRef by mutex calls. That, of course, cleared the issue, because now the failure scenario (the one described above with the assembly code sample) could no longer happen. Quick tests have proven that no rapid abort happened in the lab environments. This is where we are today. Of course, this does not mean that the bug is actually fixed, but the situation has improved. Neither Lorenzo nor my lab could endure more than 30 to 60 minutes without an abort with the previous code.

After I got this confirmation, I begin to dug into the atomic operations. Lorenzo suggested to look at the assembly code and that somehow brought me to question things like cache coherency and all that. It still took me a while to detect the real cause which turned out to be a rather simple coding error.

So error #4 (this time maybe better to be called learning #4) was: I was so blindfolded by my knowledge of the project, that I was unable to see the hidden bombs in the source code. My reception was filtered by my thoughts. Even though I analyzed things, I did not do a proper analysis, as was unable to correctly reproduce what was written in code. Such situations happen [at least to me ;)]. What is very useful is hard evidence, or a persistent person, that makes you overcome your invalid perception. Unfortunately, I do not know how to overcome this problem in a general way.

After I came to this conclusion, I did further analysis and finally ended up with the failure scenario described at the top of this post (including everything else I wrote). As soon as this was clear, I started another lab which replaced the version with mutex calls by one that relies on atomic operations exclusively. And, as it looks, this seems to perform equally well (but we have not yet changed Lorenzo's lab environment, he still runs the "mutex version"). So this somewhat backs my theoretical findings on the bug sources.

Again, it is far too early to claim victory over the recent race problem. It was far to hard to reproduce, so a few hours in a few labs without problems are not a definite answer - but a trend. Also, as I wrote, what we have found definitely is a bug. It may not be the only one, but it is one for sure. So in any case, the stability of code is better with these fixes than without them.

I will monitor how the fixes perform in practice. As next steps, I'll also update all branches of rsyslog. I have already done this to the branch used by Debian Lenny (we have a separate branch due to their release freeze). That will add another environment to the lab list where we had frequent reproductions (but it is not as useful, as this branch has some potential other threading issues, which were already sorted out in the past). I hope to be able to do clean fixes for the other branches by the end of the week.

As a sum-up, it is important to keep instruction sequences, cache coherency, execution order and all the fine details of a highly parallel system in mind when designing multithreaded code. A very special challenge is getting the right test environment, as race conditions can occur only under rare occasions. A good understanding of the machine's internals is definitely useful, otherwise you may not even be able to think about failure scenarios. Thoroughly reading code and specs should be self-evident, but obviously it pays to remind oneself to it. Getting help from someone who is not familiar with code/spec is definitely useful.

I have written this analysis in the hopes that it is useful as a record for me - and also for others. I think it outlines rather well which programming errors can be made in general and with multi-threaded code in specific. I also hope that it aids understanding of the strange things one need to look at with highly concurrent systems. I think it also proves the point that any (system-level) programmer should know about what exactly the machine does, and what not.

That's basically what I wanted to communicate. But... there is a bit more. Read on if you'd like to get down to some details that may also affect your lab environment and may puzzle you...

Let's have a quick look at optimized vs. non-optimized code. I am using GCC in this example:

Some sample code:
void tester() {
msg_t *pThis;
int currRefCount;
currRefCount = --pThis->iRefCount;
}
with default gcc settings (without -O) generates

311: 55 push %rbp
312: 48 89 e5 mov %rsp,%rbp
315: 48 8b 45 f0 mov -0x10(%rbp),%rax
319: 8b 40 48 mov 0x48(%rax),%eax
31c: 8d 50 ff lea -0x1(%rax),%edx
31f: 48 8b 45 f0 mov -0x10(%rbp),%rax
323: 89 50 48 mov %edx,0x48(%rax)
326: 48 8b 45 f0 mov -0x10(%rbp),%rax
32a: 8b 40 48 mov 0x48(%rax),%eax
32d: 89 45 fc mov %eax,-0x4(%rbp)
330: c9 leaveq
331: c3 retq
and with gcc -O3 it leads to
   10: 83 2c 25 48 00 00 00  subl   $0x1,0x48
17: 01
18: c3 retq
19: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
And, just to get the whole picture, this is a simple assignment (without optimization):

source:
void tester() {
msg_t *pThis;
int currRefCount;
currRefCount = pThis->iRefCount;
}
The unoptimized version is what I described in the failure scenario above. With the optimized version, we may be puzzled again. Here, the optimized version works with a single subl and the operand is word-aligned. So, from a naive view, this should be an atomic operation (and, in fact, it is on many systems, most importantly on single-CPU, single-core ones). However, careful reading Intel's architecture manual[3] brings up that it is not an atomic operation on newer, highly parallel, hardware. In any case, even on that hardware the probablity of a collision is far lower than in a 4-instruction sequence.

In theory, this indicates that the bug is more likely to appear in debug mode than in production mode (where the compiler is set to full optimization). In practice, it was just the opposite: in debug mode, it almost never was seen, while in production it occurred every now and then. The explanation, I think, is that you need a certain machine speed to generate the error condition. Rsyslog, when compiled for debug, runs considerably slower compared to when compiled for production. In other words, you need a considerably faster machine to experience the same probability for the race condition to manifest in an error. I think most of the systems were simply too slow to hit this barrier, and so could produce the problem only in release builds. I agree that I have no data backing this claim, but it would at least be a good explanation.

I hope this post is useful. Any comments are appreciated.
Rainer Gerhards

Update (2009-01-29): If you are interested in more details, be sure to read the discussion I had with Lorenzo after finishing this paper.

[1] Intel documentation home - http://www.intel.com/products/processor/index.htm?iid=subhdr+prod_proc

[2] DRD- a thread error detector, http://valgrind.org/docs/manual/drd-manual.html

[3] Intel® 64 and IA-32 Architectures Software Developer's Manual
Volume 3A: System Programming Guide, http://download.intel.com/design/processor/manuals/253668.pdf

2 comments:

Christoph Bartoschek said...

What bothers me is the following question: Are the atomic operations worth the whole hassle compared to locking with mutexes?

Rainer said...

Well, that's actually an excellent question. I can't elaborate in full right now (need to do the patches), but let me try: A pthreads-mutex call is quite efficient if it can be handled in user space, what means it can be acquired. As long as the probability that no fallback to kernel space is necessary, one may argue if the performance difference really matters. Note that there still is a big difference in the number of cpu cycles used. At the bottom of user-space mutexes are atomic operations, so the performance penalty from "going atomic" is there in any case. But the mutex needs to execute a lot of additional instructions, whereas the atomic operations can be executed in a few clock cycles - after all, they are single instructions. And when the mutex must go to kernel space, that is extremely slow compared to atomics. In the time required for such a context switch you can completely process a couple of messages (really!).

Introducing new team member

Good news: we have some new folks working on the rsyslog project. In a small mini-series of two blog postings I'd like to introduce the...