Thursday, January 29, 2009

Some more on the rsyslog data race...

After I had written my writeup on the rsyslog data race, I had another very useful discussion with Lorenzo Catucci. We touched some points that are not yet included in the write-up. I also explained some of the rsyslog internals that you probably find no other source yet. In short: I think this discussion is useful for anyone interested in more details. I am not sure when (if) I will be able to include it into the original post. So rather than lose it, I asked Lorenzo if he agreed to making it public and thankfully he did.

So here it is - very authentic and totally unedited, with all the typos and sometimes non-sequence of a chat discussion as we all like it. Have fun! Comments are appreciated.

writeup finally finished: ;)
catucci: very nice wrap-up.
Rainer: the really good thing is that it actually sound we found the bug :)
now lets hope for the best...
catucci: I think you should emphasise a little bit more the need for coherent handling of condition variables or the like: now I understand in what sense you talked about an existing memory barrier this morning; the problem with that "barrier" was that it wasn't enabled on both sides... something like a water valve...
Rainer: good point, will see that I edit it in
but with the barrier, I actually meant the real thing:
it forces the cpu to in-order memory operations
I thought it is sufficient.. but obviously not
I didn't have the exact failure scenario on my mind this morning
that evolved during the analysis
and was clarified by the write-up
in fact I started a new lab in the midst of the writeup
which again proves the point that it pays to write things down
catucci: Yes, but what I was thinking was the constructor side missing the ordering.
Rainer: it's even more complicated
I didn't explain that all
rsyslog shall run great on uniprocessor
catucci: will wait for the next version of the write-up!!! .-)
Rainer: under some constraints, it is possible to run without any sync at all
catucci: :-)
Rainer: (I wont add this, it is too much in depth)
so the message mutex is enabled depending on whether it is needed or not
similarly, when the message is constructed, it is done in a single thread
the mutex is disabled while doing so
only after it is constructed, the object is finalized, which then (and only then) enables the mutex
so it is ok that part of the objhects lifetime it is not synchronized
maybe never, depending on the config
that gains quite some performance
even on MP, because of the construction phase
of course, it doesn't ease troubleshooting...
catucci: Yes, as usual, I've been too fast in both thinking and writing... what I was talking about was the enqueueing and not the real building.
Rainer: yeah, that's right
the message is first finaliuzed, then it is enqueued
from then on, the sequence is always the same (there is only a single call to do so)
well... is it.. good point
will check reality vs. theory tomorrow ;)
that's why I love discussion :)
catucci: good, see you.
Rainer: cu - have a great evening!
Sent at 6:49 PM on Wednesday
rgerhards: good morning - what's your lab doing? ;)
Sent at 9:47 AM on Thursday
catucci: still running well.
remember: with -d option active.
rgerhards: sounds good. mine was running well, too
yeah... verification will be a long road
at some point we should switch back to production code
catucci: Just another small note on your weblog entry:
rgerhards: be happy to hear it
catucci: you some to have missed a very important ingredient of my mix: a very high load on the system; sar -q gives me these figures:
rgerhards: oh yeah
I missed to write that down
should add it
catucci: 07:15:01 runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
09:45:01 0 216 0.70 0.59 0.49
rgerhards: all in all, now that I (think) I know the failure scenario
I know that it is only probable under high load
catucci: this is importanto, since with big plists, it's much more probable for the threads to be preempted by the scheduler, and to jump to other packages
rgerhards: and even then it is so unlikely to happen that you need an awful lot of iterations, what also requires high load
catucci: you know, I'm a bit old school, and I'm used to talk about packages for multi core...
rgerhards: so the ingredient is to have a fast machine, many cores, run them utilized, but in a way that as few as possible preemption of the rsyslog threads happen
do you concur to this POV?
Sent at 9:54 AM on Thursday
catucci: Almost completely: an high contect switch rate is needed even between the syslog threads IMHO: if the threads do just sleep undisturbed on waits/pools/selects, the probability of the race happening just go infinitesimal. to get to the race you (almost) need a forced context switch (I think)
rgerhards: now I agree mostly... I think the context switch itself is not necessary. But it is necessary that access to the reference counter happens concurrently.
catucci: I no loger own the elder two Pentium-III system that has been my workstation for the last ten years, but I'd bet you could just as well see the races there; maybe in a shorter time than on my mx-01
rgerhards: from my POV, this can happen either due to context switches OR due to two threads being executed on two different execution units right at the same time. Actually, I think the later easier leads to the failure scenario.
catucci: yes, that's very true; the high context switch rate does more to probabilistically spread the syslog run between different packages
rgerhards: I think the key is to understand the scenario. I completely agree that now that we have details, we can craft lab environments that will reproduce the issue on not-so-powerful hardware. One needs to know the ingredients.
catucci: remember that multi-core packages usually have a shared cache between the cores in the same package
rgerhards: For example, I did one lab where I increased the concurrency of rsyslog actions. That lead to more context switches, more mutex calles and lead to no repro at all.
This was overdone. If I reduced the concurrency, I was again able to reproduce. For every machine, there seems to be a mix that probably triggerst the failure scenario. I think this was one of the reasons why it was so hard to find. Not knowing the root cause, the bug appeared very random and was extremely hard to track down to a config/environment.
BTW: do you mind if I post a copy of this conversation on the blog? I will probably not be able to update the analysis today, but our conversation should be very helpful to everyone interested.
catucci: In that lab the queuer and the dequeuer did share the same mutex opinion? The key ingredient seems to be using a two different syncronization primitives on the two sides
rgerhards: yes and now - in action execution there are two mutex types involved, each with multiple instances
catucci: No problem at all for publishing
rgerhards: publishing: thx
there is one mutex that guards each individual action. No action (output module) may be called concurrently - that's part of the module spec (for good reason with many modules)
so there is one mutex for each action and it is aquired before an action is entered and released when it is done.
The other mutex class guards each individual message. that is the one described in the analysis.
if two different action invocations process a message, they may concurrently try to modify the message (that depends on the actual config, most importantly on the templates used and order of action execution)
So any such modifcation - AFTER the message has been finalized (think yesterday evening) is guarded by the message mutex.
catucci: right; I was thinking about the "unlocked" case in MsgAddRef's# ifdef xHAVE_ATOMIC_BUILTINS case
rgerhards: if you have n actions, a single message may therotically in use n times
(side-note: if n is 1, it can not be used multiple times, and this is - I think - one of the cases where the message mutex is "turned off")
I needed to set stage ;)
with that design, all actions use exactly the same sequence to increment and decrement the reference counter.
so this will always be used in a consistent manner. That part of the logic was absolutely right. BUT.... as I wrote, the msgAddRef used atomics, while msgDestruct used the message mutex.
catucci: Right: this is the problem (IMHO) the mutex and the atomics where not coherent.
rgerhards: so while the sequence of calls to msgAddRef and msgDestruct was right and consistent, the operations carried out by these calls were inconsistent
yes, exactly
catucci: there should be only one kind of sync primitive in use, be it atomics or mutexes
rgerhards: thus there are two fixes: either use the mutex exlusively (that's running in your lab) or use the atomics exlusively (I've run that in my lab since around noon yesterday)
and this is what the final fix will look like: it'll check if atomics are avaiable and, if so, use them. If they are unavailable, we will use the mutex (which is much slower)
catucci: yes, this is what I think too. and I'm happy we have an experimental confirmation!
rgerhards: indeed
The problem is that we for now have just a indication, I wouldn't say its a confirmation yet.
catucci: what branch should I download to test the atimics only?
rgerhards: But it is a pretty good indication.
catucci: you know... scince is made of falsifications...
rgerhards: I have no branch yet to test with. What runs in my lab is a heavily hacked version, quite ugly
my point is we do not have enough proof yet
but... we have a sound theory
plus we know for sure (as we can show a failure case) that this was a bug
catucci: right.
rgerhards: we do not know for sure if it was the only one. But we now it was one
so code has improved
catucci: right too.
rgerhards: and both of our lab results show stability has increased, at worst only in some scenarios
in that sense, however, theory and experiment go together
and this indeed makes me feel good :)
I'll see that I get a new master branch together today that includes everything needed to address the failure scenario AND enabling atomics again.
I need to be careful as the atomics version is ugly and I need to verify I do everything needed in a clean version, too.
It would be excellent if you could then run that new master branch. And then we see what happens. I think a debug build would be good at first (as this was the repro environment) and once this works out well (a day or two without aborts should be sufficent), we can switch the production build.
In parall, I hope that others who experienced the problem will also try the new code.
so over the next couple of days, maybe weeks, we should get a good feeling if there still is an issue.
I have to admit it is quite complex code, so I wouldn't outrule there is something else...
catucci: Your are right especially on the other's systems. Maybe ew could try to get in touch with the other debian users experiencing the crash
rgerhards: definitely. There is also an entry in debian's bug tracking system. I already did a patch for the version used by debian lenny and michael Biebl, the package maintainer, did an updated package this night.
so I think we should get some feedback from there, too. But the problem is that the problem occured very sporadic in that case. I'd say anything less than two weeks of undisturbed operation is not even an indication
I will also ask the folks on the rsyslog mailing list to apply patches .. as soon as new versions are out.
catucci: good. see you.

No comments: