Wednesday, November 28, 2007
UNAWE ("Universe Awareness for Young Children") is an internationally-recognized organization trying to educate young children on astronomy. The target age is rougly between 4 and 10 years. Besides astronomy, UNAWE is also about people (children) from different cultures talking to each others and sharing their experiences. This is a fantastic idea and I like it very much. There are already a number of UNAWE committees all around the world and I am eager to help grow that network.
If you are into astronomy, work with children, or both: consider contributing!
May it be that we "just" have a compiler option problem? So far, we only call it with "-lpthreads", which only effects the linker. -pthread, if I understood it right, defines _REENTRANT, which in turn causes some reentrant versions of standard library functions to be used. If that isn't outdated information, it could pretty much be our problem.
I am not sure which functions of the run time library are affected by the _REENTRANT macro. So I do not know if I call them. However, non-reentrant run time library functions would be a good explanation for very seldom segfaults that only occur under heavy load and when running in multithreading mode. It would also explain why so many in-depth code reviews did not find anything...
This compiler option finding looks very promising. I'll probably do a special 1.19.11 release with just that patch and then we'll see what happens. In the mean time, I wait if somebody comes up with some more diagnostic information. But I finally have found a good explanation for what I see - at least in theory...
Tuesday, November 27, 2007
I will now see what is required to get rid of the alarm(). Actually, this is harder than it initially looks. For alarm() to go away, I need to set up a real background thread that does the timer ticks and that causes the mark messages. That, in turn, means that I have two concurrent messages sources, which involves quite a bit of synchronization. All of that is currently not needed as the alarm signal simply interrupts the select, which in turn leads to execution of nice sequential code. Oh, and yes: the alarm signal handler of course does (contrary to sysklogd) nothing then to set a global variable flag. So I'll have a look at all that...
> *** glibc detected *** rsyslogd: double free or corruption (!prev):
> (gdb) print (uchar *) 0x09cbd588
> $49 = (unsigned char *) 0x9cbd588 ""
> is it possible to call free on pszRcvFrom when it points at "" ?
> because I can see:
> if(pM->pszRcvFrom == NULL)
> return "";
> return (char*) pM->pszRcvFrom;
> same with getRawMsg/getUxTradMsg/..., places where mudflap
> screams in MsgDestruct().
Sorry - if I have not totally screwed up, it's impossible that this is the problem. I've done (yet another) thorough review. All of these functions are either called to supply another Set...() function (which simply copies over the "", but does not free it) or from MsgGetProp(). MsgGetProp() uses a variable (pbMustBeFreed) to track whether or not a buffer must be freed. In general, unmodified properties are never freed - this happens only at message destruction (MsgDestruct()). MsgDestruct() however, just checks the pointers and frees if they are non-NULL. So MsgDestruct() is not interested in getRawMsg() or another get() function returning an empty string except a NULL pointer.
I have also checked if MsgGetProp() and its helpers correctly handle pbMustBeFreed - and they do it right. So a buffer is only freed when it was dynamically allocated. Message properties are only created during message construction and are free()ed when the message is destroyed.
So I guess we have found another part of the code that does not cause the problem.
What I would like to re-iterate, though, is that the segfault problem seems to disappear under all circumstances if compiled with --disable-pthreads. IF SO (and this is my current assumption), it can not be a general logic error like a double free but must be a synchronization problem. Let me re-iterate: I have yet to find a single installation that has a segfault when running on a single thread! All folks with segfaults that compiled in single threading mode did no longer experience any problems. This is strong evidence to me.
But while I almost think I know it is threading related, I can not find anything wrong in that area, too. The threading model is very simple and construction/destruction of elements is spread in an easy way over the two threads. Basically, the listener thread creates the message and its properties, while the action thread destructs them. There are a few exceptions where properties are derived, but nothing really complex. Yet, I still think it is in the threading area - why does it work in single thread mode otherwise?
This is why we are asking for the help of our users (that means: you!). If you run rsyslog and experience a segfault, we would very much appreciate if you could run a specially instrumented version. Peter Vrabec created it and it contains debugging support as well mudflap support, which is a tool to track down nasty memory management errors.
The RPM can be found at:
Please install it. You will notice that CPU usage is higher than normal, but in most cases this is harmless. If you are concerned, drop me a line with specifics and I'll happily address them.
If a segfault happens when you run this version, please send us
- coredump and
- hardware and OS information (which distro? multiprocessor? ...?)
Our sincere hope is that we receive enough reports to find something in common between them. So, please contribute your segfault info if you happen to have it. This is a great way to contribute to the project!
Wednesday, November 21, 2007
So on to the promised change log (it also covers some past days where I had not reported):
- applied gssapi patch from varmojfekoj - gss-api is now supported
- added some debug message to ommysql
- added user doc for gssapi patch from varmojfekoj - thanks!
- bumped version number to 1.20.0, because of new gss api functionality
- begun to look at dynamic module unloading - this is currently a hack
and works with the mysql module only (which is the only one, so there
is no problem in practice. But it would be good to begin to do it right ;)
- added new modExit() entry point to loadable module interface
- added an identifier to command handler table - need to identify which
command handler entries need to be removed when module is unloaded
- added support so that linkedlist key can be used for owner handle
- enhanced llExecFunc to support deletion of list elements (on behalf of
user function being called, slight interface change)
- enhanced linkedlist class so that list elements can now be deleted based
on the key value they have
- created entry point so that CfSysLine handlers are removed on modExit()
- some cleanup
- modules are now correctly unloaded and de-initialized
So what I am set now for is doing it right and making sure that a loadable module can be cleanly unloaded under all circumstances. That, of course, requires some interface changes, but nothing major (keep in mind the interface is not yet finalized!). This work provides the basis for upcoming work, which will utilize many more loadable modules for other functionality, too. So it is a critical task.
I have to admit, though, that I think I need another day or so to get fully re-acquainted to the code. There was really a toll from my absence and I begin to notice it ;)
Tuesday, November 20, 2007
I have thought about setting up a full lab for GSS-API before carrying on. For now, I have decided to NOT do that. I am sure that the contributors have tested it quite well and the code that I have reviewed looks excellent.
So I will pull it in as is and wait for some feedback from the field (with the assumption "no feedback" equals "OK").
I will then begin to look at the loadable module de-initialization. This is not really clean in the current release, but that's no problem because modules never get unloaded. However, in the long term we need this to be clean.
The mysterious segfault issue is still dangling. I was hesitant to do any larger-scale new development without fixing it. But given the fact that it is extremely hard to find, and obviously happens very seldom, I'll continue developing. I am right now looking into upgrading the dev machine to an x64 OS, where most of the problems happened. My hope is that I will see a segfault during further development work and then hopefully be able to tackle it. I still think that the segfault must be well understood and fixed before I go into some serious multithreading redesign. As such, unfortunately, this issue still holds some of the work scheduled for the next *major* version.
I thought I give you an update here in my end (will also post this to the blog for the others). Any feedback/suggestion is highly welcome.
Monday, November 19, 2007
It adds a new commandline option '-g
' to listen for a connection wrapped with gss-api and few new configuration directives:
$gsslistenservicename <service name>
$gssforwardservicename <service name>
With gssmode set to "encryption" or "integrity" all tcp selectors will be forwarding messages via gss-api.
That's probably useful while I am getting up some real documentation.
I also have not yet evaluated how this patch relates to the syslog-sec IETF syslog security effort, namely syslog-transport-tls. I guess they are related and the patch probably not only provided non-standard functionality but may even make it harder to implement the standard. HOWEVER, if we look at how slow moving that IETF WG is, I do not bother about any compliance problems. They can be dealt with later. What I find much more important is that we have a real-world answer to real-world security question and we do have this now. So what could be more important? ;)
I keep you updated on the progress.
Thursday, November 08, 2007
Those of you interested in rsyslog might like to hear that I will be back in the office next Monday. I'll then review what has happened and go for another round of enhancements.