Friday, July 25, 2008

rsyslog error reporting - how to do it well...

Rsyslog obviously gains momentum. Not only is it becoming the default syslogd on Debian, a very important distribution, I am also seeing an increasing amount of questions inside the rsyslog forums and mailing list. The later, I think, is a good indication that people begin to care about rsyslog and also begin to explore rsyslog's enhanced features.

I am very happy with this development. However, it also shows a downside. Rsyslog, for obvious reasons, "offers" much more chance for misconfiguration than its feature-bare ancestor sysklogd. While I tried hard to make configuration simple and intuitive, I managed to succeed only in the simple cases. If you configure rsyslog for complex needs, I have provided ample ways to screw up ;) That situation will hopefully get better with the new scripting engine, but even that will not be able to totally resolve the issue. To help people getting over this phase, rsyslog offers a myriad of diagnostics. Whenever something is wrong, it logs at least one message telling you (and often a couple of them, helping to identify the culprit).

With some frustration, however, I begin to see that many people never see these diagnostics. Many people try a configuration and just notice that it does not work. They never look at rsyslog's own detailed error message (and some even do not write them anywhere, so they actually have nothing to look at ;)). So it has become quite common that for questions raised on the forum, we go lengths through the process of obtaining (large) debug logs, just to see that the debug log contains the otherwise-ignored error message that explains it all.

A (too) simple approach would be to blame users: why, the heck, don't they pay attention to their system setup. Beside being lame to blame users, it doesn't help solve the issue. The root cause seems to be that people tend not to know where to look for help. One approach is to improve documentation. I've gone some length on that, but what comes into our way is that people (including me ;)) tend to ignore documentation until there is no way around it (at this point, almost always too late, the foolishness of that approach is proven ;)).

An additional complexity is that people need to have at least some working configuration, including permissions, to make rsyslog record its error messages. A while ago, I thought about an internal message buffer, for error messages and maybe others, that can be viewed via a web browsers (this implies a simple http server inside rsyslogd). So folks with problems could simply point their web browser at rsyslogd and see which diagnostics exists. This solves at least the problem of recording and finding the messages. It does not solve the issue of people not knowing to do that, but it improves the situation: it is easier to tell "just point your web browser to" then to instruct people on how to review log files or create and send debug logs.

Up until now, I have stayed back from this approach because of security concerns I have. Such a http server, especially if it would also enable to view some of the live log data and live state information, could be an ideal tool for attackers. Of course, I can disable it by default and limit its features. But that is counter-productive in regard to the simple troubleshooting case. So I stayed back from it.

Now, seeing that the support problem (and pain people experience) is becoming worse, I have taken another look at this approach. I now think there is a compromise: I can create a separate plugin, one that can be loaded via a single "$ModLoad", but is not by default. So the system is secure by default but it is easy to instruct people to activate advanced diagnostics capability. It is obviously also easy to disable them, but what if people forget (and we tend to, don't we?). That would leave the attack vector intact once it was enabled. I think I now have a cure: What if the plugin automatically disables itself after a given period of time. Looking at the use case, it would be sufficient for most cases if diagnostics are available for 10 minutes after rsyslogd restart, but not any longer. I think I will now take this route. While it leaves the attack vector open, it mitigates the risk to a time frame, which usually is very short compared to overall runtime. And, after all, this is just a second safety measure. In the first instance, people should disable diagnostics once they no longer need it. Enabled diagnostics probably warrant a (startup) log message in themselves, so someone who cares to get an error-free startup should not forget about that (and those who don't care, don't deserve any better but still have a safety belt...).

I think I will try to implement a testing plugin in the current devel branch. This was unplanned work, and obviously pushes away other work. In the long term, however, I think it is very important to help folks getting up and running easily, so changing the schedule to solve that need is justified. For my testing purposes, though, I will not start with a http server but with something much more simplistic - just to grasp the essence of this approach.

4 comments:

Julian said...

One thing to suggest would be a '--lint' option would check through the syntax of a configuration file.

SpamAssassin has this option.

I imagine that this would be helpful for Rsyslog as one of the main concepts is the configuration files... Similar to SpamAssassin.

Rainer said...

Hi, thanks for the comment. Actually, this was also brought up on the mailing list and has been implemented as part of 3.21.1 (released yesterday). The option is -N1 (I need to switch to longer options some time later ;)).

Rainer

Frode said...

Why don't you just provide a cmdline option to not fork, and print everything to stdout/err. Many daemons do this. (-f --foreground -d --debug etc.) Then you just instruct users to run it on their shell instead of from initscripts. The webserver is a horrible idea imo. Maybe the debugflag is in there allready? I'm running Debian stable's version.

-Frode-

Rainer said...

Well, as of now all of this is already there, the stderr part thanks to the discussion this entry raised. However, the problem is still that many users do not read the doc and do never start it in that mode ;)

Rainer