root cause of security issue in rsyslog

If you have followed the rsyslog mailing list, you have noticed that we had a small, but still noteworthy, security issue in rsyslog recently. In short words, the $AllowedSender directive was accepted but no longer honored, given potentially any remote system a chance to send messages to the instance in question (its a minor issue because most people rightfully tend to use firewalls to carry out that kind of access control).

After this is now settled, I sat back, relaxed and meditated a bit about the root cause of the issue. Acutally, I didn’t need to think very hard. The problem was introduced when I implemented the netstream driver class. During that implementation, I shuffled a lot of code to the now-modular interfaces. Among them were the access control lists, whose roots were kept in global variables at this time.

I screwed up the first time when I allowed them to remain global variables. We all know that global variables are evil, especially when making publically accessible. Now that we moved to a proper interface, I should have replaced them by a function call. Doing that in the first place had prevented the problem. Why? Because I just initialized the now-interface specific global variable “representative” with the value at time of interface creation, which meant NULL in all cases. So whoever used the interface, always got an empty list, which meant no access control was configured.

Any user-configuration still hit the global variable, which caused the ACLs to be created, but no part of the code ever accessed it any longer. One may argue if that is a simple coding error, and there is some truth in it, but I’d still say its primarily a design issue (bad design promises to provide the quick solution, but it seldom does…).

And as it always takes at least two faults to really screw up, the next major issue wasn’t around to far. Rsyslog had not – and still has not! – a formal test suite that you can simply run each time code changes. I have begun to employ some limited test cases via “make check”, but they cover primarily exotic aspects and do not yet contain any serious test case that involves actually running rsyslogd against any serious number of messages. One of the reasons is that I had no good tool for doing so, or that I considered building the test suite to be too expensive (in comparison what else needed to do). As a small excuse I would like to mention that some others have encouraged this view. But I always new it is a lame excuse…

So it exactly happened what usually happens in such cases: the test case vital to discover this problem was not present in the series of test I ran against the new code. As usual, the programmer himself tests whatever he thinks needs testing. And, also as usual, this means that the programmer doesn’t test those things that he can not think of being wrong. Usually, these are the real problems, because if the programmer did not think of a potential problem, he did not implement, or at least carefully check, for it. This is just another example, why external testers are needed.

In open source, users adopting the devel and beta releases are often considered to be these testers. Quite frankly, I could not afford a full testing lab and continue developing the project. I think this is true for most open source projects. “Free testing” by early adopters is a major advantage over closed source. But this time, this failed, too. Probably the (small) club of early adopters also did not think about this issue. Maybe that’s because the more knowledgeable folks prefer to solve this problem with a firewall, which is the better approach to use for various reasons (not to be outlined here, see security advisory for details).

Finally, the issue came up in the form of a bug report. Unfortunately quite later, month after the initial release. But it was reported and so I could fix it as quickly as possible once I knew.

The important lesson to learn is that it usually takes more than one error to cause real problems. But these things happen!

I think the case also strengthens the need for good, systematic testing. Some time ago, I began to look into the DejaGnu testing suite and asked the mailing list if somebody had some experience in it. Unfortunately, nobody showed up. I’ll now give it another shot. There have been too-often small problems that were rooted in things not being consistently tested. Most often, it were only really small issues, like missing files, or some variables not defined in some conditional path. Since I improved my “make distcheck” settings, many of these small items no longer appear. Even the small set of current exotic tests reveal a problem from time to time.

So I think it would be wise to try to expand the test cases that rsyslog runs on regular basis. Frankly, I will not be able to create a full suite from the ground up. But the idea is if I once manage to get DejaGNU – or something similar – up and running, and acquire the necessary knowledge, I could gradually add tests as I go along. So over time, the tests would increase and we could finally very much better, automatic, that existing functionality is no longer broken by new features.

I will try to get the focus for my next release steps on DejaGNU. Obviously, any help in doing so is appreciated.

security…

No system is totally secure. Few systems are totally insecure. Most systems are between these two extremes. But what does “more secure” mean? We had an interesting discussion on the rsyslog mailing list on the use of root jails. I’d like to reproduce one of my posts here, not only because it is mine, but because it can guard us a bit towards the security goals for rsyslog.

Let me think of security as a probability of security breach. S_curr is the security of the reference system without a root jail. S_total is the security of a hypothetical system that is “totally secure” (knowing well that no such system exists). In other words, the probability S_total equals 0.

I think the common ground is that a root jail does not worsen security. Note that I do not say it improves security, only that it does not reduce a system’s security. S_jail is the security of a system that is otherwise identical to the reference system, but with a root jail. Than S_jail <= s_curr, because we assume that the security of the system is not reduced.

I think it is also common ground that the probability of a security breach is reduced if the number of attack vectors is reduced, without any new attack vectors being added. [There is one generic “attack vector”, the “thought of being secure and thus becoming careless” which always increases as risk is reduced – I will not include that vector in my thoughts]

We seem to be in agreement that a root jail is able to prevent some attacks from being successful. I can’t enumerate them and it is probably useless to try to do so (because attackers invent new attacks each day), but there exist some attacks which can be prevented by a root jail. I do not try to weigh them by their importance.

For obvious reasons, there exist other attacks which are not affected by the root jail. Some of them have been mentions, like the class of in-memory based attacks, code injection and many more.

I tend to think that the set of attack vectors that can be prevented by a root jail is much smaller than the set of those which can not. I also tend to think that the later class contains the more serious attack vectors.

But even then, a root jail seems to remove a subset of the attack vectors that otherwise exist and so it reduces the probably of security breach. So it benefits security. We can only argue that it does not benefit security if we can show that in all cases we can think of (and those we can not), security is not improved. However, some cases have been show, where it improves, so it can not be that security is not improved in all cases. As such, a root jail improves security, or more precisely the probability of a security breach is

0 < S_jail < S_curr

We can identify the benefit we gain is the difference between the reference system’s probability of security breach and the system with the jail. Be S_impr this improvement, than

S_impr = S_curr – S_jail

Now the root jail is just one potential security measure. We could now try to calculate S_impr for all kinds of security measures, for example a privilege drop. I find it hard to do the actual probability calculations, but I would guess that S_impr_privdop > S_impr_jail.

Based on the improvements, one may finally decide what to implement first (either at the code or admin level), all of this of course weighted with the importance of the numbers.

In any case, I think I have shown that both is correct:

  • the root jail is a security improvement
  • there exist numerous other improvements, many of them probably more efficient than the jail

starting with rsyslog v4…

Finally, rsyslog v4 is materializing. Yesterday, I released the first devel version that is named 4.1.0. This starts a totally new branch. I decided to finally move on to v4 because I am enhancing performance quite a bit and this causes a number of big changes to the core engine as well as many modules. So rather than doing all of this in v3, I thought it is a good time to move to a new major version.

I expect that the new code will de-stabilize the project for some time and so I have now a feature-rich v3-stable release which will be available to everyone, at the price of less performance. That doesn’t mean v3 is slow, but v4 is even much faster. So I can finally begin to experiment a bit more with the new v4 branch and don’t need to think too hard that I may be introducing changes that are hard to roll into a stable within a reasonable time frame.

As far as the first v4-stable is concerned, I do not expect one to surface before February 2009 and, obviously, it will be not as stable as v3-stable is.

So the game is starting once again, and I hope you enjoy it ;)

Windows 2008 Event Log…

Did you know – Windows 2008 has a much changed event log format. There are new APIs and formats all over that operating system. The first incarnation of the new logging system was seen in Windows Vista, but, being a workstation OS, it did not receive much attention from the corporate world.

When Vista came out, we at Adiscon immediately introduced the new event log monitor V2 service in MonitorWare Agent and EventReporter. That service worked well, but not many customers ever used it (who really monitors the workstation event logs…).

With the rise of Windows 2008 Server, we saw a notable increase in interest. And we finally got questions! While we always supported all properties, some of the former event log monitor properties are not available under the Windows 2008 event system. A number of customers asked if we could map them. Makes an awful lot of sense, especially if you have log analysis scripts that expect those fields. So I went to development (no longer working on Windows myself these days) and asked what we could do. I thought it would be trivial. But it wasn’t. Some mappings seem to have really hard, plus we got the impression (from lack of discussion and coverage) that we are probably among the first to ever work in this area). But nothing can stop a good programmer ;)

So I was quite happy to learn today that we have finally manage to include a full emulation of pre-Windows 2008 event properties. The code currently is in beta, and is available both for MonitorWare Agent and EventReporter.

I am especially happy as the new emulation also makes it far easier for phpLogCon to work with Windows events in a consistent way. Not to mention that I like happy customers ;)

regular expression tool for rsyslog

Regular expressions are quite powerful, but the syntax in rsyslog is, well, not easy to use. Also, as we have seen, the usual regex check tools don’t work always well with rsyslog’s POSIX expressions. I have created a web-based regular expression checker/generator today. It is more or less finished, but of course needs fine-tuning.

If this tool turns out to be useful (judging on comments and access count over the next weeks), I will probably do some other online tools aiding in config generation. This will be part of the overall effort to make it easier to unleash rsyslog’s full potential (all too often people simply do not know what magic they could do ;)).

New SyslogAppliance Web Site Design…

The joy of being part of a smaller team is that you get exposed to things that you otherwise would never be. Today, this hit me in the form of the redesign of the SyslogAppliance web site. We set up a very basic site when we announced the initial beta a few days ago and now we thought to upgrade it at least a bit. I ended up writing the new web site copy. I personally like the new site. It still is a very small one, with not yet that many information available about the syslog appliance project. But I think it has evolved rather quickly. Anyhow, I’d appreciate feedback from my readers. If you have a few minutes, I’d appreciate if you could have a glance at www.syslogappliance.de and let me know what you think.

virtual appliance for disaster recovery?

I was asked what role virtual appliances speak in disaster recovery planning. I though I share my view here. Speaking for ourselfs as a smaller company: we are moving towards virtual environments not only in order to consolidate systems, but also because it is much easier to move over functionality from a failed system to another. Some of the functions (like mail gateway, firewall etc) do not even require state data, so they can simply be restored by using a generic template virtual machine.

Instantiating this is much quicker then building a machine with scripts from scratch, not to mention that we do not need to have the hardware in stock. In fact, we think about moving such functionality even to data center servers and thus be able to quickly switch between them if there is need to.

My syslog appliance could play a similar role in disaster recovery. While it probably is not appropriate to lose data (depending on use case), it may make sense to set up a new temporary appliance, just to continue gather data and provide analysis while the rest of the system is restored. Instant log analysis is probably a key thing you would like to have in your early recovery stages.

Doing an appliance right…

Why do people turn to (virtual) software appliances? I think the number one reason is ease of installation. If an appliance has one benefit, then it is that the system was put together by someone who really knows what he does. So the end-user can simply “plug it in” into the local network, do a few configuration steps and enjoy the software.

While we worked on the virtual syslog appliance, we have checked out various other appliances. They live up to this promise in very different ways. Some are really plug and play, while others are more a demo-type of a complicated system, where the user does not know what to do with the appliance unless he reads through a big manual. This is definitely not what people are after if they look for appliances.

With SyslogAppliance, I try hard to do things as simple as possible. I learned that I probably need to add some nice HTML start page, not only the plain phplogcon log analysis display. So I have now begun to do this appliance home page, just to see that displaying information is probably not sufficient.

I will need to do some basic configuration of the appliance, too. I was (and am) tempted to use something like webmin. But on the other hand, there are so many settings. I think most appliance user will never want to touch them. So a full config front-end is probably good for those in the know. But for the rest, a software appliance should come with the bare minimum of config options that are absolutely essential to do the job. For me, the “make everything configurable expert”, this is a hard lesson to learn. Usability is top priority with appliances and usability means to present only those options that are useful to most folks (the rest will probably not use an appliance, at least not for anything but demo).

I thought I share this interesting thought on my way to creating great virtual software appliances. Besides logging, I have some other ideas (and all benefit from a great logging interface), but it is too early to talk about these, now.

New rsyslog HUP processing

There has been some discussions about rsyslog HUP processing. Traditionally, SIGHUP is used to signal the syslogd to a) close its files and b) reload its config. Rsyslog carried over this behavior from sysklogd.

However, rsyslog is much more capable than sysklogd. Among others, it is able to buffer messages that were received, but could not yet be processed. To remain compatible to the sysklogd of doing HUP, rsyslogd does a full daemon restart when it is HUPed. Among others, that means that messages from the queue are discarded, at least if the queue is configured with default settings. David Lang correctly stated that this may surprise some, if not most users. While I am still of the view that discarding the queue, under these circumstances, is the right thing to do, I agree it may be surprising (I added a hint to the man pages recently to reduce the level of surprise).

Still, there is no real need to do a full daemon restart in most cases. The typical HUP case is when logrotation wants to rotate files away and it needs to tell rsyslogd to close them. Actually, I asked if anybody knew any script that HUPs rsyslog to do a full config reload. The outcome was that nobody knew. However, some people liked to stick with the old semantics, and there may be reason to do so.

I have now implemented a lightweight HUP to address this issue. It is triggered via a new configuration directive, $HUPisRestart. If set to “on”, rsyslogd will work as usual and do a (very, very expensive) full restart. This is the default to keep folks happy that want to keep things as backwards-compatible as possible. Still, I guess most folks will set it to “off”, which is the new non-restart mode. In it, only output files are closed. Actually, the output plugin receive a HUP notification and can do whatever it likes. Currently, onle omfile acts on that and closes any open files. I can envision that other outputs, e.g. omfwd, can also be configured to do some light HUP action (for example close outbound connections).

The administrator needs to select either mode for the system. I think this is no issue at all and it safes me the trouble to define multiple signals just to do different types of HUP. My suggestion obviously is to use the new lightweight HUP for file closing, which means you need not to change anything for logrotate et al. Then, when you need to do a config reload, do a “real” restart by issuing a command like “/etc/init.d/rsyslogd restart”. And if there really exists a script that requires a config-reload HUP, that should be changed accordingly.

rsyslog v4

Finally, it is time to think about the next major rsyslog release! I have done many enhancements in v3, but the latest performance optimization work leads to a couple of significant changes in the core engine. I think it makes sense to roll these into a new major release. That leaves folks with the option to keep at the feature-rich v3-stable branch, while avoiding some of the potential unavoidable bugs in the upcoming v4 branch.

From a feature point of few, version 3 would have been good for at least three to four major releases, which I did not do just because to prevent you from coming scared by the pace with which we are moving ;)- so I think it now is a perfect spot to begin developing v4. I hope that we will see a first beta of that branch around xmas, which, I think, is a nice gift.