Tuesday, February 09, 2010

Some thoughts on reliability...

When talking syslog, we often talk about audit or other important data. A frequent question I get is if syslog (and rsyslog in specific) can provide a reliable transport.

When this happens, I need to first ask what level of reliability is needed? There are several flavors of reliability and usually loss of message is acceptable at some level.

For example, let's assume the process writes out log messages to a text file. Under (allmost?) all modern operating systems and by default, this means the OS accepts the information to write, acks it, does NOT persist it to storage and lets the application continue. The actual data block is usually written a short while later. Obviously, this is not reliable: you can lose log data if an unrecoverable i/o error happens or something else goes fatally wrong.

This can be solved by instructing the operating system to actually persist the information to durable store before returning back from the API. You have to pay a big performance toll for that. This is also a frequent question for syslog data, and many operators do NOT sync and accept a small message loss risk to save themselves from requiring a factor of 10 servers of what they now need.

But even if writes are synchronous, how does the application react? For example: what shall the application do if log data cannot be written? If one really needs reliable logging, the only choice is to shutdown the application when it can no longer log. I know of very few systems that actually do that, even though "reliability" is highly demanded. Here, the cost of shutting down the application may be so high (or even fatal), that the limited risk of log data loss is accepted.

There are a myriad of things when thinking about reliability. So I think it is important to define the level of reliability that is required by the solution and do that in detail. To the best of my knowledge, this is also important for operators who are required by law to do "reliable" logging. If they have a risk matrix, they can define where it is "impossible" (for technical or financial reasons) to achieve full reliability and as of my understanding this is information auditors are looking for.

So for all cases, I strongly recommend to think about which level of reliability is needed. But to provide an answer for the rsyslog case: it can provide very high reliability and will most probably fulfil all needs you may have. But there is a toll in both performance and system uptime (as said above) to go to "full" reliability.

No comments:

Introducing new team member

Good news: we have some new folks working on the rsyslog project. In a small mini-series of two blog postings I'd like to introduce the...