On the (un)reliability of plain tcp syslog…

Did you ever use TCP to transfer syslog reliably? And do you think that makes you immune against message loss? If so, it’s time to think again.

There is a subtle, yet important problem with plain tcp syslog. It is a very simple protocol and it works without any app-level acknowledgment. Well, “why care”, you may say “we have the TCP ack”. That, of course is right, but that low-level ack won’t help you in all cases. The TCP ack is fine while everything goes well. But if the connection breaks, there is no mechanism in TCP that tells the sender immediately. In fact, “not immediately” actually is “after [2] hours” (and only if you have keep alive active, else its even later). This is not a flaw in TCP but desired behavior. To be honest, when the server aborts and is restarted, the client will see an error as soon as the server is restarted. On a busy system, this may still be many messages later. So when the client receives the error status, it has no clue at all which messages were lost and which finally made it to the server.

Why that? The problem here is that (by design!), the TCP send() API always returns success but buffers the message locally until it can be transmitted to the remote host. This behavior is fine – it enables TCP to be used reliably even when a network path is temporarily down. The downside is that the sender never knows if and when the data was received by the remote peer. Of course, there is an upper limit on the buffer size. It’s depending on TCP window negotiations, and if I remember correctly the ultimate upper limit is around 240K. I haven’t checked, but the lower limit is probably around 1.5K, the size of an Ethernet packet.

If we now assume a generous syslog message size of 150 bytes (many are much smaller), we can have between 10 and 1,600 messages sitting in this buffer! So we may lose up to 1,600 messages if the server goes down – without even noticing it. Frightening, isn’t it?

I was aware of this problem since the early days of plain tcp syslog. I have even documented it as part of the SELP (simple event logging protocol) discussion on the loganalysis mailing list: see the selp spec, section 2.4. I have to admit that I never took it too seriously. After all, it only happens when the server goes down … and there were so many other occasions on which syslog messages could get lost. The most prominent one probably running out of buffer space on the local syslogd or simply restarting it.

Then, I begun to implement more and more reliability into rsyslog (my *nix syslogd implementation for those not in the know). And the more reliable that engine got, the more the reliability problems of plain tcp syslog surfaced. A typical sample of TCP unreliability is included in this post, which also was the thing that finally reminded me on the root cause of all that (if you search the forum, you’ll find some other example of “the tcp problem”).

Once the core engine was reliable, the unreliability of TCP, so far well-hidden by the unreliability of the rest of the system, quickly surfaced. No advanced flow control, no tricks, no nothing helped. We can not build a reliable solution out of plain tcp syslog. It’s simply a no-go. At this point, it is probably good to add that this is not a rsyslog problem. It’s a protocol issue and as such all softwares implementing plain TCP syslog have the same shortcoming!

My quest for a highly reliable and tamper-proof logging system got shaky at that point. So what to do? Of course, a different, more reliable protocol was needed. It must support app-level acks, so that the client (sender) knows what was processed by the server and what not. If so, it can resend the right messages in case of a connection failure. My initial thoughts were to use liblogging and thus RFC 3195, which bases on BEEP (RFC 3080/3081). However, after some thinking, I came to the conclusion that RFC 3195 is far too less accepted and quite a bit to complicated for the job. I ended up designing a new, reliable and extensible protocol which I called RELP in honor of the initial SELP work. RELP stands for reliable event logging protocol. That implies two things: it’s reliable and it’s NOT only about syslog. RELP can carry a variety of payloads and will probably do so in the future. [You can find some more of my thought process in a previous blog post.]

I have now finished the initial alpha implementation of librelp, the RELP core library. I have also released the first version of rsyslog (3.15.0) with RELP support. There are still a few nits with RELP. Most importantly, it may duplicate some messages in extreme network failure cases (documented inside rsyslog’s imrelp module). But, except for a bug, RELP will never again lose any message … nearly. Now the ball is back to rsyslog. If the network connection breaks and rsyslog sent some yet-unacknowledged messages just before the network connection broke and rsyslog is terminated in that state … then we may indeed still lose a handful of messages. The bad thing is that it happens, the good thing is that there is some cure, it “just” needs to be implemented. So the reliability quest is not over. It just went into the next round. But, again, the probability of message loss is now very much lower than with plain tcp syslog.

So just let me repeat: plain tcp syslog is *not* a reliable solution. It works well as long as everything is working well, but it screws up when the network or the server breaks… And, yes, you don’t really need to care about that until you try to make the rest of the system fully reliable.

UPDATE:

I have received a good comment via mail and provided an in-depth answer, probably interesting to others, too. So I here reproduced it (but snip much of the original mail):

> There is very expensive software out there that can do guarenteed
> packet
> deliver over tcp…

Well… RELP can *really* do the trick. It needs some help in the rsyslog engine to work under all cases, that is when the relp stack is terminated (when the client rsyslog is shut down). But that isn’t rocket science and has just been pushed back by more urgent work (securing the channel).

So far, relp does single ack, that is the server ack’s every packet back to the client. Client discards packets only after ack. So when a session breaks, we know what the server has processed. If, however, some of the acks get lost, we resend packets that the server already processed, resulting in some mild message *duplication* (we currently have a 128 packet app-layer max window and typically acks come in rather quickly). To work around this, the client must ack the server’s acks (double-acking). That sounds scary, but is not. It’s also not performance intense and the protocol is modeled that those with ultra-slim bandwidth can turn it off and live with the message duplication scenario. If double-acks are active, relp client and server will go into a recovery phase after reconnect negotiation, in which mutally acked package numbers are exchanged. To do so, the client must provide a session cookie back to the server. This is a potential attack vector and thus I’d like to have secure transport in place before I do it. Once the recovery negotiation is done, client and server know, and have discarded, what each other peer has processed. The remaining packets are re-sent, but under the same reliability settings. So if the connection breaks at this point again (or during the renegotiation), we’ll simply go into another recovery phase until we finally succeed. It is of course important that session caches be persisted to disk when an engine stops – that will be part of rsyslog. The recovery procedure itself is in librelp. rsyslog needs to add just a slim persistence capability.

Once this is done, and with proper rsyslog queue settings, sufficiently stable hardware and sufficient disk space, I guarantee (not in a lawyers sense, though ;)) that you’ll never lose a message nor get a duplicate.

This is when I think we have achieved our reliability goal. If all goes well… summer? ;)

ANOTHER UPDATE (2008-05-29)

Some idea was brought up about one may be able to circumvent this problem. I tried it out, saw it failed and could also prove that it is simply impossible to have a reliable plain tcp syslog protocol without application-level acknowledgment.