why you can’t build a reliable TCP protocol without app-level acks…

Again, there we are in the area of a reliable plain TCP transport. Some weeks ago, I noted down that plain tcp syslog is unreliable by design. Yesterday, Martin Schütte came up with his blog post “Reliable TCP Reconect made Easy“. In it, he describes how he thinks one can get around the limitations.

I was more than interested. Should there be a solution I had overlooked? Martin’s idea is that he queries the TCP stack if the connection is alive before sending any data. He outlined two approaches with the later being a non-blocking recv() right in front of the send(). The idea is that the recv() should detect a broken connection.

After I thought a bit about this approach I had a concern that it may be a bit racy. But in any case, new thoughts are worth evaluating. And a solution would be most welcome. So I quickly implemented some logic in my existing gtls stream driver. To make matter simple, I just did the recv() and sent the return status and errno to debug output (but did not try any reconnects based on it). And then the ugly happened: I always got the same EAGAIN return state (which is not an error), no matter in what state the connection was. I was even able to pull the receiver’s network cable and the sender didn’t care.

So, this approach doesn’t work. And, if you think a bit more about it, it comes at no surprise.

Consider the case with the pulled network cable. When I plugged it in again a quarter hour later, TCP happily delivered the “in-transit” messages (that were sitting in the client buffer) after a short while. This is how TCP is supposed to work! The whole point is that it is designed to survive even serious network failures. This is why the client buffers messages in the first place.

What should the poor client do in the “pulled network cable” case. Assume the data is lost just because it can not immediately send it? To make it worse, let’s assume the data had already left the system and successfully reached the destination machine. Now it is sitting in the destination’s receive buffer. What now if the server application (for example due to a bug) does not pull this data but also does not abort or close the file descriptor? How at all should TCP detect these failures? The simple truth is it isn’t and it is not supposed to be.

The real problem is the missing application level acknowledgment. The transport level ACK is for use by TCP. It shall not imply anything for the app layer. So if we depend on the TCP ack for an app-level protocol, we abuse TCP IMHO. Of course, abusing something may be OK, but we shouldn’t wonder if it doesn’t work as we expect.

Back to the proposed solution: the problem here is not that the send call does not fail, even though the stack knows the connection is broken. The real problem is that the TCP stack does not know it is broken! Thus, it permits us to send data on a broken connection – it still assumes the connection is OK (and, as can be seen in the plugged cable case, this assumption often is true).

As such, we can NOT cure the problem by querying the TCP stack if the connection is broke before we send. Because the stack will tell us it is fine in all those cases where the actual problem occurs. So we do not gain anything from the additional system call. It just reflects the same unreliable status information that the send() call is also working on.

And now let’s dig a bit deeper. Let’s assume we had a magic system call that told us “yes, this connection is fine and alive”. Let’s call it isalive() for easy reference. So we would be tempted to use this logic:

if(isalive()) then
send()
else
recover()

But, as thread programmers know, this code is racy. What, if the connection breaks after the isalive() call but before the send()? Of course, the same problem happens! And, believe me, this would happen often enough ;).

What we would need is an atomic send_if_alive() call which checks if the connection is alive and only then submits data. Of course, this atomicity must be preserved over the network. This is exactly why databases invented two-phase commits. It requires an awful lot of well thought-out code … and a network protocol that supports it. To ensure this, you need to shuffle at least two network packets between the peers. To handle it correctly (the case where the client dies), you need four, or a protocol that works with delayed commits (as a side-note, RELP works along those lines).

Coming back to our syslog problem, the only solution to solve our unreliability problem without specifying app-layer acks inside syslog, is to define a whole new protocol that does these acks (quite costly and complex) out of band. Of course, this is not acceptable.

Looking at all of this evidence, I come to the conclusion that my former statement unfortunately is still correct: one can not implement a reliable syslog transport without introducing app-level acks. It simply is impossible. The only cure, in syslog terms, is to use a transport with acks, like RFC 3195 (the unbeloved standard) or RELP (my non-standard logging protocol). There are no other choices.

The discussion, however, was good insofar that we now have generally proven that it is impossible to implement a reliable TCP based protocol without application layer acknowledgment at all. So we do not need to waste any more effort on trying that.

As a funny side-note, I just discovered that I described the problem we currently see in IETF’s transport-tls document back than on June, 16, 2006, nearly two years ago:

http://www.ietf.org/mail-archive/web/syslog/current/msg00994.html

Tom Petch also voiced some other concerns, which still exist in the current draft:

http://www.ietf.org/mail-archive/web/syslog/current/msg00989.html

As you can see, he also mentions the problem of TCP failures. The idea of introducing some indication of a successful connection was quickly abandoned, as it was considered too complex for the time being. But as it looks, plain technical fact does not go away by simply ignoring it ;)

UPDATE, half an hour later…

People tend to be lazy and so am I. So I postponed to do “the right thing” until now: read RFC 793, the core TCP RFC that still is not obsoleted (but updated by RFC3168, which I only had a quick look at because it seems not to be relevant to our case). In 793, read at least section 3.4 and 3.7. In there, you will see that the local TCP buffer is not permitted to consider a connection to be broken until it receives a reset from the remote end. This is the ultimate evidence that you can not build a reliable syslog infrastructure just on top of TCP (without app-layer acks).