Wednesday, May 13, 2009

ultra-reliable queueing in rsyslog

As part of the ongoing mailing list discussion on ultra-reliable queueing in rsyslog, I'd like to create another blogpost from discussion content (again, I hope this reference is handy in the future).

The key point with ultra-reliable queues is that no message can be lost once it has been enqueued. In the current (v2,v3,v4 <= 4.1.2) releases of rsyslog, this is ensured as long a the system is guarded against a sudden loss of power (or similar disaster) and even then all but the last messages dequeued are save.

To make queue operations ultra-reliable in that case, the queue needs to be run as a pure disk queue and a checkpoint interval of one needs to be used. This makes the queue reliable at the expense of performance. Note also that with a disk queue only a single queue worker is permitted.

Now let's look at a simplified scenario:

input -> queue -> output

This is not correct in that inputs never connect directly to outputs, but this detail is irrelevant for what I intend to say (replace "input" by "producer" and "output" by "consumer" if you'd prefer to have a fully consistent version).

Let's say the processing time is the cost we incur. If we look at it, the queue's cost dominates by far the combined cost of input and output. In most cases, it dominates input+output cost so much, that you can express the total cost as just the cost of the queue operation, without looking at anything else.

So the input needs to wait until the queue is ready to accept a new message. Once it has done so, the output is notified and immediately acquires the queue lock and begins the dequeue operation. At the same time, the input has already finished input processing (as I said, this happens in virtually "no time" compared to the queue operation). So it needs to wait for the queue lock. Once the dequeue operation is finished, the output releases the lock, and processes the message in virtually no time, too. The input acquired the queue lock, and the whole story begins right from the start.

A small queue may build up depending on the OS scheduler, but I think most often, input and output will just wait for the queue to complete. In that sense, this mode is similar to DIRECT mode, except that a queue can build up when the action needs to be retried.

So to optimize such a scenario, the best thing to do is a totally new queue storage driver for such cases. Sequential files do not really work well if we have multiple producers running.

This is a major effort and even then we need to think about the implications I raised in regard to processing cost above.

First of all, rsyslog was never designed for this use case (preserve every message EVEN in case of sudden power fail). When I introduced purely disk-based queues, this was done to support disk-assisted mode. I needed a queue type to permit me store things on disk, if we run out of memory. As a "side-effect", a pure disk mode was available also (I'd never implemented it for the sake of itself). As it was there, I decided to expose this mode and made it user-configurable. I thought (probably correct) that it could solve some need - a need that I'd consider "very exotic" (think about the reliance on a audit-grade protocol for this to really make sense). And I added the checkpoint capability because it seemed useful, even with disk-based queues, which could be guarded from total loss of messages by using a reasonable checkpoint interval. Again, a checkpoint interval of one is permitted just because this capability came "for free" and could be handy in some use cases.

The kiosk example we discussed 2008 (?) on the mailing list looked like a good match for such an exotic environment. Sudden power loss was an option, and we had low traffic volume. Bingo, perfect match.

However, I'd never thought about a reasonable high-volume system using disk-only queues. Think about the cost functions, such a system boils down to a DIRECT mode queue which just takes an exceptional lot of time for processing messages.

So probably the best approach for this situation would be to run the queue actually in direct mode. That removes the overwhelming cost of queue operations. Direct mode also ensures that the input receives an ack from the output [but there may be subtle issues which I need to check to make sure this is always the case, so do not take this for granted - but if it is not yet so, this should not be too complex to change]. With this approach, we have two issues left:

a) the output action may be so slow, that it actually is the dominating cost factor and not disk queue operation

b) the output action may block for an extended period of time (e.g. during a retry)

In case a), a disk-queue makes sense, because it's cost is irrelevant in this scenario. Indeed, it is irrelevant under all circumstances. As such, we can configure a disk-only action queue in that case. Note that this implies a *very* slow output.

Case b) is more complicated. We do NOT have any proper way to address it with current code. The solution IMHO is to introduce a new queue mode "Disk Queue on Delay" which starts an ultra-reliable disk queue (preferably with a faster queue store driver) if and only if the action indicates that it will need extended processing time. This requires some changes to action processing, but the action state machine should be capable to handle that with relatively slight modification [again, an educated guess, not a guarantee]).

In that scenario, we run the action immediately whenever possible. Only if that take the (considerable) extra effort of buffering messages into a much-slower on disk queue. Note that such a mode makes only sense with audit-grade protocols and senders (which hold processing until the ACK has been received). As such, a busy system automatically slows down to the rate that the queue writer can handle. In this sense, the overall system (e.g. a financial trading system!) may be slowed down by the unavailability of a failing output (which in turn causes the extra and very high cost of disk queue operations). It needs to be considered if that is an acceptable price.

The faster an ultra-reliable queue disk store driver performs, the more cases we can handle in the spirit of a) above. In theory, this can lead to elimination of b) cases.

Nevertheless, I hope I have shown that re-designing the queue (drivers) to support high throughput AND ultra-reliable operations AT THE SAME TIME is far from being a trivial task. To do it right, it involves some other changes too.
Post a Comment