Wednesday, October 31, 2012

rsyslog disk queues: refactor or redesign? - part I

I am currently thinking about the design of rsyslog's disk queues. There are some things that I really like about them, but there are also lots of things I dislike. This blog post is meant to save as a basis for discussing whether the disk queue system should better be refactored or completely redesigned. The actual discussion should please be kept on the mailing list. I just blogged because this makes it easier to have a stable reference on where that whole thing started. This is the first in a series of ? blog posts. It describes the current state as of now (rsyslog 7.3.1 and below).

First of all, it must be noted that both serious refactoring as well as redesign are rathter costly. So I don't know when I can actually do that - it probably depends on the fact if there are some sponsors at least for parts of the work available. With that said, let's get down to technology:

The disk queue system is designed to be used primarily in exceptions (like remote destination being down for a longer period of time) or for high-reliability environments. A main design goal was that they should not take too much disk space, at least not more than necessary at any point in time. Also, human operators should be able to fix things (or at least understand them), when things go really wild.

To solve these needs, queue data is stored in pure US-ASCII text form, without any kind of compression or binary storage. The encoding used is rather simple it's basically name, type and value of any property, one per line (the details can be found in the relevant source files). The "not too much space" requirement is solved by using a set of rotating sequential files, where each file has a configured maximum size (10MB by default) [to be technically precise, the maximum is not totally enforced, a complete message is persisted to the file, even if that goes slightly above the maximum]. There is a kind of head and tail pointer into that rotating file set: the head is used by new messages coming in. They are always appended to the most recent file. The tail pointer is where messages are dequeued from, it alwass points to the oldest file inside the file set. Originally, there were just these two pointers, and a file was deleted as soon as the tail pointer moved past the end of the current file. However, this was modified to cover the ultra-reliable use case in v5: The problem with two pointers was that when messages were dequeued at the end of the tail file, that file was immediately deleted. When now something went wrong, the potentially unprocessed messages were lost. To prevent this, v5+ has actually three pointers: the head pointer (as before) a read pointer (former tail) and a delete pointer (kind of "real tail"). The read pointer is now used to dequeue the messages, and the delete pointer is used to re-read messages from the file after they have been processed and delete queue files, if they go out of the current file set while doing so (this actually is exactly the processin that the former tail pointer did). The bottom line is that each message is written once but read twice during queue processing.  The queue also maintains a state (or "queue info" - .qi) file, which among others contains these pointers. The user can configure how often data shall be synced to disk. The more often, the more robust the queue is, but it also gets slower as part of that process. To flush, both the current write needs to be flushed as well as the state file. For the state file, this currently means it is opened, written and closed again. At the extreme, this happens for each record being written (but keep in mind that the queue file itself is always kept open). It is also worth mentioning that all records are of variable size, so there is no random access possible to messages that reside inside the queue file.

From the threading point of view, queue operations are naturally serial (as sequential files are involved). As such, a queue usues exactly one worker, no matter how many worker threads the user has configured. In case of a disk assisted (DA) queue, there are as many workers as configured for the in-memory part of the queue and one worker for the disk part of the queue. Note that all of these workers run in parallel. That also means that message sequence will change if disk and in-memory mode run at the same time But this comes at no surprise: for obvious reasons, this is always the case when multiple workers process queue messages in parallel (otherwise, the parallel workers would need to serialize the, effectively running a single worker thread...). To keep the code simple later versions of rsyslg (late v5+) keep the disk queue open once it has been started. First of all, this removes some complex synchronization issues inside the rsyslog core, which would otherwise not only complicate the code but slow things down, even when no disk queue is active. Secondly, there is a good chance that disk mode is needed again if it was initiated once, so it probably is even very smart to make this as painless as possible. Note that during rsyslog shutdown, the disk queue files are of course fully be deleted.

No comments: