Thursday, April 30, 2009

A batch output handling algorithm

With this post, I'd like to reproduce a posting from David Lang on the rsyslog mailing list. I consider this to be important information and would like to have it available for easy reference.

Here we go:

the company that I work for has decided to sponser multi-message queue
output capability, they have chosen to remain anonomous (I am posting from
my personal account)

there are two parts to this.

1. the interaction between the output module and the queue

2. the configuration of the output module for it's interaction with the

On for the first part (how the output module interacts with the queue), the
criteria are that

1. it needs to be able to maintain guarenteed delivery (even in the face
of crashes, assuming rsyslog is configured appropriately)

2. at low-volume times it must not wait for 'enough' messages to
accumulate, messages should be processed with as little latency as

to meet these criteria, what is being proposed is the following

a configuration option to define the max number of messages to be
processed at once.

the output module goes through the following loop


if (messages in queue)
mark that it is going to process the next X messages
grab the messages
format them for output
attempt to deliver the messages
if (message delived sucessfully)
mark messages in the queue as delivered
X=max_messages (reset X in case it was reduced due to delivery errors)
else (delivering this batch failed, reset and try to deliver the first half)
unmark the messages that it tried to deliver (putting them back into the status where no delivery has been attempted)
X=int(# messages attempted / 2)
if (X=0)
unable to deliver a single message, do existing message error

this approach is more complex than a simple 'wait for X messages, then
insert them all', but it has some significant advantages

1. no waiting for 'enough' things to happen before something gets written

2. if you have one bad message, it will transmit all the good messages
before the bad one, then error out only on the bad one before picking up
with the ones after the bad one.

3. nothing is marked as delivered before delivery is confirmed.

an example of how this would work


messages arrive 1/sec

it takes 2+(# messages/2) seconds to process each message (in reality the
time to insert things into a database is more like 10 + (# messages / 100)
or even more drastic)

with the traditional rsyslog output, this would require multiple output
threads to keep up (processing a single message takes 1.5 seconds with
messages arriving 1/sec)

with the new approach and a cold start you would see

message arrives (Q=1) at T=0
om starts processing message a T=0 (expected to take 2.5)
message arrives (Q=2) at T=1
message arrives (Q=3) at T=2
om finishes processing message (Q=2) at T=2.5
om starts processing 2 messages at T=2.5 (expected to take 3)
message arrives (Q=4) at T=3
message arrives (Q=5) at T=4
message arrives (Q=6) at T=5
om finishes processing 2 messages (Q=4) at T=5.5
om starts processing 4 messages at T=5.5 (expected to take 4)
message arrives (Q=5) at T=6
message arrives (Q=6) at T=7
message arrives (Q=7) at T=8
message arrives (Q=8) at T=9
om finishes processing 4 messages (Q=4) at T=9.5
om starts processing 4 messages at T=9.5 (expected to take 4)

the system is now in a steady state

message arrives (Q=5) at T=10
message arrives (Q=6) at T=11
message arrives (Q=7) at T=12
message arrives (Q=8) at T=13
om finishes processing 4 messages (Q=4) at T=13.5
om starts processing 4 messages at T=13.5 (expected to take 4)

if a burst of 10 extra messages arrived at time 13.5 this last item would

11 messages arrive at (Q=14) at T=13.5
om starts processing 14 messages at T=13.5 (expected to take 9)
message arrives (Q=15) at T=14
message arrives (Q=16) at T=15
message arrives (Q=17) at T=16
message arrives (Q=18) at T=17
message arrives (Q=19) at T=18
message arrives (Q=20) at T=19
message arrives (Q=21) at T=20
message arrives (Q=22) at T=21
message arrives (Q=23) at T=22
om finishes processing 14 messages (Q=9) at T=22.5
om starts processing 9 messages at T=22.5 (expected to take 6.5)

Monday, April 27, 2009

Levels of reliabilty

We had a good discussion about reliability in rsyslog this morning. On the mailing list, it started with a question about the dynafile cache, but quickly morphed into something else. As the mailing list thread is rather long, I'll try to do a quick excerpt of those things that I consider vital.

First a note on RELP, which is a reliable transport protocol. This was the relevant thought from the discussion:

I've got relp set up for transfer - but apparently I discovered
that relp doesnt take care of a "disk full" situation on the receiver
end? I would have expected my old entries to come in once I had cleared the disk space, but no... I'm not complaining btw - just remarking that this was an unexpected behaviour for me.

That has nothing to do with RELP. The issue here is that the file output writer (in v3) uses the sysklogd concept of "if I can't write it, I'll throw it away". This is another issue that was "fixed" in v4 (not really a fix, but a conceptual change).

If RELP gets an ack from the receiver, the message is delivered from the RELP POV. The receiving end acks, so everything is done for RELP. Some thing if you queue at the receiver and for some reason lose the queue.

RELP is reliable transport, but not more than that. However, if you need reliable end-to-end, you can do that by running the receiver totally synchronous, that is all queues (including the main message queue!) in direct mode. You'll have awful performance and will lose messages if you use anything other than RELP for message reception (well, plain tcp works mostly correct, too), but you'll have synchronous end-to-end. Usually, reliable queuing is sufficient, but then the sender does NOT know when the message was actually processed (just that the receiver enqueued it, think about the difference!).

This explanation triggered further questions about the difference in end-to-end reliability between direct queue mode versus disk based queues:

The core idea is that a disk-based queue should provide sufficient reliability for most use cases. One may even question if there is a reliability difference at all. However, there is a subtle difference:

If you don't use direct mode, than processing is no longer synchronous. Think about the street analogy:

For synchronous, you need the u-turn like structure.

If you use a disk-based queue, I'd say it is sufficiently reliable, but it is no longer an end-to-end acknowledgement. If I had this scenario, I'd go for the disk queue, but it is not the same level of reliability.

Wild sample: sender and receiver at two different geographical locations. Receiver writes to database, database is down.

Direct queue case: sender blocks because it does not receive ultimate ack (until database is back online and records are committed).

Disk queue case: sender spools to receiver disk, then considers records committed. Receiver ensures that records are actually committed when database is back up again. You use ultra-reliable hardware for the disk queues.

Level of reliability is the same under almost all circumstances (and I'd expect "good enough" for almost all cases). But now consider we have a disaster at the receiver's side (let's say a flood) that causes physical loss of receiver.

Now, in the disk queue case, messages are lost without the sender knowing. In direct queue case we have no message loss.

And then David Lang provided a perfect explanation (to which I fully agree) why in practice a disk-based queue can be considered mostly as reliable as direct mode:

> Level of reliability is the same under almost all circumstances (and I'd
> expect "good enough" for almost all cases). But now consider we have a
> disaster at the receiver's side (let's say a flood) that causes physical loss
> of reciver.

no worse than a disaster on the sender side that causes physical loss of the sender.

you are just picking which end to have the vunerability on, not picking if you will have the vunerability or not (although it's probably cheaper to put reliable hardware on the upstream reciever than it is to do so on all senders)

> Now, in the disk queue case, messages are lost without sender knowing. In
> direct queue case we have no message loss.

true, but you then also need to have the sender wait until all hops have been completed. that can add a _lot_ of delay without nessasarily adding noticably to the reliability. the difference between getting the message stored in a disk-based queue (assuming it's on redundant disks with fsync) one hop away vs the message going a couple more hops and then being stored in it's final destination (again assuming it's on redundant disks with fsync) is really not much in terms of reliability, but it can be a huge difference in terms of latency (and unless you have configured many worker threads to allow you to have the messages in flight at the same time, throughput also drops)

besides which, this would also assume that the ultimate destination is somehow less likely to be affected by the disaster on the recieving side than the rsyslog box. this can be the case, but usually isn't.

That leaves me with nothing more to say ;)

Wednesday, April 08, 2009

what is "nextmaster" good for?

People that looked at rsyslog's git may have wondered what the branch "nextmaster" is good for. This actually is an indication that the next rsyslog stable/beta/devel rollover will happen soon. With it, the current beta becomes the next v3-stable. At the same time, the current (v4) devel becomes the next beta (which means there won't be any beta any longer in v3). In order to facilitate this, I have branched of "nextmaster", which I will currently work on. The "master" branch will no longer be touched and soon become beta. Then, I will merge "nextmaster" back into the "master" branch and continue to work with it.

The bottom line is that you currently need to pull nextmaster if you would like to keep current on the edge of development. Sorry for any inconvenience this causes, but this is the best approach I see to go through the migration (and I've done the same in the past with good success, just that then nobody noticed it ;)).

Wednesday, April 01, 2009

rsyslog going to outer space

Rsyslog was designed to be a flexible and ultra-reliable platform for demanding applications. Among others, it is designed to work very well in occasionally connected systems.

There are some systems that are inherently occasionally connected - space ships. And while we are still a bit away from the Star Trek way of doing things, current space technology needs a "captain's star log". Even for spacecraft, it is important when and why systems were powered up, over- or under-utilized or malfunction (for example, due to "attack" not of a Klingon, but a cosmic ray). And all of this information needs to be communicated back to earth, where it can be reviewed and analyzed. For all of this, systems capable of reliable transmission in a disconnected environment are needed.

Inspired by NASA's needs, the Internet Resarch Task Force (the research branch of the IETF) is working on a protocol named DTN, usually called the interplanetary Internet.

As we probably all know, Microsoft Windows flies on the Space Shuttle. And, more importantly, Linux also did. With the growing robustness of Open Source, future space missions will most probably contain more Linux components.

This overall trend will also be present in NASA's and ESA's future Jupiter mission. There is a lot of information technology on the upcoming spacecraft, and so there is a lot of things worth logging. While specialized software is usually required for spacecraft operations, it is considered the rsyslog as the leading provider of reliable occasionally connected logging infrastructures may extend its range into the solar system. It only sounds logical to use all the technology we already have in place for reliable logging even under strange conditions (see "reliable forwarding"). Of importance is also rsyslog's speed and robustness.

As a consequence, we have today begun to implement the DTN protocol for the interplanetary Internet. That will be "omdtn" and is available as part of the rsyslog spaceship git branch. This branch is available as of now from the public git repository.

We could also envision that mission controllers will utilize phpLogCon to help analyze space craft malfunction. A very interesting feature is also rsyslog's modular architecture, which could be used to radiate a new communication plugin up to the space ship, in case this is required to support some alien format. This also enables the rsyslog team to provide an upgrade to the Interstellar Internet, should this finally be standardized in the IETF. If so, and provided the probe has enough consumables, it may be in the best spot to work as a stellar relay between us and whoever else.