Tuesday, August 07, 2007

next generation rsyslog design, worker pools, parellelism and future hardware

Based on my last post on rsyslog multithreading, an email conversation arose. When I noticed I elaborated about a lot of things possibly of general interest, I decided to turn this into a blog post. Here it is.

The question that all this started was whether or not a worker thread pool is good or evil for a syslogd.

Yes, the worker-thread pool has a lot of pros and cons with syslog. I know this quite well, because I am also the main design lead behind WinSyslog, which was the first syslogd available natively on Windows (a commercial app). It is heavily multi-threaded. I opposed a worker pool for a long time but have accepted it lately. In some high-volume scenarios (unusual cases) it is quite valuable. But of course you lose order of messages. For WinSyslog, we made the compromise to have the worker pool configurable and set it to 1 worker if order of events is very important. I designed WinSyslog in 1995 and released the first version in 1996 - so I know quite well what I am talking about (but to be honest the initial multi threading engine got in somewhat later;)).

HOWEVER, especially in high-volume environments and with network senders, you are somewhat playing Russian roulette if you strongly believe that order in which events come in is exactly the same order in which they were sent. Just think routers, switches, congestions, etc... For small volumes, that's fair enough. But once the volume goes up, you do not get it 100% right. This is one of the reasons I am working quite hard it the IETF to get a better timestamp in into syslog (rsyslog already has it, of course as an option). The right thing to do message sequencing is by looking at a high-precision timestamp, not by looking at time of reception.

For rsyslog, I am not yet sure if it will ever receive a worker pool. From today's view, it does not look necessary. But if I think about future developments in hardware, the only key in getting more performance is by using the enhanced parallelism the hardware will provide. The time of fast single cores is over. We will have relatively slow (as fast as today ;-]) massively parallel hardware. This is a challenge for all of software engineering. Many folks have not yet realized it. I think it has more problem potential than the last "big software crisis". As you can see in the sequencing discussion, parallelism doesn't only mean mutexes and threads - it means you need to re-think on how you do things (e.g. use timestamps for correlation instead of reception/processing order, because the later is no stable concept in a massively parallel program). With the design I am now doing, I will definitely have separate threads for each output action (like file or database writer). I need this, because rsyslog will provide a way to queue messages on disk when a destination is not available. You could also call this "store-and-forward syslog", just like SMTP is queued while in transit. You can not do this concept with a single thread (at least not in a reasonable complex way). I also do not think that multiple processes would be the right solution. First off, they are too slow. Secondly, multiple processes for this use are much more complicated than threads (I know, I've written similar multi-process based things back in the 80s and 90s).

Rsyslog will also most probably have more than one thread for the input. Input will become modular, too. If I use one thread per input, each input can use whatever threading model that it likes. That makes writing input plugins easy (one of my goals). Also, in high volume environments, having the ability to run inputs on multiple CPUs *is* *very* beneficial. One of the first plugins will be RFC 3195 syslog, which is currently run as an external process (communicating via unix local domain socket, which is quite a hack). At the core of 3195 is liblogging, a slim and fast select-server I have written. However, integrating it into a single-threaded application is still a challenge (you need to merge the select() calls and provide an API for that). With multiple threads, you can run that select server on its own thread, which is quite reasonable (after all, why should both UDP and 3195 reception run on the same thread?). The same is true for tcp based processing. Then think about the native ssl support that is coming. Without hardware crypto acceleration, doesn't it make much sense to run the receiver on its own thread? Doesn't it even make sense to run the sender on its own thread?

So the threading model of the next major release of rsyslog (called 3.x for reasons not yet elaborated about) will most probably be:


  1. multiple input threads, at least one for each input module (module may decide about more threads if they like to)
  2. all those input threads serialize messages to a single incoming queue
  3. the queue will most probably be processed by a single worker thread working on filter conditions and passing messages to the relevant outputs. There may be a worker pool, but if so its size can be configured (and set to 1, if needed)
  4. multiple output threads, at least one for each output. Again, it is the output's decision if it runs on more than one thread
  5. there may be a number of housekeeping threads, e.g. for DNS cache maintenance


This design will provide superb performance, is oriented on logical needs, allows for easy intrgration of plugins AND will be reasonable easy to manage - at least I hope so.

But, wait, why have I written all this? OK, one reason is that I wanted to document the upcoming threading model for quite a while and now was a good time for doing so. But I think it also shows that you can not say "multithreading is good/bad" without thinking about the rest of the system design. Multi-threading is inherently complex. Humans think sequentially, at least when we think about something consciously. So parallel programming doesn't match human nature. On the other hand, nature is doing things itself massively parallel. Just think about our human vision: in tech terms, it is massively parallel signal processing. This is where technology is also heading to. As we have learned to program at all, we will also learn to deal with parallel programming. A key thing is that we should never think about parallelism as a feature - it's actually a tool, that can be used right or wrong. So let's hope the rsyslog design will do it right. Of course, comments are very much appreciated!

4 comments:

lebarbu said...

Rainer,

in my opinion multiple worker threads only add 1 thing: scheduling decisions. In other words your multi-threaded architecture with 1 thread per input module and 1 thread per output module is just fine; I suppose you have carefully crafted what each module does to avoid race conditions and lock conflicts on the central queue

The only "worker" I can think of consists of separating the computing necessary for deciding which message needs to be processed by what output module from the receiver -input- modules. The obvious alternative is to place that code in the output modules

Whether this is beneficial to CPU useage or not depends if you can roll-up decision criteria into a single decision tree with multiple match points. I have seen this done, but with simple integer matches rather then regular expressions.

Dirk

Rainer said...

Dirk,

as I said, I am not sure we will have a worker pool - but there is an option to do that if need arises.

With the current *idea* (its not finished, this is why discussion is so important), we have very little possibility for race conditions. Primarily, the entry into the main queue needs to be mutex-guarded as need output-module *instances* which are used by multiple filters. Other than that, there is nothing where things can really mess up.

Part of the beauty is that input modules create the message, thereafter it is read-only. A single message can not be created by more than one thread. So there can't be a conflict there. Outputs don't receive messages but requested strings. No chance for a problem. Once we are inside the output instance, there is nothing that can go wrong (assuming the output module is well-written and does not use its own multi-threading). Output instances need to be guarded by mutexes. The same is true for the per-action-instance queue (because we need decoupling of producer and consumer). So all in all, the design requires very little sync objects, which not only means few deadlock possibilities, but also means few waits and thus good performance.

And you got the point. The main queue will do the decision process. This is the only reason why a worker pool may make sense. Decision making does not belong into the outputs, because it should be output-agnostics. Also, I would like to see as slim as possible output (and input) modules, to make creating plug-ins a very, very easy task.

I will develop and publish an object model in the next time (no promise when this will be exactly). Then it is much easier to discuss the threaded entities and why they are threaded.

Thanks again for your comments!

Rainer

JFermenich said...

Since Sun is arguably leading the field in multicore/multithread hardware right now, Are there any plans to make Rsyslog more compatible with solaris? We have been trying to compile & implement Rsyslog on a Sol10 SunFire T2000 to no avail. We have had to switch to implementing Rsyslog on a dual core opteron with linux instead of our original Sun plan. Considering the high utilization this box will have (It will be collecting logs from 30+ servers) the T2000 would have been ideal especially with the added multithreading capabilities you have planned.

Rainer said...

Yes, Solaris is on the target list. Actually, it should already compile and run (but you proved that's probably wrong), BUT it doesn't handle local logging right.

The current plan is to postpone this into the fall/winter time frame. The reason is that we need to implement a different input module for solaris, but we do currently NOT have input modules in the design ;). So we'd like to implement input modules first, and then create a plugin for solaris local input. As a pure network logger, however, we should be able to get it to work relatively quickly (I remember reports of folks who did that successfully).