rsyslog performance improvement rather impressive

I (think I ;)) have finished this round of performance rsyslog tuning. The result is rather impressive:

On my virtualized 4-core development environment (not exactly a high-end environment), I went from approx. 70,000 messages per second (mps) to approx. 280,000 mps. Note that these numbers do not necessarily represent a practice configuration, but I think they show the magnitude of the optimization. Also note that some complex configurations have far lower gain, because some things (like execute an action only n times within m seconds or “message repeated n times” processing) require serialization of the message flow and there is little we can gain in this case.

I plan to do an “official release” in the not so distant future. Next, I will first see which patches I have in my queue and then I’ll focus on the config language enhancement. That’s much more complex than just the format — I’ll blog the details hopefully either later today or tomorrow morning.

Getting Top Performance out of rsyslog

Rsyslog is lightning fast. However, the configuration influences speed very much. This blog post tells what offers optimal performance for the most recent v5 version.

I will update this blog post whenever there is news to share (at least this is the plan). This information will also hopefully flow into the rsyslog doc at some time.

  • do not use more than one ruleset within a single queue
  • do not use rate limiting unless absolutely necessary
  • use array-based queue modes
  • do not use
  • send data from different inputs to multiple queues
  • use “$ActionWriteAllMarkMessages on” for all actions where you can afford it (it really makes a difference!)

This following blogpost also has some solid information on performance-influencing parameters: rsyslog evaluation. Note that it talks about a somewhat older rsyslog release. While already quoting 250,000 messages per second, rsyslog 5.5.6 is quite a bit faster.

what are actions and action instance data?

On the rsyslog mailing list, the question about what actions are in in which way they are kept single-threaded from the POV of the output module came up again. I try to summarize the most important points and term here.

David Lang gave the following example configuration:

*.* file1
*.* file2
*.* @ip1
*.* @ip2
*.* @@ip3
*.* @@ip4

and asked how many different actions/entities that were. Here is my answer:

An *action* is a specific instance of some desired output. The actual processing carried out is NOT termed “action”, even though one could easily do so. I have to admit I have not defined any term for that. So let’s call this processing. That actual processing is carried out by the output module (and the really bad thing is that the entry point is named “doAction”, which somewhat implies that the output module is called the action, what is not the case).

Each action can use the service of exactly one output module. Each output module can provide services to many actions. So we have a N:1 relationship between actions and output modules.

In the above samples, 3 output modules are involved, where each output module is used by two actions. We have 6 actions, and so we have 6 action locks.

So the output module interface does not serialize access to the output module, but rather to the action instance. All action-specific data is kept in a separate, per-action data structure and passed into the output module at the time the doAction call is made. The output module can modify all of this instance data as if it were running on a single thread. HOWEVER, any global data items (in short: everything not inside the action instance data) is *not* synchronized by the rsyslog core. The output module must take care itself of synchronization if it desires to have concurrent access to such data items. All current output modules do NOT access global data other than for config parsing (which is serial and single-threaded by nature).

Note that the consistency of the action instance data is guarded by the rsyslog core by actually running the output module processing on a single thread *for that action*. But the output module code itself may be called concurrently if more than one action uses the same output module. That is a typical case. If so, each of the concurrently running instances receives its private instance data pointer but shares everything else.

further improving tcp input performance

As one of the next things, I will be further improving rsyslog‘s tcp syslog input performance. As you know, rsyslog already has excellent performance (some sources, for example, quote 250,000 msgs per second). But, of course, there is room for improvement.

One such area is imtcp, the tcp syslog input module. It uses a single polling loop to obtain data from all senders. It is worth noting that the actual input module does NOT do very much, but hands the majority of work off to queue worker threads. However, it pulls the data from operating system buffers to our user space and also fills some basic properties (like time of reception, remote peer and so on). Then, the message is pushed to the message queue and at the other side of the queue the majority of processing happens (including such things like parsing the message, which some would assume to happen inside the receiving thread).

As can be seen in practice, this design scales pretty well in most cases. However, on a highly parallel system, it obviously limits the process of pulling data “off the wire” to be done on a single CPU. If then the rule set is not very complex (and thus fast to process), the single-threadedness off the initial receiver becomes a bottleneck. On a couple of high performance systems, we have seen this to be the bottleneck, and I am now trying to address it.

Right now, I am looking for a good solution. There are two obvious ones:

a) start up a single thread for each connection
b) do a hybrid approach of what we currently do and a)

Even with 64bit machines and NPTL, approach a) does probably not work well for a very large number of active sessions. Even worse, receiving messages from two different hosts would then require at least one context switch, and do so repeatedly. Context switches are quite expensive in terms of performance, and so better to avoid. Note that the current approach needs no context switch at all (for the part it does). On a system with many connections, I would be close to betting that the runtime required by the a)-approch context switching alone is probably more than what we need to do the processing with our current approach. So that seems to be a dead end.

So it looks like b) is a route to take, combining a (rather limited) number of threads with an reception-even driven loop. But how to best do that? A naive approach is to have one thread running the epoll() loop and have a pool of worker threads that actually pull the data off the wire. So the epoll loop would essentially just dispense to-be processed file descriptors to the workers. HOWEVER, that also implies one context switch during processing, that is when the epoll loop thread activates a worker. Note that this situation is by far not as bad as in a): as we have limited number of workers, and they are activated by the epoll thread, and that thread blocks when no workers are available, we have limited the level of concurrency. Note that limiting the concurrency level roughly to the number of CPUs available makes a lot of sense from a performance point of view (but not necessarily from a program simplicity and starvation-avoidance point of view – these concerns will be locked at, but now I have a focused problem to solve).

One approach to this problem could be that I further reduce the amount of work done in imtcp: if it no longer pulls data off the wire, but just places the file descriptor into a “message” object and submit that to the overall queue, modified queue processing could then take care of the rest. However, there are many subtle issues, including how to handle system shutdown and restart as well as disk queues. In short: that probably requires a full redesign, or at least considerable change. Anything less than that would probably result in another processing stage in front of the rule engine, as outlined initially (and thus require additional context changes).

So I focused back to the optimal way to partition this problem. One (simple) approach is to partition the problem by tcp listeners. It would be fairly easy to run multiple listeners concurrently, but each of the listeners would have its own (epoll/take data off the wire)-loop that runs on the listener’s single thread. So in essence, it would be much like running two or more rsyslog instances, using the current code, concurrently. That approach obviously causes no additional context switches. But it has a major drawback: if the workload is spread unevenly between listeners, it may not provide sufficient parallelism to busy all CPU cores. However, if the workload is spread evenly enough, the approach can prevent starvation between listeners – but not between sessions of one listener. This problem is also not addressed by the current code, and there has never been any user complaint about that (or it’s potential effects). So one may conclude starvation is not an issue.

It looks like the usefulness of this approach is strongly depending on the spread of workload between different listeners. Looking at a busy system, we need focus on the number of highly active listeners in relation to the number of expectedly idle CPU cores i. That number i obviously must take into consideration any other processing requirements, both from rsyslog (parsing, rule processing, …) as well as all other processes the system is intended to run. So, in general, the number i is probably (much) lower than the total number of cores inside the system. If we now have a number l of listeners, we must look closely: if among all listeners, l_h is the number of high activity listeners, than it is sufficient to have i equals l_h: few occasional wakeups from low activity listeners do not really matter. However, if l_a is lower than i, or even just one, then we can not fully utilize the system hardware. In that case, we would need to provide partitioning based on sessions, and there we see a similar scheme based on the view of low- and high-activity sessions.

But the real questions is if we can assume that most busy systems have a sufficient number of high activity listeners, so that per-listener concurrency is sufficient to fully utilize the hardware. If that is the case, we can drastically improve potential message processing rates and still be able to keep the code simple. Even more concrete, the question is if we re sufficiently sure this approach works well enough so that we implement it. Doing so, could save considerable development effort, which could be put to better uses (like speeding up queue processing). BUT that development effort is wasted time if for a large enough number of systems we can not see benefit. And note that single-listener systems are not uncommon, a case where we would gain NO benefit at all..

I am actually somewhat undecided and would appreciate feedback on that matter.

Thanks in advance to all who provide it.
Rainer

Update: there is a long and very insightful discussion about this post on the rsyslog mailing list. All interested parties are strongly advised to read through it, it will definitely enhance your understanding. Please also note that based on that discussion the development focus shifted a bit.

rsyslog string generators … done :)

A rsyslog string generator is what I had previously called a “template module” – in essence a facility to generate a template string with some custom native C code. I have decided to name it a bit differently, because at some later stage there may be other uses for these types of modules as well. Specifically, I am thinking about adding custom name-value pairs to the message object, and then a string generator (or strgen for short) could be used to generate such a value as well.

Implementation went smooth. I implemented both the interface as well as a small set of important core strgens, those that are frequently used to write files or forward message to remote machines. I did not touch any others, as that is probably not really necessary — and could easily be done any time if need arises.

The new interface also provides a capability to third-parties that enables them to create their own high speed parsers. The performance impact can be dramatic, just think about cases where multiple regular expression calls can be replaced by a single call and some C logic.

Finally, these modules may even provide a way to fund rsyslog development. Adiscon can probably sell them for some small amount (I guess around $500 based on what needs to be done, in some cases maybe less, in some maybe a bit more). I guess that would be attractive for anyone who needs both high speed and a custom format and runs rsyslog for profit. Getting into all the details to develop such a thing oneself probably costs more than our whole implementation effort. I hope we will get some orders for these, and I hope that folks will contribute the strgen back to the project. Their plus is then that we maintain it for free and the plus for the community is that, in the long term, we will hopefully get a large directory of ready-to use custom strgens (OK, that sidesteps the funding process a bit, but… ;)).

I have also managed to write some basic doc on the new capability, to be seen here:

What now is missing is some feedback from the field, including from someone who actually uses this to create a custom format.

The code has been merged into v5-devel (the master branch) and will most probably be released early next week. Then, it will undergo the usual devel/beta cycle, so that availability in a stable v5 release can be expected towards the end of summer 2010. Special thanks go to David Lang, who provided good advise that helped me create the new functionality.

rsyslog template plugins

As I have written yesterday, I am evaluating the use of “template modules” in rsyslog.

In that post, I mentioned that I’d expect a 5% speedup as proof that the new plugin type was worth considering. As it turns out, this method seems to provide a speedup factor of 5 to 6 percent, so it seems to be useful in its own right.

After I had written yesterday’s post, I checked what it would take to create a test environment. It turned out that it was not too hard to change the engine so that I could hardcode one of the default templates AND provide a vehicle to activate that code via the configuration file. Of course, we do not yet have full loadable modules, but I was able to create a proof of concept in a couple of hours and do some (mild) performance testing on it. The current code provides a vehicle to use a c-function based template generator. It is actiated by saying

$template tpl,=generator

where the equal sign indicates to use a C generator instead of the usual template string. The name that follows the equal sign that will probably later become the actual module name, but is irrelevant right now. I then implemented a generator for the default file format in a very crude way, but I would expect that a real loadable module will not take considerably more processing time (just a very small amount of calling overhead after the initial config parsing stage). So with that experimental code, I could switch between the template-based default file format and the generator based format, with the outcome being exactly the same.

Having that capability, I ran a couple of performance tests. I have to admit I did not go to a real test environment, but rather used my (virtualized) standard development machine. Also, I ran the load generator inside the same box. So there were a lot of factors that influenced the performance, and this for sure was no totally valid test. To make up for that, I ran several incarnations of the same test, with 1 to 10 million of test messages. The results quite consistently reported a speedup between 5 and 6 percent achieved by the C template generator. Even though the test was crude, this consistently seen speedup is sufficient proof for me that native template generators actually have value in them. I have to admit that I had expected improvements in the 1 to 2 percent area, so the 5 and more percent is considerable.

I committed the experimental branch to git, so everyone is free to review and test it oneself.

Now that I am convinced this is a useful addition, my next step will be to add proper code for template plugins (and, along that way, decide if they will actually be called template plugins — I guess library plugins could be used as well and with somewhat less effort and greater flexibility). Then, I will convert the canned templates into such generators and included them statically inside rsyslog (just like omfile and a couple of other modules are statically included inside rsyslog). I hope that in practice we will also see this potential speedup.

Another benefit is that any third party can write new generator functions. Of course, there is some code duplication inside such functions. But that should not be a bit issue, especially as generator functions are usually expected to be rather small (but of course need not be so). If someone intends to write a set of complex generator functions, these can be written with a common core module whom’s utility functions are accessed by each of the generators. But this is not of my concerns as of now.

Note that I will probably use very simple list data structures to keep track of the available generators. The reason is that after the initial config file parsing, access to these structures is no longer required and so there is no point in using a more advanced method.

I expect my effort to take a couple of days at most, but beware that Thursday is a public holiday over here in Germany and I may not work on the project on Thursday and Friday (depending, I have to admit, a little bit on the weather ;)).