Monday, June 07, 2010

further improving tcp input performance

As one of the next things, I will be further improving rsyslog's tcp syslog input performance. As you know, rsyslog already has excellent performance (some sources, for example, quote 250,000 msgs per second). But, of course, there is room for improvement.

One such area is imtcp, the tcp syslog input module. It uses a single polling loop to obtain data from all senders. It is worth noting that the actual input module does NOT do very much, but hands the majority of work off to queue worker threads. However, it pulls the data from operating system buffers to our user space and also fills some basic properties (like time of reception, remote peer and so on). Then, the message is pushed to the message queue and at the other side of the queue the majority of processing happens (including such things like parsing the message, which some would assume to happen inside the receiving thread).

As can be seen in practice, this design scales pretty well in most cases. However, on a highly parallel system, it obviously limits the process of pulling data "off the wire" to be done on a single CPU. If then the rule set is not very complex (and thus fast to process), the single-threadedness off the initial receiver becomes a bottleneck. On a couple of high performance systems, we have seen this to be the bottleneck, and I am now trying to address it.

Right now, I am looking for a good solution. There are two obvious ones:

a) start up a single thread for each connection
b) do a hybrid approach of what we currently do and a)

Even with 64bit machines and NPTL, approach a) does probably not work well for a very large number of active sessions. Even worse, receiving messages from two different hosts would then require at least one context switch, and do so repeatedly. Context switches are quite expensive in terms of performance, and so better to avoid. Note that the current approach needs no context switch at all (for the part it does). On a system with many connections, I would be close to betting that the runtime required by the a)-approch context switching alone is probably more than what we need to do the processing with our current approach. So that seems to be a dead end.

So it looks like b) is a route to take, combining a (rather limited) number of threads with an reception-even driven loop. But how to best do that? A naive approach is to have one thread running the epoll() loop and have a pool of worker threads that actually pull the data off the wire. So the epoll loop would essentially just dispense to-be processed file descriptors to the workers. HOWEVER, that also implies one context switch during processing, that is when the epoll loop thread activates a worker. Note that this situation is by far not as bad as in a): as we have limited number of workers, and they are activated by the epoll thread, and that thread blocks when no workers are available, we have limited the level of concurrency. Note that limiting the concurrency level roughly to the number of CPUs available makes a lot of sense from a performance point of view (but not necessarily from a program simplicity and starvation-avoidance point of view - these concerns will be locked at, but now I have a focused problem to solve).

One approach to this problem could be that I further reduce the amount of work done in imtcp: if it no longer pulls data off the wire, but just places the file descriptor into a "message" object and submit that to the overall queue, modified queue processing could then take care of the rest. However, there are many subtle issues, including how to handle system shutdown and restart as well as disk queues. In short: that probably requires a full redesign, or at least considerable change. Anything less than that would probably result in another processing stage in front of the rule engine, as outlined initially (and thus require additional context changes).

So I focused back to the optimal way to partition this problem. One (simple) approach is to partition the problem by tcp listeners. It would be fairly easy to run multiple listeners concurrently, but each of the listeners would have its own (epoll/take data off the wire)-loop that runs on the listener's single thread. So in essence, it would be much like running two or more rsyslog instances, using the current code, concurrently. That approach obviously causes no additional context switches. But it has a major drawback: if the workload is spread unevenly between listeners, it may not provide sufficient parallelism to busy all CPU cores. However, if the workload is spread evenly enough, the approach can prevent starvation between listeners - but not between sessions of one listener. This problem is also not addressed by the current code, and there has never been any user complaint about that (or it's potential effects). So one may conclude starvation is not an issue.

It looks like the usefulness of this approach is strongly depending on the spread of workload between different listeners. Looking at a busy system, we need focus on the number of highly active listeners in relation to the number of expectedly idle CPU cores i. That number i obviously must take into consideration any other processing requirements, both from rsyslog (parsing, rule processing, ...) as well as all other processes the system is intended to run. So, in general, the number i is probably (much) lower than the total number of cores inside the system. If we now have a number l of listeners, we must look closely: if among all listeners, l_h is the number of high activity listeners, than it is sufficient to have i equals l_h: few occasional wakeups from low activity listeners do not really matter. However, if l_a is lower than i, or even just one, then we can not fully utilize the system hardware. In that case, we would need to provide partitioning based on sessions, and there we see a similar scheme based on the view of low- and high-activity sessions.

But the real questions is if we can assume that most busy systems have a sufficient number of high activity listeners, so that per-listener concurrency is sufficient to fully utilize the hardware. If that is the case, we can drastically improve potential message processing rates and still be able to keep the code simple. Even more concrete, the question is if we re sufficiently sure this approach works well enough so that we implement it. Doing so, could save considerable development effort, which could be put to better uses (like speeding up queue processing). BUT that development effort is wasted time if for a large enough number of systems we can not see benefit. And note that single-listener systems are not uncommon, a case where we would gain NO benefit at all..

I am actually somewhat undecided and would appreciate feedback on that matter.

Thanks in advance to all who provide it.

Update: there is a long and very insightful discussion about this post on the rsyslog mailing list. All interested parties are strongly advised to read through it, it will definitely enhance your understanding. Please also note that based on that discussion the development focus shifted a bit.

No comments: