Monday, March 28, 2011

why does the rsyslog testbench sometimes fail?

Rsyslog contains a set of automatted tests, the so-called "testbench". It is invoked via the standard method of "make check" and "make distcheck". Since its introduction in version 3, the testbench has been continously enhanced and extended. It now contains around 150 individual tests, which sum up to around 80 tests from the autoconf point of view (some autoconf tests run a couple of subtests, thus the difference in number). The testbench has been proven to be very useful and caught numerous problems before new code was released.

But the testbench is not perfect, and it may sometimes fail without any actual problem. There are two reasons for this. One is that the test require a very specific environment. For example, some parser based tests assume that the system the test is run on is configured to be named "localhost.localdomain" (the default for many test deployments). This needs to be the case because there currently is no way in rsyslog to overwrite the local hostname. Some parser tests use malformed messages, in which case (as of the RFC), the local system name must be used. As such, we need to have a specific system name set in order to prove the results. In the long term, I'll add the capability to overwrite system name inside rsyslog, but it does not make sense to create a dirty trick just for testbench use. So this needs to wait until we get to it as part of regular development. Note that a similar issues may exist at other places. An obvious one is the database tests, where we need pre-created users, databases, tables etc in order to run the tests.

The other issue is a bit more subtle. The syslog protocol is simply, without App-Layer acknowledgments. This makes it hard to know when rsyslog has received a while bunch of test data. That in turn makes it hard to definitely say when all test data has arrived and an instance can be shut down. So the whole process is a bit racy. To "solve" this, I use some wait periods in tests affected by this problem. However, longer wait periods mean longer test bench runtime and this reduces my development productivity. So I use wait time that usually does the job, but may fail under some circumstances (most notably when --enable-debug is set). This can affect a couple of TCP-based tests (like imtcp_conndrop.sh and similar ones). I have not yet a good idea what a clean solution to this problem is, where "clean" means that it a) always works and b) does no introduce unnecessary code complexity under non-testbench runs.

Given these problems, some care must be taken interpreting testbench results. Most importantly, a fail does not necessarily mean that things are actually broken. It merely means that one needs to look at the actual test and check a) why it fails and b) if it fails repeatedly. Especially the "racy" test tend to occasionally fail without any real problem. I've also seen them to fail consistenly on some platforms, simply because my timing assumptions are not valid there (Solaris was one example where I needed to adjust my overall wait periods).

So testbench results need to be taken with a grain of salt, and require interpretation. I know this is inconvenient for occasional users, but it is the best compromise I currently can offer.
Post a Comment