Wednesday, April 27, 2011

rsyslog config reload - random thoughts

This blog post is more or less a think tank, maybe even an utility to clear my mind. Please note that I am not talking about anything that is right now present in rsyslog. I am not even saying that it will be in the near future. But I'd like to think a bit about alternatives on the route to there.

Let's assume rsyslog shall have two abilities:
  • use different config languages
  • dynamically reload a config without a full restart (thus applying a delta between new and old config)
In any case, the usual approach is to have an object representing a full configuration. This is an in-memory object. Usually, it is created during parsing the configuration file(s). During that parsing, nothing of the new config is actually carried out, just the in-memory presentation build. In that model, it would also be possible to have several fully populated config objects in core at the same time. The important thing to note is that none of them actually affects the current system - they are just loaded and ready to use.

Usually, in such a design, there is one thing that is called the "running conf". This configuration is the one the system actually uses for processing.

So how is a new configuration activated? In a first step, the config parsers create an in-memory object. Once this is done, that object is a candidate config, one that could be activated. To actually activate it, the candidate config is loaded as running config. During that process, all the settings are applied and services are started. Please note that a dynamic config reload can be done by first creating a delta between the candidate config and the current running config. This delta can then be used to keep the currently existing config running, but just modify it so that it is equivalent to the candidate config. This process can be less intrusive than shutting down the running config and restarting it based on the candidate config. As an example, a rsyslog system may have several hundered incoming TCP connections open. If the delta is just the addition of a new output file, there is no need to shut down these TCP connections as part of the (delta-driven) candidate config activation. Whereas, if no delta were to be used, all connections would need to be shut down and re-established after restart. There is obviously benefit in delta-based configuration apply. However, it should be noted that there are many subtle issues associated with creating and applying the delta.

Thinking about such a loaded/candidate/running config system for rsyslog, there is one overall issue that complicates things: loadable modules! Each module not only provides extra functionality, it also provides a set of configuration settings to modify its functionality. As such, we have a problem with config parsing: in order to fully parse a config file, we actually need to load the module as part of config processing. Or more precisely, we need to load its configuration file processor. However, it does not make sense to split each module into a config file processor and the actual module, at least I think so. Splitting them up would make things over-complex IMHO. However, we must demand that a module, in its config processing code, must not do anything other then creating configuration objects. Most importantly, it must not start any service or act out any non-config related activity. If it would do so, it would possibly affect the current running configuration. Also, config processing does not necessarily mean that the config will actually be activated, so the module must not assume that its processing will ever be called. As a side-note, this is one of the issues with the current legacy configuration system (as seen in all versions prior to v6 and early v6 versions).

Rsyslog traditionally keeps a list of loaded modules. Modules are added to that list when they are loaded inside the configuration system. During system startup, that very same list is also used to activate services inside the module. So that single list serves both as
  • a registry of loaded modules (e.g. to know what is already available and what needs to be unloaded)
  • a registry of modules required to for configuration
Both functions are tied together into a single list because rsyslog currently has only the concept of a single configuration, and not a candidate/running (multi-)config system. For the latter, it is necessary to differentiate between the two cases:

As we need to load modules during config parsing, we still need a single global list that keeps track of modules already present inside the system. Please note that with a multi-config system, a module that is first "loaded" inside a currently being parsed configuration file may actually already be loaded inside the system. In that case, a duplicate load must be avoided and the already existing module be used. The global, config-independent list is required to support this functionality.

On the other hand, such a global list can no longer be used to activate services for a specific config. This is easy to see when we have a config A which uses e.g. a TCP listener and we have a config B which does not. If B is activated, the TCP listener shall obviously not be activated. As such we need a dedicated, config-specific list of modules that are part of the current configuration. Let's call this one the "config module list" and the other one the "loaded module list".

The loaded module list is than just use to keep track of which modules are loaded. Also, it will/can be used to locate a module, so that global functions (like the config parser) can be found and carried out. Note that I call the config parser global, not config-specific. The reason is that the config parser does not take a config instance as input, but rather has no input (other than the config language) but has the config instance as output. As such, it emits config specific data, but does not require it for processing. So it is global.

The config module list in contrast must hold all config specific data elements for the module (most importantly the module specific config instance itself). The config module list is to be used for all config-specific actions. For example, it will be used to activate a module's services when the candidate config becomes a running config (maybe via a delta-apply process).

Note that on-demand module unload can be done via reference counting, which is already implemented in rsyslog. When a module is put onto a config module list, the count is incremented. If it is removed from such a list (usually because the in-memory config is destroyed), the count is decremented. An unload happens if the reference count reaches zero. If the module would be required by later processing another config, that would trigger a reload just as if the module had never before been loaded.

Note that a clearly defined and implemented split in global vs. configuration specific functionality is of vital importance for a multi-config system. This probably has some subtle issues as well. Right out of my head, I can think of the problem of some potentially global configuration settings (a term that seems to contradict itself in this context - just think a bit about it...). For example, we have the module search path, which tells us from where to load modules. With different configs, we can potentially have different module search pathes. That, in turn, can load to modules with the same name being loaded from different locations. That means we could potentially have different functionality, including different sets of config parameters (!) in the system at the same time. This could lead to some hard to diagnose issues. So it looks necessary to have the ability to load the same module via different pathes concurrently, and apply only the "right" module to the config in question. Looking at the current code base, implementing this would be even harder than just splitting out the global/config-specific lists. Maybe this would be something that shall be added at a later stage, *if* at all we take the path down that road. Also, there may be other issues along the way that I do not currently envision...

Monday, April 11, 2011

message classification with liblognorm sample code

I have just enhanced liblognorm's normalizer tool to support the -t option. If it is given, only messages with the specified tag will be output. Currently, only a single tag can be specified. The main purpose of this change is to provide some example code on how to use the message classification API, so that other developers can include it into their solutions more easily.

In essence, the whole logic is contained in normalizer.c, line 122 and 123. The application needs to keep the "wanted" tags inside an es_str_t type. Then, it needs to call the ee_getEventField() API to find out if the normalizer (better said: its rule) associated the tag with a given message. That's it...

Please note that we may implement a more powerful API in the future -- if this makes sense. If you think API additions would be useful, please suggest them together with a description of the benefits.

Wednesday, April 06, 2011

log classification with liblognorm

Today, I have added support for so-called "tags" to liblognorm (and it's base library libee). This new capabilities permits very easy classification of syslog message and log records in general. So you can not only extract data from your various log source, you can also classify events, for example, as being a "login", a "logout" or a firewall "denied access". This makes it very easy to look at specific subsets of messages and process them in ways specific to the information being conveyed.

To see how it works, let's first define what a tag is: A tag is a simple alphanumeric string that identifies a specific type of object, action, status, etc. For example, we can have object tags for firewalls and servers. For simplicity, let's call them "firewall" and "server". Then, we can have action tags like "login", "logout" and "connectionOpen". Status tags could include "success" or "fail", among others. The idea of tags is based on early CEE concepts. I will try to keep consistent with whatever CEE heads to. Tags form a flat space, there is no inherent relationship between then (but this may be added later on top of the current implementation). Think of tags like the tag cloud in a blogging system. Tags can be defined for any reason and need (though obviously we must strive to get to a standard set, something I hope CEE will provide in the not too distant future). A single event can be associated with as many tags as required.

Assigning tags to messages is simple. A rule contains both the sample of the message (including the extracted fields) as well as -now- the tags. Have a look at this sample, taken from liblognorm 0.2.0:

rule=:sshd[%pid:number%]: Invalid user %user:word% from %src-ip:ipv4%

Here, we have a rule that shows an invalid ssh login request. The various field are used to extract information into a well-defined structure. Have you ever wondered why every rule starts with a colon? Now, here is the answer: the colon separates the tag part from the actual sample part. Starting with liblognorm 0.3.0, you can create a rule like this:

rule=ssh,user,login,fail:sshd[%pid:number%]: Invalid user %user:word% from %src-ip:ipv4%

Note the "ssh,user,login,fail" part in front of the colon. These are the four tags the user has decided to assign to this event. What now happens is that the normalizer does not only extract the information from the message if it finds a match, but it also adds the tags as metadata. Once normalization is done, one can not only query the individual fields, but also query if a specific tag is associated with this event. For example, to find all ssh-related events (provided the rules are built that way), you can normalize a large log and select only that subset of the normalized log that contains the tag "ssh".

Note that versions of liblognorm 0.2.0 simply ignore the tag part, so old versions of the library are capable of working with new rule bases.

This is pretty cool and has ample potential. Just think about creating firewall reports: if you have different firewalls, you only need to have different rule bases to normalize these events all into the same format. Even more now, you can process the logs based on the classification assigned during the normalization process. For example, a "failed connection request" report may ignore everything that is not tagged as "connection, fail".

That probably sounds pretty good to you, but how to actually use it? Right now, the core functionality is available inside the libraries (more precisely in the git version, I will do an official release very soon but wanted to spread word). That means developers have the necessary API to integrate with their programs. End user tools do not yet exist (what is not too surprisingly for a library). Integration of the new functionality is very easy. Classification is available without need to change anything in existing applications. A single new simple API ee_EventHasTag() has been added, which needs to be called to see if an event is associated with the given tag. [side-note: the current API is NOT guaranteed to be stable, even though I try not to break things without need]

In hope that developers will play with the new functionality, so that it will be available in end-user tools soon as well. I myself plan to enhance the normalizer tool very soon to support selecting subsets based on tags (this can also serve as an example for other developers). Also, I plan to add classification support to rsyslog very, very soon. So stay tuned to what's coming up -- it's exciting ;)

Monday, April 04, 2011

What is rsyslog auto-backgrounding?

Rsyslog, by default, auto-backgrounds itself after startup. That simply means that the instance that is started by the user (or script) more or less does nothing but fork a new instance detached from the current terminal session and execute it. The originally started instance exits after a short timeout. This behavior was carried over from sysklogd.

Note that auto-backgrounding is problematic (aka "makes things more complicated than the need to be") in debug sessions, lab environments and so on. So command line switch "-n" can be used to turn off auto-backgrounding. In that case, the first instance started will actually carried out the work to be done (as most would expect in the first place).

It is strongly recommended to use "-n" option for lab testing.

log normalization: how to share rulebases?

Rulebases play a crucial role in log normalization. While the log normalizer itself needs to be of high quality and speed, it is the rulebase that really helps to detect which message the one in question is. I myself have so far concentrated on the code and not created any larger rulebase. Champ Clarck III has created many more for his use inside Sagan. But this means everything is in its infancy. What we really need is community involvement to create a large number of easy to access rulebases for almost all devices.

This brings up the question of how to manage and share such a repository. One method may be to place it on a web site, together with some submission tool. An alternate approach would be to put everything into a public git. This latter approach has some beauty, because git is universally available and well know. Even if a user does not know git, only a minimal set of commands is required to pull the rulebase. So maybe this is the way to go?

I would be very interested in suggestions on how we shall manage rulebases and spread the word. What do we need to support a great community? Whom can we talk to? If you have any ideas, concerns, questions or even an idle rant, be sure to let me know. At best, send mail to the lognorm mailing list, so we can broadcast this to other folks interested.