Monday, October 11, 2010

splitting up the normalization library

I have dug into the design of my upcoming event/log normalization library. As it will base on CEE, I intend to pull in CEE definitions for types defined there, like tags or field types. Also, I thought about what the library should output. An obvious choice for many use cases is an in-memory object model describing the normalized form of the event that was passed in. This is probably most convenient for applications that want to do further processing on the event.

However, it also seems useful to have the ability to serialize this data in the form of a text string. That string could be stored in a file for later reference, forensics or to feed some other tool capable of understanding the file format. And as the in-memory object model will be CEE based, and CEE defines such serialization formats, it seems obvious that the library should be able to generate serialization based on the CEE-defined and supported formats (note that does not necessarily means XML, it may be JSON or syslog structured data as well).

Looking at all this, the normalization library seems to consist of two largely independent (but co-operating) parts:
  • the parser engine itself, that part that is used to actually normalize the input string according to the provided sample base and CEE definitions
  • a CEE support library, which provides the plumbing for everything that is defined in CEE (like tags, field types and serialization formats)
Now consider that I intended to create the normalization feature as a separate library instead of a rsyslog plugin because I hope that other projects can reuse it. Looking at the above bullet points, it looks like it is also natural to split core parser from CEE functionality. Again, there seems to be a broader potential user base for generic CEE functionality than for normalization. For example, a CEE support library could also be used by projects that natively support CEE. It hopefully would safe them the hassle of coding all CEE functionality just to do some basic things. Think, for example, on some application that would "just" like to emit a well-formed CEE log record (a very common case, I guess). With a library, it could just generate (via the library) a proper in-memory representation of the event and then have the library process it. The library could then also check if it is syntactically correct and contains all the necessary fields to conform to a specific CEE profile.

The more I think about it, the more I think it is useful. So I'll probably split the core normalization library from the CEE part. This is not much effort, but opens up additional uses. I'll call the normalization part then liblognorm (or libeventnorm) and the CEE part libcee -- or something along these lines. Under this light, liblognorm may actually be a better name, because the parser part is more concernd about logs and log files instead of generic events (which often come in other format).

Again, feedback is appreciated!
Post a Comment