Thursday, October 07, 2010

Introducing liblognorm

Hi folks,

With this posting, I introduce a new project of mine: liblognorm. This library shall help to make sense out of syslog data, or, actually, any event data that is present in text form.

In short words, one will be able to throw arbitrary log message to liblognorm, one at a time, and for each message it will output well-defined name-value pairs and a set of tags describing the message.

So, for example, if you have traffic logs from three different firewalls, liblognorm will be able to "normalize" the events into generic ones. Among others, it will extract source and destination ip addresses and ports and make them available via well-defined fields. As the end result, a common log analysis application will be able to work on that common set and so this backend will be independent from the actual firewalls feeding it. Even better, once we have a well-understood interim format, it is also easy to convert that into any other vendor specific format, so that you can use that vendor's analysis tool.

So liblognorm will be able to provide interoperability between systems that were never designed to be interoperable.

This sounds bold, so why am I thinking I can do this?

Well, I am working for quite some years in this field and have accumulated a lot of experience including a sufficient number of implementations where I failed in one way or another. You can read about this in my previous blog post on "syslog normalization". So know-how is available. The core technology is actually not that complex. I hope to code the core parts of the lib within one month (but it may take some more real-world time as I can not yet work on it full time). Also, very importantly, there is the Common Event Expression (CEE) standard coming up. Its aim is nothing less than to provide the plumbing that is needed for log normalization. CEE is initiated by the same folks that made so successful standards like CVE alive -- so it is something we can hope to succeed. Thankfully, I have recently become a member of the editorial board so I have good insight of what it takes to write a library that utilizes CEE.

This all sounds good, but there is one ingredient missing: liblognorm will not be able to magically know the format of each message. We must teach it the formats. This involves some community effort as a single person can not compile all message formats this IT world has to offer. Not even a small group can do. I hope that we can get sufficient log samples from the rsyslog community and hopefully later from a growing liblognorm community. I have thought long about how this can be done. A core finding is that it must be dumb easy to do. This is how it is intended to work:

When you feed liblognorm a message that it can not recognize, it will spit out that message. Let's assume we have the following message:

AAA credentials rejected: reason = reason: server = server_IP_address: user = user

Then the user just needs to tell which fields it contains and do so via a simple syntax. A hypothetical sample could be:

reject,AAA:AAA credentials rejected: reason = %reason_msg%: server = %sourceIP%: user = %user%

The strings "reject" and "AAA" are tags. Tags will be placed in comma-delimited format in front of the actual message sample and are terminated by a colon. Everything after the colon is actual message text. The field between percent signs reflect some well-known properties (which are taken from the CEE base def and/or are custom defined). The syntax will be taken from a data dictionary, so the user does not need to bother about that in most cases. So creating a message sample out of an unknown message type should be fairly easy.

The idea now is that we gather these one-line message samples and compile them into a central repository. Then, all users can download fresh versions of the "sample base" and use them to normalize their tags -- much like a virus scanner works. Of course, there are a number of questions in regard of how trustworthy samples are. But this is definitely something we can solve. To get started, we could simply do some manual review first, much like code integration. At later stages, some trust level policy could be applied. Well-known technology...

None of this is cast in stone as I am in a very early stage of this project. Your feedback definitely makes a difference and so be sure to provide ample ;)

That's it for today, but I plan to do a series of blog posts on the new system within the next couple of days. Please provide feedback whenever you have it. It's definitely something very useful to have. I'll also make sure that I set up a new list for this effort, but initially I will be abusing the rsyslog mailing list, as I assume folks on that list will definitely be interested in liblognorm as well. And, of course, once the library is ready I'll utilize it in various ways inside rsyslog.


Kamil Kisiel said...

This sounds like a great idea. A common library with a large sample base is essential for good log analytics. Some tools such as Zenoss have already made some attempts at this, but of course their efforts are limited in scope to their particular toolset. The default included library in Zenoss was also quite limited, and with the thousands of types of common and uncommon messages it's basically an impossible undertaking for an administrator to attempt to create classifiers for all the different messages they may encounter.

I think if this project picks up momentum, it would be instrumental to provide a way for application authors to include some kind of definition of all the log messages the application can output. If this could somehow linked with the application's internationalization or strings facility, it would be even more so convenient. I think this would help decrease the burden on maintaining the library.

Rainer said...

Yeah, I initially was tempted to implement this as part of rsyslog, but then it would have been tied to it's toolset. Not so smart...

As of the apps, CEE is the real solution here. If I can manage to make using CEE very simple, we can probably get enough momentum to help us skip the parsing step for new log messages being emitted, because they are already in a well-defined format. CEE does all the hard work of coming up with standard field syntax and semantics as well as with recommendations on what a particular log record should contain.

Raffy said...

Rainer, as I communicated before: Love the effort and I will help wherever I can! Quick question that might be of interest to others as well: How are you going to specify the parsing? A set of regexes? Or how will that work?

Rainer said...

They hypothetical sample that you see is not the output format. That format will be provided by CEE (or be a in-memory representation for a library user app).

It is the input format. The core idea is the use CEE fields (and thus field types, ...) inside the "sample". I try to keep the sample as similar to a log message, because I think that this makes it most easy for folks to generate new samples.

The idea is that when the library spits out an unknown log line, the user can simply take that line, look at its format and replace actual values with the relevant CEE field name.

Sorry if that was not 100% clear. The actual format for the sample, of course, needs to be found. But I insist it must be very simple -- and accept that it may look a bit ugly for that reason. Most specifically, I do not think that XML is right for this task.

simplifying rsyslog JSON generation

With RESTful APIs, like for example ElasticSearch, you need to generate JSON strings. Rsyslog will soon do this in a very easy to use way. ...