I am working on syslog normalization for quite some years now. A couple of days ago, David Lang talked to me about syslog-ng's patterndb, an approach to classify log messages and extract properties from it.
I have looked at this approach, and it indeed is promising. One ingredient, though, is missing, that is a directory of standard properties (like bytes sent and received in traffic logs). I know this missing ingredient very well, because we also forgot it until recently.
The aim to normalize log data is far from being new. Actually, I think it is one of the main concerns in log analysis. Probably one of the first folks who thought seriously about it was Marcus Ranum, who coined the concept of "artificial ignorance", meaning that we can remove those messages from a big pile of logs that we know to be uninteresting. But in order to do that correctly, you need to know how exactly they look. And this is where log normalization comes in. I have written an in-depth paper in 2004, title "On the nature of syslog data". The version officially published claims "work in progress", but it still has all the juicy details.
Internally, we implemented this approach in our MonitorWare products a little bit later. For example, it is used inside the "Post Process Action" in WinSyslog (Michael also wrote a nice article on how to parse log messages with this action). While this was a great addition (and is used with great success), I failed to get enough community momentum to build a larger database of log messages that could be used as a basis for large scale log normalization. One such - largely failed for syslog - approach is the event knowledge base.
However, I did not give up on the general idea and proposed it wherever appropriate. The last outcome of this approach is the soon-to-be-released Adiscon LogAnalyzer v3, which uses so-called message parsers to obtain useful information from log entries. Here, I hope we will be able to gain more community involvement. We already got two message parsers contributed. Granted, that's not much, but the ability to have them is so far little known. With the release of v3, I hope we get more and more momentum.
The syslog-ng patterndb approach brings an interesting idea to this space: as far as I have heard (I generally do NOT look at competing code to prevent polluting my code with things that I should not use), they use radix trees to parse the log messages. That is a clever approach, as it provides a solution for much quicker parsing large amounts of parse templates. This makes the approach suitable for real-time normalization of an incoming stream of syslog data.
Adiscon LogAnalyzer, by contrast, uses a regex-based approach, but that primarily for simplicity in an effort to invite more contributions (WinSyslog has a far more sophisticated approach). In Adiscon LogAnalyzer we began to become serious with identifying what a property actually means. While we have a fixed set of properties, with fixed semantics, in both WinSyslog, MonitorWare Agent and rsyslog, this set is rather limited. The Windows product line supports ease of extension of the properties, but does not provide standard IDs for those properties.
In Adiscon LogAnalyzer, we have fixed IDs for a larger set of properties, now about 50 or so. Still, that set is very small. But we created it with the intention to be able to map various "semantic objects" from different log entries to a single identity. For example, most firewall logs will contain a source and destination IP address, but almost all firewalls will use different log message formats to do that. So we need to have different analyzers to support these native formats, for example in reports. In Adiscon LogAnalyzer, we can now have a message parser "normalize" these syslog entries and map the vendor-specific format to the generic "semantic object". Thus the upper layers (like views and reports) then work on these normalized semantic objects and do not need to be adopted to each firewall. This needs only be done at the parser level.
Such a directory of semantics objects would be very useful in my humble opinion. We are currently working on making it publicly available, all this in the hope for a community to involve itself ;) If we manage to get a large enough number of log and/or parser contributions, we may potentially be able to make Adiscon LogAnalyzer an even better free tool for system administrators.
And as there is hope that this will finally succeed, I have begun to think about a potential implementation inside rsyslog. It doesn't sound very hard, but still requires careful thinking. One thing I would like to see is a unified approach that covers at least rsyslog and Adiscon Loganalyzer, and hopefully the Windows tools as well.
Another very good thing is that there already is a standard for providing standard semantical objects: during the IETF syslog standardization effort, I pressed hard for so-called structured data elements. I managed to get them into the final RFC. These structured data elements are now the key for conveying the log information once it is normalized: the corresponding name-value pairs can easily be encoded with it.
I hope we will finally able to succeed on this road, because I think this would be of tremendous benefit for the syslog community.