calling for log samples!

Now I join those mass of people who are asking for log samples. But I do for a good reason :) Also, I do not need a lot, a single log message works well for my needs. I need them to improve rsyslog so that the parser can even better handle exotic message formats. So the short story is if you have a syslog message, please provide it to me.

And here is the long story:

One of the strength of rsyslog is that it is very much focused on standards. That also means it tries to parse syslog messages according to the relevant RFCs. Unfortunately, syslog has been standardized only recently and so there is no real standard for what to expect inside the header. So rsyslog strength is also its weakness: if messages are ill-formed, results are often suboptimal.

I am working around this by doing smart guesswork inside the legacy syslog parser. However, every now and then some folks pop up with problems. And, more importantly, some others do not even ask. On my twitter account, I recently saw one such frustration. In that case, timestamps were duplicated. I guess that was caused by something unexpected inside the timestamp. However, I was not able to get down to the real problem, because I did not have access to the raw message. That’s an important point: I need the raw message content, not what happens to usually be in the logfile. The later is already parsed, processed and recombined, so it does not tell me what the actual message is. But I need the actual message to improve the parser.

What I would like to do is create a very broad test suite with a vast amount of real-life syslog formats. The message text itself is actually not so important to me at this stage. It is the header format. If I get this, I’d like to analyze the different ways in which the format is malformed and then try to find ways to implement it inside the parser. If I find out that I can not detect the right format in all cases automatically, I may find ways to configure the different formats. The end result, I hope, will be far more plug-and-play message detection, something that should be of great benefit for all users.

Please contribute your logs! I need logs from many different devices, with many different versions. But I need only a few lines from each one. For each individual contributor, there is not a lot of effort required. Even a single log line would be great (ten or so be even greater). Just please don’t mangle the logs and provide me with raw log messages. That’s probably the hardest part. One way to do it is to sniff them off the wire, for example with WireShark. Another way is to use rsyslog itself. All you need is a special template and an output file using it:

$template rawmsg,”%rawmsg%n”
*.* /path/to/raw-file.log

Add this to your rsyslog.conf, restart rsyslog, make the device emit a few lines and mail me the result to rgerhards@gmail.com. You may also simply post the log sample to the sample log thread on the rsyslog forum – whatever you prefer. After you have done that, you can remove the lines from rsyslog.conf again. Before you mail me, it is a good idea to check if there is any sensitive information inside the log file. Feel free to delete any lines you have, but I would appreciate if you do not modify line contents. Also, it would be useful for me if you let me know which device, vendor and version produced the log.

I hope that you can help me improve the rsyslog parser even more. Besides, it will probably be a very interesting experiment to see how different syslog messages really are.

Thanks in advance for all contributions. Please let them flow!

Rainer

normalizing Apache Access logs to JSON, XML and syslog

I like to make my mind up based on examples, especially for complex issues. For a discussion we had on the CEE editorial board, I’d like to have some real-world example of a log file with many empty fields. An easy to grasp, well understood and easy to parse example of such is the Apache access log. Thanks to Anton Chuvakin and his Public Security Log Sharing Site I also had a few research samples at hand.

Apache common log format is structured data. So there is no point in running it through a free-text normalization engine like liblognorm. Of course it could process the data, but why use that complex technology. Instead, the decoder is now part of libee and receives a simple string describing which fields are present. It’s called like this:

$ ./convert  -exml -dapache -D “host identity user date request status size f1 useragent” < apache.org > apache.xml

Options specify encoder and decoder, and the string after -D tells the convert field names and order. But now let’s speak the input and output for itself:

The converter works by calling the decoder, which creates an in-memory representation of the log format in a CEE-like object model. Then, that object model and the called-for encoder is used to write the actual output format. That way, the conversion tool can convert between any structured log format, as long as the necessary encoders and decoders are available. This greatly facilitates log processing in heterogeneous environments.

Note that liblognorm works similar and, from libee’s point of view, can be viewed at as an decoder for unstructured text data (or, more precisely, for text data where the structure is well-hidden ;)).

log normalization – first results

At the beginning of this week I was pretty confident, that I would not make my self-set deadline of one month to implement a first rough proof of concept of liblognorm, a log normalizing library. Fortunately, I made extremely good progress the past two days and I am now happy to say that I have such a proof of concept available. All of this can be seen by pulling from Adiscon’s public git server: you need libestr, libee and liblognorm to make it work.

Right now, I’d like to provide a glimpse at how things work. Thanks to Anton Chuvakin and his Public Security Log Sharing Site I got a couple of examples to play with (but I am still interested in more lag samples, especially from Cisco devices). Out of the many, I took a random messages.log file written by sysklogd. This is my input file and can be seen here.

To normalize events, liblognorm needs to know which fields are present at which positions of the input file. It learns this via so-called “samples”. Samples are very similar to the patterns used by virus scanners: like virus patterns describe how a specific virus looks, log samples describe how a specific log line looks. Other than virus patters, I have crafted a format hopefully easy (enough) to understand by sysadmins, so that everyone can add relevant samples himself. To support this, samples look relatively similar to actual log lines, and this is the reason I have termed them “log samples”. Like log files, samples are stored in simple text files. For the initial test, I used a a very small set of samples, available here. A production system will have many more samples, and I envision systems that have many (ten?-) thousand of samples loaded at the same time. If you look at the samples, take special care about entities enclosed in ‘%’ – these are field definitions, the rest is literal text.

The actual normalization is performed by the libraries engine, which parses log lines, based on the samples, into fields. This creates an in-memory representation of the event, which can than be processed by the driving application or be written to some other media or the network.

Liblognorm will come with a small toll called “the normalizer”. It is a minimal library user: it loads a sample database and reads log lines from standard input, creates the event in-memory representation and then writes this representation to standard output in a standardized format. So far, it supports formats as they are expected for the upcoming CEE standard.

The result of a normalizer run on my test input file based on the provided sample base can be seen here. The output is actually a bit more verbose than described above, because it lists the to-be-normalized line as well. If you look at the properties I extracted, you’ll probably notice that some do not make too much sense (maybe…). Also, a classification of the message is missing. Don’t care about these aspects right now: it’s a proof of concept and these things will be addressed by future development (the classification, for example, will be based on CEE taxonomy via tags).

I hope I was able to convey some of the power that is available with liblognorm. Of course, a “little bit” of more work and time will be required to get it production-ready. Unfortunately, I will be unavailable for larger parts of the next two weeks (other work now pressing plus a long-awaited seminar ;)), but I will try to get liblognorm as quickly as possible into the best shape possible. In the meantime, if you like, feel free to have a look at its code or play with it. All of what I wrote can actually be done with the versions available in git.

Call for Log Samples

My log normalization effort made good progress and I have a very rough first proof of concept available. It will take a log sample database, and transform input log files to a CEE-like output format.
Now I am looking at ways to practice-test it. So I’d appreciate if you could point me to some sources of log files. It mustn’t be terabytes, but they should be anonymized and be usable in the public Internet. For obvious reasons, it would be good if they are from widely deployed devices.
I would use a subset of these samples to extract usable sample database entries and see how the run through the normalizer.
Thanks,
Rainer

libee status update

Yesterday I reached some important milestones:
  • defined an internal simple event representation format “int” to create test cases
     http://doc.libee.org/classee__int.html
  • wrote a decoder “int” -> CEE
  • wrote a syslog encoder CEE->syslog
    (with enough escapes to be testable, but without all details)
  • wrote a small program (int2syslog) that ties these pieces togetherand can be used to visualize test cases in syslog representation
 All is available from the libee git at http://git.adiscon.com/?p=libee.git;a=summary

introducing libestr

I tried to avoid this, but it looks like I need to start a new project. I need a simple counted-byte string “class” for the CEE effort. All of that is already present in rsyslog, but using it’s runtime library is overkill. I could have copied the string code, but I really don’t like to have multiple copies of the same code around. So I’ll now create a new small library just for that purpose. Well, while I am at it, I’ll probably also add basic hashtable support, as this can help keep libee small for cases when no XML-specific functionality is desired. And I don’t want to start yet another library for that (we already have lib inflation… ;)).

I named it libestr, bare of a better name. I thought a while about the name, but could not find anything really decent. “estr” means “essentials for string handling” and is probably descriptive enough. Quite honestly, I really like to gain some speed in coding again instead of just creating skeletons and thinking about names…

logging and the C NUL problem

Again, I ran into the “C NUL Problem”, that is the way C strings are terminated. Unfortunately, the creators of C have represented strings as variable arrays of char without an explicitely-stated size. Instead of a size property, a C string is terminated by an US-ASCII NUL character (”). This works well enough in most cases, but has one serious drawback: the NUL character is a reserved character that cannot be part of any C string. So, for example strlen(“AB”) equals one and not three as one would usually expect.

CERT has a good presentation of some of the more important problems associated with the standard C string handling functions. I do not intend to reproduce this here or elaborate on any further details except that we get into big trouble if NUL characters are used e.g. in logging data sets. We had this problem in the IETF syslog WG, where we permited NUL to be part of the syslog message text, but permitted a receiver to escape it. This is far from being an ideal solution, but we considered it good enough, especially as it permits to keep compatible with existing toolset libraries.

Now, in CEE, we face the same challenge: the problem is if the in-memory representation of event fields should permit NUL characters. The correct technical answer to this question is “yes, of course”, but unfortunately it has a serious drawback that can affect adoption: if NULs are permited, none of the string handling functions of the C runtime library can be used. This is, as said above, because the C runtime library is not able to handle NULs inside “standard” C strings. A potential solution would be to escape NULs when they enter the system. However, that would require an additional reserved character, to do the escaping. So in any case, we’ll end up with a string that is different from what the “usual” runtime library routines expect.

Of course, this problem is not new, and many solutions already have been proved. The obvious solution is to replace the standard C string handling functions with octet-counting functions that do not require any reserved characters.

A short, non- weighted list of string replacement string libraries is:

Note that some of them try to mimic standard C strings as part of their API. I consider this highly dangerous, because it provides a false sense of security. While the library now can handle strings with included NUL characters (like “AB”), all parts of the string after the first NUL will be discarded if passed to a “regular” C runtime library string function (like printf!). So IMO this is a mis-feature. A replacement library must explicitely try to avoid compatibility to the C runtime library in order to safe the user from subtle issues, many of them resulting in security problems (think: information hiding).

Unfortunately, I could not identify any globally-accepted string replacement library that is in widespread use.. Despite its deficits, C programmers’  tend to use the plain old string functions present in the standard C runtime library.

So we are back to the original issue:

If CEE supports NUL characters inside strings, the C standard string library can not be used, and there are also problems with a potentially large number of other toolsets. This can lead to low acceptance rate.

But if CEE forbids NUL characters, data must be carefully asserted when it enters the system. Most importantly, a string value like “AB” must NOT be accepted when it is put in via an API. Experience tells that implementors sometimes simply overlook such restrictions. So this mode opens up a number of very subtle bug (security) issues.

I am very undicided which route is best. Obviously, a sound technical solution is what we want. However, the best technical solution is irrelevant if nobody actually uses it. In that light, the second best solution might be better. Comments, anyone?