Tuesday, October 05, 2010

rsyslog now supports Hadoop's HDFS

I will be releasing rsyslog 5.7.1 today, a member of the v5-devel branch. With this version, omhdfs debuts. This is a specially-crafted output module to support Hadoop's HDFS file system. The new module was a sponsored project and is useful for folks expecting enormous amounts of data or having high processing time needs for analysis of the logs.

The module has undergone basic testing and is considered (almost) production-ready. However, I myself have limited test equipment and limited needs for and know-how of Hadoop, so it is probably be interesting to see how real-world users will perceive this module. I am looking forward to any experiences, be it good or bad!

One thing that is a bit bad at the moment is the way omhdfs is build: Hadoop is Java-based, and so is HDFS. There is a C library, libhdfs, available to handle the details, but it uses JNI. That seems to result in a lot of dependencies on environment variables. Not knowing better, I request the user to set these before ./configure is called. If someone has a better way, I would really appreciate to hear about that.

Please also note that it was almost impossible to check omhdfs under valgrind: the Java VM created an enormous amount of memory violations under the debugger and made the output unusable. So far I have not bothered to create a suppression file, but may try this in the future.

All in all, I am very happy we now have native output capability for HDFS as well. Adding the module also proved how useful the idea of a rsyslog core paired up with relatively light-weight output/action modules is.

No comments:

Busy at the moment...

Some might have noticed that I am not as active as usual on the rsyslog project . As this seems to turn out to keep at least for the upcomi...