Centralized system and LSF logging on a Turing Pi system

April 8, 2024

I love high performance computers, and some of my best friends work in high performance computing (HPC). Obviously, sometimes we also talk about logging. Recently we not just talked, but I also helped Gábor in his first steps with syslog-ng. He summarized his experiences in a blog:

Logs are one of those indispensable things in IT when things go wrong. Having worked in technical support for software products in a past life, I’ve likely looked at hundreds (or more) logs over the years, helping to identify issues. So, I really appreciate the importance of logs, but I can honestly say that I never really thought about a logging strategy for the systems on my home network - primarily those running Linux.

One of my longtime friends, Peter Czanik, who also works in IT, happens to be a logging guru as well as an IBM Champion for Power Systems (yeah!). So it’s only natural that we get to talking about logging. He is often complaining that even at IT security conferences people are unaware of the importance of central logging. So, why is it so important? For security it’s obvious: logs are stored independently from the compromised system, so they cannot be modified or deleted by the attacker. But central logging is beneficial for the HPC operator as well. First of all, it’s availability. You can read the logs even if one of your nodes becomes unreachable. Instead of trying to breath life into the failed node, you can just take a look at the logs and see a broken hard drive, or a similar deadly problem. And it is also convenience, as all logs are available at a single location. Logging into each node on the 3 node cluster to check locally saved logs is inconvenient but doable. On a 10 node cluster it takes a long time. On a 100 node cluster a couple of working days. While, if your logs are collected to a central location, maybe a single grep command, or search in a Kibana or similar web interface.

Those who follow my blog will know that I’ve been tinkering with a Turing Pi V1 system lately. You can read my latest post here. For me, the Turing Pi has always been a cluster in a box. My Turing Pi is fully populated with 7 compute modules. I’ve designed Node 1 to be the NFS server and LSF manager for the cluster. LSF is a workload scheduler for high-performance computing (HPC) from IBM. Naturally I turned to Peter for his guidance on this, and the result is this blog. Peter recommended that I use syslog-ng for log aggregation and also helped me through some of my first steps with syslog-ng. And the goal was to aggregate both the system (syslog) as well as LSF logs on Node 1. TL;DR it was easy to get it all working. But I encourage you to read on to better understand the nuances and necessary configuration both syslog-ng and LSF that was needed.

Read the rest at: https://www.gaborsamu.com/blog/turingpi_syslog-ng_lsf/

syslog-ng logo