Hi,
Hope the reply doesn't confuse threaded readers, I only get the Digest.
Message: 2
Date: Wed, 13 May 2020 18:00:39 +1000
From: Russell Coker <russell@coker.com.au>
To: luv-main@luv.asn.au
Subject: server stats collection
Message-ID: <14453232.tgyrW5GtJc@liv>
Content-Type: text/plain; charset="us-ascii"
https://www.datadoghq.com/
I want to do something like what DataDog does, but with free software. The
aim is to address the LUV server load average issue as well as other similar
things. Below is a bunch of links to things I'm considering. I welcome
comments about any of the below or general comments about the issue that don't
reference the below stuff. So if you have some experience to report and don't
want to bother reading the below then please let me know.
<snip>
In $LIFE-1 we used to use Ganglia for this type of monitoring on our HPC cluster (CentOS). You can install the monitor daemon on your nodes/VM's etc and it creates a series of graphs, usually with load-1 as the topmost level for all monitored machines. You can then drill down to each node and see more detail and other metrics. It monitors a bunch of stuff out of the box and is fairly extensible if you want metrics that aren't among the defaults. Uses RRDTool to store and create the graphs.
The first few of the links you included looked like slightly more modern variations.
May or may not be useful for this, but worth mentioning.
--