Something went wrong while setting issue due date.
Roll out disk drive health monitoring and make it part of SOE
-
ACTION: test
smartctl
drive monitoring with... grafana? http://uccmonitor.ucc.asn.au:3000/d/PkPI4xGWz/s-m-a-r-t-dashboard - proof-of-concept tested 2021-09-27: @nick @mtearle
- https://matrix.to/#/!zAfheZzGazlYUQqAeJ:ucc.asn.au/$HuKyvV8eVoTXKah1Ua3hwR9jWyodlIt2P1iO4upAPmE
-
/etc/cron.d/node_prometheus-SMART-export
:*/5 * * * * root /usr/local/bin/smartmon.sh > /var/lib/node_exporter/smart_metrics.prom
-rwxr-xr-x 1 nick wheel 11287 Sep 27 19:49 /usr/local/bin/smartmon.sh
-
ACTION: make it standard/SOE config in https://gitlab.ucc.asn.au/ucc-systems/ansiblemonitoring
- and/or do some rewriting/tweaking of the metric collector and/or dashboard
- central temperature summary graph, good for summer
- alerts
-
ACTION: roll out more: @jimbo volunteered 2021-10-10 to try the
ansible
monitoring rollouts