Wondering what people are using for automated management by exception of your systems ? I have at the moment 14 different Tasmota/Sonoff devices scattered around and doing various tasks - some of them only do something once a day (or once a night etc) - it would be nice to have a system to monitor if their tasks have completed (and/or) is the device alive and report on any failures/deviations.
Are people using Offboard tools for this (MRTG, Nagios, etc etc) or specialised flows ?
My "main rpi" is responsible for monitoring & checking all other distributed units (rpi's, esp's etc). All those distributed units provide various services to my main home automation. Some services are expected to provide data at a regular interval, others on demand or just executing some stuff/commands at certain dates & times or situations
To be sure they are all alive and healthy, I communicate with them using mqtt. Either by sending out commands and expecting valid answers back or just awaiting expected data at regular interval. Some of my services may be a NR flow, a Python script, an application etc etc
All services will also try to "heal themselves" by restarting if something goes wrong and if no success, finally also reboot the device itself, hopefully to resolve the issue. So far this has worked as I expected
To monitor all this, my "main rpi" has a flow as below. Each and every service out there has a dedicated trigger node that is "energized" regularly with responses coming from each service. If responses would stop, the trigger node will fire an event message that is sent out via Telegram
An overall system status is captured using the "status node". I use this in my GUI's, providing a visual indicator of the overall system status
To monitor that the "main rpi" Node-RED flow itself is working, it simply sends out a heartbeat that is monitored & reported by one of the other distributed rpi's. Closing the loop so to say
Some time ago I posted a flow to create a network map (does not adhere to your requirements, but it is an idea), if it is only sonoff you are interested in, you could use the LWT for device availability.
MQTT already has Last Will and Testament (LWT) baked into the protocol exactly for this purpose. The sensor lodges a message with the broker, that the broker then sends if the sensor goes offline unexpectedly.