Planning disaster recovery

ianmacs · 12 October 2019 23:39

I find that I am slowly automating more and more things my home with node-red. I use a node-red project (i.e. git), commit and push the repo to my home server (albeit only every few weeks) and thought by doing this I was sufficiently prepared for hardware failure.

Until today I have not been using node-red for anything overly critical. Node-red controls my kitchen radio, my TV and video recorder, reads and stores all data from my central heating, monitors my freezer, displays the outdoor temperature, detects presence, and can notify me by SMS when certain conditions occur. Besides node-red my system uses mosquitto, influx, grafana (but also dashboard), mutliple Tasmota devices, and other ESP8266 devices with self-written firmware.

The "nothing overly critical" part changed today, when I replaced the analogue room thermostats for some rooms with self-made hardware controlled by node-red, mqtt, and ESPs. Since today, if node-red or mosquitto should fail, then some rooms cannot be heated anymore until corrective measures are taken.

As if to remind me of the importance, when I rebooted the Raspberry today that runs node-red and mosquitto, this Raspberry failed to boot back up. That was when I realized that my system requires a bit more than just a git clone and an npm install. E.g. I need to enable 1wire on the raspberry and install/configure the gammu sms daemon as well as mosquitto. Probably some more steps that I have not thought of yet.

Furthermore, since all ESP devices around the house want to connect to the mosquitto by either hostname or IP, the replacement server needs to respond to the same hostname and IP as the previous server.

I got around doing a full recovery with an fsck of the Raspberry's SD card and the attached SSD. It booted again after that. But now I know that I should have a plan, i.e. a comprehensive backup and instructions for doing a restore. Ideally I should have a replacement Raspberry that automatically takes over when the original Raspberry is unreachable for some time.

If you have precautions like that already in place, please share. I've seen Node Redundancy but this is not where I want to go. I plan to share here what I will eventually use. I imagine it will involve Docker, and project and backup monitoring.

miker256 · 13 October 2019 00:49

I found this video to be a good resource for creating a recovery plan.

Paul-Reed · 13 October 2019 09:08

Only if you are running HASS.IO ...
@ianmacs makes no mention of it, so doubtful that it would be any use.

TotallyInformation · 13 October 2019 10:49

I'm not going to talk about Docker, I've made my personal feelings on the issues it brings known previously. You need to balance those against any benefits.

What I would say is that, if you need resilience - and I suggest that you really do - you need to look at more than a single Pi.

I think that you have two basic approaches. I'm sure other people can chip in with other approaches.

One would be to run a cluster - which would be fine if everything you were running were cluster aware and so would continue to work if one of the Pi's in the cluster fails. However, I don't believe this is very feasible with the collection of software that you are using (which, incidentally is very similar to mine).

The alternative is to run two independent instances of everything on two separate Pi's and then get them to work in concert. This is not hard to do and doesn't require any fancy OS level software but does require some careful planning out of how things communicate.

With this approach, you need the Pi's to have their own identities/IP addresses and to set up Node-RED, Mosquitto and InfluxDB on both.

Then you need flows on both that detect whether Node-RED is still able to talk to the outside world. You can use these to auto-reboot the device which may fix whatever the problem was.

You should also have flows from each NR instance that give an MQTT heartbeat on both instances of Mosquitto, along with a last will and testament so that the broker itself will mark Node-RED as being offline. This is important as it means the other instance of Node-RED can detect when the first goes offline reliably without any other software or hardware. This lets you send out your alerts - personally, I use Telegram rather than SMS. If you want to get clever, you could use one of the circuits you can find online that will let you cut power and forcibly restart the other Pi though there are, of course, some risks with that.

Then you need to decide how far you want to go with resilience. The above is fairly easy to set up and work with and keeps things simple. It is easy to extend such that both Pi's will normally be recording data should you wish that. But all it really does is give you notifications and lets you reboot the offline device.

Next step, would be to go for a simple failover approach. In this scenario, you designate a "main" Pi. The alternate monitors the main Pi and if it goes offline, probably tries to restart it first. If that fails, it runs a script that makes it the "main" device and swaps from running the monitoring flow to the live flow along with firing up the supporting services. This is probably where Docker can help. You will lose some data with this approach since you haven't got a sync'd copy of your InfluxDB data, it is probable that you could fix that but I'm not sure if a normal Pi would have the resources to run a suitable Influx cluster. I also don't know how difficult it is to set up. In any case, such data is generally a nice-to-have rather than a critical element. Doing a periodic copy (nightly maybe) of the Influx data to the backup Pi may help with that.

Other than getting the failover script right, this again is fairly straightforwards to achieve and again requires no particularly special software or knowledge.

On the ESP hardware side of things, you will want to set them up so that they talk to both MQTT brokers and have flows that detect when they go offline so that you can take action. You should also make them restart themselves once if they can't reach the broker, but don't let them go into a reboot loop as that will kill them quite quickly.

There are lots more things that could be done but this post is already quite long.

You mention a single IP name though and to touch on that, I don't think that will help you massively here since you would really need a reverse proxy to handle proper failover and that becomes another element that you need to make resilient - it has to run somewhere. You would need to proxy at least Node-RED (http(s) and websockets) and MQTT. Proxying the InfluxDB wouldn't help unless both instances have the same data.

One other thing. Don't forget that it isn't just the Pi and its services that might fail. WiFi is another element that can commonly fail. Your Pi's should be hard wired (switches rarely fail) and should monitor WiFi availability, reporting on failure. I would also recommend using a separate access point rather than one built into a router.

So to summarise:

You need 2 Pi's or at least something else to run at least Node-RED and MQTT
A minimum setup would simply report when the main Pi goes offline. It might trigger a restart of that Pi (e.g. via a SONOFF switch)
An extension to that would be a failover script that turns the backup into the primary (some Influx data will be lost)
You can go further but it gets very complex, very quickly.
You need to also report on WiFi not just the Pi and its services. Again, you could have a SONOFF or similar switch on your WiFi AP to restart it. Truly resilient WiFi is very hard.

And finally, many Pi related issues are either caused by poor power supplies or by overloading the Pi. Get a really good power supply and put your router, switches, AP and Pi's on a filtered power supply - a PC UPS is ideal, but at least a reputable make of a protected extension board. On the Pi, disable the GUI desktop and remove any software that isn't really needed.

kuema · 14 October 2019 05:40

~~Some interesting thoughts related to this topic were also discussed here:~~

https://discourse.nodered.org/t/thoughts-about-continuity/15441?u=kuema

The topic I linked to is available in the Lounge area only. I'll keep this post for continuity.

ianmacs · 14 October 2019 11:13

@kuema The link produces "Oops! That page doesn’t exist or is private."

@TotallyInformation Thanks, very useful thoughts.

I think that additionally I want this type of redundancy mainly for the critical parts of the home automation, and have ordered 2 new Raspberries to extract the critical parts of the setup to a redundant setup, while keeping the non-critical parts on the existing setup.

Will extend this topic with more details when I set the up the new Raspberries.

kuema · 14 October 2019 11:18

Oops, indeed. I just saw this topic has been posted in the Lounge area. Sorry for that.

ianmacs · 22 October 2019 18:49

My new Raspberries are here. I have set them up with static IP addresses to remove the DHCP server as a possible single point of failure. They are connected with Ethernet cable to the central home switch. I'm still in the process of setting up the failsafe system. The following is what I plan to implement next:

Both Raspberry Pis will receive the same node-red + influxdb + mosquitto installation: node-red flows are shared via project (git), influxdb data is not shared, the flows must be able to cope with no history present, mosquitto state is also not shared, processes must be able to function or start without persistent commands present.

Linux allows to assign multiple IP addresses to a single network device. The devices will have static IP addresses x.x.x.254 and x.x.x.253, but other devices will refer to them by IP address x.x.x.252, the active device adds this IP address to the network card.

One of the devices, x.x.x.254, is the designated active device. On this device, node-red starts on boot, while on x.x.x.253, node-red will be started only when it takes over because the .254 device has failed. A watchdog for this condition will be implemented with shell script periodically executed by cron on .253.

.254 will be powered through a chain of two Tasmota-enabled power plugs:

Mains -> Tasmota1 -> Tasmota2 -> Raspberry .254 Power supply.

Both Raspberry Pis will also span a WiFi network of their own with hostapd. Raspberry Pi .254 will provide SSID 254, Tasmota2 connects to this WiFi. Raspberry Pi .253 will provide SSID 253, to which Tasmota1 connects.

Switching off any one of the Tasmotas ensures that Pi .254 is switched off and does not interfere when .253 takes over.

Tasmota 2 shall be configured to switch off if it does not receive a steady stream keep-alive messages from Pi .254. With only enough leeway to allow performing a reboot. Pi .254 will send these keep-alive messages as long as it thinks it is healthy. Hope this can be realized with off-the-shelf Tasmota images, otherwise I will have to write my own firmware for this device.

Pi .253, in its cron-driven watchdog process, also periodically checks if .254 can be reached and if .254 thinks it is healthy. If it does not receive positive responses for some period of time (enough to allow for a reboot), it will deactivate Tasmota1, assign the additional .252 IP address to its own Ethernet connection, and start node-red, thereby taking over control of the central heating and other critical stuff. Pi .253 is directly connected to mains without Tasmota power plugs in between.

Warnings about any one of these Pis not being reachable anymore can be triggered by ping presence detection on non-critical, separate node-red installations in the home dedicated to sending warnings by email, SMS, phone or loudspeaker announcement or status displays in the house.

Regarding the WiFi reliability (for all the Tasmota and other ESP devices), I already have 2 separate access points in the house connecting wireless devices to the home network independent of the router. These access points serve the same SSID and connect to the same Ethernet switch which means that devices should connect to the other access point if one of them fails. I plan to add a third and maybe fourth access point with the same SSID for better area coverage; and to reconfigure all Tasmotas and ESPs with static IP addresses to remove the DHCP single point of failure.

If there are changes in plans or I discover something interesting I will share it in this topic. Working on realization of the remaining parts now.

Topic		Replies	Views
High Availability for Node-RED (and more) with Node-RED Share Your Projects	12	3641	8 March 2023
High Availability failover cluster General	15	6730	1 January 2019
Choosing an architecture for Home Automation General	37	9599	8 January 2019
Hardware recommendations (2022) for node red home automation server General	22	3931	25 April 2022
Hardwired sensor network utilizing obsolete CAT6 infrastructure Hardware	41	1377	19 February 2023

Planning disaster recovery

Related topics