Auto-Load "Last working version" when NR start is crashing

Facts:

  • Many struggle with problems to start NR, when a sub-module is crashing, because most of those are not well/safely written. (Without proper try-catch.) Even "NR's built-in" modules, like 35-arduino.
  • If NR is starting as a service, the users do not know how to start NR manually.
  • Most users do not know about the: node-red --safe switch.
  • Even if started with --safe mode, If the user does not know, what was (accidentally?) changed before pressing the DEPLOY button > it won't help.

That's why I strongly suggest to make NR start process safer by

Auto-Load last stable Flow!

Let's call this new built-in feature: autoLoadLastFlow or aulolast

To disable it: node-red --unsafe (for developers, to test crash scenarios.)

How it would work?

  1. If the user deploys a flow, first a backup would be made, before saving / overwriting the current flow. [{flowname}.stable]. (Except if the file already exists.)

  2. While starting NR, each flow would start in a try..catch part, and if any unhandled error appears (which would normally crash NR) :

    • The loading process would rename the current flow to [{flowname}.unsafe]
    • If there is already an .unsafe file present, it would disable the current flow, and not load it. Set flag unsafeAlreadyExists = true
    • Rename back the [{flowname}.stable] to be the "normal" one
    • Set a flag restoredLastStable = true to visually show a popup for the user once it started.
    • Restart the whole loading process from the beginning.
  3. Else >> If loading was successfull, it would delete [{flowname}.stable] file.

  4. After NR starts, the user would be informed with a popup warning and could choose to:

    • Keep the old = restored = currently loaded flow,
      and delete the newer [{flowname}.unsafe] = unstable one?
      OR
    • Load the new, [{flowname}.unsafe] flow for further editing?
      (Would keep the old, stable file)

.

Have you raised this as an issue against the Arduino node ? GitHub · Where software is built - please do so and if possible share the flow that causes this error...
Thanks

Not yet. I'm working on multiple problems I would like to share later.
If possible with ready-to-use commits / with solutions.

(But currently: Please do not spam this topic with specific problems, like how many ways NR can crash.) :wink:

How long would this starting phase last?

Sorry, I do not know how the whole starting procedure is loading the modules, so I do not exactly understand the question, so:

Probably it would "last" until all modules and flows are loaded. In theory something like:

for (let i = 0; i < flows.count; i++) {
  try
    flows[i].load;
  catch (... here comes the part I wrote down above ... ) ;
} 

I don't think such problems generally occur when the nodes are loaded, but when they receive a message, which might not happen till minutes or hours after node red is started. In fact often they are as a result of invalid data which might only happen under unusual conditions.

I think we are talking about 2 different things.

Yes, You are right about:

  • Some nodes or data can crash NR completely while running,
  • and that problem should be addressed too, avoiding that completely.

NR should not ever crash because of 1 node is crashing.
NR should be like a safe-OS.
If that could be solved, it would be the ultimate goal.
I've already asked it here at the forum, while I was developing my own node, but nothing happened. So I gave up.


My current idea is more simple: just make the first loading process of the flows safe, so that we have a backup.

So we can revoke / re-edit things, after a successful load/start.

Seems to me a better solution would be to have an external monitoring process that watches the Node-RED runtime w heartbeats and others (even health control signals from your own dev nodes), to ensure NR is working properly. If things goes wrong, the external process could kill NR if necessary, do restarts a number of times, notify you via some service, ev last restart attempt with the --safe switch

Node-RED has to shut down on an uncaught exception, because it has no idea what state anything is now left in.

Re your original post - step 1 - Node-RED already does exactly that - it saves the existing flow file to (typically) .flows.json.backup in your .node-red- directory. That can be manually restored in case of failures. But of course if you hit deploy again you will overwrite it.

Could have a whole collection (confgurable of course) of backups, perhaps the last ten deploys. When or if NR crashes and can't recover using the existing flows.json, it drops to backup.01 and if that fails, it then tries backup.02 and if that fails .... I think you get the picture.

That could be automated but will cause much confusion, i.e., which backup is actually working :wink:

EDIT: of course if NR fails when writing a new backup (for example because the hard drive is full) then you'll need a NR-watchdog-daemon to empty out the hard drive :wink: /s

EDIT2: Hangon, I could do this from outside of NR by simply watching the flows.json file (with something like the watch command) when it got changed then I would create a new backup (unbeknownst to NR). At the same time this external process could monitor the NR process - it not running for X seconds/minutes, restart it and check whether it starts or not. If not then, replace flows.json. That could all be done outside of NR - seems like a complicated start script/service script. Hangon, doesn't the serviced daemon already provide for this ... :thinking:

(sorry for thinking aloud here)

I would personally like to see a node-red setting that allowed control over the number of backups being kept. I think that would be a really useful addition.

Of course, more backups, the longer the startup time. But having it configurable would mean that people could tune it to their own needs.

Certainly there have been times when I dearly wished for more backup files.


I'm in two minds about the original request here. I'm not convinced that auto-loading the previous version is entirely useful. However, I recognise that it could be. I think it would need some additional testing though to look for possible edge-cases.

If it was adopted, as a minimum, I think that the failed version should be separately moved to a different file name so that it can be later examined without danger of it being overwritten.

Of course you can use Projects - then you can go back to any previous commit you like. (you did remember to commit the changes didn't you :wink: )

Note that there may not even be a user interface connected when node-red crashes, so this could happen at any time on an unattended system. The changes made in the most recent edit might, for example, be to adjust for changes in the hardware so to run the old flow might be disastrous, as that flow would not be compatible with the hardware. Automatically running a backup cannot therefore be an automatic operation, it would have to be ok'd by a user.

Exactly, far too complex to have a one-size-fits-all solution. It's a rabbit warren.

In this case, something changed on the underlying machine, the flow didn't change but when NR now does an automatic restart (e.g. via a serviced script), it ends in the endless start/fail loop.

Hence

would be a better approach.

All my Node-RED instances (I have for the moment 11 instances running in 11 different computers) do have a backup solution included in the flow, saving backups to my network every night in a round robin scheme (I think it is called like that). Means I do always have fresh backups from the last week. This has sometimes been very useful, it happened I made some changes & added some new ideas that did now work perfectly well so it was very easy to revert back

In terms of monitoring "from outside" I run Python scripts in parallel to NR. Those are monitoring the "behaviour" of NR and informs me if there are some problems. Some of them are capable to make both partial restarts as well as complete reboots

Would be cool if you could have flows that are self-monitoring but I'm not sure how reliable that would be. NR is single threaded so if it stops...it stops and can't restart itself...I assume...unless you could use timers of some kind in a smart unexplored way

But then you also have a whole bunch of other complexity.

Not a good experience for people who are not familiar with GIT and not used to having to constantly commit changes.

Indeed - unintended consequences and edge-cases.

And indeed, my live instances certainly backup. 7 dailies, 4 weeklies and 12 monthlies. Each rotating. Not controlled by a flow either (since that would be too easy to fail), A separate set of scripts controlled by CRON at the OS level.

This is much less useful for development and test instances though where the pace of change might be measured in minutes.

I use Telegraf for independent monitoring. Though I admit, I don't have a separate alerting system right now because my Node-RED live instances are so stable! :smile:

I'm sure there may indeed be some tweaks that could be made to the Linux node red.service file that could help a bit - for example if the service fails to start (say 5 times) then try to start in safe mode... but I can't work out all the options required to do that.

If anyone is an expert on systemd service files - please step forwards...

This became an interesting topic. :slight_smile: Thanks for all the comments!

But :worried:

You guys are all PROs! It is clear to me, that none of You can think any longer with a head of a student or beginner or "simple user".
If NR crashed, (especially as a service) they do not know why it did and how to restore backup files, etc...
They just want to undo the last thing they (accidentally?) did, and keep on working on the flow again after pressing a simple "Restore last flow that worked?" >> YES button on the screen !

So, here are my conclusions:

I agree, this is cool idea. The install scripts should ask interactively, how many attempts should it make to restart. (default = -1 for endless retries.)

I agree! That's why I wrote:

  • "NR should start in an "empty flow" state
  • and ask the user in a popup window, what to do next!"

Similar as in --safe mode.

@gregorius : I think creating 5-10-... backups is an overkill.
(If anyone wants to create specific backup system, (like Git, SyncThing, daemon, service, etc) feel free to do so!)

Making backups from the flow sounds great, (drop me a link of the node/script/function!) and this technic should be included in the main page documentation, (we should have here +1 Tab for: "Security" and and +1 for: "Safety and Backups". )

But this will not work, if NR is already crashed.
A simple user or beginner will not be able to restore backups so easy. (They can not find files in a complicated hierarchy, not they know how to rename them in an emergency.)


Most users do not know about the current .backup files. (I didn't either, and I'm not a beginner any more. From which NR version is this happening?)
Whenever one is created, it should be shown in the logs + on the debugger-sidebar!

Solution:

Two auto-backup of the flow is enough.

  1. one [...backup] for every deploy (as it works now)
  2. one [...stable] if the flow starts successfully. (Deleting prev. stable one)

Everything else could work as I have already wrote at the top.

There has been several variants discussed earlier, I did follow this approach, except that I use my own networked SSD instead of Dropbox:

1 Like