Persistence after crashing

Hi there,

I have been looking around but unable to find an answer to this. Is there a way for the state of a flow to be persisted?

For example, I have created a flow that accepts an HTTP request and returns a response. At the same time, it then passes the payload to different nodes which do a variety of different processes etc.

However, if the payload is (say for example) 40% through the flow and the node process crashes, when I restart Node-RED is there a way for it to continue where it left off, or would it be lost forever and a new HTTP request have to be made?

Thanks,
George

Hi @georgehanson.

There is no built-in way to resume a flow following a crash.

In the case of an HTTP request, if NR crashes then the node process dies and the OS will close the socket. There's simply no way for NR to be able to restart and respond to that same request.

1 Like

Thanks for quick the response, I suppose that makes sense when you think about it.

Nevertheless, Node-RED is great so thanks for making it open source :slight_smile:

Crashes should be extremely rare. I can't remember the last time that my 'production' node-red crashed. Are you getting regular crashes?

No, but I haven't yet tried sending hundreds of requests to it yet. I was just curious in the event of the Node process crashing how it would handle it.

Do you necessarily have to send http requests? Otherwise, if you instead could send requests via a MQTT broker, you could design it so the latest request will be executed when NR reconnects to broker/topic after a crash if you send requests with the retain flag set

It's mainly going to be a flow which runs upon a web hook from a third-party source.

Yes, understood, but if you could capture those http requests and "convert" them into request messages sent to a mqtt broker, the flow would only react to requests originating from mqtt. You might need to write some "frontend" that handles this conversion. I do not know but maybe you also would need to stop sending http requests when/if NR is down? Or it is ok to lose such requests?

http requests -> "frontend" -> mqtt "request" message with retain flag set -> broker -> NR mqtt subscription -> NR flow etc

You could then let both the "frontend" and NR monitor each other that they are running. They could both restart each other if necessary (it's maybe not so likely they would crash at the same time)

As maybe an easier solution you could let the "frontend" FIFO buffer requests if NR is down and send them as soon as NR is running again

http requests -> "frontend" with FIFO buffer -> NR flow

I would write the "frontend" in Python (but that is just my personal opinion and choice)

Not quite sure I have explained myself properly.

So the flow we are building has around 10 nodes in it (each is a custom node). The first node accepts a HTTP request and returns a response back to the user to say that it has been accepted. After that it then sends the payload through the flow to the other nodes. So by the time the second node starts, the HTTP response has already been sent back to the user.

So what I was looking for was a way of solving the issue where if it gets to say the 5th node in the flow and Node crashes, it could possibly be restored (or rollback any changes)

There is no 'automatic' built in way to recover from a crash. A crash could occur from a power failure - for example.

You could add a node between each of your current nodes, that store the msg.payload and the curent location to a file. Then if there is a crash, at startup you would check the file and restore the msg and start at that location.

so an HTTP request comes in and is passed to the first node 'A' - put in a node between HTTP-in and 'A' to write out the msg and the location 'A-in'. Node 'A' finished and passes on to node 'B' - put in a node between HTTP-in and 'B' to write out the msg and the location 'B-in' etc, etc

When ever NR starts you have a node that checks the file and it it sees a msg and 'B-in' it passes the msg to 'B'.

But what happens if the system crashes while the 'backup' node is running or while it is writting out the file? or what happens it it crashes while processing in the first node before sending out the HTTP response? or how about it is still in the first node and has sent the HTTP response, should that node be restarted and is it ok to send out the HTTP response again? Or what happens if after restarting and while reading the file it crashes again. Or the file becones corrupted or...

there are probably many other conditions that all depend on what you are doing, how critical the data is and other factors that only you know and can determine, but I hope I've given you some ideas and things to think about.

1 Like

Would it help to delay the response to the user until the all the necessary actions have been performed?

Probably not, It could go down any number of routes and hand have to wait for a variety of processes to finish. A flow could last anything up to a few minutes

If there were a failure would it be sufficient to repeat tasks that should have been performed or must each task be performed once and only once?

If you need atomic operations and transaction support, you should build your process on a database, where all that is available...

As a suggestion, maybe try to tag each HTTP request / response with a unique identifier (in the first node) and then use "context.flow.unique_id= unique_tag" and then flow.set("last_id_ack", context.flow.unique_id) to store it. Then use "flow.get("last_id_ack") in your first node to see if it had been acknowledged previously. There are a number of ways to use this to do even more elaborate checking.
st

1 Like

I just started reading through this thread because I am trying to build a similar approach towards flow state recovery.

This is what I am looking for -

  1. A running flow crashed in between due to some event that ultimately killed the process (pod) running on our cluster for a particular user.
  2. Kubernetes auto scalar brings back a new instance for the same user
  3. The new Node-RED instance resumes its flow execution from the last saved SAFE state.

so now I am in September 2020 and just wondering if someone has already figured out something which can be followed or reused :face_with_monocle:

1 Like