Catch node not catching

This is still the ongoing issue I raised here: Memory leak, what am I doing wrong?

Basically my node-red flow uses all the memory and my instance is terminated by the OOM killer.

  • The "will" message to the MQTT server triggers an SMS txt to me so I know its happened within a few seconds.
  • Systemd restarts node-red and my flow restarts everything as if a cold boot and I get a message telling me its up in about 3 minutes from the MQTT server "birth" message.
  • All is well for ~20-36 hours and the cycle repeats.

Here is the syslog message:

Sep 24 11:09:25 AlarmPi systemd[1]: Started Clean php session files.
Sep 24 11:16:32 AlarmPi Node-RED[23852]: 24 Sep 11:16:32 - [red] Uncaught Exception:
Sep 24 11:16:32 AlarmPi Node-RED[23852]: 24 Sep 11:16:32 - RangeError: Array buffer allocation failed
Sep 24 11:16:32 AlarmPi Node-RED[23852]: at new ArrayBuffer ()
Sep 24 11:16:32 AlarmPi Node-RED[23852]: at typedArrayConstructByLength ()
Sep 24 11:16:32 AlarmPi Node-RED[23852]: at new Uint8Array (native)
Sep 24 11:16:32 AlarmPi Node-RED[23852]: at new FastBuffer (buffer.js:38:5)
Sep 24 11:16:32 AlarmPi Node-RED[23852]: at Function.Buffer.alloc (buffer.js:245:10)
Sep 24 11:16:32 AlarmPi Node-RED[23852]: at new Buffer (buffer.js:156:19)
Sep 24 11:16:32 AlarmPi Node-RED[23852]: at Socket.dataHandler (/home/pi/.node-red/node_modules/ftpd/lib/FtpConnection.js:1255:2
7)
Sep 24 11:16:32 AlarmPi Node-RED[23852]: at emitOne (events.js:116:13)
Sep 24 11:16:32 AlarmPi Node-RED[23852]: at Socket.emit (events.js:211:7)
Sep 24 11:16:32 AlarmPi Node-RED[23852]: at addChunk (_stream_readable.js:263:12)
Sep 24 11:16:32 AlarmPi Node-RED[23852]: at readableAddChunk (_stream_readable.js:250:11)
Sep 24 11:16:32 AlarmPi Node-RED[23852]: at Socket.Readable.push (_stream_readable.js:208:10)
Sep 24 11:16:32 AlarmPi Node-RED[23852]: at TCP.onread (net.js:597:20)
Sep 24 11:16:32 AlarmPi systemd[1]: nodered.service: Main process exited, code=exited, status=1/FAILURE

I've added a catch node, set to catch all and a debug and console log output but it never seems to print anything I can find.

I'm just wondering what I'm doing wrong with the catch node:
[{"id":"f3ad7d96.e91a5","type":"catch","z":"b3610d6f.9372","name":"","scope":null,"x":120,"y":40,"wires":[["93dd1cef.a9c34"]]},{"id":"93dd1cef.a9c34","type":"debug","z":"b3610d6f.9372","name":"","active":false,"tosidebar":true,"console":true,"tostatus":false,"complete":"true","x":410,"y":100,"wires":[]}]

Its not a show-stopper issue since losing 3-4 minutes of 24/7 monitoring is not the end of the world, but it does offend my programming sensibilities.

As a matter of fact the catch node will capture only errors generated by nodes that are on the very same TAB as the catch node

OK, its certainly possible that the error comes from a node on the other tab, I'm fuzzy on the relationship among tabs and multiple flows on the same tab.

I'll add one to the other tab as well and see if it catches anything. Should I "debug" the complete message object or just the msg? When its set at the default of msg, there is an extra "node status" checkbox. Which is best debug option for the catch node?

I separated the two main flows into tab because one was reusable elsewhere with only some parameter changes.

Given it takes 20-36 hours or so for the failure to repeat, its pretty slow going.

Looking the system over its seems I have at least two flows on one tab and three on the other tab, depending on the exact definition of a flow.

The "main" flows on each tab bifurcate depending on the values of various QOS 2 retained messages (state variables) Other minor flows are watchdog timers and monitors for will messages should one of the subsystems fail the MQTT broker keepalive.

I really need to find some good docs about how node-red is doing its event loop and synchronization.

well the actual error reported is the red/node_modules/ftpd/lib/FtpConnection.js:1255:2
failing to allocate a buffer - caused by it running out of RAM no doubt.
Are you using the nodered.service systemd file ? (that tries to set the total memory used to 256M - at which point the garbage collector should kick in... but only there is garbage to collect. Of course it may not be that node that filled up the memory - it is just the one tipping it over, or it could be that it isn't releasing old buffers. How big are the objects it's trying to ftp ?

PS - which ftp node is it ? (the full name)

Dave's post is a good place to start certainly. Another option after that might be to create a second instance of Node-RED and start moving flows over until you find the one causing the issue. My first guess would be that FTP node though.

Its node-red-contrib-ftp-server v0.30.

I'm not sure I understand the bit about nodered.service systemd file, but its is automatically started on boot up and automatically restarts after it dies. My /lib/systemd/system/nodered.service contains: Environment="PI_NODE_OPTIONS=--max_old_space_size=256" Would making this smaller be expected to make the situation better or worse? It seema easy to try.

I've suspected the ftp server node from the beginning, as a lot of data flows through it, the jpeg image files typically each run 110K to 180k depending on the lighting and camera view. Which was the focus of my original thread asking if I was not doing something in node-red that was required to dispose of the image buffers when finished wirh them.

The image buffers from the ftp-server are either discarded by returning a null in the function node that gets them or passed to my Python AI script as an MQTT message, depending on the values in the context state variables. Python and MQTT broker show perfectly reasonable CPU memory usages which doesn't change very much.

Node-red has died and restarted three times since I started this thread and still the catch nodes don't catch anything. I suspect the issue is in the underlying nodejs library used by node-red-contrib-ftp-server.

My MQTT "will" messages notify me that node-red has died and restarted with SMS messages. Looks like I'm just going to have to live with it crashing and restarting every ~30 hours or so. Down time 1s usually less than two minutes.

Check the log files... sounds like you are running out of memory. In that case, there's not much the runtime can do to "catch" a low-level system exception.

We've given a number of recommendations and, as Steve says, this is something beneath anything that Node.JS can capture let alone Node-RED.

There are other ways to use Node-RED to do what you want, time to change.

yes.... but maybe there is a bug in that node that is not releasing buffers etc...

Quite possibly and some info has been given on how to begin to prove that. But we know that these issues are hard to track down, especially if you don't have the requisite skills (that's not a criticism, many of us don't have those skills which is a major reason for using NR of course).

So if we need to move on and can't track down the problem, time to look at alternatives.

...or just accept 20 secs of downtime now and again...

Haha :slight_smile: - you may have a point - as long as things gracefully recover themselves.

Should always design things that way - https://en.wikipedia.org/wiki/Chaos_Monkey

2 Likes

I've quite a lot of code to detect if various subsystems are working or not and to restart them should they die or hang. I do detect the node-red being OOM killed, but in this case systemd restarts it with no effort on my part.

I'm not willing to take apart and rebuild a working system in an effort to isolate things any further as I've setup a similar system using onvif network cameras which don't need the node-red-contrib-ftp-server node and it has never run out of memory.

Eventually I'll replace all cameras with Onvif models, but now I'm phasing them in slowly in the most important locations first.

OK, well you might want to raise a GitHub issue with the author at least. As you see from the comments this was written by someone for his father. I don't think that the actual author uses Node-RED directly so he won't be reading this forum.

Generally, I use MQTT to with the last-will-and-testament feature to track whether services and systems are online or offline. Then I can use a simple flow to trap anything going offline and either raise an alert for a manual intervention or do something automatic.

I think the issue is with the underlying nodejs ftpd library that it is built on top of. I think the node-red-contrib-ftp-server author has raised the issue.

interesting comment in that node's code // TODO connection.on('close' ...) doesn't work
so while the work around may work - it does possibly point to a deeper problem where close isn't doing the right thing and may not be cleaning up fully. But yes really needs an ftp expert and some digging.

Though I also see that the node itself tries to clean up by deleting the temporary files 5 secs after they are created (and have been sent onwards)