Node Red consuming 100% CPU power

Hey guys,

i am using Node-RED in my home automation scenario for different purposes. I have about 10 flows with each about 5 up to 50 nodes. Everything is working fine, but i have a big problem with the CPU usage that sometimes occur.

From time to time - mostly every one to three days - CPU usage jumps to 100% (one CPU core) and stucks there forever since i restart Node-RED. I am running Node-RED on Docker on a quad core with 16 GB RAM on Ubuntu Server.

This is a screenshot from my munin server monitoring from the last week:

The flows are usual working normal, when this happens, but i don't like the extra power the CPU is converting to heat.

What can i do to find and debug the cause of this? The log output does not show anything at the time the CPU usage occurs.

The error occurred lately at about 22 o'clock on 1st of january:

1 Jan 18:45:44 - [warn] [modbus-read:Modbus TCP SE17K SunSpec] Modbus Failure On State sending Get More About It By Logging
1 Jan 18:45:44 - [warn] [modbus-read:Modbus TCP SE17K Power Control] Modbus Failure On State sending Get More About It By Logging
1 Jan 18:45:44 - [warn] [modbus-write:Write Power Limit] Modbus Failure On State sending Get More About It By Logging
1 Jan 18:45:44 - [warn] [modbus-write:Enable Advanced Power Control] Modbus Failure On State sending Get More About It By Logging
1 Jan 18:45:44 - [warn] [modbus-write:Enable Reactive Power Config] Modbus Failure On State sending Get More About It By Logging
1 Jan 18:45:44 - [warn] [modbus-write:Commit Power Settings] Modbus Failure On State sending Get More About It By Logging
1 Jan 18:45:44 - [error] [modbus-read:Modbus TCP SE17K SunSpec] Error: Port Not Open
1 Jan 18:45:54 - [error] [modbus-write:Write Power Limit] Error: Client Not Ready To Write At State init
1 Jan 18:46:04 - [error] [modbus-write:Write Power Limit] Error: Client Not Ready To Write At State init
1 Jan 18:46:14 - [error] [modbus-write:Write Power Limit] Error: Client Not Ready To Write At State reconnecting
1 Jan 23:24:07 - [info] [alphaess-monitoring:Alpha ESS Monitoring] Renewing token...
2 Jan 02:00:00 - [warn] [change:reset counter] Cannot set property of non-object type: topic.reset
2 Jan 04:46:18 - [warn] [modbus-read:Modbus TCP SE17K SunSpec] Modbus Failure On State sending Get More About It By Logging
2 Jan 04:46:18 - [warn] [modbus-read:Modbus TCP SE17K Power Control] Modbus Failure On State sending Get More About It By Logging
2 Jan 04:46:18 - [warn] [modbus-write:Write Power Limit] Modbus Failure On State sending Get More About It By Logging
2 Jan 04:46:18 - [warn] [modbus-write:Enable Advanced Power Control] Modbus Failure On State sending Get More About It By Logging
2 Jan 04:46:18 - [warn] [modbus-write:Enable Reactive Power Config] Modbus Failure On State sending Get More About It By Logging
2 Jan 04:46:18 - [warn] [modbus-write:Commit Power Settings] Modbus Failure On State sending Get More About It By Logging

Any ideas? I am a software developer, but not very experienced with JavaScript, Node.js, etc.

Thank you in advance! :slight_smile:

Best regards,
Konni

Looking at your logs it seems whatever modbus device you are polling is dropping off air. What modbus device are you communicating to? What modbus nodes are you using? How often are you polling? Are you reading lots of individual modbus values at the same time?

PS: I am consuming a lot of external APIs (mostly HTTP/REST). Could it be that some of them don't respond and my request does not time out? I tried to put a request timeout to all my http request nodes, but I have some proprietary nodes that encapuslate internally the http requests and so i can't put timeouts to them.

I am using node-red-contrib-modbus for both reading and writing values from my solar inverter as well as my heat pump.

I'm polling every 10 seconds from the SolarEdge solar inverter (~100 bytes in two separated commands) as well as writing 4 single values (i think one or two bytes each command) every 10 seconds.

I read from the heat pump only every minute (~70 bytes) in a single command.

Command queuing with a delay of 250 ms is enabled.

One more thing: both modbus connections are modbus TCP. And sadly SolarEdge has a very restricted implementation. It's very nervous about multiple connections. It's not allowed to poll data from two different IPs and even two connections from the same hosts are not possible. Because of that there are often errors on reconnection, until the inverter accepts a reconnect.

It's hard to know at this point. You could try disabling modbus & rule it out?

You could also try logging pings to your router and the modbus devices - perhaps your network is glitching?

Then there is your custom nodes. Do that have synchronous code that might block the event queue?

Hi @IAmKonni,

In case you don't manage to find the root cause (based on the extensive info that @Steve-Mcl already has given), could this be of any help to you perhaps:

Bart

2 Likes

Maybe...

Some of your flows have feedback.

Or it happens at weird times because of context, values, etc.

Thank you - I will try it today or tomorrow evening.

I am very experience with Java and I like the possibilities there to inspect and profile the Java Virtual Machine. I can very easily detect which code "burns" the CPU usage.

It would help me, if i find out, which node causes the problem.

Regarding the feedback loops or synchronizing problems: Most flows are very linear in means of reading data from a source (e.g. modbus or an external API) and writing it to an InfluxDB v2. I am using 3 or4 global context variables to combine and calculate values based from two different flows. But those are triggered by timer and not events. I don't think I have feedback loops in the manner of MQTT or similar.

BTW: I have these problems now since months, even the upgrade to Node-RED 2.x did not resolve it (currently using latest-12).

Ok. Good to hear.

Mind if I ask what you are running it on and which O/S?

It's running on a low power quad core Athlon 5350 (4x 2.05 GHz) with 16 GB RAM DDR3. Host O/S is Ubuntu Server 20.04 LTS.

Ok, sorry. I'm out of ideas.

Good luck.

You could try setting logging level to "debug" and metrics "true". There is an audit setting too but I don't know what it does.

That will put more info in the log, including memory use (but not CPU, unfortunately).

You could also have a script running that logs CPU use every few seconds. Something like this:

#! /bin/bash
while true
do
NOW=$(date +%H:%M:%S)
CPU=$(uptime | cut -d ',' -f 4-6)
DATA=$NOW$CPU

echo $DATA >> /tmp/noderedlog
sleep 10
done

(It could output to the same file as Node-Red logging but I don't have Ubuntu, nor can I recall how to append to syslog)
Oh, the uptime command gives average CPU usage on Debian, not sure about Ubuntu.

Thanks to the efforts of @BartButenaers i could find at least one problem in a function node after a catch node. But I am wondering why nowhere a problem was logged/recognized.

This node is worth your efforts, thank you! :+1::slightly_smiling_face:

1 Like

What exactly was the problem?

Was this function after the catch node causing an exception that triggered the catch node (thus creating a loop?)

1 Like

You are welcome!
If you have any time left, it would be nice if you could share a screenshot of your flame graph and explain in some steps how you have interpreted it, to find the root cause. That might be VERY useful for me and others to solve similar issues.
Thanks !!!

1 Like

That was probably the problem, I will show later my flow and the code in my function flow. I need to play with my baby girl first. :stuck_out_tongue_winking_eye:

Here the steps that i took to find the cause of the problem:

  1. I used the profiler node of @BartButenaers to create the .cpuprofile file and loaded it like described into the chrome developer tools (my CPU usage was at time at 100%).

  2. Then you get this view:

I sorted by "Heavy (Bottom Up)" and Self-Time and checked then line by line the corresponding source code that can be seen in the right column. I tried to find the first occurrence of code that didn't look like Node-RED internals and seem to be caused by my own flows.
In my case this was the second line. It told me, that there is CPU time wasting in node with id "blabla". With this id i could check in my flows which node caused the problem.

  1. Try to find the cause:
    In my case it was a function node behind a catch node that i used to create messages that will be sent via telegram to my smart phone. This function was intended to report me fast, if anything in my smart home and automation system fails.
    This catch node belongs to a weather cloud API that reads several thermal sensors in my house. Obvisiously this API is sometimes not reachable, what leads to errors and triggers the catch node.

The flow:

Have a look into my simple code within the function node:

It seems that the HTTP status code is not piped to the catch clause and cannot been integrated therefore in the telegram message. I removed the appending of the msg.statusCode and now everything runs fine. Sadly Node-RED didn't get me something like a null pointer or similar. Or it was my bad that it was catched by catch node too and so i got the loop.

I hope you understood what I did and what steps i took to get the problem solved. Feel free to ask for more details.

Thank you all, especially @BartButenaers and have a nice day! :slight_smile:

3 Likes

Thanks for sharing this!!!!!!!! I will add some of your info tonight in the wiki page.