i am using Node-RED in my home automation scenario for different purposes. I have about 10 flows with each about 5 up to 50 nodes. Everything is working fine, but i have a big problem with the CPU usage that sometimes occur.
From time to time - mostly every one to three days - CPU usage jumps to 100% (one CPU core) and stucks there forever since i restart Node-RED. I am running Node-RED on Docker on a quad core with 16 GB RAM on Ubuntu Server.
This is a screenshot from my munin server monitoring from the last week:
The flows are usual working normal, when this happens, but i don't like the extra power the CPU is converting to heat.
What can i do to find and debug the cause of this? The log output does not show anything at the time the CPU usage occurs.
The error occurred lately at about 22 o'clock on 1st of january:
1 Jan 18:45:44 - [warn] [modbus-read:Modbus TCP SE17K SunSpec] Modbus Failure On State sending Get More About It By Logging
1 Jan 18:45:44 - [warn] [modbus-read:Modbus TCP SE17K Power Control] Modbus Failure On State sending Get More About It By Logging
1 Jan 18:45:44 - [warn] [modbus-write:Write Power Limit] Modbus Failure On State sending Get More About It By Logging
1 Jan 18:45:44 - [warn] [modbus-write:Enable Advanced Power Control] Modbus Failure On State sending Get More About It By Logging
1 Jan 18:45:44 - [warn] [modbus-write:Enable Reactive Power Config] Modbus Failure On State sending Get More About It By Logging
1 Jan 18:45:44 - [warn] [modbus-write:Commit Power Settings] Modbus Failure On State sending Get More About It By Logging
1 Jan 18:45:44 - [error] [modbus-read:Modbus TCP SE17K SunSpec] Error: Port Not Open
1 Jan 18:45:54 - [error] [modbus-write:Write Power Limit] Error: Client Not Ready To Write At State init
1 Jan 18:46:04 - [error] [modbus-write:Write Power Limit] Error: Client Not Ready To Write At State init
1 Jan 18:46:14 - [error] [modbus-write:Write Power Limit] Error: Client Not Ready To Write At State reconnecting
1 Jan 23:24:07 - [info] [alphaess-monitoring:Alpha ESS Monitoring] Renewing token...
2 Jan 02:00:00 - [warn] [change:reset counter] Cannot set property of non-object type: topic.reset
2 Jan 04:46:18 - [warn] [modbus-read:Modbus TCP SE17K SunSpec] Modbus Failure On State sending Get More About It By Logging
2 Jan 04:46:18 - [warn] [modbus-read:Modbus TCP SE17K Power Control] Modbus Failure On State sending Get More About It By Logging
2 Jan 04:46:18 - [warn] [modbus-write:Write Power Limit] Modbus Failure On State sending Get More About It By Logging
2 Jan 04:46:18 - [warn] [modbus-write:Enable Advanced Power Control] Modbus Failure On State sending Get More About It By Logging
2 Jan 04:46:18 - [warn] [modbus-write:Enable Reactive Power Config] Modbus Failure On State sending Get More About It By Logging
2 Jan 04:46:18 - [warn] [modbus-write:Commit Power Settings] Modbus Failure On State sending Get More About It By Logging
Any ideas? I am a software developer, but not very experienced with JavaScript, Node.js, etc.
Looking at your logs it seems whatever modbus device you are polling is dropping off air. What modbus device are you communicating to? What modbus nodes are you using? How often are you polling? Are you reading lots of individual modbus values at the same time?
PS: I am consuming a lot of external APIs (mostly HTTP/REST). Could it be that some of them don't respond and my request does not time out? I tried to put a request timeout to all my http request nodes, but I have some proprietary nodes that encapuslate internally the http requests and so i can't put timeouts to them.
I am using node-red-contrib-modbus for both reading and writing values from my solar inverter as well as my heat pump.
I'm polling every 10 seconds from the SolarEdge solar inverter (~100 bytes in two separated commands) as well as writing 4 single values (i think one or two bytes each command) every 10 seconds.
I read from the heat pump only every minute (~70 bytes) in a single command.
Command queuing with a delay of 250 ms is enabled.
One more thing: both modbus connections are modbus TCP. And sadly SolarEdge has a very restricted implementation. It's very nervous about multiple connections. It's not allowed to poll data from two different IPs and even two connections from the same hosts are not possible. Because of that there are often errors on reconnection, until the inverter accepts a reconnect.
In case you don't manage to find the root cause (based on the extensive info that @Steve-Mcl already has given), could this be of any help to you perhaps:
Thank you - I will try it today or tomorrow evening.
I am very experience with Java and I like the possibilities there to inspect and profile the Java Virtual Machine. I can very easily detect which code "burns" the CPU usage.
It would help me, if i find out, which node causes the problem.
Regarding the feedback loops or synchronizing problems: Most flows are very linear in means of reading data from a source (e.g. modbus or an external API) and writing it to an InfluxDB v2. I am using 3 or4 global context variables to combine and calculate values based from two different flows. But those are triggered by timer and not events. I don't think I have feedback loops in the manner of MQTT or similar.
BTW: I have these problems now since months, even the upgrade to Node-RED 2.x did not resolve it (currently using latest-12).
(It could output to the same file as Node-Red logging but I don't have Ubuntu, nor can I recall how to append to syslog)
Oh, the uptime command gives average CPU usage on Debian, not sure about Ubuntu.
Thanks to the efforts of @BartButenaers i could find at least one problem in a function node after a catch node. But I am wondering why nowhere a problem was logged/recognized.
You are welcome!
If you have any time left, it would be nice if you could share a screenshot of your flame graph and explain in some steps how you have interpreted it, to find the root cause. That might be VERY useful for me and others to solve similar issues.
Thanks !!!
Here the steps that i took to find the cause of the problem:
I used the profiler node of @BartButenaers to create the .cpuprofile file and loaded it like described into the chrome developer tools (my CPU usage was at time at 100%).
I sorted by "Heavy (Bottom Up)" and Self-Time and checked then line by line the corresponding source code that can be seen in the right column. I tried to find the first occurrence of code that didn't look like Node-RED internals and seem to be caused by my own flows.
In my case this was the second line. It told me, that there is CPU time wasting in node with id "blabla". With this id i could check in my flows which node caused the problem.
Try to find the cause:
In my case it was a function node behind a catch node that i used to create messages that will be sent via telegram to my smart phone. This function was intended to report me fast, if anything in my smart home and automation system fails.
This catch node belongs to a weather cloud API that reads several thermal sensors in my house. Obvisiously this API is sometimes not reachable, what leads to errors and triggers the catch node.
It seems that the HTTP status code is not piped to the catch clause and cannot been integrated therefore in the telegram message. I removed the appending of the msg.statusCode and now everything runs fine. Sadly Node-RED didn't get me something like a null pointer or similar. Or it was my bad that it was catched by catch node too and so i got the loop.
I hope you understood what I did and what steps i took to get the problem solved. Feel free to ask for more details.
Thank you all, especially @BartButenaers and have a nice day!