As to the MQTT Loop regarding a dash switch, I really don't think so... I have been particularly careful to avoid that, having fallen into that type of trap way back....
Swap is also a no go... Swap space used was zero while this whole lot was happening....
@Steve-Mcl
Agreed re the subtlety of the loops - Not easy to track, but nothing that I can think of actually comes from one sensor or mqtt sub and goes back to the origin per se... particularly at night when solar is, well, zero.... The vast majority of switching and checking is solar event driven, the midnight cpu rev is when the system is at it's quietest processing -wise... About the only things that are happening then are, well, pretty much nothing...
As an aside: I wanted to test the single core of cpu getting busy... An inject node triggering "Command node" with[ free -m | grep Swap | awk '{print ($3/$2)*100}' ] at about every .1 seconds.... This pushed all the cpu cores up dramatically, not just a single...
Does this sound right? I would expect it to push up only one....
It is easy enough to rule out an mqtt loop by subscribing to "#" in an external client and see what is going on. Even if it isn't a loop this may give you valuable information on what is happening. Perhaps worth checking that even when there doesn't seem to be a problem in case you see something unexpected.
Regarding the CPU use, when I inject at .1 intervals to force some load, the cpu hovers around the 50% mark, but intermittently goes as high as 160%
I ran the subscribe to # but there is just too much traffic to make much head or tail of it... Visually that is... There are around 30 to 40 clients publishing to the broker in 5 or 10 second telemetry intervals...
Regds
Ed
Edit: Mosquitto
Edit II : Sorry, typo - CPU usage of node red process intermittently goes as high as 160%
You will know if there is an MQTT loop as it will be handling thousands of messages a second, mostly just a small number of topics.
It does seem that spawned processes can appear under the node-red value under some situations, leading to values > 100. I am not sure under what circumstances that occurs. When it was in the fault condition what was top showing for node-red?
Re loop... I will keep an eye out for that when it happens again...
Re the spawned processes, Cant remember the exact figure, but it was well above 100% if memory serves me right, but, it was midnight, I was tired, and I did stumble out of bed when the system woke me to check on it... I didn't look much further down the list to see if there were any additional NR spawned processes taking more cpu oomph...
Incidentally... Currently there are 11 pid's showing up in htop with one showing the bulk of the action and another 2 or 3 showing minimal cpu use intermittently...
Did the Subscribe to # before it bogged down almost completely, there was a fair amount of message traffic, but not unduly so.... Certainly no topics that seemed to be flooding the market per se...(no duplicate sequences that I could pick up off the cuff...)
Did a NR restart and all quiet and back to around 20% cpu occupancy...
Unfortunately I couldn't do a syslog check in, had clients waiting... Will have to try and see if I can catch it next time... So far it looks like its a varying time period, from 3 to 5 days or so...(the pi proper was restarted about a week and a bit ago, as there were a few updates that I ran, one of them recommending a sys restart... Prior to that it had a good few days running on the O/S with a NR restart as required.)
Quite intriguing (and perplexing) to say the least...
Yep.... Memory stays around the 25 to 35% mark.... It doesn't seem to spike during the CPU grunt period either... My gut feel tells me that any memory leak or recursive program looping is a no-go...
I have a feeling, still, that its a fundamental "setup" type parameter that I am missing... The settings.js file I am running on is pretty skimpy to say the least...
Any idea where I can find a "fully populated" or documented settings file to take a read through?
I would like to see if there are any settings available to modify the "aggression factor" for closing/clearing/flushing, unused/dormant memory and or buffers in particular...
Somewhere, a long while back, I am sure I saw a settings parameter that was particular to the Pi regarding flushing/clearing of something... Grey matter aint what it used to be!!
One node I am using, Telegram, warns that, to quote { Tip: Polling mode is very robust and easy to set up. Nevertheless it creates more traffic on the network over time.} - There are 3x Telegram Bots that I am using, Info/Important/Urgent sort of structure - I have disabled 2x bots(Only leaving Urgent going), but there does not seem to be any real difference regarding NR uptime to cpu occupancy problems ratio....
Regds
Ed
Edit: Spotted this in the syslog at last hogging session:
08:02:23 solpiplog kernel: [589340.746683] TCP: out of memory -- consider tuning tcp_mem
and also:
Apr 3 23:42:18 solpiplog kernel: [213726.990422] TCP: out of memory -- consider tuning tcp_mem
I don't know about that error but Google reveals this which might or might not be helpful.
You have already determined that it is not a normal out of memory situation as you have checked the memory usage is ok when it is happening.
If you install node-red onto a new system you will get the latest version with the latest stuff in it. But basically you only need whatever you need, the rest doesn't need to be there. Leaving stuff out is not going to cause the problem.
It still looks like some sort of loop somewhere that gets triggered by an unknown situation. Either a loop involving a number of nodes or loop within a function node. Perhaps all you can do is to drop debug nodes in at what might be appropriate places, set them to log to the console (so they will go to syslog), you can disable them so they won't log to the browser, they will go to syslog, and wait till it happens again and hope to get some illumination from the log.
See my previous post... Found a "default settings file" ... finished an edit as you finished your reply!
dzone.com - Interesting read.... Right at the bottom I spotted: "net.core.netdev_max_backlog=30000" where it was defaulted initially to "net.core.netdev_max_backlog = 1000" for their situation - This kind of rings a bell from way back.... Under certain conditions, the cpu might get "busy" servicing other requests, causing a tcp backlog to develop, this causing a snowball, somewhere ... Whilst the cpu(in my case) doesn't seem to be overly busy on the whole, who's to say that I am actually graphing the full occupancy correctly - Multithreading....Aaaargh.... I'm way out of my depth here... A fossil in a silicon age...Lol...
Found This:
And the long and short of it is that I am going to look at this when it next hogs:
cat /proc/net/sockstat
Currently, its running like this:
pi@solpiplog:~ $ cat /proc/net/sockstat
sockets: used 8633
TCP: inuse 80 orphan 0 tw 3062 alloc 8262 mem 8225
UDP: inuse 4 mem 2
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
The "sockets: used 8633" hovers around the 8.6k mark.....