Advice needed on Pi CPU usage

Good day peeps!!

First off... I am at kindergarten level... Small words only!!

I have a fairly large flow set that runs my home auto system... There is a fair amount of to and fro on mqtt from Inverter/Sonoff/Tasmota switches that I control and monitor with the pi...(EmonCMS)

My cpu usage on the pi (Pi4/4gb/16gb sd) hovers around the 15 to 25% mark as displayed on a management dash, when live imaging the SD card for backup, it can go as high as 80 to 90%, but then drops after the image process back to the "normal" cpu occupancy.

At a random period though, the cpu will suddenly jump to 100% and all things come to a grinding halt - This can be anything from 24hrs to 120hrs operation, sometimes longer, making it difficult, to say the least, to track....(Regardless of time of day or solar activity)

Optimising node traffic as far as possible(with my limited knowledge) has not been successful..

As to when it started, it's a while back - Bear in mind this is a 3yr old install, plus a bit... The bulk of the problems seem to have come about when migrating from a PI3 to a Pi4, that's about all I can surmise.... Unfortunately, with the Pi4, additional modules were written in soon after as there was extra horsepower to use... possibly compounding a dormant problem...

Furthermore, this being an "old" install of node.js and Node-Red(and everything else), settings and original parameters have been "carried forward" - Possibly being the actual "root" of the problem...

From my dinosaur based experience and knowledge, the problem seems to be of "open files" nature... Almost as if there is a setting that has been missed to "time out and close" used files... These accumulate over a number of hours, eventually causing mayhem...

Possibly too(in fact more than likely), there mights be a bit of "ignorant bad programming practise" involved as well... I am using this platform to learn as I go...

Any advice on where and what to check would be of great assistance!

Regds
Ed

What process is it that is hogging the processor? If you run top in a terminal and leave it running then you will be able to see when it clogs up.

hi @Colin

This is where it is interesting.... I have, on the odd occasion managed to fire up top and look see... I shut down the "hogging" process, and the next steps into its place... A real funny...(this was on one or two occasions)..

Unfortunately, by the time I can get to a terminal and try and take a peek at the processes, the snowball has grown to an extent that the unit is pretty much unresponsive... Only a hard power off and repower gets it rolling again...

Not being clued up enough to even scratch the surface, I feel a bit like the average monkey trying to understand Shakespear... Lol

Somewhere, some time ago, I saw a mention of start up parameters specific to the pi and am wondering whether this is what has got screwed up in the trail of hardware/software updates over the years somehow...

Unfortunately, I don't even know where to track down a "fully populated" and commented setup file for this beastie to compare the current one to...

Regds
Ed

You say you shut down the hogging process, what was it?

It was a program called "solpiplog" .... an mqtt "interface" to a Solar Inverter... Incidentally, this software has given little to no problems in the past... Except once, where an update of an update came out in short order to solve a similar problem......

Here its running away quietly as it should:
screenshot-192.168.0.118_1880-2021.03.27-18_48_57

And this is the front end that Node Red et al feeds:
screenshot-192.168.0.118-2021.03.27-18_52_50

I wonder whether you have an MQTT loop somewhere.

Do you have switches, sliders buttons, text input etc where MQTT is fed in the front and the output sends off to MQTT again.

Not to my knowledge... That is a trap I fell into early on, spotted it after deploying it, and sorted out almost immediately... Let me go through my flows again and confirm...

Edit: All looks clear, nothing on that side as such... I am thinking its more to do with my infantile programming attempts on some function nodes....mutter...mutter...

Ok, so here's an "aside" question:

On my flows I have a standard "Control Node" that I put together... Essentially, it looks at a few flow/global variables then watches an input - The input is 0/1/2 - 0 being manual off, 1 being manual on, 2 being in "auto" mode....

This control node is duplicated on most of my controlling flows.... from a horsepower/resources point of view, would it be better to convert this to a subflow, or should I just leave as is?

Ed

It is up to you, it is not going to make any difference to anything significant.

What makes you think the processor hogging is anything to do with node-red rather than solpiplog?

Yeh...figured as much re the subflow...just wanted to confirm...

Solpip hasnt featured high on the htop list since then... Even when I did shut it down, another process stepped up in the HP dept and hogged cpu straight after... It was as if the processes had each queued themselves to hog cpu when they got a chance... Kinda like a file handle shortage or similar... difficult to explain.. I cant and I was watching it...

The randomness of fallover is what gets me... All irritatingly frequent, but not frequent enough to wait for, also, seemingly random enough to not tie it to a particular situation....

E

Edit: Is there a parameter for node red, that can be set to more aggressively flush static or superfluous memory contents, rather than leave them cached?

So, to ask the question again, what makes you think it is anything to do with node-red? If it was node-red then either that or node would be at the top of the list.
Open a terminal to the Pi on your PC and leave it open running top. When it happens take a screenshot of the terminal.
Are you running an MQTT broker on the pi? If so is it Mosquitto?
Next time it happens kill the broker.

You speak about cpu usage, you even mention 100% cpu. How did you come to this conclusion? How did you measure this ? Which processes were using most of the cpu?

When reading the rest of the thread I would certainly also check if you don’t have a memory issue ? If you run out of memory then your pi becomes very unresponsive due to extensive paging. This behaviour you won’t get with a cpu bottleneck in node-red. In that case node-red usually claims 25% of total cpu power on pi with 4 processors as the underlying node.js main loop is single threaded.

1 Like

@Colin

I am not certain that its NR that is causing it, I am trying to track down whether it possibly could be.

As to leaving an open terminal on the pc while running, that's a no go... We have just lost a fair amount of electronics due to lightning and they get shut down and unplugged while not in use.

I am running mqtt, but by the time I can get to a terminal, the pi is doa, no access, and no broker running either that I can scan on the network at that point...

@janvda

I have a logging system on Emon that records pi stats to help with finding this problem, trying to find if it coincides with any other particular event... As to which processes are the cause, that is where I am having difficulty in the post mortem's....

Regarding memory usage, cpu and swap file, here is an excerpt from the recording, showing the "normal" cpu usage etc:

Yellow is temperature, red is cpu, purple is mem use, blue is swap space(all in % of available).... The spikey bits at the start of the graph are during deploys and system updates that I manually triggered...

I have cleared all historical data from the system prior to the last 3 days or so, in the unlikely event of a corrupted data file causing random mayhem (around 3 years worth).... There was one lock up since then, around 48hrs ago, that was so fast and severe, the info was lost as the cache did not get a chance to flush before the crash... I am waiting for the next to see if I can record it and diagnose..

Noted regarding the process! Good info!!

Hopefully the last lockup was a "random" and nothing will be repeated, hold thumbs!

Regds
Ed

How do you know something is hogging the processor then?

After you reboot it have a look in /var/log/syslog and see what it shows before the reboot. Post a chunk of it here if necessary.

@Colin

Hmmm interesting syslog.... Unfortunately, I am way out of my depth here... There is a continued reference to "one wire slave" ... (Yes I do know what i2c comms etc are, but have never used them directly on the pi) ..... Will disable them under raspi-config and see if there is any improvement...

Tnx
Ed

Post the log from a bit before the failure to the start of boot. You can find that by looking at the timestamps. That may well tell us a lot. If the pi is really DOA then it may be you have a hardware issue, failing PSU for example. It may be possible to learn a lot from the log.

@Colin

Yeh, realising that.... unfortunately the log has rotated out already... I will keep an eye on it and see what crops up...

As to hardware - A bit unlikely that 3xPi4, 2xSD cards and 2xPSU will all cause the same fault... Swapped them out systematically with no real discernible improvement - Except going from pi4 1gb to pi4 4gb... That just increased the crash gaps to a more manageable level... Hence my initial thinking of "open files" or the like... Everything I am doing is improving the situation(so far) but not solving it...

Let's see what transpires... Originally I was getting up times of 90days plus on the pi3(longest 270 or so days), only beating it to death so I could clean out fans and heatsinks etc (Its not in the best spot environmentally) ...

Something is nagging the back of my skull though... (nube-boolean...?) Problems with that node set and a update.... Seems to coincide chronologically, but I was going through a bit of a rough time and didn't make notes...

E

Edit: Incidentally .... dropping the unused comms protocols through raspi-config seems to have lowered cpu use by about 4% or so since the reboot... Maybe I'm just hopeful... LOL