Machine times between reboots/lockups reducing

Thanks both.

But I do agree with Julian in a lot of ways. Better to give the O/S a bit more memory to use than not.

Yes, things have been a lot more stable since doing it.

However, I think it will need a few more weeks to be sure.

All this is good stuff, but I have to remind you that the reboot you showed at the start of this thread was not caused by out of memory. It was a sudden restart such as would be caused by a PSU problem or a power fail.

Ok, today I noticed the RPZ (W) had locked up.

Luckily I could SSH to it.

So looking in the syslog around 17:09 I see this:

Jun 14 14:16:04 TelePi Node-RED[208]: 14 Jun 14:16:04 - [info] [cronplus:Monthly] Refreshing running schedules
Jun 14 14:16:04 TelePi Node-RED[208]: 14 Jun 14:16:04 - [info] [cronplus:Monthly] System Time Change Detected!
Jun 14 14:16:04 TelePi Node-RED[208]: 14 Jun 14:16:04 - [info] [cronplus:Monthly] Refreshing running schedules
Jun 14 14:16:04 TelePi Node-RED[208]: 14 Jun 14:16:04 - [info] [cronplus:Monthly] System Time Change Detected!
Jun 14 14:16:04 TelePi Node-RED[208]: 14 Jun 14:16:04 - [info] [cronplus:Monthly] Refreshing running schedules
Jun 14 14:16:10 TelePi Node-RED[208]: 14 Jun 14:16:10 - [error] [function:MQTT Decoder] Function tried to send a message of type string
Jun 14 14:17:02 TelePi CRON[9327]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 14 14:48:27 TelePi wpa_supplicant[298]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
Jun 14 15:17:01 TelePi CRON[13720]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 14 15:48:27 TelePi wpa_supplicant[298]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
Jun 14 16:17:01 TelePi CRON[18110]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 14 16:48:27 TelePi wpa_supplicant[298]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
Jun 14 17:09:30 TelePi Node-RED[208]: 14 Jun 17:09:29 - [info] [cronplus:Monthly] System Time Change Detected!
Jun 14 17:09:30 TelePi Node-RED[208]: 14 Jun 17:09:29 - [info] [cronplus:Monthly] Refreshing running schedules
Jun 14 17:09:32 TelePi Node-RED[208]: 14 Jun 17:09:31 - [info] [cronplus:Monthly] System Time Change Detected!
Jun 14 17:09:32 TelePi Node-RED[208]: 14 Jun 17:09:31 - [info] [cronplus:Monthly] Refreshing running schedules
Jun 14 17:09:33 TelePi Node-RED[208]: 14 Jun 17:09:33 - [info] [cronplus:Monthly] System Time Change Detected!
Jun 14 17:09:33 TelePi Node-RED[208]: 14 Jun 17:09:33 - [info] [cronplus:Monthly] Refreshing running schedules
Jun 14 17:10:16 TelePi Node-RED[208]: 14 Jun 17:10:15 - [info] [cronplus:Monthly] System Time Change Detected!
Jun 14 17:10:16 TelePi Node-RED[208]: 14 Jun 17:10:16 - [info] [cronplus:Monthly] Refreshing running schedules
Jun 14 17:10:18 TelePi Node-RED[208]: 14 Jun 17:10:18 - [info] [cronplus:Monthly] System Time Change Detected!
Jun 14 17:10:18 TelePi Node-RED[208]: 14 Jun 17:10:18 - [info] [cronplus:Monthly] Refreshing running schedules
Jun 14 17:10:21 TelePi Node-RED[208]: 14 Jun 17:10:21 - [info] [cronplus:Monthly] System Time Change Detected!
Jun 14 17:10:22 TelePi Node-RED[208]: 14 Jun 17:10:21 - [info] [cronplus:Monthly] Refreshing running schedules
Jun 14 17:10:32 TelePi Node-RED[208]: 14 Jun 17:10:31 - [info] [cronplus:Monthly] System Time Change Detected!

Does that help with what is going on?

Yes, I agree.

But there seems to be a few problems working together to annoy me.

I can't ask one lot to behave while I look at the other.

Many of the problems have gone a way on some of the machines.

All I can do is process what I see and try to remove problems one at a time.

Also as you can see in the screenshot something crashed at 17:25. It looks like it is something to do with accessing the SD card. You have a hardware issue of some sort. SD Card or power supply or Pi hardware or something similar. Nothing to do with node-red apparently.

This suggests you have some serious blocking in your flows, enough for nodejs to not service the cron plus node within 5 whole seconds.

Are you running some heavy process or do you have a loop that is causing the event queue to be blocked?

A file is written to every.... 30 seconds.

It time stamps when the flow is last active.

It is/was about 17:25 and I noticed things weren't working.

When I looked at the last_alive file 17:09 was the last time stamp.

So - correct me if I am wrong - the flow/script/what ever kind of stopped then.

As I could SSH to it - or I wouldn't be posting what I see/saw - I stopped NR and restarted it.
Probably around 17:25.

So if it isn't Node-red, why wasn't the last_alive file (on a USB stick) not written to since 17:09?

Shrug.

Every month (on the first) it backs up logs (well, I'll call it backing up. It copies the existing log to a date stamped on which shows me which month it is for.

Today is 14. No node AFAIK (or cronplus schedule) should be active.

Just wondering if this is also not helping - happening daily:

Starting Daily apt upgrade and clean activities

Would it hurt to turn it off?

My best guess is that at 17:09 there is a problem accessing the SD card to write the file, so the cron task clogs up. Eventually then you get the kernel crashes with the error shown. Did it reboot after the kernel crash? I am deducing that it is a card related problem because the crash is in mmc_get_card. That is mostly a guess though. Note that a card related problem could be the card itself, the pi hardware or the power supply, or possibly other hardware related issues.

No, I noticed it has stopped saying "I am alive" and could (luckily) SSH into it. (As stated)

I stopped NR and restarted it. node-red-stop node-red-start.
I don't like node-red-restart only in that I don't get to see where it is at.

It restarted and all is/was good.

Power supply is the next thing to be checked.

But it has been about a 6 days since it was last kicked. I think.
That is pretty good all things considered.

I think it was rebooted when I activated the swap file.

So 2 of the three are looking a lot better.
1 is still a bit.... vague. It is behaving itself too as it was also included in the swap file activation.

But other gremlins are still hanging around. I've mentioned them in another post but never got any replies to that. Oh well.

So: I'll give it another week to see what happens.

If it is still problematic, I may have to change the power supply. Or check it.
(I will need arm extensions as it is in a difficult place to reach.)
But isn't that part and parcel with all/most problems? :wink:

What followed after the kernel crash?

Well, good news/bad news. (Mostly bad)

Screenshot from 2021-06-15 18-48-40

It locked up again today.

I could get to it still (again?) but the CPU was at 100% and I couldn't really get it back into control.

I force power downed it. (Not the preferred way) and while at it, moved it to a different supply source.

Shall see what happens in the next 24 hours.

Doubt that is caused by a bad PSU

Agreed, but am reporting what I see so no one can say You didn't tell us that at the start!

All I can do it trying things.

Alas 3 machines it will take time to find the problem/s on them all.

But! 2 of them are playing the game quite well just now.
Though 1 of those 2 is showing another problem which has existed for a long time. (mentioned a couple of posts back and in a whole other thread)

Oh, sorry.

I changed the power supply only because: If I am going to pull the plug on it and power it down, I may as well use that as the time to then connect it to a new/other supply and see if the problem goes away or remains.

1 Like

No sorry, worth trying anyway

No problems. I just thought I had better explain why I did that.

Again: So all the cards are declared on what has happened.

No it isn't. You can analyse and record the symptoms each time you get a problem. You haven't told us what happened after the kernel crash on the previous one. Have you looked in syslog to see what symptoms there are on this one?

A lot of that is given I am sitting here waiting for things to happen and catch them.
Yeah, log files......
That also implies I know where they are and how to read them.
Yeah, it is part of the deal learning to read them.

In answer to your last question/s. No and No.
Not yet.

There is only one of me and other things need doing ITMT.
Yeah so: why complain if you can't put in the time to fix them?.
Kind of difficult to know how much time they need if I don't understand the problem.