Machine times between reboots/lockups reducing

Trying_to_learn · 12 June 2021 00:42

Thanks both.

But I do agree with Julian in a lot of ways. Better to give the O/S a bit more memory to use than not.

Yes, things have been a lot more stable since doing it.

However, I think it will need a few more weeks to be sure.

Colin · 13 June 2021 19:57

All this is good stuff, but I have to remind you that the reboot you showed at the start of this thread was not caused by out of memory. It was a sudden restart such as would be caused by a PSU problem or a power fail.

Trying_to_learn · 14 June 2021 07:29

Ok, today I noticed the RPZ (W) had locked up.

Luckily I could SSH to it.

So looking in the syslog around 17:09 I see this:

Jun 14 14:16:04 TelePi Node-RED[208]: 14 Jun 14:16:04 - [info] [cronplus:Monthly] Refreshing running schedules
Jun 14 14:16:04 TelePi Node-RED[208]: 14 Jun 14:16:04 - [info] [cronplus:Monthly] System Time Change Detected!
Jun 14 14:16:04 TelePi Node-RED[208]: 14 Jun 14:16:04 - [info] [cronplus:Monthly] Refreshing running schedules
Jun 14 14:16:04 TelePi Node-RED[208]: 14 Jun 14:16:04 - [info] [cronplus:Monthly] System Time Change Detected!
Jun 14 14:16:04 TelePi Node-RED[208]: 14 Jun 14:16:04 - [info] [cronplus:Monthly] Refreshing running schedules
Jun 14 14:16:10 TelePi Node-RED[208]: 14 Jun 14:16:10 - [error] [function:MQTT Decoder] Function tried to send a message of type string
Jun 14 14:17:02 TelePi CRON[9327]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 14 14:48:27 TelePi wpa_supplicant[298]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
Jun 14 15:17:01 TelePi CRON[13720]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 14 15:48:27 TelePi wpa_supplicant[298]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
Jun 14 16:17:01 TelePi CRON[18110]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 14 16:48:27 TelePi wpa_supplicant[298]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
Jun 14 17:09:30 TelePi Node-RED[208]: 14 Jun 17:09:29 - [info] [cronplus:Monthly] System Time Change Detected!
Jun 14 17:09:30 TelePi Node-RED[208]: 14 Jun 17:09:29 - [info] [cronplus:Monthly] Refreshing running schedules
Jun 14 17:09:32 TelePi Node-RED[208]: 14 Jun 17:09:31 - [info] [cronplus:Monthly] System Time Change Detected!
Jun 14 17:09:32 TelePi Node-RED[208]: 14 Jun 17:09:31 - [info] [cronplus:Monthly] Refreshing running schedules
Jun 14 17:09:33 TelePi Node-RED[208]: 14 Jun 17:09:33 - [info] [cronplus:Monthly] System Time Change Detected!
Jun 14 17:09:33 TelePi Node-RED[208]: 14 Jun 17:09:33 - [info] [cronplus:Monthly] Refreshing running schedules
Jun 14 17:10:16 TelePi Node-RED[208]: 14 Jun 17:10:15 - [info] [cronplus:Monthly] System Time Change Detected!
Jun 14 17:10:16 TelePi Node-RED[208]: 14 Jun 17:10:16 - [info] [cronplus:Monthly] Refreshing running schedules
Jun 14 17:10:18 TelePi Node-RED[208]: 14 Jun 17:10:18 - [info] [cronplus:Monthly] System Time Change Detected!
Jun 14 17:10:18 TelePi Node-RED[208]: 14 Jun 17:10:18 - [info] [cronplus:Monthly] Refreshing running schedules
Jun 14 17:10:21 TelePi Node-RED[208]: 14 Jun 17:10:21 - [info] [cronplus:Monthly] System Time Change Detected!
Jun 14 17:10:22 TelePi Node-RED[208]: 14 Jun 17:10:21 - [info] [cronplus:Monthly] Refreshing running schedules
Jun 14 17:10:32 TelePi Node-RED[208]: 14 Jun 17:10:31 - [info] [cronplus:Monthly] System Time Change Detected!

Does that help with what is going on?

Trying_to_learn · 14 June 2021 07:31

Yes, I agree.

But there seems to be a few problems working together to annoy me.

I can't ask one lot to behave while I look at the other.

Many of the problems have gone a way on some of the machines.

All I can do is process what I see and try to remove problems one at a time.

Colin · 14 June 2021 07:53

Also as you can see in the screenshot something crashed at 17:25. It looks like it is something to do with accessing the SD card. You have a hardware issue of some sort. SD Card or power supply or Pi hardware or something similar. Nothing to do with node-red apparently.

Steve-Mcl · 14 June 2021 07:53

Trying_to_learn:

Jun 14 17:09:30 TelePi Node-RED[208]: 14 Jun 17:09:29 - [info] [cronplus:Monthly] System Time Change Detected!
Jun 14 17:09:30 TelePi Node-RED[208]: 14 Jun 17:09:29 - [info] [cronplus:Monthly] Refreshing running schedules

This suggests you have some serious blocking in your flows, enough for nodejs to not service the cron plus node within 5 whole seconds.

Are you running some heavy process or do you have a loop that is causing the event queue to be blocked?

Trying_to_learn · 14 June 2021 07:56

A file is written to every.... 30 seconds.

It time stamps when the flow is last active.

It is/was about 17:25 and I noticed things weren't working.

When I looked at the last_alive file 17:09 was the last time stamp.

So - correct me if I am wrong - the flow/script/what ever kind of stopped then.

As I could SSH to it - or I wouldn't be posting what I see/saw - I stopped NR and restarted it.
Probably around 17:25.

So if it isn't Node-red, why wasn't the last_alive file (on a USB stick) not written to since 17:09?

Trying_to_learn · 14 June 2021 07:59

Shrug.

Every month (on the first) it backs up logs (well, I'll call it backing up. It copies the existing log to a date stamped on which shows me which month it is for.

Today is 14. No node AFAIK (or cronplus schedule) should be active.

Trying_to_learn · 14 June 2021 08:49

Just wondering if this is also not helping - happening daily:

Starting Daily apt upgrade and clean activities

Would it hurt to turn it off?

Colin · 14 June 2021 09:09

My best guess is that at 17:09 there is a problem accessing the SD card to write the file, so the cron task clogs up. Eventually then you get the kernel crashes with the error shown. Did it reboot after the kernel crash? I am deducing that it is a card related problem because the crash is in mmc_get_card. That is mostly a guess though. Note that a card related problem could be the card itself, the pi hardware or the power supply, or possibly other hardware related issues.

Trying_to_learn · 14 June 2021 09:15

No, I noticed it has stopped saying "I am alive" and could (luckily) SSH into it. (As stated)

I stopped NR and restarted it. node-red-stop node-red-start.
I don't like node-red-restart only in that I don't get to see where it is at.

It restarted and all is/was good.

Power supply is the next thing to be checked.

But it has been about a 6 days since it was last kicked. I think.
That is pretty good all things considered.

I think it was rebooted when I activated the swap file.

So 2 of the three are looking a lot better.
1 is still a bit.... vague. It is behaving itself too as it was also included in the swap file activation.

But other gremlins are still hanging around. I've mentioned them in another post but never got any replies to that. Oh well.

So: I'll give it another week to see what happens.

If it is still problematic, I may have to change the power supply. Or check it.
(I will need arm extensions as it is in a difficult place to reach.)
But isn't that part and parcel with all/most problems?

Colin · 14 June 2021 12:52

What followed after the kernel crash?

Trying_to_learn · 15 June 2021 08:50

Well, good news/bad news. (Mostly bad)

Screenshot from 2021-06-15 18-48-40

It locked up again today.

I could get to it still (again?) but the CPU was at 100% and I couldn't really get it back into control.

I force power downed it. (Not the preferred way) and while at it, moved it to a different supply source.

Shall see what happens in the next 24 hours.

krambriw · 15 June 2021 08:56

Doubt that is caused by a bad PSU

Trying_to_learn · 15 June 2021 08:59

Agreed, but am reporting what I see so no one can say You didn't tell us that at the start!

All I can do it trying things.

Alas 3 machines it will take time to find the problem/s on them all.

But! 2 of them are playing the game quite well just now.
Though 1 of those 2 is showing another problem which has existed for a long time. (mentioned a couple of posts back and in a whole other thread)

Trying_to_learn · 15 June 2021 09:20

Oh, sorry.

I changed the power supply only because: If I am going to pull the plug on it and power it down, I may as well use that as the time to then connect it to a new/other supply and see if the problem goes away or remains.

krambriw · 15 June 2021 09:37

No sorry, worth trying anyway

Trying_to_learn · 15 June 2021 09:39

No problems. I just thought I had better explain why I did that.

Again: So all the cards are declared on what has happened.

Colin · 15 June 2021 09:42

No it isn't. You can analyse and record the symptoms each time you get a problem. You haven't told us what happened after the kernel crash on the previous one. Have you looked in syslog to see what symptoms there are on this one?

Trying_to_learn · 15 June 2021 09:49

A lot of that is given I am sitting here waiting for things to happen and catch them.
Yeah, log files......
That also implies I know where they are and how to read them.
Yeah, it is part of the deal learning to read them.

In answer to your last question/s. No and No.
Not yet.

There is only one of me and other things need doing ITMT.
Yeah so: why complain if you can't put in the time to fix them?.
Kind of difficult to know how much time they need if I don't understand the problem.

Topic		Replies	Views
Node Red watchdog, assure long time working General	20	3783	23 April 2019
Reboot? Restart? General	7	1059	6 August 2021
Raspbian node-red stop job is running General	4	584	14 January 2019
I've lost everything. CONNECTION LOST TO MACHINE error - subflow connections General	14	845	13 June 2020
RPi hangs/freezes after running a few days General	16	3489	12 October 2019

Machine times between reboots/lockups reducing

Related topics