Machine times between reboots/lockups reducing

Yes, the flows are getting slightly more complicated/better in the mean time, but I am not sure how much of a cause to the problem that is.

I have 3 RasPis running 24/7.

Way back when they would run happily for weeks/months between locking up and being needed to be power reset.

That wasn't too long ago. Maybe a few months.

Now, it seems I am lucky to get a week between the machine rebooting or locking up.
I have a watchdog running that resets the machine if it gets stuck. That works sometimes. Sometimes not. Then I have to power cycle it.

Background on the machines:
RasPi 2 (early one 2 USB ports and minimal GPIO pins) Stretch
RPZ (W) Stretch
RasPi 3b Rev 1.1 Buster.

Not accusing anyone of anything, but some things which I have noticed:
The amount of memory used is increasing - a lot.

I am suspicious there may be a memory leak (I think that is what they are called) which is filling up the usable memory and causing the machine to get locked up.

Weirdly enough the Early machine is the most stable.
There are other things which are being looked into as well. Alas I recently lost a machine twice in a short time. (Another story)

But the frequency that the machines are locking up....

Slightly worrying.

I am sure you already know this, but the first thing to do is to look in /var/log/syslog at what happened before the reboot.

First thing to do is to redirect the syslog files to another machine - so when they do lock up you will have the most up to date file to enable troubleshooting.

If you think something is chewing up memory then it may be worthwhile getting one of the monitoring systems up and running to remotely monitor these machines - Zabbix is a good free one that you can get as Virtual Machines or docker images.

Here are some options

Craig

I have generally found that not necessary. Usually the information you need is in in syslog.

I've seen Craig's reply too.

Colin, you have me at a disadvantage - I believe.
(Though I do have a bad dose of CRAFT just now)

I have (in the past) posted syslogs and the first thing you do is shoot a huge hole in them with the time is wrong - therefore: the log can't be trusted.

I don't know how to reply to that.

To further complicate things:
It isn't like I am sitting here waiting for this to happen and look at the log there and then.

I could be a couple of days - or hours - from when it happened to when I know about it.

A machine is question - which prompted me to ask this question - has been up now for 1 day 5 hours 44 minutes (at time of posting).

I may have to look at Craig's suggestion of getting the log file remotely saved.
Though I fear that is going to be another rabbit hole for me.

I'll have to research more what I need to do.

That only applies from the time of reboot to the time the clock syncs. The log can be trusted, you just have to understand the clock issue for that short sequence. The clock from before the reboot should be correct so it shouldn't be an issue in this case. It is more tricky when you are trying to understand the startup sequence. There should be about 5 days of syslog there. Earlier ones may be gzipped. To find them run
ls -l /var/log/syslog*
You can tell from the modified date which one to look at. The modified date is that of the end of the logfile.

In reply to the command you showed me.....

(Not exactly the same as what you said, but keep reading)

pi@TelePi:/var/log $ lf
apt/                   alternatives.log.5.gz  daemon.log.3.gz  dpkg.log.6.gz   messages.2.gz             syslog.6.gz
cups/                  auth.log               daemon.log.4.gz  dpkg.log.7.gz   messages.3.gz             syslog.7.gz
letsencrypt/           auth.log.1             debug            dpkg.log.8.gz   messages.4.gz             user.log
lightdm/               auth.log.2.gz          debug.1          faillog         nodered-install.log       user.log.1
mosquitto/             auth.log.3.gz          debug.2.gz       fontconfig.log  nodered-install.log.1     user.log.2.gz
samba/                 auth.log.4.gz          debug.3.gz       kern.log        nodered-install.log.2.gz  user.log.3.gz
watchdog/              boot.log               debug.4.gz       kern.log.1      nodered-install.log.3.gz  user.log.4.gz
wicd/                  bootstrap.log          dpkg.log         kern.log.2.gz   syslog                    wtmp
alternatives.log       btmp                   dpkg.log.1       kern.log.3.gz   syslog.1                  wtmp.1
alternatives.log.1     btmp.1                 dpkg.log.2.gz    kern.log.4.gz   syslog.2.gz               Xorg.0.log
alternatives.log.2.gz  daemon.log             dpkg.log.3.gz    lastlog         syslog.3.gz               Xorg.0.log.old
alternatives.log.3.gz  daemon.log.1           dpkg.log.4.gz    messages        syslog.4.gz
alternatives.log.4.gz  daemon.log.2.gz        dpkg.log.5.gz    messages.1      syslog.5.gz
pi@TelePi:/var/log $ exit

syslog only goes back to 06:xx today.

(I won't paste it)

This is from what you asked for:

-rw-r----- 1 root      adm     5634 May 29 19:42 syslog
-rw-r----- 1 root      adm    68383 May 29 06:25 syslog.1
-rw-r----- 1 root      adm     1158 May 28 06:25 syslog.2.gz
-rw-r----- 1 root      adm     1362 May 27 06:25 syslog.3.gz
-rw-r----- 1 root      adm     4387 May 26 06:25 syslog.4.gz
-rw-r----- 1 root      adm    18804 May 25 06:25 syslog.5.gz
-rw-r----- 1 root      adm     1420 May 24 06:25 syslog.6.gz
-rw-r----- 1 root      adm     1362 May 23 06:25 syslog.7.gz
-rw-r----- 1 root      adm      

These .gz files I know are a zip file. How to access it..... Beyond me.

syslog.2.gz?

I always have to look it up, so I am sure you can google just as well as I can.

Top tip

  • install midnight commander (sudo apt install mc)
  • starting by entering mc

This is what I found.

It is all confusing to me, but I opened the one which covers the time in question.

Though there are a couple of indications that it had rebooted more than once.

So, to (maybe) help, I copied from the start to just after the first suspected reboot.

This is what I see:

May 29 06:25:05 TelePi systemd[1]: Started Make remote CUPS printers available locally.
May 29 06:25:07 TelePi liblogging-stdlog:  [origin software="rsyslogd" swVersion="8.24.0" x-pid="218" x-info="http://www.rsyslog.com"] rsyslogd was H
UPed
May 28 06:48:25 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 06:52:24 TelePi systemd[1]: Starting Daily apt upgrade and clean activities...
May 28 06:52:39 TelePi systemd[1]: Started Daily apt upgrade and clean activities.
May 28 06:52:39 TelePi systemd[1]: apt-daily-upgrade.timer: Adding 32min 30.773660s random time.
May 28 06:52:39 TelePi systemd[1]: apt-daily-upgrade.timer: Adding 37min 29.340678s random time.
May 28 07:15:24 TelePi systemd[1]: Starting Certbot...
May 28 07:16:08 TelePi systemd[1]: Started Certbot.
May 28 07:16:09 TelePi systemd[1]: certbot.timer: Adding 5min 2.345063s random time.
May 28 07:16:09 TelePi systemd[1]: certbot.timer: Adding 5h 50min 9.784448s random time.
May 28 07:17:01 TelePi CRON[6678]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 28 07:42:24 TelePi systemd[1]: Starting Daily apt download activities...
May 28 07:42:35 TelePi systemd[1]: Started Daily apt download activities.
May 28 07:42:35 TelePi systemd[1]: apt-daily.timer: Adding 1h 3min 12.366145s random time.
May 28 07:42:35 TelePi systemd[1]: apt-daily.timer: Adding 3h 7min 20.619760s random time.
May 28 07:48:25 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 08:17:01 TelePi CRON[11898]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 28 08:48:25 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 09:17:01 TelePi CRON[17057]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 28 09:48:26 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 10:17:01 TelePi CRON[22233]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 28 10:48:25 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 11:17:01 TelePi CRON[27413]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 28 11:48:25 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 12:00:01 TelePi CRON[31120]: (root) CMD (test -x /usr/bin/certbot -a \! -d /run/systemd/system && perl -e 'sleep int(rand(43200))' && certbot 
-q renew)
May 28 12:17:01 TelePi CRON[32595]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 28 12:48:25 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 13:17:01 TelePi CRON[5431]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 28 13:17:14 TelePi systemd-modules-load[90]: Inserted module 'i2c_dev'
May 28 13:17:14 TelePi systemd-modules-load[90]: Module 'bcm2835_wdt' is builtin
May 28 13:17:14 TelePi fake-hwclock[79]: Fri 28 May 03:17:01 UTC 2021
May 28 13:17:14 TelePi systemd[1]: Time has been changed
May 28 13:17:14 TelePi systemd[1]: Mounting Configuration File System...
May 28 13:17:14 TelePi systemd[1]: Starting Apply Kernel Variables...
May 28 13:17:14 TelePi systemd[1]: Starting File System Check on Root Device...
May 28 13:17:14 TelePi systemd[1]: Starting Create Static Device Nodes in /dev...
May 28 13:17:14 TelePi systemd[1]: Starting pNFS block layout mapping daemon...
May 28 13:17:14 TelePi systemd[1]: Mounted Configuration File System.
May 28 13:17:14 TelePi kernel: [    0.000000] Booting Linux on physical CPU 0x0
May 28 13:17:14 TelePi kernel: [    0.000000] Linux version 4.19.66+ (dom@buildbot) (gcc version 4.9.3 (crosstool-NG crosstool-ng-1.22.0-88-g8460611)
) #1253 Thu Aug 15 11:37:30 BST 2019

I'm guessing it was commanded to reboot by watchdog.

I don't know if this is (now) more under the RasPi per vue (spelling?) than Node-Red's.

Thanks.

I had it installed and did it the way you showed.

That does not show a normal shutdown, and nothing to show there was a problem. Did you power cycle it at some time after 12:17? You would have to look later to where it sets the clock when the internet is established to see exactly what time it was.

This is what I see for reboot times:

(What's going on?)

Ok, normally the machine writes to last_alive.db.
On reboot it renames last_alive.db to Rebooted at ....... with the date/time.
So I know it isn't the best method, but it helps me get an idea of what is happening.

Of course it isn't too good if the machine is locked up for an hour before it reboots.

But since you mentioned 12:17 this is the first block of text from the log:

Scroll down a bit.

May 28 06:48:25 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 06:52:24 TelePi systemd[1]: Starting Daily apt upgrade and clean activities...
May 28 06:52:39 TelePi systemd[1]: Started Daily apt upgrade and clean activities.
May 28 06:52:39 TelePi systemd[1]: apt-daily-upgrade.timer: Adding 32min 30.773660s random time.
May 28 06:52:39 TelePi systemd[1]: apt-daily-upgrade.timer: Adding 37min 29.340678s random time.
May 28 07:15:24 TelePi systemd[1]: Starting Certbot...
May 28 07:16:08 TelePi systemd[1]: Started Certbot.
May 28 07:16:09 TelePi systemd[1]: certbot.timer: Adding 5min 2.345063s random time.
May 28 07:16:09 TelePi systemd[1]: certbot.timer: Adding 5h 50min 9.784448s random time.
May 28 07:17:01 TelePi CRON[6678]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 28 07:42:24 TelePi systemd[1]: Starting Daily apt download activities...
May 28 07:42:35 TelePi systemd[1]: Started Daily apt download activities.
May 28 07:42:35 TelePi systemd[1]: apt-daily.timer: Adding 1h 3min 12.366145s random time.
May 28 07:42:35 TelePi systemd[1]: apt-daily.timer: Adding 3h 7min 20.619760s random time.
May 28 07:48:25 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 08:17:01 TelePi CRON[11898]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 28 08:48:25 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 09:17:01 TelePi CRON[17057]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 28 09:48:26 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 10:17:01 TelePi CRON[22233]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 28 10:48:25 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 11:17:01 TelePi CRON[27413]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 28 11:48:25 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 12:00:01 TelePi CRON[31120]: (root) CMD (test -x /usr/bin/certbot -a \! -d /run/systemd/system && perl -e 'sleep int(rand(43200))' && certbot 
-q renew)
May 28 12:17:01 TelePi CRON[32595]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 28 12:48:25 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 13:17:01 TelePi CRON[5431]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 28 13:17:14 TelePi systemd-modules-load[90]: Inserted module 'i2c_dev'
May 28 13:17:14 TelePi systemd-modules-load[90]: Module 'bcm2835_wdt' is builtin
May 28 13:17:14 TelePi fake-hwclock[79]: Fri 28 May 03:17:01 UTC 2021
May 28 13:17:14 TelePi systemd[1]: Time has been changed
May 28 13:17:14 TelePi systemd[1]: Mounting Configuration File System...
May 28 13:17:14 TelePi systemd[1]: Starting Apply Kernel Variables...
May 28 13:17:14 TelePi systemd[1]: Starting File System Check on Root Device...
May 28 13:17:14 TelePi systemd[1]: Starting Create Static Device Nodes in /dev...

Does that help?

We went over all this last time I helped you to sort out a problem looking at the log. The text fake-hwclock is a clue. There isn't a h/w clock so it uses a fake one to guess what the time is. You have look much further down the log until after the network has connected, at which point it should sync the clock from the internet. However the time of 13:30 from your screenshot looks entirely reasonable.

As to why it rebooted, as I said earlier, it didn't do a normal shutdown so either there was a power fail or something caused it to reset without warning. If there wasn't a power fail then I would be suspecting a hardware issue of some sort. Possibly the PSU or if it is connected to hardware then a spike coming in on one of the wires. Possibly a faulty Pi but that is less likely.

Yeah, ok.

The machine doesn't have a HWC. It usually gets its time from the NTP server.
That is evident in the log.
And as you say 13:17 is about the time of the reset - according to what I can determine.

So you want me to look further down the log to see what caused the reboot?

Ok.

But for what I am looking?
I don't want to dump the whole log here and expect you to go through it.
(That isn't fair to you)

But if I don't know for what I am looking, it makes it difficult for me to know what to post/show/share.

Would/could this be it?

May 28 13:17:14 TelePi kernel: [    2.821749] usb-storage 1-1:1.0: USB Mass Storage device detected
May 28 13:17:14 TelePi kernel: [    2.836066] scsi host0: usb-storage 1-1:1.0
May 28 13:17:14 TelePi kernel: [    3.410834] EXT4-fs (mmcblk0p2): orphan cleanup on readonly fs
May 28 13:17:14 TelePi kernel: [    3.447763] EXT4-fs (mmcblk0p2): 7 orphan inodes deleted
May 28 13:17:14 TelePi kernel: [    3.459120] EXT4-fs (mmcblk0p2): recovery complete
May 28 13:17:14 TelePi kernel: [    3.497915] EXT4-fs (mmcblk0p2): mounted filesystem with ordered data mode. Opts: (null)
May 28 13:17:14 TelePi kernel: [    3.512317] VFS: Mounted root (ext4 filesystem) readonly on device 179:2.
May 28 13:17:14 TelePi kernel: [    3.529617] devtmpfs: mounted

No, further down the log you should find it connecting to the network and getting the actual time, if you want to know that.

There is no information about what caused the reboot, that would be before the start of boot. There is nothing there, so it just suddenly rebooted for some reason. It did not shut down normally, just restarted as you would get from a power cycle or hitting the reset button.

Ok, thanks.

So that wasn't caused by the watchdog?

He is watching the system too.

The time is not of importance. I am trying to work out why:
Originally I would/could get a couple of weeks between resets on the machine/s.

Now I am lucky to get a week.

AFAIK only the software has been slowly edited and nodes updated.

If any of the updated nodes have memory leaks...... It would be problematic.
(No accusations meant by that. Just saying.)

If the watchdog works by commanding a reboot using the shutdown command, or similar, then no it wasn't caused by that because then you would see dozens, if not hundreds, of messages about processes shutting down before the messages showing it rebooting.

A pi should run for years without restarting, if there isn't a power fail or an intentional reboot.

That isn't the problem here as you would see the out of memory messages and node-red would restart, but it wouldn't cause a reboot, and certainly not a sudden reboot without a normal shutdown.

Almost certainly it is a hardware issue, as I said before.

Well, I hope you understand I have to ask the questions to be sure.

I can't just accept what is said with no...... backing.

Seems other things need to be looked at / into.

I agree with Colin here, this feels like a hardware issue.

Do you have a spare Pi you can try?

If not, at least test your power supply. Try using a USB power tester to make sure that the Pi is actually getting 5.1v and that the PSU can deliver enough Amps.
Power Supply - Raspberry Pi Documentation