Yes, the flows are getting slightly more complicated/better in the mean time, but I am not sure how much of a cause to the problem that is.
I have 3 RasPis running 24/7.
Way back when they would run happily for weeks/months between locking up and being needed to be power reset.
That wasn't too long ago. Maybe a few months.
Now, it seems I am lucky to get a week between the machine rebooting or locking up.
I have a watchdog running that resets the machine if it gets stuck. That works sometimes. Sometimes not. Then I have to power cycle it.
Background on the machines:
RasPi 2 (early one 2 USB ports and minimal GPIO pins) Stretch
RPZ (W) Stretch
RasPi 3b Rev 1.1 Buster.
Not accusing anyone of anything, but some things which I have noticed:
The amount of memory used is increasing - a lot.
I am suspicious there may be a memory leak (I think that is what they are called) which is filling up the usable memory and causing the machine to get locked up.
Weirdly enough the Early machine is the most stable.
There are other things which are being looked into as well. Alas I recently lost a machine twice in a short time. (Another story)
But the frequency that the machines are locking up....
First thing to do is to redirect the syslog files to another machine - so when they do lock up you will have the most up to date file to enable troubleshooting.
If you think something is chewing up memory then it may be worthwhile getting one of the monitoring systems up and running to remotely monitor these machines - Zabbix is a good free one that you can get as Virtual Machines or docker images.
Colin, you have me at a disadvantage - I believe.
(Though I do have a bad dose of CRAFT just now)
I have (in the past) posted syslogs and the first thing you do is shoot a huge hole in them with the time is wrong - therefore: the log can't be trusted.
I don't know how to reply to that.
To further complicate things:
It isn't like I am sitting here waiting for this to happen and look at the log there and then.
I could be a couple of days - or hours - from when it happened to when I know about it.
A machine is question - which prompted me to ask this question - has been up now for 1 day 5 hours 44 minutes (at time of posting).
I may have to look at Craig's suggestion of getting the log file remotely saved.
Though I fear that is going to be another rabbit hole for me.
That only applies from the time of reboot to the time the clock syncs. The log can be trusted, you just have to understand the clock issue for that short sequence. The clock from before the reboot should be correct so it shouldn't be an issue in this case. It is more tricky when you are trying to understand the startup sequence. There should be about 5 days of syslog there. Earlier ones may be gzipped. To find them run ls -l /var/log/syslog*
You can tell from the modified date which one to look at. The modified date is that of the end of the logfile.
That does not show a normal shutdown, and nothing to show there was a problem. Did you power cycle it at some time after 12:17? You would have to look later to where it sets the clock when the internet is established to see exactly what time it was.
Ok, normally the machine writes to last_alive.db.
On reboot it renames last_alive.db to Rebooted at ....... with the date/time.
So I know it isn't the best method, but it helps me get an idea of what is happening.
Of course it isn't too good if the machine is locked up for an hour before it reboots.
But since you mentioned 12:17 this is the first block of text from the log:
Scroll down a bit.
May 28 06:48:25 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 06:52:24 TelePi systemd[1]: Starting Daily apt upgrade and clean activities...
May 28 06:52:39 TelePi systemd[1]: Started Daily apt upgrade and clean activities.
May 28 06:52:39 TelePi systemd[1]: apt-daily-upgrade.timer: Adding 32min 30.773660s random time.
May 28 06:52:39 TelePi systemd[1]: apt-daily-upgrade.timer: Adding 37min 29.340678s random time.
May 28 07:15:24 TelePi systemd[1]: Starting Certbot...
May 28 07:16:08 TelePi systemd[1]: Started Certbot.
May 28 07:16:09 TelePi systemd[1]: certbot.timer: Adding 5min 2.345063s random time.
May 28 07:16:09 TelePi systemd[1]: certbot.timer: Adding 5h 50min 9.784448s random time.
May 28 07:17:01 TelePi CRON[6678]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
May 28 07:42:24 TelePi systemd[1]: Starting Daily apt download activities...
May 28 07:42:35 TelePi systemd[1]: Started Daily apt download activities.
May 28 07:42:35 TelePi systemd[1]: apt-daily.timer: Adding 1h 3min 12.366145s random time.
May 28 07:42:35 TelePi systemd[1]: apt-daily.timer: Adding 3h 7min 20.619760s random time.
May 28 07:48:25 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 08:17:01 TelePi CRON[11898]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
May 28 08:48:25 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 09:17:01 TelePi CRON[17057]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
May 28 09:48:26 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 10:17:01 TelePi CRON[22233]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
May 28 10:48:25 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 11:17:01 TelePi CRON[27413]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
May 28 11:48:25 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 12:00:01 TelePi CRON[31120]: (root) CMD (test -x /usr/bin/certbot -a \! -d /run/systemd/system && perl -e 'sleep int(rand(43200))' && certbot
-q renew)
May 28 12:17:01 TelePi CRON[32595]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
May 28 12:48:25 TelePi wpa_supplicant[310]: wlan0: WPA: Group rekeying completed with 24:f5:a2:b2:2a:07 [GTK=CCMP]
May 28 13:17:01 TelePi CRON[5431]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
May 28 13:17:14 TelePi systemd-modules-load[90]: Inserted module 'i2c_dev'
May 28 13:17:14 TelePi systemd-modules-load[90]: Module 'bcm2835_wdt' is builtin
May 28 13:17:14 TelePi fake-hwclock[79]: Fri 28 May 03:17:01 UTC 2021
May 28 13:17:14 TelePi systemd[1]: Time has been changed
May 28 13:17:14 TelePi systemd[1]: Mounting Configuration File System...
May 28 13:17:14 TelePi systemd[1]: Starting Apply Kernel Variables...
May 28 13:17:14 TelePi systemd[1]: Starting File System Check on Root Device...
May 28 13:17:14 TelePi systemd[1]: Starting Create Static Device Nodes in /dev...
We went over all this last time I helped you to sort out a problem looking at the log. The text fake-hwclock is a clue. There isn't a h/w clock so it uses a fake one to guess what the time is. You have look much further down the log until after the network has connected, at which point it should sync the clock from the internet. However the time of 13:30 from your screenshot looks entirely reasonable.
As to why it rebooted, as I said earlier, it didn't do a normal shutdown so either there was a power fail or something caused it to reset without warning. If there wasn't a power fail then I would be suspecting a hardware issue of some sort. Possibly the PSU or if it is connected to hardware then a spike coming in on one of the wires. Possibly a faulty Pi but that is less likely.
The machine doesn't have a HWC. It usually gets its time from the NTP server.
That is evident in the log.
And as you say 13:17 is about the time of the reset - according to what I can determine.
So you want me to look further down the log to see what caused the reboot?
Ok.
But for what I am looking?
I don't want to dump the whole log here and expect you to go through it.
(That isn't fair to you)
But if I don't know for what I am looking, it makes it difficult for me to know what to post/show/share.
Would/could this be it?
May 28 13:17:14 TelePi kernel: [ 2.821749] usb-storage 1-1:1.0: USB Mass Storage device detected
May 28 13:17:14 TelePi kernel: [ 2.836066] scsi host0: usb-storage 1-1:1.0
May 28 13:17:14 TelePi kernel: [ 3.410834] EXT4-fs (mmcblk0p2): orphan cleanup on readonly fs
May 28 13:17:14 TelePi kernel: [ 3.447763] EXT4-fs (mmcblk0p2): 7 orphan inodes deleted
May 28 13:17:14 TelePi kernel: [ 3.459120] EXT4-fs (mmcblk0p2): recovery complete
May 28 13:17:14 TelePi kernel: [ 3.497915] EXT4-fs (mmcblk0p2): mounted filesystem with ordered data mode. Opts: (null)
May 28 13:17:14 TelePi kernel: [ 3.512317] VFS: Mounted root (ext4 filesystem) readonly on device 179:2.
May 28 13:17:14 TelePi kernel: [ 3.529617] devtmpfs: mounted
No, further down the log you should find it connecting to the network and getting the actual time, if you want to know that.
There is no information about what caused the reboot, that would be before the start of boot. There is nothing there, so it just suddenly rebooted for some reason. It did not shut down normally, just restarted as you would get from a power cycle or hitting the reset button.
If the watchdog works by commanding a reboot using the shutdown command, or similar, then no it wasn't caused by that because then you would see dozens, if not hundreds, of messages about processes shutting down before the messages showing it rebooting.
A pi should run for years without restarting, if there isn't a power fail or an intentional reboot.
That isn't the problem here as you would see the out of memory messages and node-red would restart, but it wouldn't cause a reboot, and certainly not a sudden reboot without a normal shutdown.
Almost certainly it is a hardware issue, as I said before.
I agree with Colin here, this feels like a hardware issue.
Do you have a spare Pi you can try?
If not, at least test your power supply. Try using a USB power tester to make sure that the Pi is actually getting 5.1v and that the PSU can deliver enough Amps. Power Supply - Raspberry Pi Documentation