I know I have mentioned it before, but I now have new data and it has been a while since that thread was used.
Here is a clean start on the problem with more up to date information.
I have a RPz(W) connected to my router's WAP.
Also connected (but by Cat 5) is another RPI.
The RPz pings the router and a real external IP address to determine if the link is there or not.
Said machine has a low load. It is basically doing this:
- Every 18 seconds pinging the modem and the external IP address and reporting back.
- Receiving MQTT messages from the other RPI.
- Logging selected stuff to a USB stick.
I noticed every now and then the RPz would not appear (or be pingable) from other machines.
This was attributed to lock ups. So a power cycle was forced.
It would reboot and "all be sweet".
Since then I have added better/more checking and there is something weird happening.
It is still running when I am not seeing it on my network because I added a little bit of code which - every 18 seconds - it writes the time to a file called "last alive".
So if it is running, that is constantly being updated with the latest time.
When it re/boots, it copies that file to another file prefixed with "Rebooted at" then the current time.
So:
This morning I got up and looked. RPz was "MIA"/"AWOL"/not responding to pings.
I rebooted it and looked. It was running.
Here are some log files to help with the time line.
pi@TelePi:/media/pi/9020-9C27/events $ cat Main
2019-7-17 20:43:36 Main__ - Online
2019-7-20 07:43:33 Main__ - Online
2019-7-21 16:32:13 Main__ - Online
2019-7-22 03:57:21 Main__ - Offline
2019-7-22 03:57:21 Main__ - Online
2019-7-22 04:05:22 Main__ - Offline
2019-7-22 04:09:22 Main__ - Online
2019-7-22 06:26:12 Main__ - Online
pi@TelePi:/media/pi/9020-9C27/events $
You can see clearly that at 04:05:22 the Main WAP was Offline.
It came back Online at 04:09:22
Alas the other end (the other RPI) isn't logging what it sees for other reasons.
But, when I look at the reboot log:
pi@TelePi:/media/pi/9020-9C27/logs $ lf
last_alive.db Rebooted at 2019-7-21.163212.db
Rebooted at 2019-7-20.074332.db Rebooted at 2019-7-22.062612.db
pi@TelePi:/media/pi/9020-9C27/logs $ cat Rebooted\ at\ 2019-7-22.062612.db
2019-7-22 06:23:06 up 13 hours, 53 minutes
pi@TelePi:/media/pi/9020-9C27/logs $
I rebooted it at 06:22:06
.
.
Oh. Drats. Ok, maybe not.
Ok, but why is it that at 04:09:22 when the Main WAP comes back online, isn't this machine reconnecting?
It would seem that if/when the WAP goes down, the machine locks up.
Yes, it loses the MQTT inputs. So what?
Yes, it can't ping the Router or external IP address. So what? It just can't ping them.
It would seem that this happening causes the machine to lock up.
This same thing happened yesterday day time when the uplink went down. When it came back up I saw that the RPz was not responding to PINGS.
(The killer there was that when the main link goes down the ROUTER kills its WAP.)
Anyway, anyone have any ideas on what is going on?
This is the flow which I use for boot detection and time stamping, etc.
The top part:
The exe
node is polled by the link
node. It writes to a text
node the uptime.
Below that the date is created and written to a file, (Alive) via a gate
node.
This node is default closed
- it doesn't let data through.
(So how does it work? I'm getting there.)
Below is the boot detector.
It is triggered on boot and injects one message.
That is split and a "Boot detected" text
node is activated. That is the function
node and text
node.
Below that is another button
(Ack boot) which when pressed resets the "boot detected" indicator.
The "timestamp" node then splits its output.
One goes to a delay
node set for 10 seconds.
That then goes to a function
node which sends a command to "OPEN" the gate in the upper part of the flow. That is how the gate is opened for normal writing of times to the file.
The second output of the function
node goes to another function
(gets current time/date) node and then to a file name
node.
This renames the file to what the time is NOW. (With prefixes, etc.)
So, really (and given any slight variation in times) the new "Alive" file can't be written to until (and 10 seconds after) the machine has booted.
In that 10 seconds the existing file would be moved (renamed) to what is the current time.
So it should work.