RPz(W), Wireless connection - more/new information

I know I have mentioned it before, but I now have new data and it has been a while since that thread was used.

Here is a clean start on the problem with more up to date information.

I have a RPz(W) connected to my router's WAP.
Also connected (but by Cat 5) is another RPI.

The RPz pings the router and a real external IP address to determine if the link is there or not.

Said machine has a low load. It is basically doing this:

  • Every 18 seconds pinging the modem and the external IP address and reporting back.
  • Receiving MQTT messages from the other RPI.
  • Logging selected stuff to a USB stick.

I noticed every now and then the RPz would not appear (or be pingable) from other machines.

This was attributed to lock ups. So a power cycle was forced.

It would reboot and "all be sweet".

Since then I have added better/more checking and there is something weird happening.

It is still running when I am not seeing it on my network because I added a little bit of code which - every 18 seconds - it writes the time to a file called "last alive".
So if it is running, that is constantly being updated with the latest time.

When it re/boots, it copies that file to another file prefixed with "Rebooted at" then the current time.

So:
This morning I got up and looked. RPz was "MIA"/"AWOL"/not responding to pings.
I rebooted it and looked. It was running.

Here are some log files to help with the time line.

pi@TelePi:/media/pi/9020-9C27/events $ cat Main
2019-7-17 20:43:36 Main__  - Online
2019-7-20 07:43:33 Main__  - Online
2019-7-21 16:32:13 Main__  - Online
2019-7-22 03:57:21 Main__  - Offline
2019-7-22 03:57:21 Main__  - Online
2019-7-22 04:05:22 Main__  - Offline
2019-7-22 04:09:22 Main__  - Online
2019-7-22 06:26:12 Main__  - Online
pi@TelePi:/media/pi/9020-9C27/events $ 

You can see clearly that at 04:05:22 the Main WAP was Offline.
It came back Online at 04:09:22
Alas the other end (the other RPI) isn't logging what it sees for other reasons.

But, when I look at the reboot log:

pi@TelePi:/media/pi/9020-9C27/logs $ lf
last_alive.db                    Rebooted at 2019-7-21.163212.db
Rebooted at 2019-7-20.074332.db  Rebooted at 2019-7-22.062612.db
pi@TelePi:/media/pi/9020-9C27/logs $ cat Rebooted\ at\ 2019-7-22.062612.db 
2019-7-22 06:23:06 up 13 hours, 53 minutes

pi@TelePi:/media/pi/9020-9C27/logs $ 

I rebooted it at 06:22:06
.
.
Oh. Drats. Ok, maybe not.

Ok, but why is it that at 04:09:22 when the Main WAP comes back online, isn't this machine reconnecting?

It would seem that if/when the WAP goes down, the machine locks up.

Yes, it loses the MQTT inputs. So what?
Yes, it can't ping the Router or external IP address. So what? It just can't ping them.

It would seem that this happening causes the machine to lock up.

This same thing happened yesterday day time when the uplink went down. When it came back up I saw that the RPz was not responding to PINGS.
(The killer there was that when the main link goes down the ROUTER kills its WAP.)

Anyway, anyone have any ideas on what is going on?

This is the flow which I use for boot detection and time stamping, etc.

The top part:
The exe node is polled by the link node. It writes to a text node the uptime.
Below that the date is created and written to a file, (Alive) via a gate node.
This node is default closed - it doesn't let data through.
(So how does it work? I'm getting there.)

Below is the boot detector.
It is triggered on boot and injects one message.
That is split and a "Boot detected" text node is activated. That is the function node and text node.
Below that is another button (Ack boot) which when pressed resets the "boot detected" indicator.
The "timestamp" node then splits its output.
One goes to a delay node set for 10 seconds.
That then goes to a function node which sends a command to "OPEN" the gate in the upper part of the flow. That is how the gate is opened for normal writing of times to the file.

The second output of the function node goes to another function (gets current time/date) node and then to a file name node.
This renames the file to what the time is NOW. (With prefixes, etc.)

So, really (and given any slight variation in times) the new "Alive" file can't be written to until (and 10 seconds after) the machine has booted.
In that 10 seconds the existing file would be moved (renamed) to what is the current time.

So it should work.

Screenshot%20from%202019-07-22%2007-00-58

What version of Raspbian are you using? I have a similar problem on my Pi2 running Jessie, if the device switches to UPS it loses its network (wired in this case) which doesn't come back even when mains power is restored - a reboot brings the network back. This doesn't happen on the Pi3 (I think that is runnign Stretch).

I have a flow that pings the router periodically. If it can't reach it, it reboots the Pi. It also sends a keepalive message to MQTT on the other Pi every minute - that connection has an LWT so the broker marks the Pi as offline if it hasn't sent a ping in the last 90sec.

1 Like

Your idea is interesting.

This is the info I get from the Pi:

Raspberry Pi Zero W Rev 1.1

PRETTY_NAME="Raspbian GNU/Linux 9 (stretch)"
NAME="Raspbian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
ID=raspbian
ID_LIKE=debian
HOME_URL="http://www.raspbian.org/"
SUPPORT_URL="http://www.raspbian.org/RaspbianForums"
BUG_REPORT_URL="http://www.raspbian.org/RaspbianBugs"

Raspberry Pi reference 2017-09-07
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 496e41575eeb9fa13f394ffb407b7bc1d00b21c2, stage5

I may have to do something like that here too.

Just out of interest, your thoughts on the flow:
Should it work as I expect?

With the gate (normally closed) and needing a control signal to "open" it to allow the "I'm alive" pulse to update the last alive counter.

And on booting, that being blocked and the file being moved/renamed....

Can post if you want.

Well it is an interesting idea but I can't help thinking it is overkill if you have another device somewhere that you can use with MQTT on it. As long as the other device is more reliable, then when Node-RED starts on the Pi Zero, if it starts sending an "Online" payload to a suitable MQTT topic with an LWT that says "Offline", you will have a very clear indication of when the Pi went offline and came online, especially if you then record that topic to InfluxDB. Then a simple Grafana chart will show you clearly when it goes on/off, at least to the nearest minute which would be more than enough.

Don't forget that your syslog entries should also show when the Pi goes on/offline. You could also use Telegraf to record data to InfluxDB.

Yeah, that's where I am in trouble just now.

I am reluctant at this point to plug anything else into my main RPI MQTT broker just now.
It is also the WAP which has recently died on me.

I don't like writing log files to the SD Card.
Though I could get away with it for a while. I don't want to push my luck.

I've heard of Telegraf, but not got into it. I hope MQTT is good enough.

Anyway, I have just completed a fun session with NR and the dashboard, and groups and who is where. (Etc).

And updating a whole lot of flows to the newer version and design. A lot of testing done.

Thanks.

While I realise that you are probably trying to save money, personally I would never choose a Pi as a Wireless Access Point. If you can afford it, get a dedicated WAP such as the Unifi series from Ubiquity. It will always be more secure, perform better and be more stable.

I've never had a problem since I switched to a decent SD card. Just get one that is big enough to have lots of space for wear levelling. I've been running both my Pi's off the same cards for several years now.

Telegraf is for grabbing data from various sources like your Pi's cpu, memory, temperature, service statuses, etc. and forwarding the data onto 1 or more recording/reporting services.

You can output direct to MQTT from Telegraf and that is fine for spot data (what is happening now) but not for historical records. That's where a database comes in useful. InfluxDB can be written to direct from Telegraf as well (you can do both of course).

Out of curiosity:
Somewhere there is a line:
allow hotplug <interface>
Do you have that in any of the files which points to your wlan0 interface?
That could be a problem from what I am reading.
But I am not yet 100% certain where it goes.
Way back in Jessie (and earlier) I remember it was in files.

This is where I got this from:
allow-hotplug wlan0

I don't use wlan on either of my 2 Pi's they are both wired only.

Ok, But do you have a line like that...

allow-hotplug eth0