A bigger problem than the ones I have been mentioning to now

Trying_to_learn · 15 June 2019 23:33

Recently I have been busy with NR and flows.

I've mentioned weird errors, and a few other things which have been brought about because of strange happenings on my flow/s / machines.

I have a couple of RPIs. One is a RPI 2a and acts as a WAP for local stuff.
I have 2 other WAPs. 1 is my "modem" and one is another RPI which is mostly turned off just now for reasons beyond this post.

The RPI WAP is DHCP. (And different IP to the other WAPs)

(and I now realise this could be the cause of the problem, but: here goes.)

This RPI (as stated) is a WAP. It is on 24/7. It is my "main" machine for collecting information about what's going on in my world stuff.

Saying that, it also scans for available WAPs.
It looks for my main WAP being up/down and the other (third) WAP which is down 99% of the time.

The reason for this is so it can tell me if the main WAP is working or not, and reduce the annoyance when I am wondering why things aren't connecting to it. I can easily see it is down.
(That fact in itself is a problem, but it only slightly but let's say that it is mostly up and hasn't really been a problem.)
I did that bit when I was starting off with NR and it also checks if my internet link is working.
Just an exercise in checking things.
Check the WAP, check the modem, check the internet..... ( a few other checks/ping there too, but...)

The local WAP is a bit smarter in that when a device connects to it, a signal is sent out asking the device to identify itself.
The device responds and all is sweet in the world.

Getting more adventurous I then started adding logs of what was happening and that is when things got bad.

I was/am seeing unexpected "identify yourself" message replies. These should only happen if a device is asked. Why are they happening?

(I did find some "contamination" of messages and quickly sorted that out.) (IQ +1)

I then started logging the requests. They too are happening for no real reason.

Then I got to this point/stage:

I am getting spruatic (?spelling?) messages saying that the main WAP is down.
To double check myself, I got the second RPI (which is a RPZ(w)) to also check the state of things.

It is saying that both the main and RPI (local) WAP are going down/up at about the same time.

I stuck a lot of extra stuff on the flows to log what is going on, but it is still happening and even the logs aren't making sense.

Here is a breakdown of the flow monitoring the WAPs:
(pulse trigger at given interval. Every 40 seconds) --> Scan for WAP. --> create list --> split on name.
If name == modem WAP, #1, if name == RPI WAP, #2

#1 and #2
--> timeout trigger (set to 80 seconds) to send false if it times out --> (branch) indicator set to show condition.
(branch) --> switch (if false) --> function node to timestamp message --> (branch) log message.
(branch) --> another indicator.

Yeah, sorry. It will be hard to get your head around if you aren't me. I'll post the flow/screen shots later if you want. (I'm torn between too much and not enough information.)
But I now notice that even this is detecting failures, now and then.
Not as often as I remember, but enough to be of interest.
Mostly because it is usually both which go down/up at about the same time.

Probably timing, but I am just not getting something.

I may extend the timout time to .... 120 seconds rather than 80 (twice the scan time) and see what happens.

TotallyInformation · 16 June 2019 00:01

That whole flow is really hard to follow.

In my own system, I get my devices, including my Pi's to send out an MQTT message about every 50secs or so to something like DEVICES/Pi2 with a payload of "Online". I use an MQTT connection that specifies a "Last Will and Testament" message with a timeout of 90 seconds and a payload of "Offine".

This means that I have a very simple set of flows that give a "heartbeat" to the MQTT broker (every 50s) and if the broker doesn't receive such a heartbeat message in 90s, it sets the topic to "Offline".

Now I can have as many instances of Node-RED (on any number of devices) that can listen to those DEVICES/# topics and react if anything goes offline.

Not only do I not have to worry about timings within Node-RED flows, I can keep the flows simple by separating out the processing.

I can get any instance of Node-RED to send out its own heartbeat but other devices can do the same thing directly. So all of my ESP8266 sensor platforms do that same heartbeat processing directly to the MQTT broker.

Trying_to_learn · 16 June 2019 00:13

That is a nice idea.

I may think about that soon.

I also like how you are using the LWT on MQTT to detect link failures.

What I do is very similar, but I PING the devices. I know that is limiting in some ways, but it is also useful in other ways.

On my devices - as most are RIP - they send back their CPU load and Temperature once connected.

But I don't think I can use what you suggest to detect a WAP.

To try and make it as easy to understand:
Every 40 seconds I scan the WAP list and sort the names accordingly.
A timeout of 80 seconds (now changed to 120) is used that if a message is not received it "times out" and sends a false signal indicating the WAP is not seen.

I am getting these bursts of both WAPs being Offline/Down but I am not seeing any MQTT activity to support this in that the MQTT link from the main RPI to the RPZ(w) isn't broken.
I don't see a LWT sent when the RPI (main WAP) (allegedly) dies.

So it doesn't make sense what I am seeing.

To further complicate it: I am getting those weird errors pointing to nodes saying they are wrong but when I test them with the inputs they expect they don't complain.

So I am still at the stage of chasing shadows, rather than actual things.

TotallyInformation · 16 June 2019 10:25

Ping is a very different test. The MQTT/LWT approach ensures that a devices service is really working. All ping does is find whether the devices network card hasn't died. Because Ping works on a very low level in the network stack, it doesn't tell you very much.

I have something similar to keep track of network devices on my network but I use an ARP scan with a customised lookup file. ARP scan walks the network and tracks MAC addresses returning the IP addresses, if any, for each device found.

I do use ping to track whether my Node-RED instance can still detect a couple of devices. Namely my router and my NAS. Those flows also use MQTT/LWT to indicate whether the device appears to be contactable. But of course, this only tells me if Node-RED can ping them, it doesn't really tell me if they are actually working. For that, I'd need SNMP or I would need to try and connect to a service such as TELNET or a web interface.

Which ever way you look at it, I would say that it is likely that you've created something too complex to be easily debugged. You need to separate out the parts of the flow so that you can track down the actual error, all those timing delays could be causing an issue but I can't work that out from here.

My advice is to simplify the flow, split it up. You may well find that simplifying it gets rid of the issue anyway.

Trying_to_learn · 16 June 2019 11:20

Yes, and I use PING as a low level detection system.

I've heard of ARP, but I think it is tagged differently in my world at this point.
And, that limits your scanning to fixed devices pretty much the same way my PING limits my scanning too.
(ARP, that's a network thing about Address Resolution Protocol - guessing.)

Simple Network Mail Protocol?
(It's late.... Sorry)

I have got the flows split up into sections which was/is the idea to help track problems.

What is really helping me now is subflows as I am finding a lot of repeat stuff and replacing them with subflows.

Just on how my system works - which isn't really that difficult:

Timed PING to all addresses.
If a machine replies, it is tagged as "alive" and marked as such.
Then, it is given a fixed time to start sending telemetry data (using NR) of CPU temp, processor load, etc.
If that doesn't arrive in a specified time - and format - the machine is fagged as "hung".
However, if the data is received, the machine is shown as working.

Yes there is a bit of code to do that and a lot of timers and conditions.
But I think this would be true with your way also. Not that either is better.
My way is just the way I did it with the knowledge I had at the time.

The WAP time outs are maybe just outside the scope of what I originally did, and I did it that way again only because I knew no other way.

I've tweaked the time out from 80 to 120 seconds - another 40 seconds - and that is 3 periods of scanning for the PING to not work.

That may fix it. I guess these things need to be done/tried before anything can be learned.

I have to do a bit more on the LWT/BC of MQTT. I have the "keep alive time" set to 60 seconds, which is more than the ping time of 40 seconds. But again, if it was failing with 80 second time outs on one level, 60 may fail on this.

I have set all three messages for the LWT/BC to again give as much information as possible to what happened/is going on.

As all the LWT/BC are sent on separate channels (EOM and SOM respectively) I then parse any SOM EOM messages.

I am not seeing any of these when I am getting the "identify yourself" message replies.
So that still has me stumped.

But then there is the WAP time out I just adjusted today - from memory. Or yesterday at worst.

Another thing which I am using is the fan node. Though I think it should be called a bus node, as it kind of represents a bus of sorts.

I am using them to help segregate flows into sections and having as few wires running all over the place and rather they be segmented by the fan nodes.

But that's just me.

Again: Thanks for the time and suggestions.
Appreciated.

TotallyInformation · 16 June 2019 13:39

Yes. This is the command, the macfile contains identifiers for common mac address types and known mac addresses.

sudo arp-scan --localnet --macfile=/home/pi/node/nrlive/mac-vendor-jk.txt

You can also do ip neighbor to see whether addresses in the current ARP table are reachable.

No, you are thinking of SMTP (Simple Mail Transport Protocol).

Most routers and NAS's can be configured to give useful operational data over SMTP.

Steve-Mcl · 16 June 2019 14:05

No, based on the context of this thread, he will definitely have meant SNMP.

It's an object based protocol often used in managed switches & servers etc for monitoring.

SMTP is just email transport.

Trying_to_learn · 16 June 2019 21:12

Was that a typo?

And thanks for the other stuff too.

I have touched ARP when I was doing routers/CISCO stuff.
Then there is also RIP but I never got too far down that path.

TotallyInformation · 16 June 2019 21:29

Doh! Yes, SNMP of course as I'd already explained. Sorry for the confusion.

RIP is something different, it lets routers collaborate with each other. Largely discredited now I think due to vulnerabilities.

Trying_to_learn · 16 June 2019 21:33

Moving on from that, here is a list of outputs I got overnight from my scanners:

This is the "Who are you" requests. (RAW format)

>> -- Mark -- 2019-6-14 20:35:27------------
{"WIFI_DEVICE":"GPS","IP_Address":"192.168.1.4"}
{"WIFI_DEVICE":"GPS","IP_Address":"192.168.1.4"}
>> -- Mark -- 2019-6-15 15:59:52------------
{"WIFI_DEVICE":"GPS","IP_Address":"192.168.1.4"}
{"WIFI_DEVICE":"GPS","IP_Address":"192.168.1.4"}

Newest at the bottom.

Now the ones I see are:

>> -- Mark -- 2019-6-16 21:02:49------------

So a whole day has gone by and all seems good there.

This is my MQTT stuff:

>> -- Mark -- 2019-6-16 21:02:49------------
2019-6-16 21:25:21 TimePi shutting down

Good.

A very brief look at visible WiFi networks.

2019-6-17 07:03:02["Telstra6C2C0D", "Telstra Air", "Fon WiFi", "TelstraCA5E31", "Telstra Air", "Fon WiFi", "PiNet", "BigPond121C", "TPG-VZBS", "Tango2", "Chilli", "TelstraAC4579", "Slowest Internet in the World", "Interweb Thingy", "TP-LINK_40B6"]

And any WAP changes of interest: (Last)

>> -- Mark -- 2019-6-16 21:02:49------------

That is a lot better than it used to be with a few "Who are you?" being transmitted and a lot of WAP up / downs.

That extra 40 seconds may have been the trick.

Trying_to_learn · 16 June 2019 21:34

Thanks.

RIP was only covered and I have not used it on my ..... lab as I haven't seen the benefits of it.
But, anyway, thanks again.

No prob's on the typo. We all make mistakes. And I should know.

Gotta go now. Monday morning and a new week of fun to be had.
Back in about 7 hours maybe.

Topic		Replies	Views
Help with how to manage messages - timing General	19	316	20 June 2022
RPz(W), Wireless connection - more/new information General	9	734	22 July 2019
WiFi events and strange going on's General	65	2589	24 September 2019
"Wifi scan" (python scrpit) different outputs from different machines General	30	2047	24 July 2019
How to watch devices availability in a network Share Your Projects	26	15559	16 June 2020

A bigger problem than the ones I have been mentioning to now

Related topics