MQTT is driving me crazy with inconsistent doings

Every time I think I have it nutted out, MQTT throws another curve ball at me and just doesn't do things consistently.

(Slight back story)
Multiple machines controlling single device.
(Sorry, it is actually becoming quite long)

For the sake of simplifying this is a breakdown of what happens on each machine.
This is only a functional example. Not the real code.

As the button is a pushbutton, the icon needs to be altered if another machine changes the state of the external device.
That way the button shows the correct indication.

Of course I didn't foresee the problem: If I turn the device on (or off) on one device, and try to turn it off (or on) on another there is a problem. The toggle node isn't reflecting the same real condition of the device. Therefore a two time button press is needed.

That isn't too much of a problem. I modify the toggle node and connect from the decode message node to the toggle node also.

That way both the button node and the toggle node are on the same page for the device's status.

That if fine. One computer is on 24/7. The other isn't. It is this machine and it gets turned on and off many times a day.

Now, the past few days I have had to roll out that modification to all machines concerned and it seems to be working.
I had to mess around with the MQTT settings about messages being kept (retained) or not.
(End back story - I hope)

To add another layer of truth I also included code so if the device is physically turned off, it disables the buttons on all machines.
This is done with the LWT in MQTT.

After a lot of oops moments rolling out the above, I am still not seeing the correct things on the machines.

This is the screen of the machine which is on 24/7:

Screenshot from 2020-07-22 08-29-56

Which is wrong.

This is the from this machine which has only just been turned on this morning:

Screenshot from 2020-07-22 08-31-02

Which is more accurate - but still slightly wrong.

When the bulb is physically powered down, a LWT message is sent and it tells MQTT that it is dead.
That message is used to put the ban icon on the button and disable it.

Last night I physically turned off the bulb. I did that because this machine was turned off and the other machine which is on 24/7 is headless.

During testing I was seeing the LWT message/s coming through and the buttons changing state obligingly.

Until last night and then the machine which is always on didn't see the message.
(Why? -- semi rhetorical)

A while ago I had to make a special MQTT in node and set the topic to the entire path for the LWT of the device.
This didn't really help.

This is an extract of the actual code that detects the LWT.

[{"id":"9478ca91.86f89","type":"mqtt in","z":"26262ba1.62dcbc","name":"Bulb-2 *","topic":"BULB-2/tele/LWT","qos":"2","datatype":"auto","broker":"378c0403.8cda04","x":2480,"y":1540,"wires":[["33bdf784.1f12e8","eb79a878.978f78","17be4a17.f26556","18542f8a.4235b8"]],"info":"This needs editing for different BULBS"},{"id":"17be4a17.f26556","type":"delay","z":"26262ba1.62dcbc","name":"Delay","pauseType":"delay","timeout":"200","timeoutUnits":"milliseconds","rate":"1","nbRateUnits":"1","rateUnits":"second","randomFirst":"1","randomLast":"5","randomUnits":"seconds","drop":false,"x":2480,"y":1580,"wires":[["47b5cbbd.2089dc","372c717c.ac1446"]]},{"id":"47b5cbbd.2089dc","type":"switch","z":"26262ba1.62dcbc","name":"LWT","property":"payload","propertyType":"msg","rules":[{"t":"eq","v":"Online","vt":"str"},{"t":"eq","v":"Offline","vt":"str"}],"checkall":"true","repair":false,"outputs":2,"x":2650,"y":1580,"wires":[["c04aa13e.9d3ca","b53edd19.caae"],["1cda007c.85c458"]]},{"id":"1cda007c.85c458","type":"function","z":"26262ba1.62dcbc","name":"BAN","func":"msg = {icon: '<font color = \"red\"><i class=\"fa fa-ban fa-3x\"></i></font>'};\nreturn msg;","outputs":1,"noerr":0,"x":2650,"y":1620,"wires":[["ec8dc91b.795888"]]},{"id":"c04aa13e.9d3ca","type":"change","z":"26262ba1.62dcbc","name":"Enable","rules":[{"t":"set","p":"enabled","pt":"msg","to":"true","tot":"bool"}],"action":"","property":"","from":"","to":"","reg":false,"x":2860,"y":1580,"wires":[["9ccce8b4.a581d"]]},{"id":"ec8dc91b.795888","type":"change","z":"26262ba1.62dcbc","name":"Disable","rules":[{"t":"set","p":"enabled","pt":"msg","to":"false","tot":"bool"}],"action":"","property":"","from":"","to":"","reg":false,"x":2860,"y":1620,"wires":[["9ccce8b4.a581d","de51f5b3.e7b2"]]},{"id":"9ccce8b4.a581d","type":"ui_button","z":"26262ba1.62dcbc","name":"BULB#2","group":"c9a39d1d.9fa798","order":2,"width":"1","height":"3","passthru":false,"label":"{{msg.icon}}","tooltip":"","color":"","bgcolor":"{{msg.background}}","icon":"","payload":"X","payloadType":"str","topic":"","x":3070,"y":1580,"wires":[["8318d073.9248c"]]},{"id":"378c0403.8cda04","type":"mqtt-broker","z":"","name":"MQTT HOST","broker":"192.168.0.99","port":"1883","clientid":"","usetls":false,"compatmode":false,"keepalive":"60","cleansession":true,"birthTopic":"","birthQos":"0","birthRetain":"true","birthPayload":"","closeTopic":"","closeQos":"0","closePayload":"","willTopic":"","willQos":"0","willPayload":""},{"id":"c9a39d1d.9fa798","type":"ui_group","z":"","name":"BULB-2","tab":"aa487daa.33c1c","order":5,"disp":true,"width":"3","collapse":false},{"id":"aa487daa.33c1c","type":"ui_tab","z":"","name":"Real_World_Control","icon":"dashboard","order":3,"disabled":false,"hidden":false}]

Both machines have this code. I posted this as there is a bit of explanation in one of the comment nodes. See the one with a * in its name.

So given both machines have the same code:
Why is it that this machine which was turned on this morning showing the correct state of the bulb when the one which is on 24/7 isn't?
The LWT message must have been sent, as it was being sent earlier last night when I was testing the flow.

Here too is a show from MQTT explorer

I'm going off line for a few hours.
It isn't I am doing a runner - asking a question and not replying.

But I am just missing something why the machine which is on 24/7 is showing the wrong information/state of the button.

I have a few questions with regards to your architecture, but I'll start with a bit of an outline of what I am working with in my own setup.

My setup uses a couple of WiFi enabled micro-controllers (ESP8266) outfitted with a humidity/temperature sensor. The micro-controllers can be controlled by a variety of computers (laptop, phone, tablet, etc.). Movement of the data and commands uses MQTT for the transport. Data from the sensors is stored in a time-series database (InfluxDB).

At the centre of all this is a Raspberry Pi 3 (RPi). The RPi has Node-RED, mosquitto (MQTT broker) and InfluxDB installed. Access to the Node-RED Editor and UI are via web browser. This eliminates any need to move any modifications around. The RPi is inexpensive, resilient and reliable.

I don't have any issues with consistency. For example, I can change the state of an onboard LED using Node-RED and the on/off status of the LED is the same on my laptop, phone and tablet.

I feel this is a similar setup to what you describe you have.

Now for the questions:

  1. Where are Node-RED and the MQTT broker installed?
  2. Are you using a web browser to access the Node-RED ui and control the device?

The MQTT broker is on a RasPi 2 machine. On 24/7.
Node-red is installed on all machines mentioned.

I am using a browser to access the dashboards. Either on this machine from this machine or on another RasPi to access it's Node-Red dashboard.
That is not the MQTT broker machine.

The device - in this case a bulb - is on the WiFi side of things.

On any given machine: in this case only two; this machine and the afore mentioned RasPi, there is the dashboard which has a button which sends a simple X when pressed.

That goes into a function node to toggle the message sent to the bulb. Being either ON or OFF.

The problem arises when the other machine does that.
Although the message is received on the second machine, and I had the smarts to make it set the button's icon to match the incoming message's state, the function node isn't updated as well.
So there is disparity if I go to press the button on said other machine.

But that is now no longer the problem.

The problem is that if the bulb is turned off by the actual switch - as in physically - the bulb sends (I say that, but.....) the MQTT message that it is offline.

That then disables the button and sets the icon to indicate that.

One machine - this one which has just been turned on - shows that. It got the preserved MQTT message.
Yet, the machine which is on 24/7 seems to think the bulb is still turned on and active.

I have spent a lot of time looking at MQTT messages and the settings.

AFAIK, all the boxes are ticked and things should work. They obviously don't and I am at a loss to why and where this problem is happening.

It sounds like the 24/7 subscriber is not really subscribing as expected. Is the publisher sending to '+' topic? So the both subscribers are listening? I have 10 different Pi devices that send to one monitoring device, so I structured the MQTT topic to be open ended on the back end. Baring that, divide and conquer. Break apart the flow(s) to its component parts.

For example, I validate that my LED or LCD or whatever status reporting method is working correctly with direct input, no MQTT. Then I confirm that MQTT is working as expected, etc. Recently I was driving CEC to some HDMI devices via buttons on tab on dashboard, and I kept getting these odd CEC directives from the buttons on the tab. It was driving me nuts, turned out one specific button, had msg forwarding enabled, so I was getting an unexpected payload, that was feeding into the CEC exec node, that was messing with the HDMI connected device to my Pi. It was not until I broke the total flow into easier to test segments, I back traced to the real cause of my problem.

Kinda difficult when it is a WiFi bulb.

But I do get what you mean. It is just that one day things happen one way. Another day they seem to happen another way.

Most annoying.

Admittedly I haven't read your original post in its entirety but I would question this as a general design philosophy, simply because it can lead to the sorts of issues that you keep seeing.

It was originally.

Each device sent on off signals and didn't really set the other devices to reflect the change.

That was improved by adding a MQTT in node that got the change of the device from the device rather from the local flow and the button being pressed.

That again had problems because when one changed the state of the device, the other machines, though reflecting this update, their internal function node toggle wasn't changed.

That too was fixed.

But overlying all of this the death certificate disables and sets the icon for the button to the ban icon.

Weirdly the machine which is on 24/7 didn't seem to get the MQTT message, yet the machine (this one) that gets turned on/off several times a day gets the message.

The message is retained and so I was told / led to believe that this means when this machine powers up, it gets the message form the BROKER.

That aside, the other machine which is on all the time should see the original message when the death certificate is issued and change the button's icon accordingly.

Yet it doesn't.

Why?

Not answering your question specifically, but I run two physical machines (both RPi) with NR. One I call the Prod server and the other the Test server. Obviously their names give away their purpose. I run one MQTT server on Prod, but the Test server also utilises the MQTT server. On the Test server I run multiple instances of NR as that helps with the testing process.

If I knock up a new flow, or test a node to my satisfaction then I move it over to the Prod server. The Prod server also runs other services like web server, Grafana, Influx etc. I have various nodes (esp, arduino and other RPi) connected via MQTT. All rock solid.

I guess what I am trying to suggest is that having a good fundamental design architecture of how your network is set up really helps in eliminating the gremlins.

Having a (reasonably strict) demarcation between Prod and Test allows you to play to your hearts content on the Test server, while still maintaining a viable and happy Prod server. If I was going to suggest one thing, it would be to take macro look at how you have things set up with the aim of getting to a "Prod/Test" environment. You don't need NR (or any other server for that matter) running on all your boxes in a home environment.

2 Likes

I do agree @Bobo, but that is in a very specific scenario.

I am writing things and they need to online (usable) now. Not in 3 months.

So I write the code. It works. This is dynamic - well not really - but the problem is.
I test it and it seems to work. I dot all the I's and cross all the T's and thing look good.

The next day: "BANG!" It doesn't, and I don't know why.

Whether it is specific or not is beside the point. The point is it is easy to manage, and doesn't cost much to set up.

Who said anything about 3 months? Once you have your test environment set up there is little extra time needed.

Exactly. Because it sounds like you have a network that is ten times more complicated than it needs to be. Why do you need to run more than one NR server, or one MQTT server, as an example. Complexity introduces inconsistencies.

2 Likes

That was an example. As I said at the end: I spend time and it all works. While I am watching it.

As soon as I think every thing is working the next day it all falls over.

Well, it's a pity when such things happen

But I doubt it's because something is wrong with the MQTT as you indicate in the topic. If it would have been, I would expect a massive thunderstorm of reports

It seems you are a bit too much in a hurry. Try to revert back to the basics, sometimes just a pen & paper is a good start. The architecture you have is mentioned above. Have you made a thorough analyze and design? Throwing in new nodes and code to a malfunctioning solution is not the recommended way to fix a problem

1 Like

Well, this is where I am really confused.

I have the MQTT in node.
Topic: BULB-2/#

Supposedly ALL messages are received and passed by that node.... for BULB-2/#

I physically turn on the bulb. (Hang on)
Message seen:
{"topic":"BULB-2/tele/LWT","payload":"Online","qos":0,"retain":false,"_msgid":"abd3aab5.1a4a38"}

{"topic":"BULB-2/cmnd/POWER","payload":"","qos":0,"retain":false,"_msgid":"1995549a.b8ce1b"}

{"topic":"BULB-2/tele/INFO1","payload":"{\"Module\":\"Generic\",\"Version\":\"8.3.1(tasmota)\",\"FallbackTopic\":\"cmnd/BULB-2_fb/\",\"GroupTopic\":\"tasmotas/cmnd/\"}","qos":0,"retain":false,"_msgid":"976a6835.c83368"}

{"topic":"BULB-2/tele/INFO2","payload":"{\"WebServerMode\":\"Admin\",\"Hostname\":\"BULB-2\",\"IPAddress\":\"192.168.0.26\"}","qos":0,"retain":false,"_msgid":"5cda1d1d.d4f4c4"}

{"topic":"BULB-2/tele/INFO3","payload":"{\"RestartReason\":\"Power On\"}","qos":0,"retain":false,"_msgid":"6dea984b.78d258"}

{"topic":"BULB-2/stat/RESULT","payload":"{\"POWER\":\"ON\"}","qos":0,"retain":false,"_msgid":"7363ffad.62521"}

{"topic":"BULB-2/stat/POWER","payload":"ON","qos":0,"retain":false,"_msgid":"84a96076.6256e"}

{"topic":"BULB-2/stat/RESULT","payload":"{\"POWER\":\"ON\"}","qos":0,"retain":false,"_msgid":"331deac2.6a46c6"}

{"topic":"BULB-2/stat/POWER","payload":"ON","qos":0,"retain":false,"_msgid":"87bc438.9f06ac"}

{"topic":"BULB-2/tele/STATE","payload":"{\"Time\":\"2020-07-22T18:17:42+10:00\",\"Uptime\":\"0T00:00:09\",\"UptimeSec\":9,\"Heap\":29,\"SleepMode\":\"Dynamic\",\"Sleep\":10,\"LoadAvg\":42,\"MqttCount\":1,\"POWER\":\"ON\",\"Dimmer\":70,\"Color\":\"7241\",\"White\":70,\"CT\":280,\"Channel\":[45,25],\"Fade\":\"OFF\",\"Speed\":1,\"LedTable\":\"ON\",\"Wifi\":{\"AP\":1,\"SSId\":\"Marys_Farm_2.4\",\"BSSId\":\"24:F5:A2:B2:2A:07\",\"Channel\":6,\"RSSI\":100,\"Signal\":-50,\"LinkCount\":1,\"Downtime\":\"0T00:00:03\"}}","qos":0,"retain":false,"_msgid":"158bea6b.64bf46"}

Make of that as you will.

But, the first message:
{"topic":"BULB-2/tele/LWT","payload":"Online","qos":0,"retain":false,"_msgid":"abd3aab5.1a4a38"}

extract:
"topic":"BULB-2/tele/LWT","payload":"Online"
Note: Online.

This is what the dashboard looks like:
(on the 24/7 machine)

This is what I see on this machine:

See also that BULB-1 is incorrectly shown on machine 1. (Red cross)

Now I physically turn off the bulb.

MQTT message received:
{"topic":"BULB-2/tele/LWT","payload":"Offline","qos":1,"retain":false,"_msgid":"5064fdea.e0e5a4"}

Note: Offline

The 24/7 machine dashboard:

This machine's dashboard:

So as of now BULB-2 is correct on both machines.

Next time I power this machine up and look at the other machine which hasn't (shouldn't) be powered down, or anything else happen to it:

It's screen will be incorrect, and this machine's desktop will be correct.

WHY?

Maybe use node-red-contrib-flogger and connect it to the mqtt nodes subscribed to the lwt for your ui. That way you can see where it went wrong the next time this happens while your not watching.
Edit
In addition also connect flogger nodes to catch and status nodes watching those mqtt nodes in question. That should help you narrow down what is happening. Eg seeing if the ltw msg is send when expected.

I have this connected to the MQTT in node:

[{"id":"9aaa840a.131cf8","type":"subflow","name":"Time Stamp","info":"","category":"","in":[{"x":80,"y":100,"wires":[{"id":"6e2f05f9.3d8634"}]}],"out":[{"x":660,"y":180,"wires":[{"id":"6e2f05f9.3d8634","port":0},{"id":"df2d30dc.4a8ea","port":0}]},{"x":660,"y":230,"wires":[{"id":"6e2f05f9.3d8634","port":0},{"id":"4c048d86.0f87a4","port":0}]},{"x":660,"y":280,"wires":[{"id":"6e2f05f9.3d8634","port":0},{"id":"ebdb6996.ecdbd8","port":0}]}],"env":[],"color":"#FF8888","outputLabels":["For logging use","msg.time","For filename use"],"icon":"node-red/timer.svg"},{"id":"df2d30dc.4a8ea","type":"moment","z":"9aaa840a.131cf8","name":"","topic":"","input":"","inputType":"msg","inTz":"Australia/Sydney","adjAmount":0,"adjType":"days","adjDir":"add","format":"YYYY-MM-DD HH:mm:ss","locale":"en_AU","output":"","outputType":"msg","outTz":"Australia/Sydney","x":400,"y":180,"wires":[["ebdb6996.ecdbd8","c91e62d1.fa5ce"]]},{"id":"ebdb6996.ecdbd8","type":"string","z":"9aaa840a.131cf8","name":"","methods":[{"name":"replaceAll","params":[{"type":"str","value":":"},{"type":"str","value":""}]}],"prop":"payload","propout":"payload","object":"msg","objectout":"msg","x":450,"y":280,"wires":[[]]},{"id":"c91e62d1.fa5ce","type":"change","z":"9aaa840a.131cf8","name":"TOPIC","rules":[{"t":"move","p":"payload","pt":"msg","to":"time","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":230,"y":230,"wires":[["4c048d86.0f87a4"]]},{"id":"820c4f44.982f38","type":"change","z":"9aaa840a.131cf8","name":"Save","rules":[{"t":"set","p":"payload","pt":"flow","to":"payload","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":230,"y":140,"wires":[["8ebff2c7.232ed"]]},{"id":"4c048d86.0f87a4","type":"change","z":"9aaa840a.131cf8","name":"Get","rules":[{"t":"set","p":"payload","pt":"msg","to":"payload","tot":"flow"}],"action":"","property":"","from":"","to":"","reg":false,"x":450,"y":230,"wires":[[]]},{"id":"6e2f05f9.3d8634","type":"switch","z":"9aaa840a.131cf8","name":"check topic","property":"topic","propertyType":"msg","rules":[{"t":"eq","v":"TIMESTAMP","vt":"str"},{"t":"else"}],"checkall":"true","repair":false,"outputs":2,"x":210,"y":100,"wires":[[],["820c4f44.982f38"]]},{"id":"8ebff2c7.232ed","type":"change","z":"9aaa840a.131cf8","name":"TimeStamp","rules":[{"t":"set","p":"payload","pt":"msg","to":"","tot":"date"}],"action":"","property":"","from":"","to":"","reg":false,"x":210,"y":180,"wires":[["df2d30dc.4a8ea"]]},{"id":"8459c8bc.6bb67","type":"subflow:9aaa840a.131cf8","z":"6dd5ca4d.d1958c","name":"","x":2700,"y":2170,"wires":[[],["ee89490.9432638"],[]]},{"id":"5aa1f4c2.f97dc4","type":"link in","z":"6dd5ca4d.d1958c","name":"","links":["e89586bf.34a2c","7cd6e58b.0faa74"],"x":2545,"y":2170,"wires":[["8459c8bc.6bb67"]]},{"id":"ee89490.9432638","type":"simple-queue","z":"6dd5ca4d.d1958c","name":"queue1","firstMessageBypass":false,"bypassInterval":"0","x":2861,"y":2220,"wires":[["ea7d84e1.273138"]]},{"id":"d3ad6513.ae9e58","type":"change","z":"6dd5ca4d.d1958c","name":"Read","rules":[{"t":"set","p":"trigger","pt":"msg","to":"1","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":2680,"y":2220,"wires":[["ee89490.9432638"]]},{"id":"ea7d84e1.273138","type":"debug","z":"6dd5ca4d.d1958c","name":"Raw MQTT message incoming","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"true","targetType":"full","statusVal":"","statusType":"auto","x":3075,"y":2220,"wires":[]},{"id":"d5ac3e33.7235","type":"change","z":"6dd5ca4d.d1958c","name":"Wipe","rules":[{"t":"set","p":"reset","pt":"msg","to":"1","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":2680,"y":2260,"wires":[["ee89490.9432638"]]},{"id":"b2fc48c1.90354","type":"inject","z":"6dd5ca4d.d1958c","name":"Read","repeat":"","crontab":"","once":false,"onceDelay":"","topic":"","payload":" ","payloadType":"str","x":2515,"y":2220,"wires":[["d3ad6513.ae9e58"]]},{"id":"208f0bf4.1bf824","type":"inject","z":"6dd5ca4d.d1958c","name":"Wipe","repeat":"","crontab":"","once":false,"onceDelay":"","topic":"","payload":" ","payloadType":"str","x":2515,"y":2260,"wires":[["d5ac3e33.7235"]]}]

It has a sub-flow (mine) to time stamp the event and a queue node that builds a queue so I can scroll through any messages received by that node.

I shall check what that node you mentioned does.

I'm not going to solve your issue but I would suspect you have created one or more loops somewhere in your solution. If you look at the above messages that you said was incoming when you turn on the bulb, you find there are duplicates with the same content but they all have different msg id's. That would to me be a warning signal

1 Like

As in msg1 and 3 are the same, and 2 and 4?

Hmmm.... Yeah, ok. Thanks.

I may look into that part.

Appreciated.

You are using the concept of Tasmota in combination with MQTT and NR not correct.

The local feedback loops you have created now are sending duplicate message's and that will give problems as you already know.

The idea from Tasmota is that you send the commands to the device with the CMND topic (/vm/cmnd/LichtHal/power2)
Tasmota will switch on the light and sends his feedback back to NR over STAT topic (/vm/stat/LichtHal/POWER2)
The state of the switch on the dashboard is changed by the incoming STAT topic and not local feedback.

To get this working correct you need to set the switch configuration correct.
Pass though message if payload... should be checked off
Indicator must be set to "Switch icon shows state of input"

This works perfect for me. All the phone and tablets are update fine even if they where switched of for days.

Same for dimmers etc. send the command with the CMND topic and wait for the response on the STAT topic.
Use this response to change the slider, color picker etc. state in NR.

1 Like

I think I am doing that if you really look at the code.

When I press the button (switch/what ever you want to call it), it sends an X message into the function node.
From there the message generates an ON of OFF message to be sent to the tasmota device.
At the same time it sends a message to the button to have an orange colour. Indicating the press detection.
When the MQTT message comes in it goes through a switch then change node.
Depending if the message (payload) was ON or OFF the correct colour for the button is set to either RED or LIME, indicating the bulb is OFF or ON.

So I am doing what I believe you are saying I should be doing.

What code, I don't see code that matches your start post.
Can't test anything without the proper code.