Help to debug MQTT/Influx host not found connection issue

Hello guys, I am new on the forums, this is not a NodeRed specific question, but I would like to share this issue here so you can hep me out on how to debug it.

I am running a raspi with docker+portainer+grafna+influx+emqx

All the containers are sharing a network, so they can talk eachother without IP, my mqtt server is http://emqx:1880/ and my influx db in grafana is http://influxdb:8086

This same setup is working fine on other raspis, I just made a fresh install on this one, and I found out that the MQTT connections drops sometimes... I was not sure if it was an issue from the 2 esp8266 and the local network, so I made a dummy payload from a nodered instance in a Linode, and I gor the same results

The raspi was online all the time, I was able to log , view the debug messages in realtime, but the MQTT connection was in "connecting" state, the one from the local network against emqx, and the ones against outside the raspi. The weird thing is that every 1 hour, 1 data point was written

also, at a different time, I was able to login, I was able to see the mqtt messages on the debug node, but the influx node was giving me a "host not found error"

but I was able to login to grafana, grafana was able to pull the database from the same container, so the influx server was "online"

After restarting the nodered docker, all was working again, but every 20 hours more or less It get stuck again

I would like any recomendations on how to troubleshoot this....
also, in all the events, nodered was running, I was able to modifi, stop, restar the flow, and the state was the same, the mqtt connection, to the local raspi, and to the linode was unable to connect

welcome @Andresc4
I have had issues with my esp8266's losing connection with either the pressure transducers connected and or my one wire temp sensors.
I created a "babysitting " flow that monitors my "sensor" mqtt messages.
if it gets the message like it is supposed to in a trigger node it waits 60 seconds before it sends a trigger saying mqtt messages stopped coming in. however if it keeps coming in like it is supposed to it does nothing.
I actuallly set it up to send an email to me when its back up and when it goes down.
I set up a gmail email just for monitoring errors.
I am fairly new to node red but I do run a brewing system from it and love Node-Red.
It's been addicting learning all the capabilities of node-red on how it can help me and help me help others.
attached is a flow.
I can walk you through what you need to change if needed. but this will get you going on monitoring your specific mqtt messages that seem to stop working.
also not this is also set to reboot the esp8266 so it can resync with your input devices connected to the esp8266
Keep in mind i use tasmota with my esp8266's so i didnt have to upload a code with arduino ide.

[{"id":"fe9d6e7861d88d65","type":"tab","label":"Flow 2","disabled":false,"info":"","env":[]},{"id":"683bee9bf72b5c31","type":"group","z":"fe9d6e7861d88d65","name":"Temp Sensor Baby sitting","style":{"stroke":"#000000","fill":"#bfdbef","label":true,"color":"#000000"},"nodes":["f716f8c2fdec4f8f","132a36d56fc6365a","b28545a0e6840e56","65d999f28ddb3c3e","806af3511ad100b6","fe44a097296be644","efc35da4f5ca30c3","d1d474e913e03daf","ef35a4d5efef66c6","bfb15d230e8b5b4d","bf095db852075129","e872f76981f1a2b9","87b5769933375420","1f26f5396fa01acf","e793aea4f5277060","7f21a7afae815ec8","063216bb1e227704","8b0d0f21b7085c35","808d3b57149f3e41","98dd613b6546063b","27b346e939184dfa","bf0deb414ca42770","5338d248d38744ad","6803e2d1d44b10b7","544c4fa49d3cab53","680d37e51c4f4b9a"],"x":30,"y":98,"w":1349.4999380111694,"h":476.829083442688},{"id":"fe44a097296be644","type":"e-mail","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","server":"smtp.gmail.com","port":"465","secure":true,"tls":true,"name":"railhoperrors@gmail.com","dname":"","credentials":{},"x":659.7011022567749,"y":190.25610637664795,"wires":[]},{"id":"ef35a4d5efef66c6","type":"e-mail","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","server":"smtp.gmail.com","port":"465","secure":true,"tls":true,"name":"railhoperrors@gmail.com","dname":"","credentials":{},"x":1243.4999380111694,"y":441.82908153533936,"wires":[]},{"id":"f716f8c2fdec4f8f","type":"inject","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"must change topic","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"cmnd/YOURTOPIC/RESTART","payload":"1","payloadType":"num","x":609.5000905990601,"y":344.162410736084,"wires":[["132a36d56fc6365a"]]},{"id":"132a36d56fc6365a","type":"ui_button","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"needs your topic","group":"f5e2c3bf3d5a99b2","order":1,"width":0,"height":0,"passthru":true,"label":"REBOOT","tooltip":"","color":"","bgcolor":"","icon":"","payload":"1","payloadType":"str","topic":"cmnd/YOURTOPIC/RESTART","x":860.5000905990601,"y":217.1623821258545,"wires":[["d1d474e913e03daf"]]},{"id":"b28545a0e6840e56","type":"trigger","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"","op1":"Make a statement that works and makesense","op2":"On","op1type":"str","op2type":"str","duration":"60","extend":true,"overrideDelay":false,"units":"s","reset":"","bytopic":"all","topic":"topic","outputs":2,"x":328.5000066757202,"y":210.82907485961914,"wires":[["806af3511ad100b6","7f21a7afae815ec8","544c4fa49d3cab53"],["65d999f28ddb3c3e","132a36d56fc6365a","bf095db852075129","bf0deb414ca42770"]]},{"id":"65d999f28ddb3c3e","type":"debug","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"badsensor","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":526.5000905990601,"y":300.1623821258545,"wires":[]},{"id":"806af3511ad100b6","type":"debug","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"sensor good","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":426.1543912887573,"y":139,"wires":[]},{"id":"efc35da4f5ca30c3","type":"link in","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"either link or replace with mqtt in","links":["7c4cbc1a995ee8c7"],"x":71,"y":211.82907485961914,"wires":[["b28545a0e6840e56","808d3b57149f3e41"]]},{"id":"d1d474e913e03daf","type":"mqtt out","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"your topic needs to go here","topic":"cmnd/YOURTOPIC/RESTART","qos":"","retain":"","respTopic":"","contentType":"","userProps":"","correl":"","expiry":"","broker":"31a99116.50a74e","x":1129.50009059906,"y":222.1623821258545,"wires":[]},{"id":"bfb15d230e8b5b4d","type":"gate","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"","controlTopic":"control","defaultState":"open","openCmd":"open","closeCmd":"close","toggleCmd":"toggle","defaultCmd":"default","statusCmd":"status","persist":false,"storeName":"default","x":1055.4999685287476,"y":418.8291301727295,"wires":[["8b0d0f21b7085c35","680d37e51c4f4b9a"]]},{"id":"bf095db852075129","type":"mytimeout","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"Close the Gate after a single message","outtopic":"","outsafe":"Sensor Failed your comments that tells you its not working","outwarning":"","outunsafe":"off","warning":"5","timer":"10","debug":false,"ndebug":false,"ignoreCase":false,"repeat":false,"again":false,"x":571.5000905990601,"y":512.8291301727295,"wires":[["bfb15d230e8b5b4d","e872f76981f1a2b9"],["87b5769933375420"]]},{"id":"e872f76981f1a2b9","type":"debug","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"After Timer","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"true","targetType":"full","statusVal":"","statusType":"auto","x":977.4999380111694,"y":363.8290796279907,"wires":[]},{"id":"87b5769933375420","type":"change","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"Close the Gate","rules":[{"t":"set","p":"payload","pt":"msg","to":"Close","tot":"str"},{"t":"set","p":"topic","pt":"msg","to":"control","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":914.4999303817749,"y":533.829083442688,"wires":[["bfb15d230e8b5b4d"]]},{"id":"1f26f5396fa01acf","type":"link in","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"Fermenter (East) Gate Control In","links":["e793aea4f5277060"],"x":926.4999380111694,"y":469.82907581329346,"wires":[["bfb15d230e8b5b4d"]]},{"id":"e793aea4f5277060","type":"link out","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"Fermenter (East) Out to Gate Control","mode":"link","links":["1f26f5396fa01acf"],"x":1037.4999990463257,"y":159.8290548324585,"wires":[]},{"id":"7f21a7afae815ec8","type":"change","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"Open The Gate after a Good Temperature Read","rules":[{"t":"set","p":"payload","pt":"msg","to":"Open","tot":"str"},{"t":"set","p":"topic","pt":"msg","to":"control","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":727.4999380111694,"y":153.82910346984863,"wires":[["e793aea4f5277060"]]},{"id":"063216bb1e227704","type":"inject","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"Test Flow","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"date","x":169.50008296966553,"y":172.162446975708,"wires":[["b28545a0e6840e56"]]},{"id":"8b0d0f21b7085c35","type":"debug","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"After Gate","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"true","targetType":"full","statusVal":"","statusType":"auto","x":1202.5001058578491,"y":398.16241455078125,"wires":[]},{"id":"808d3b57149f3e41","type":"debug","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","x":173.16674709320068,"y":259.8290777206421,"wires":[]},{"id":"98dd613b6546063b","type":"change","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"","rules":[{"t":"set","p":"payload","pt":"msg","to":"1","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":925.1668386459351,"y":261.8291301727295,"wires":[["d1d474e913e03daf","27b346e939184dfa"]]},{"id":"27b346e939184dfa","type":"debug","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","x":1118.1667165756226,"y":271.829008102417,"wires":[]},{"id":"bf0deb414ca42770","type":"link out","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"Restart Out Link","mode":"link","links":["5338d248d38744ad"],"x":493.1667776107788,"y":246.829008102417,"wires":[]},{"id":"5338d248d38744ad","type":"link in","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"restart in link","links":["bf0deb414ca42770"],"x":809.1668386459351,"y":258.829008102417,"wires":[["98dd613b6546063b"]]},{"id":"6803e2d1d44b10b7","type":"comment","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","name":"Temp Sensor Babysitting","info":"this section controls email notifications if the temperature sensor stops reporting to the esp8266\nemails should be sent to railhoperrors@gmail.com\nths section of flow should also restart the esp8266\nto refresh connection with the ds18b20 temp sensor.","x":173.16680812835693,"y":523.8290548324585,"wires":[]},{"id":"544c4fa49d3cab53","type":"e-mail","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","server":"smtp.gmail.com","port":"465","secure":true,"tls":true,"name":"","dname":"","x":590.51171875,"y":187.9238375,"wires":[]},{"id":"680d37e51c4f4b9a","type":"e-mail","z":"fe9d6e7861d88d65","g":"683bee9bf72b5c31","server":"smtp.gmail.com","port":"465","secure":true,"tls":true,"name":"","dname":"","x":1205.5078010559082,"y":486.9277563095093,"wires":[]},{"id":"f5e2c3bf3d5a99b2","type":"ui_group","name":"Default","tab":"450ac4c9b5c2acc6","order":1,"disp":true,"width":"11","collapse":false},{"id":"31a99116.50a74e","type":"mqtt-broker","broker":"localhost","port":"1883","clientid":"","usetls":false,"compatmode":true,"keepalive":"60","cleansession":true,"birthTopic":"","birthQos":"0","birthPayload":"","willTopic":"","willQos":"0","willPayload":""},{"id":"450ac4c9b5c2acc6","type":"ui_tab","name":"Home","icon":"dashboard","disabled":false,"hidden":false}]

@Andresc4 in my flow above there may be some nodes you don't have installed.
Let me know if some say are missing ill get you links to them.

the most important part is how we communicate with the esp8266.
Mine are flashed with tasmota , Tasmota has a console where you can add commands to it.
but you can also send the commands through mqtt
if you are using Tasmota with your esp8266 this will monitor your desired messages and if they stop coming in the flow will take control let you know it stopped via email. and reboot the esp. i have a ui button in the flow because so i can just manually reboot with the dashboard if i choose to.
Cheers,

Its not strictly true to say the containers can talk without a network - there is actually a bridged network that is created and by default all containers will join this

This bridge does not usually provide DNS resolution between containers however - which you are implying you have - so i asusme you have created an additional bridge and joined all the containers to this ?

As an example here is my docker setup on my test/dev host

to try and help you isolate this down - i would suggest change the connections in each container to use IP addressing to ensure that comms are OK

It would also be worthwhile looking at how busy this Raspi is - @BartButenaers has just published a nice node that does CPU profiling for you to see if the issues are in there somewhere.

Craig

2 Likes

@9toejack thanks for the reply and for your flow, from the esp side its not the issue, I made my own firmware using mqtt and wifi watchdogs, if the mqtt is down, I retry 5 times to reconnect, if it fails, It trigger a wifi WD and it rejoins the wifi networks. I have more than 50 of them running on different places, networks, and I am quite happy about the result, but what you suggested is true

@craigcurtin Agree, they have a "standard" network and I made a new one like this before creating my containers

docker network create --subnet=172.18.1.0/16 iot

After I made all the containers, I chose them to Join my new network "iotstack" so they can "see" between them

I use to run this using IP, but in somecases if the router connected to the raspi loses connection and the ETH is down, the raspi looses IP, and it stop the flows, as a work around I put my own router next to the raspi on every place I go, but I wanted to do a "standalone" solution. I did not found a way to set up a fix ip between containers, and reading a tutorials I was told that using hostname was the solution, and it is for many of my others setups doing the same flow and settinngs as this one, the only "difference" is that this raspi is dialing a VPN with openvpn,,,

I just check the flow, and I was having a error from the influx node, but I did have mqtt messages

After restarting the flow, the mqtt part was lost

So I jump into the container console, and check if the host was able to be resolved, and it was ok

image on next post

So this is getting me verry confused
why it works, and why it stops working every 10 hours more or less??

now I switch to point the raspi local ip 2.0.0.100 as I was use to do it before
I am quite confused

Well craig is definitely a better t teacher then I am I was thinking you had issues with mqtt . The flow I shared was a results from @craigcurtin helping me.
Hopefully some of the others chime in. I'll be watching how this pans out.
Sorry I wasn't able to help.

Have you checked the logs in each of the containers ?

Also i am inferring based on your pinging above - that in fact the IP network and DNS was still running OK (i assume this pinging was when you had an MQTT failure ?)

Can we clarify that - does the IP network still seem to be operational when you have the data transmit failure ? Can you leave a console running on each container as per the above with a ping running and then review when you have a failure ?

I assume you have checked the broker logs - although it appears you are not using the standard Mosquitto broker that most people go for ? Have you tried to substitue another container with Mosquitto as a test ?

I must say for inter container traffic those Ping times seem to be quite long ? I have just run a test on mine (admittedly a much more powerful Linux PC) and i am getting sub 100ms responses

Craig

@craigcurtin I did not found any relevan information on the logs, well, the same error as the ones I found on the debug area, " no host available "

"Also i am inferring based on your pinging above - that in fact the IP network and DNS was still running OK (i assume this pinging was when you had an MQTT failure ?)"
Yes, I made that ping emqx in the exact same time that nodered was not able to connect to it
And also relevant, in that same moment, I was able to join that mqtt server from outside the raspi, so the docker container running emqx was fine.

To recap
Nodered was not able to connect to mqtt, but at the same time I was able to prove that the mqtt broker was online.
Nodered was able to connect to mqtt, but was not able to write to influx, but grafana was running and I was able to make querys to the influx db, so they were fine also.
At the same time I had this issues, I was able to connect to the broker from outside the raspi.
From each docker container console, i was able to ping the other containers using host names like "influx" "nodered" "emqx" this woked out fine every time
and this last one is the most strange... nodered was not able to dial a mqtt connection to a linode mqtt server, ( that was online and tedted ) but it did worked out after a nodered reboot
All the issues where solve after restarting the nodered docker container

So yes, it seems to be a issue with the hostnames, or with the docker settings, but if this were the case, why the ping from the console works ?
Also, why I had that strange behavior of working every 1 hour ( check grafana plot )

Not I left all the containers dialing the services pointing to the LAN ip of the raspi 2.0.0.100 , by localhost it does not work between containers, but by local ip it does.
And I have 30 hours of stable connection

So I am not sure what other test to do. I did not found any relevan information on the logs of each containers

1 Like

Sorry not understanding here - node red was or was not able to connect to mqtt ? You say both above ?

Can you post up the Docker Compose for each of the containers so we can see the env and ports etc

Craig

@craigcurtin my bad, I re read my post and it was quite confusing

The two things are true in different timeframes
I was able to see nodered not able to connect to the local "emqx" mqtt server, and the one on the Linode host, I erased the 2 nodes from the linode, and still was not able to connect to the local instance of emqx. I was able to connect to the emqx from my pc to the raspi.
After restarting NodeRed docker ( I already restarted the flow 5 times by this time ) It was able to connect to the emqx node, and the linode node
At this point, after restarting the container mqtt and influx posting was working just fine.

In a different day, the mqtt messages were arriving from the emqx broker, but the influx node gave me a error , the one I posted before. From the docker instance console of nodered I was able to ping the influxdb host, and I did have a reply

docker network create iotstack
docker run -it -p 1880:1880 -v node_red_data:/data --name mynodered nodered/node-red --net iotstack // ok!
sudo docker run -d -p 9000:9000 --name=portainer --restart=always -v /var/run/docker.sock:/var/run/docker.sock -v portainer_data:/data portainer/portainer-ce:latest // OK
docker run -d --name emqx -p 18083:18083 -p 1883:1883 -p 8083:8083 -p 8084:8084 -e EMQX_ALLOW_ANONYMOUS=true emqx/emqx:4.3.10 usar este
docker run -d -p 3000:3000 --name grafana grafana/grafana-enterprise:latest
docker run -d -p 8086:8086 -p 2003:2003 -v /volumes/influxdb/data:/var/lib/influxdb -v /backups/influxdb/db:/var/lib/influxdb/backup --name influxdb influxdb:1.7 --net iotstack

After this, I am manualy selecting all the cointainers to join the iotstack network , and they stay with the network network and the iotstack network
As a test, I eased the bridge network, and they all worked out with the iotstack network only
I did work, and after 20 hours I had the same problem, no host, but every 1 hour it was able to post a few points