MQTT connection to AWS IoT Core disconnecting and not reconnecting without restarting flows

I have been working to implement Node-RED into a few factories. With one of the locations, I have seen multiple times now where the MQTT node is disconnected from AWS IoT Core (sometimes internet outage, other times not sure why) and is unable to reconnect on its own. The current frequency is about once or twice per week.

In order to reconnect the node, we have to log into Node-RED and click "Restart Flows" which will instantly reconnect the broker.

I have already tried to fix the issue automatically by injecting "disconnect" and "connect" into the latest version of the MQTT nodes. This however didn't help it reconnect.

This same instance of Node-RED is also connected to a local Mosquitto broker that does not experience the same issues.

The connection settings for the AWS broker are the following...

  • Server = XXXXXXXXX-ats.iot.us-east-1.amazonaws.com
  • Port = 443
  • Connect Automatically = true
  • Use TLS = true
  • Protocol = MQTT V3.1.1
  • Client ID = auto generated
  • Keep Alive = 60
  • Use Clean Session = true
  • Verify server certificate = true
  • ALPN Protocol = x-amzn-mqtt-ca

The VM is running Node-RED v2.1.3 with Node.JS v17.0.1

Any insight or help will be greatly appreciated.

1 Like

Could you add something to your flow to check the topics you are posting dont contain any invalid characters?

If possible, add the Bad Topic route (2nd output of function) to a log file or somewhere you can find it historically.

chrome_ESw3yTyB4g

[
    {
        "id": "122dbdbf51b1a76c",
        "type": "function",
        "z": "76a6f1dc1056998c",
        "name": "Check topic",
        "func": "\nvar topicOK = !/[\\+#\\b\\f\\n\\r\\t\\v\\0]/.test(msg.topic);\n\nif(topicOK) {\n    return msg; //first output\n}\nreturn [null, msg]; //second output\n",
        "outputs": 2,
        "noerr": 0,
        "initialize": "",
        "finalize": "",
        "libs": [],
        "x": 1206,
        "y": 864,
        "wires": [
            [
                "7de2b1885ee0ccd0",
                "9ffbfa9abc65a106"
            ],
            [
                "b29d540cf2742c0c"
            ]
        ]
    },
    {
        "id": "3612e50214614a5d",
        "type": "inject",
        "z": "76a6f1dc1056998c",
        "name": "ok topic ",
        "props": [
            {
                "p": "payload"
            },
            {
                "p": "topic",
                "vt": "str"
            }
        ],
        "repeat": "",
        "crontab": "",
        "once": false,
        "onceDelay": 0.1,
        "topic": "test/test3/ok",
        "payloadType": "date",
        "x": 1042,
        "y": 832,
        "wires": [
            [
                "122dbdbf51b1a76c"
            ]
        ]
    },
    {
        "id": "46c1fc418caacb69",
        "type": "inject",
        "z": "76a6f1dc1056998c",
        "name": "bad topic ",
        "props": [
            {
                "p": "payload"
            },
            {
                "p": "topic",
                "v": "\"test/test2/test3\\t\"",
                "vt": "json"
            }
        ],
        "repeat": "",
        "crontab": "",
        "once": false,
        "onceDelay": 0.1,
        "topic": "",
        "payloadType": "date",
        "x": 1036,
        "y": 896,
        "wires": [
            [
                "122dbdbf51b1a76c"
            ]
        ]
    },
    {
        "id": "b29d540cf2742c0c",
        "type": "debug",
        "z": "76a6f1dc1056998c",
        "name": "Bad Topic",
        "active": true,
        "tosidebar": true,
        "console": true,
        "tostatus": true,
        "complete": "payload",
        "targetType": "msg",
        "statusVal": "topic",
        "statusType": "msg",
        "x": 1398,
        "y": 896,
        "wires": []
    },
    {
        "id": "7de2b1885ee0ccd0",
        "type": "mqtt out",
        "z": "76a6f1dc1056998c",
        "name": "",
        "topic": "",
        "qos": "",
        "retain": "",
        "respTopic": "",
        "contentType": "",
        "userProps": "",
        "correl": "",
        "expiry": "",
        "broker": "dc811dae.05f5",
        "x": 1378,
        "y": 832,
        "wires": []
    },
    {
        "id": "9ffbfa9abc65a106",
        "type": "debug",
        "z": "76a6f1dc1056998c",
        "name": "OK Topic",
        "active": true,
        "tosidebar": false,
        "console": false,
        "tostatus": true,
        "complete": "topic",
        "targetType": "msg",
        "statusVal": "topic",
        "statusType": "auto",
        "x": 1388,
        "y": 784,
        "wires": []
    },
    {
        "id": "11e6f69902ed534b",
        "type": "inject",
        "z": "76a6f1dc1056998c",
        "name": "ok topic ",
        "props": [
            {
                "p": "payload"
            },
            {
                "p": "topic",
                "vt": "str"
            }
        ],
        "repeat": "",
        "crontab": "",
        "once": false,
        "onceDelay": 0.1,
        "topic": "test/ok",
        "payloadType": "date",
        "x": 1042,
        "y": 784,
        "wires": [
            [
                "122dbdbf51b1a76c"
            ]
        ]
    },
    {
        "id": "8fdd5c87e627832f",
        "type": "inject",
        "z": "76a6f1dc1056998c",
        "name": "bad topic ",
        "props": [
            {
                "p": "payload"
            },
            {
                "p": "topic",
                "v": "\"test/test\\n\"",
                "vt": "json"
            }
        ],
        "repeat": "",
        "crontab": "",
        "once": false,
        "onceDelay": 0.1,
        "topic": "",
        "payloadType": "date",
        "x": 1036,
        "y": 944,
        "wires": [
            [
                "122dbdbf51b1a76c"
            ]
        ]
    },
    {
        "id": "dc811dae.05f5",
        "type": "mqtt-broker",
        "name": "",
        "broker": "broker.emqx.io",
        "port": "1883",
        "clientid": "",
        "usetls": false,
        "compatmode": false,
        "protocolVersion": "5",
        "keepalive": "50",
        "cleansession": true,
        "birthTopic": "",
        "birthQos": "0",
        "birthPayload": "",
        "birthMsg": {},
        "closeTopic": "",
        "closeQos": "0",
        "closePayload": "",
        "closeMsg": {},
        "willTopic": "",
        "willQos": "0",
        "willPayload": "",
        "willMsg": {},
        "sessionExpiry": ""
    }
]

I have a suspicion from another issue I am looking into that you might have the odd bad topic going to the broker

See this issue for further details: MQTT nodes do not reconnect if the broker initiates a disconnect · Issue #3367 · node-red/node-red · GitHub

It would be very helpful to let me know if this is what you are observing when the MQTT issue re-occurs.

Hi Steve-Mcl,

Thank you for your post. I am now investigating the current flows to see if any of them have the possibility of creating that issue. First thoughts are that I don't think this is what is happening, but I definitely want to rule it out first.

Most our our flows take payloads and topics from the local MQTT broker and then pass it straight through to AWS. If this issue is occurring, it would be an issue with AWS not accepting the topic used in the local broker which I know is possible. We ran into this issue when we made a topic that was too deep for AWS but not for Mosquitto.

Again, I appreciate the information. It is something I haven't thought of as a possible issue. More to come.

Hi All,

I think I have found more around why we are seeing this issue. We turned on the logs in AWS IoT Core and now have some information on why it gets stuck in a loop of connect/disconnect.

It seems to me like Node-RED loses connection (for whatever reason) and when it tries to reconnect, it is successful, however, it doesn't know it is connected before trying again which causes an error in AWS by having two instances of a connection with the same client ID. This then restarts the loop of disconnect/reconnect.

Feel free to jump in if I am wrong on this theory. Any help is appreciated!

1 Like

Perhaps you are reusing the same credentials for two different AWS things (not intentionally)? One thing is for sure, this msg error "DUPLICATE_CLIENTID" is generated by AWS, not by Node-RED itself. My guess is that your device loses the connection and when it reconnects the AWS IoT core sees the authentication credentials still in use. Most likely the disconnection cause is exactly duplicated credentials.

PS: As a second thought it it very unlikely that the issue is related to duplicated credentials. I guess the device would not even authenticate to AWS if you were using duplicated credentials. On the other hand duplicated MQTT Client ID is also unlikely. You are probably using the random Client ID generated by the MQTT node. I don´t believe that duplicated Client ID are generated.

A good question to ask is about the ip addresses you see in the AWS log.

Are you seen any duplicated ip address for the same MQTT Client ID ?

Andrei,

Thank you for the information you provided. I am now looking into the potential issues now. First thing I saw is that there is two different pairs of "clientId" and "sourceIp" in the logs.

I need to verify it isn't because another instance of Node-RED is experiencing the same issue with a different set of AWS credentials. We do have a couple AWS "Things" created and used in Node-RED.

I will report back again soon. Thanks!

One theory we have that we are now testing on one of our Node-RED instances is that something happens on the network to cause the connection to be broken. Node-RED then attempts to reconnect before the 60 second keep alive has expired on the AWS side. AWS then thinks it is another connection and errors out.

To test our theory I have changed the keep alive to 30 seconds and the Node-RED reconnect time to 45 seconds

    /** Retry time in milliseconds for MQTT connections */
    mqttReconnectTime: 45000,

We are also testing on a different instance if turning off the "use clean session" has any affect on the connection.

The issue we are seeing seems to pop up across all of our Node-RED instances that are connected to AWS IoT Core. Some of them are able to resolve the issue by what appears to be an automatic restart of the program (eventually).

Sample section of logs on one machine below...

2 Feb 08:43:49 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Disconnected from broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:43:49 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:43:49 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:43:49 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:43:49 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:43:49 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:43:49 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:43:49 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:43:49 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Disconnected from broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:43:49 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Disconnected from broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:43:55 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Disconnected from broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:43:56 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Disconnected from broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Disconnected from broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Disconnected from broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Connected to broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Disconnected from broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
2 Feb 08:44:13 - [info] [mqtt-broker:AWS IoT Core (Prod) - MTYAP006] Disconnected from broker: mqtts://a2syg6wih4fgxt-ats.iot.us-east-1.amazonaws.com:443
Terminate batch job (Y/N)? 
> node-red-master@2.0.3 start
> node node_modules/node-red/red.js --userDir ./data

2 Feb 08:44:27 - [info] 

Welcome to Node-RED
===================

2 Feb 08:44:27 - [info] Node-RED version: v2.1.3
2 Feb 08:44:27 - [info] Node.js  version: v17.0.1
2 Feb 08:44:27 - [info] Windows_NT 10.0.14393 x64 LE
2 Feb 08:44:28 - [info] Loading palette nodes
2 Feb 08:44:30 - [info] Dashboard version 3.1.1 started at /ui
2 Feb 08:44:30 - [info] Settings file  : C:\NodeRED\data\settings.js
2 Feb 08:44:30 - [info] Context store  : 'default' [module=memory]
2 Feb 08:44:30 - [info] User directory : C:\NodeRED\data
2 Feb 08:44:30 - [info] Projects directory: C:\NodeRED\data\projects

It seems to make sense.

I tried to do some testing with AWS IoT core and Node-RED today. I was surprised to see that the MQTT node was disconnecting for each and every message sent to the cloud. First thing I tested was changing the qos (just because it was the easiest thing to do) from 2 to 1 and apparently the issue stopped. I was not expecting the issue and even less that changing qos would make any difference. I will play a little more with this setup. It is just a pity that I can not activate logs on AWS side to see the disconnection reason.

PS: AWS documentation states that qos 2 is not valid. I was not aware.

AWS IoT and the AWS IoT Device SDKs support the MQTT Quality of Service (QoS) levels 0 and 1 . The MQTT protocol defines a third level of QoS, level 2 , but AWS IoT does not support it.

Yeah, I also learned the hard way the QOS 2 is not an option. I also learned that AWS has other limitations that are tighter than a broker like Mosquitto has.

As far as logging, I was not the one to set it up, but we are able to see logs through a CloudWatch service.

have you thought about trying an alternative cloud MQTT broker service such as Hive?

Even if you don't want to go down that route for reasons, doing a test with another one would at least sort out whether its an AWS issue or not

I have not cymplecy. With the very restrictive network IT department we have in my company, it would be very difficult to use any other cloud broker on our VM's because of permissions and the firewalls in place. The company has partnered up with AWS for cloud services.

We do however have a Mosquitto broker setup in the local network environment that doesn't have any connection issues that I know of.

If needed, I could setup a test environment in my home lab to try another cloud service to see if we experience similar results.

Just another thought then - can you setup a simple Linux instance with AWS and install mosquitto on it. It would be a lot of hassle to make it secure but might be worth a quick test just to prove that everything else works fine

Thank you for the suggestion. I don't know if that is something I can do, but I will inquire with the IT team if that is something I can setup for testing.

@Nirodam , while the root cause for the disconnection is not identified I wonder if you could benefit from some kind of workaround. I remind you that since Node-RED 2.1 there is a dynamic control that may control the connection / disconnection from brokers. I have not used this new feature yet but I can imagine a status node monitoring the MQTT node. When a disconnection happens the status node will send a message to the MQTT node to force a reconnection. It is just a thought (not tested).

@Andrei , thank you for the suggestion!

This is actually something we tried before our latest posts. I set a 5 second delay, then sent a disconnect command, waited 5 more seconds, and then sent a connect command. With that setup, it didn't work as I would have expected. This might have been because too much traffic during the keep alive duration.

Knowing what we know now, there is probably a better method / timing of injecting into the MQTT nodes to handle disconnects. However, for the time being, I have deactivated those nodes while we test the other changes we have made.

I am not saying this is the root cause of issues you are facing.

I just find unusual that you are running Node.JS version 17.

Node-RED does not support this version.

Click here for supported Node versions.https://nodered.org/docs/faq/node-versions

Also the underlying MQTT library does not support Node V17.

@Andrei that is interesting. I initially installed the latest Node-RED version after installing the latest Node.JS from the website which still shows as version 16 under the LTS tab.

I did however run the update commands for both Node-RED and NodeJS recently. I guess at that point it must have jumped to version 17. I will see if I can switch it back over to the latest version 16.

@Nirodam Did you try out what I suggested?

@Steve-Mcl Yes, I took your flow and imported it in. I did some initial testing with bad topics and confirmed that it would cause the broker to disconnect. I could then reconnect as long as the topics didn't automatically subscribe or publish right away.

However, upon inspection, I found that none of the subscription or publishing topics were causing this issue. Upon checking the AWS logs, the reason for disconnect appeared to be because the client was trying to connect again a 2nd time with the same client ID from the perspective of the broker causing the connection to be ended.