MQTT Nodes Connection Behaviour

Dear Forum

I wanted to post my observations of the MQTT nodes connection behaviour.

  1. There is an approximately 20 minute delay from when internet access is lost (SIM card removed) to when the MQTT nodes show disconnected.
  2. There is no data sent via MQTT out nodes for approximately 20 minutes if internet access is lost (SIM card removed) even if internet access is restored (SIM card replaced) almost immediately.
    NB: The modem is being restarted immediately after the SIM card is replaced, but this is with a new IP address as is the nature of mobile/cellular connections.
  3. MQTT out nodes backfill is only for this approximately 20 minute period from when internet access is lost (SIM Card Removed) to when the MQTT nodes show disconnected. If the internet access outage is longer than 20 minutes then data sent after this time is lost.
    NB: I have implemented a contributed Node-RED node (node-red-contrib-msg-queue), that queues undeliverable messages to file for later delivery to avoid this data being lost.

This has raised the following questions for me.

  1. Are others observing the same connection behaviour from their MQTT nodes?
    2. When internet access is lost and then restored is it possible to force the MQTT nodes to disconnect and reconnect? If so, can someone help give me some guidance in doing this?

Configuration details...
Hardware: MTX-GTW-II
OS: Linux, ARM, 4.1.15-1.2.0+gda74a32
Node-RED 1.2.9
Node.js 10.24.0
A SIM card is used for internet access from the device.

Test Case details...
14/05/2021
Test 1: Internet disconnection for ~31 minutes.
This test uses keep alive time: 120s, clean session and auto generated client ID.
10:43:30 SIM Card Removed & MQTT Shows Connected
11:01:45 MQTT Shows Disconnected
11:02:00 MQTT Shows Connecting
11:12:00 SIM Card Replaced
11:14:51 Connection Restored
11:15:40 MQTT Shows Connected

Test 2: Internet disconnection for ~28 minutes.
This test uses keep alive time: 30s, no clean session and client ID:NR_Hanmer.
13:32:15 SIM Card Removed & MQTT Shows Connected
13:52:23 MQTT Shows Disconnected
13:52:37 MQTT Shows Connecting
13:58:34 SIM Card Replaced
14:00:10 Connection Restored
14:01:26 MQTT Shows Connected

Test 3: Internet disconnection for ~7 minutes.
This test uses keep alive time: 30s, no clean session and client ID:NR_Hanmer.
14:09:33 SIM Card Removed & MQTT Shows Connected
14:10:00 SIM Card Replaced
14:16:13 Connection Restored
14:27:59 MQTT Shows Disconnected
14:28:10 MQTT Shows Connecting
14:29:17 MQTT Shows Connected

Any assistance users can provide in answering my questions is appreciated.

Regards
Siothrun

Welcome to the forum @Siothrun

There have been some significant updates to the mqtt nodes since the old version of node red you are using. If possible it would be a good idea to upgrade node red and see if that improves it.

No, it should only take around 15 secs (default settings.js setting is 15000)

     // Retry time in milliseconds for MQTT connections
    mqttReconnectTime: 15000,

Not yet. there is an open PR for a new node called mqtt-control that permits this.
It is still to be reviewed and accepted but I am hopeful it will see its way into the next major release.


in this period of outage, have you tried ...

  • forcing a publish (use an inject node)
  • connecting via a different broker (e.g. mqtt explorer) to verify the broker is actually reachable?

I think that is a different issue to the one here where it is apparently taking 20 minutes for the MQTT node to report that the network is disconnected. I believe that is set by the Keep Alive setting in the MQTT connection node.
@Siothrun what have you got Keep Alive set to in the node?

Thanks for the replies @Colin & @Steve-Mcl!

Thanks for this feedback I'll certainly look to take this action.

Thanks for the confirmation that this control is not yet possible and also the hope that it will soon be available via an 'mqtt-control' node.

In my testing a publish was being triggered every 10 seconds.
I'm not sure I understand the next test you mention trying. I have been connecting to a Mosquitto broker, but I'd assumed the MQTT client connection behaviour I am seeing would be broker independent. I will though have a separate client communicate with the same broker to rule out potential broker issues with reconnecting.

I've tested the connection behaviour with keep alive times of 120s and 30s. This setting had no impact on the approximately 20 minute delay.
As I understand it the keep alive time is registered with the broker by the client when it establishes the connection. When the broker has not heard from the client for this period (plus some extra time) it will then issue the Last Will and Testament (LWT) message. This worked as expected when I tested it, except for the fact that there is an undefined amount of extra time that the broker waits.

That is what happens at the broker, but it is the client (which is node-red) that initiates the regular keep alive ping. If the node red node does not get the ping reply then it should show disconnected so that mechanism is not working as it should. I think you need to upgrade node-red so that at least you can rule out that possibility.

Thanks for that explanation Colin. I didn't realise that there was a keep alive ping from the client node. I'd incorrectly assumed there was a requirement to send messages within this keep alive time to avoid the issue of the LWT message. I will take your advice and look to upgrade Node-RED and Node.js to see if this changes the connection behaviour I am seeing.

My further testing has raised the following question...
Should the Keep Alive time be less than the shortest time between messages?

The results of my further testing...

Upgraded
I upgraded as recommended...
Node-Red 1.3.5
Node.js 10.24.1

Unfortunately this didn't change the MQTT nodes connection behaviour.

Different Broker
I connected using a different broker...
Mosquitto using TLS on port 8883 (Existing Broker)
HiveMQ on port 1883 (Different Broker)

They both had a Keep Alive Time of 30s and didn't use a clean session. The message type and frequency being sent to each of these brokers was different. The internet connection was interrupted for a couple of minutes.
The Mosquitto broker connected MQTT node continued to show connected for approximately 20 minutes without sending data before finally reconnecting and resumed sending data.
The HiveMQ broker connected MQTT node almost immediately reconnected and resumed sending data.

This difference prompted me to look closer at the Mosquitto broker.

PINGREQ & PINGRESP
The Mosquitto broker logs showed "Received PINGREQ" from a different client and the broker "Sending PINGRESP". The broker logs showed no "Received PINGREQ" from my client!

The Keep Alive timer, measured in seconds, defines the maximum time interval between
messages received from a client. It enables the server to detect that the network
connection to a client has dropped, without having to wait for the long TCP/IP timeout.
The client has a responsibility to send a message within each Keep Alive time period. In
the absence of a data-related message during the time period, the client sends a
PINGREQ message, which the server acknowledges with a PINGRESP message.

My client is sending a message (watchdog pulse, QoS 2) every 10 seconds and I am also setup to receive the message being sent so I can display it in a debug node. As a result I do not send a PINGREQ message. So I was curious if the behaviour would change if I selected a Keep Alive time that would force my client to send a PINGREQ. Selecting a Keep Alive time of 9 seconds and monitoring the broker logs showed "Received PINGREQ" from my client. Interrupting the internet connection for a couple of minutes now has delivered me mixed results. A number of times it has disconnected quickly and then reconnected once the connection has been re-established. However, I've also experienced the ~20 minute disconnection delay again, but I'm wondering if this now the exception and not the rule.

Disconnection Delay
I suspect now that the ~20 minute disconnection delay is due to the TCP/IP timeout, which is dictated by the following file on Linux.

/proc/sys/net/ipv4/tcp_retries2

The default value is 15, which corresponds to a duration of approximately between 13 to 30 minutes.
A hypothetical timeout of 924.6 seconds is a lower bound for the effective timeout.

As I understand it - I thought that any received message (at the broker) would act the same as a ping message and indicate that the client was still alive - and likewise at the client end there was no need to send a keepalive message until the set time after the last transmitted real message. I.E. you only needed to send keepalives if you weren't sending data for a period of time greater than the keepalive timeout.

At least that is how I thought it was supposed to work...

Based on my testing I think this is how it does work. The problem is when the internet connection is interrupted AND the MQTT node continues receiving messages more frequently than the Keep Alive time to transmit. It therefore has no need to send the PINGREQ. If it did send the PINGREQ it would time out waiting for the PINGRESP and disconnect.

What I can't reconcile now though is that I get this behaviour from the Mosquitto broker connection, but not from the HiveMQ broker connection.

How can it continue receiving messages if the connection is interrupted?

You only quoted half of my sentence! :slight_smile:
The problem is when the internet connection is interrupted AND the MQTT node continues receiving messages more frequently than the Keep Alive time to transmit.

I may not be using the best terminology here and I'm happy to be corrected, but worded differently...
The MQTT out node is not sending messages, but it is receiving messages from the flow to transmit to the broker.
The MQTT in node is not receiving messages.

@colin here is a case: you have a broker on the same device as NR. NR sends/receives msgs to other devices on the network AND sends/receives msgs to other tabs/locations within the flow.

So you could loose network connection but the mqtt msgs between the tabs/locations in the flow would(should) still work (I think :thinking:)

I don't think that is the situation here. @Siothrun can you confirm that node red is running on one device and the broker is across the internet?

Messages passed to the mqtt node from the flow are of no consequence in whether ping requests will be sent. It is purely whether data is received from the broker.

What QoS have you specified in the MQTT nodes? If you are using zero does it make a difference if you change it to 1?

I can confirm that Node-RED and the MQTT broker are on different devices.

MTX-GTW II AUS is running Node-RED.
Windows VM is running the Mosquitto MQTT broker.

The MQTT in node is also receiving messages from the broker every 10 seconds (when the internet connection is active). So if this node stops receiving messages for longer than the Keep Alive time when the internet connection is interrupted it should send a PINGREQ, but then fail to receive a PINGRESP. When it fails to receive the PINGRESP within 1.5 x Keep Alive time it should disconnect? I’m certainly not seeing this behaviour.

MQTT in nodes have QoS 0.
MQTT out nodes have QoS 2.
I can experiment with using a different QoS, but ultimately my application requires these QoS settings.

Why does it require QoS 0?

Have you upgraded node red yet?

Node-RED in my application is being used in a control application to interface a cloud HMI to a PLC.

When reading values from the PLC to display on the HMI QoS 2 is desired. The values read are time stamped and the history of what is happening is important. This is why the MQTT backfill is relevant to me.

When writing values from the HMI to the PLC QoS 0 is required. If the internet connection is interrupted I don’t want a value/command being buffered and written to the PLC at a later date.

I suspect that using QoS 0 does not absolutely guarantee that the value will not be buffered in the mqtt driver and sent later. If it is essential that you do not use old data then I suggest adding a timestamp and check at the receiving end that it is not stale.

I suggest writing the receiving end so that it ignores repeated messages. That will make your life much simpler. Then, if what you are after is guaranteeing that you do not lose any messages you can use something like this flow to send the data. The flow can even buffer the data over a node-red restart if that is required.