FlowFuse Team Broker: MQTT QOS 1 or 2 not working unless client disconnects cleanly?

Hi, we are using FlowFuse enterprise 2.12 self-hosted on Docker to manage a small collection of Node-RED instances. We are now exploring Team Broker as we plan to expand our use to build some large event-driven integrations. One of the requirements here is that a subscribing system can go offline temporarily and then receive events that occurred while it was down on being brought up again. Unfortunately my attempts to simulate that are giving surprising results.

As a learning exercise, I created two very simple flows in 2 separate Node-RED instances, one Publisher that sends a message to a topic when the button on an Inject node is pressed, and a subscriber, using a different client ID, which prints received messages in the debug window. The subscriber flow is configured for dynamic subscription. The subscribing flow looks like this:

Both clients are configured as follows:

  • MQTTv5
  • QOS: 2
  • Keep alive: 5s
  • Use clean start: false
  • Automatically unsubscribe when disconnecting: false
  • Session expiry: 7200s

My test scenario is as follows:

  1. I send a couple of messages from the publisher. As expected they are not received by the subscriber, because it hasn't connected or subscribed yet.
  2. I click the "connect" button on the subscriber, followed by the subscribe button, and repeat step 1 .As expected, the subscriber receives the new messages.
  3. I click the disconnect button on the subscriber, and send more messages, which of course the subscriber doesn't receive because it is disconnected. I then click the connect button again, and the subscriber receives the messages that were sent while it was disconnected, showing that QOS2 is working.
  4. I repeat steps 1-3, but instead of disconnecting the subscriber, I suspend the subscribing Node-RED instance in FlowFuse, to simulate an infrastructure failure. I expect the broker to detect that the subscriber has disconnected because Keep alive is only 5 seconds, and to store the messages because the QOS is 2. However when I re-start the subscribing node-red instance and reconnect the subscriber, any messages that were sent while the subscriber was down are NOT received.

I have tried this procedure with MQTT 3.1, and with QOS1, and may tweaks to other parameters, with the same results. This seems to suggest that the MQTT broker stores messages for disconnected subscribers if the subscriber politely disconnects, but not if the subscriber just disappears, which is what would happen if there was a network failure or someone killed the Node-RED instance. I'm sure I'm doing something wrong here, as MQTT is designed to cope with just those scenarios, but I'm stumped. Can anyone suggest what I need to do to get this working? Is this something to do with the way I'm simulating a failure?

Hi @stevemot

I don't have an immediate answer here, but wanted to acknowledge we'd seen your question. I've asked someone on the team to take a look.

Nick

Thanks Nick, much appreciated!

I've just done a really quick test with the mosquitto command line tools and things appear to be working as I expect from the spec. (including using kill -9 to ensure a drop rather than a disconnect)

I will do some more testing later when I get chance to set up a Node-RED instance with multiple clients.

Thanks @hardillb, I was wondering if it was something specific to the Node-RED MQTT node, or to the behaviour of a Node-RED instance starting up. In the architecture we are planning all of the subscribers and publishers would be NR instances, so I want to be able to demonstrate that we can have guaranteed delivery of data in that design.

I think what you are seeing is down to how Node-RED dynamic MQTT connections and subscriptions work.

After the Node-RED instance is disconnected from the broker after the inital connection, the persistent session is already set up on the broker, so it knows which topics to queue.

When the Connect inject node is fired, the client will reconnect to the broker and the broker will send the queued messages for the clients persistent session the immediately, without waiting for a new subscriptions to be sent.

But the mqtt-broker config node (which can hold the shared mqtt client for multiple different mqtt-in/mqtt-out nodes) doesn't know which nodes to forward the messages to because there hasn't been a subscription requested yet (until the Subscribe inject is fired).

If you do this with a statically configured MQTT node (no inject nodes) then things work as expected because it knows about the topics at connection time.

We might need to think if we can improve this behaviour, as it is not intuitive, but with dynamic subscriptions, there is no guarantee that the client will ever be told to resubscribe to the existing topics in the session, so caching the messages for delivery would get complicated.

1 Like

Hi @hardillb, thanks for investigating. I have just tried with a static MQTT node, and unfortunately I still have the same problem - messages sent while the subscribing flow is down, having suspended the instance are not received when I start the instance again, but messages sent after the node has restarted are received.

I'm wondering (based on only a limited understanding), whether the subscribe message sent by the MQTT node when it starts is treated by the broker as being a new subscription, rather than the re-connection of an existing subscription?

Just a side note, please ensure you have the debug nodes set to write to the console log as well, to ensure you are not missing messages because the editor is not connected to the backend.

The MQTT node will connect and receive messages before the edior websocket connection comes back up when you restart a suspended node. you need to use the Node-RED logs, not the debug sidebar to validate this.

No, the broker will not double up subscriptions after the reconnect.

I'm embarrassed to say that yes, that was the problem, at least for the static nodes - I can see the messages in the Node-RED logs after updating the debug node attached to the static HTTP in node to output there.

Interestingly, the problem still exists when using a dynamic subscription - nothing appears in the system logs or the debug pane when I reconnect. But perhaps that's expected behaviour for dynamically configured nodes?

From my perspective that's solved the problem, as I have no need to use dynamic connections in our planned work and was just playing with them to try and understand what was happening. Thanks very much for your time on this and sorry for the distraction.