I'm interested in how the MQTT protocol can be utilised in a scenario where data delivery is critical, over a cellular or otherwise unstable WAN connection.
Imagine we have a broker sat in the cloud and a remote IoT device (Node-RED) connecting to the broker over cellular, publishing timestampped sensor data at regular occurrences. Even with the highest Quality of Service (QoS) 2, in the event of a cellular outage, this data will still be lost.
Is it possible, either within the MQTT protocol, or by using other mechanisms, to queue these publishing events such that in the event of a WAN failure and reconnection, the sensory data will be published again (albeit late), so that no data loss occurs?
Your device should set a LWT. This will be issued when there is no update within a certain period by the broker. This will inform subscribers that a device go offline
For a message queue you have to implement something yourself that take care of this during the mqtt node disconnected to your broker.
A function node pushing the data into an array on it’s own context as log it gets don’t get “connected” form a status node supervising the mqtt out could do the job (if setting up a sql sever as in the contrib note mentioned below is a little bit overkill). If it receives “connected” it could empty the array to the mqtt out node and pass all incoming messages.
I believe that the mqtt out node will store a certain number of messages in the event of disconnection. I think you have to clear the Clean Session checkbox. Not sure how many it saves.
If it is critical for the application that no data is lost then you must put some logic in the client, probably something like writing the data to a local database and uploading it when there is a connection.
Qos>0 should take care about this but it seams not to be a solution here. If every messages is important than I would subscribe to the same topic as I will send to. Put a time stamp on every message and store it in an array. I will get back every message I sent to the broker and then can delete this from the queue. A database sounds a little bit complex, a array of objects should be enough. And use the config node to determine if it is useless to send messages or time to flush the cache.
Without the feedback you can loose messages between leaving your node and detection that mqtt is disconnected.
But qos should do the business- at least for the last message or the amount of messages the own queue can handle. Perhaps the broker is not working as expected.
Queuing of QoS 1 and 2 messages
All messages sent with QoS 1 and 2 are queued for offline clients until the client is available again. However, this queuing is only possible if the client has a persistent session.
That depends on whether you need them to be remembered and sent following a power down.
But otherwise, as you say, if the client supports persistent sessions and QoS 1 or above then there should be no need to do anything. As you point out, in this situation the MQTT protocol requires the client to buffer up messages until the connection recovers. If that is not happening then it is an issue with the client, not the broker. As I said, I believe that node-red does do the required buffering.
Think the problem here is node-Red as a remote client loosing the connection to the broker. Installing a broker on the client and map it to the remote broker could perhaps not solve the queue issue, when every message counts.
Establishing a queue in node-red gives the advantage that you can control the queue (check the length or history) and perhaps establish an alternative connection via gsm/umts or signal an alarm.
Maybe, but I don’t think so from all what I know. But perhaps I take a look into the source or mqtt spec. I personally would like to have some kind of control over the queue if possible when all the collected data is important. Normally I face the problem to detect if a client is offline and trigger an alarm. As I mentioned before that I have two temperature sensors which could be essential for my plants to survive so I would feel uncomfortable to trust a black box in that case.
The homie node for example gives a feedback on two loops for every command sent. First short loop via the broker echoing the messages directly (flagged as “predicted”) and second long loop from the client (clearing the flag). The first is implemented in mqtt itself (clients who are subscribed to a topic receives the message they sent to that topic back) the second defined by the homie convention (clients receiving a parameter change have to acknowledge this back by changing the parameter and sending back the set value). Any queue in the middle will not help for sending commands or setting parameters but rather be confusing if the queue replay many state changes as soon as the device is re-connected back after a while.
But I see the need of a queue node for a data logger which is not always online like a dosimeter ok better a fitness tracker you carry around.
I had a quick look into the mqtt spec. If you do not start a clean session and re-establish an old and still available session you will get all qos > 0 messages you subscribed with again qos>0 from the broker. I still don’t think the mqtt node should interfere with an own queue.
OK, that has made me realise what my confusion is. The spec is talking about what happens when a client subscriber is temporarily disconnected. It is not describing what should happen when a client publisher temporarily disconnects. You are right that in that case all bets are off, and it would be up to a higher level in the client to buffer the messages.
To be reliable, this means that the sender should cache first and then send so that you get the max chance of the cache being set before detecting whether the connection works. You also need some kind of feedback to detect a failed send. As the send happens over TCP (I think?), that could take an appreciable amount of time - TCP timeouts can be up to minutes though I think the MQTT timeout is generally a lot less.
I believe this means that you need to ensure that the MQTT send happens async, after the data has been cached, with a feedback to remove from cache on acknowledgement. But now you are into full ESB/Message Queue territory, MQTT is meant to be lightweight and minimises ack's as I understand it.
You would also need to cater for other messages arriving after the first but before an ack could be processed. So even deleting from the cache may be somewhat complex. And you would need to have limits on cache size with error handling for cache overflows.
Really Nick is the expert on this I believe, having written MQTT clients for Arduino as well. But whatever the implementation details, this is not likely to be a simple ask.
Would this work? Run a local MQTT server on the remote device (the one collecting the data) and publish the data to that server. On the central system also run an MQTT server and on that machine subscribe to the topic on the remote server and re-publish it to the local server. Then if the connection fails and recovers the remote server will automatically queue the messages and serve then to the local system on reconnection. If using mosquito this is known as mqtt bridging and can be set up on the central system to happen automatically, so it is not necessary to run the re-publish flow.
Yes that would work (assuming persistence correctly configured). BUT in general most systems prefer the edge devices to connect to the central one so firewalls are easier to configure and connections are only outbound - which is the opposite way round from your idea.
A) as the remote broker is “in the cloud” it is more than likely that you can’t configure the remote broker as you like
B) as @dceejay mentioned you will need to be able to establish the connection in both directions and the client has to be addressable from the cloud
C) you only transfer the problem one step further out of your control.
D) let’s assume that the mqtt out node does everything to meet the qos 1 or 2 requirements (the pubsub client on many microcontrollers does intentional don’t support other qos levels than 0 to limit the memory footprint) it is only needed to make sure that the lost connection is re-established to make the queue work. I assume that many cloud service are limited in that respect otherwise they will end up in a ton of pending sessions.
Unfortunately that is an excellent point that rules out my suggestion in most cases.
It would be ok on an internal system where the problem might be intermittent wifi but for the use case the OP has it won't do.
Thanks for the feedback everyone. Extremely interesting thoughts and ideas. And yes, there was a typo in my original post - it should have been "QoS 2" and not "QoS 3" (have corrected for completeness).
I think the hand-rolling of a queue/cache is probably the best move. I really like the idea about subscribing to the same topic that you are publishing to, however this relies on the message being sent BACK to the edge device as an acknowledgement, which is not the same as a confirmation that the message has been received by the broker. But I guess thinking wider, this is a classic TCP-like conundrum anyway!
I could dig deeper into the TCP mechanism behind MQTT, however I believe that the 'subscribe to the same topic as publishing and implement queue around this' is an elegant solution and one to look into.