I've got a low-moderately complex flow that relies on mqtt input/output nodes. It pulls data from a remote source, processes it, and when done, asks for more data. It does this in an endless loop.
Well, it's supposed to. What happens in reality is that anywhere from 20 minutes to 8 hours after deployment, the data processing just stops. Using an inject node to prime the pump whenever this happens causes more data to be retrieved (I see it come through the mqtt input node) but further processing never occurs.
From this description, it might sound like mqtt is working fine and the problem is with the data processing nodes or logic. However, I've deconstructed the flow and added things back piece by piece to isolate the issue. It's not until I add the mqtt input node into the mix that the weirdness begins. Redeploying (effectively restarting node-red) is the only way to restore function.
Salient Facts
- Flow is running on Raspbian Stretch (fully updated) on a Raspberry Pi 3b+
- Node Red v0.20.5 (behavior was the same when I first created the flow on v0.20.3)
- Total memory usage on the rpi never exceeds 100M out of 1G available
- top never reports above 1% cpu usage, even when the issue is happening
- trace logging is enabled and shows only an ambiguous 'comms.open' entry around the time the issue starts
- In Editor, mqtt in/out nodes show 'connected' at all times
- Editor remains fully responsive; jiggling a node and 'deploy'ing temporarily resolves the issue
- Remote mqtt broker is aws and TLS is in use
- Same mqtt config node is used for input/output
- 'Use clean session' and 'Use legacy MQTT 3.1 support' settings have no impact on the issue (whether disabled/enabled)
- RPI hardware has been swapped and the issue follows the code
- Different power supplies have been used in order to rule out electrical nuisances
- CPU temp is steady at around 43C
- A timeout loop has been added to the flow to try to catch the condition and gracefully restart the data flow and, surprisingly, it'll often work for a few iterations before things just grind to a halt altogether and a redeploy/restart is needed. While the node red process could be restarted from within this loop (aka, 'the nuclear option'), important state info that's stored in the flow context would be lost.
- That same timeout loop also ensures that the flow starts back up in the event of temporary network connectivity loss but this has never been an issue.
- No sign of errors anywhere - the flow just stops working. I can point to the node where events seem to stop but it's a relatively simple node (node-red-contrib-morse) and, again, if the mqtt-in node is replaced with a looped data feed, the issue never occurs.
- All data fed into the above mentioned node is sanitized (alphanum only) and no pattern has arisen that would indicate that the data itself is at issue.
- strace on node-red when the 'hang' occurs repeats the following and it's worth noting that no topic named 'status' is used in the flow:
writev(22, [{iov_base="\201\", iov_len=2}, {iov_base="[{"topic":"status/7ba70e35.0089e"..., iov_len=92}], 2) = 94
epoll_ctl(3, EPOLL_CTL_MOD, 22, {EPOLLIN, {u32=22, u64=22}}) = 0
epoll_pwait(3, , 1024, 0, NULL, 8) = 0
epoll_pwait(3, , 1024, 949, NULL, 8) = 0
epoll_pwait(3, , 1024, 50, NULL, 8) = 0
What other troubleshooting steps can I try here?