Mqtt oddness on rpi

I've got a low-moderately complex flow that relies on mqtt input/output nodes. It pulls data from a remote source, processes it, and when done, asks for more data. It does this in an endless loop.

Well, it's supposed to. What happens in reality is that anywhere from 20 minutes to 8 hours after deployment, the data processing just stops. Using an inject node to prime the pump whenever this happens causes more data to be retrieved (I see it come through the mqtt input node) but further processing never occurs.

From this description, it might sound like mqtt is working fine and the problem is with the data processing nodes or logic. However, I've deconstructed the flow and added things back piece by piece to isolate the issue. It's not until I add the mqtt input node into the mix that the weirdness begins. Redeploying (effectively restarting node-red) is the only way to restore function.

Salient Facts

  • Flow is running on Raspbian Stretch (fully updated) on a Raspberry Pi 3b+
  • Node Red v0.20.5 (behavior was the same when I first created the flow on v0.20.3)
  • Total memory usage on the rpi never exceeds 100M out of 1G available
  • top never reports above 1% cpu usage, even when the issue is happening
  • trace logging is enabled and shows only an ambiguous 'comms.open' entry around the time the issue starts
  • In Editor, mqtt in/out nodes show 'connected' at all times
  • Editor remains fully responsive; jiggling a node and 'deploy'ing temporarily resolves the issue
  • Remote mqtt broker is aws and TLS is in use
  • Same mqtt config node is used for input/output
  • 'Use clean session' and 'Use legacy MQTT 3.1 support' settings have no impact on the issue (whether disabled/enabled)
  • RPI hardware has been swapped and the issue follows the code
  • Different power supplies have been used in order to rule out electrical nuisances
  • CPU temp is steady at around 43C
  • A timeout loop has been added to the flow to try to catch the condition and gracefully restart the data flow and, surprisingly, it'll often work for a few iterations before things just grind to a halt altogether and a redeploy/restart is needed. While the node red process could be restarted from within this loop (aka, 'the nuclear option'), important state info that's stored in the flow context would be lost.
  • That same timeout loop also ensures that the flow starts back up in the event of temporary network connectivity loss but this has never been an issue.
  • No sign of errors anywhere - the flow just stops working. I can point to the node where events seem to stop but it's a relatively simple node (node-red-contrib-morse) and, again, if the mqtt-in node is replaced with a looped data feed, the issue never occurs.
  • All data fed into the above mentioned node is sanitized (alphanum only) and no pattern has arisen that would indicate that the data itself is at issue.
  • strace on node-red when the 'hang' occurs repeats the following and it's worth noting that no topic named 'status' is used in the flow:
    writev(22, [{iov_base="\201\", iov_len=2}, {iov_base="[{"topic":"status/7ba70e35.0089e"..., iov_len=92}], 2) = 94
    epoll_ctl(3, EPOLL_CTL_MOD, 22, {EPOLLIN, {u32=22, u64=22}}) = 0
    epoll_pwait(3, , 1024, 0, NULL, 8) = 0
    epoll_pwait(3, , 1024, 949, NULL, 8) = 0
    epoll_pwait(3, , 1024, 50, NULL, 8) = 0

What other troubleshooting steps can I try here?

How about trying a local mqtt broker on your pi? That could find service or connection issues with the broker. (Yes, mqtt nodes show "connected," but I'm not sure that's conclusive.)

You suggest that node-red-contrib-morse seems to be involved in some way. If you put a debug node showing what is going to that node and another showing what is coming out do you see, when it fails, that it is stuck in this node?
If so then put an additional inject node on the input. If you then wait till it fails then inject using that does anything come out?
Also tell us which version of node, npm and node-red you are using. If, in a terminal, you run

node-red-stop
node-red-start

and paste the results, that will tell us the versions and may also provide clues.

You also might want to put a catch node (connected to a debug) on the tab with the issue to see if it shows anything.

Good suggestion on the local broker. I've also thought about trying one of the aws-specific mqtt clients to see if the behavior changes. The certificate handling on those is a little hokey though (node red really needs a prescribed mechanism for consistent cert handling), so I just haven't gotten around to it yet. I'll try setting up a separate flow with a local broker first to see what it gets me.

Regarding the morse node, yes, the flow of data does die there. Again though, this is only when using mqtt-in. Using an inject node to manually supply data does not force anything out the other side of the morse node after the issue has been triggered.

node-red-start output:
14 Apr 14:36:27 - [info] Node-RED version: v0.20.5
14 Apr 14:36:27 - [info] Node.js version: v10.15.3
14 Apr 14:36:27 - [info] Linux 4.14.98-v7+ arm LE

I've added a catch node and wired it to debug. I did have a catch node on the flow at some point in the past and never got anything out of it. I suppose it's worth just leaving there, though.

I think it must be a problem with the morse node. The fact that it only happens when you have the mqtt is probably something to do with timing of messages or some race condition or something similar. If you can get it into a state where it becomes unresponsive then it must be a bug in that node.
I would submit an issue on the node.