I've been using the email-in node for a few months now. And 3 times now, I have noticed that it has gotten stuck. I haven't looked into it very deeply yet, but I noted it wasn't working all day yesterday, so yesterday evening, I took a look and it had a blue "fetching" label.
I use a cronplus node to trigger it every 60 seconds (so I can manually trigger it occasionally).
I redeployed and all the backed up emails came through. Similarly, the previous couple times, I'd rebooted and it flushed out.
Any tips on how to prevent it from getting into this stuck state or else catch it with a deadman switch and somehow give it a kick?
Yeah. I'm unfamiliar with the countdown timer, but that's exactly what the dead-man-switch node does. It hadn't occurred to me to use exec to restart NR. Doh!
But I might be able to make it a bit smarter... I assume that if I don't receive an email, the email-in node doesn't spit anything out. I get 2 types of emails. One is a very infrequent, but very important email from a water sensor in the basement. Another is a usually frequent email from a security camera (which I ONLY use because it's the only easy way I know of the get a still image from SightHound - that I forward via SMS).
The cam email is always preceded by a webhook call. So I would want to restart NR in the event that the webhook is not followed by an email say, within 2 minutes. I've done designs like this before, but they're a bit clunky. I usually change a flow variable and check it. I could have the email set a flow variable (indicating an email was received). I could have the webhook go into a 2m delay, followed by a check of the flow variable. If it's false (no email received), trigger a restart.
Is there a way to trigger just a redeploy of the flow? Seems a tad overkill to restart NR...
To do an action when something doesn't happen for a while, use a Trigger node.
You can use the admin api post /flows to force a restart flows.
Before going down that route though I suggest slowing the trigger down to 10 minutes. I am wondering if there is a problem meaning that the fetch is stuck, but because you keep retriggering it, it does not have time to timeout. I don't have any evidence to support that but worth trying. Are you using the latest version of the email node?
Incidentally, I learned the term "deadman switch" via a novel by by that name written by my favorite author (Timmothy Zahn). I do recommend.
I use a 60 second fetch interval because I want to know the reported events as soon as possible: water in the basement and someone at the door. I have a rule set up (using the camera motion trigger and a Wemo motion trigger - each of which individually have many false positives, but together are pretty good) that is fairly reliable (correct about 90% of the time), but when I get that message ("Someone is at the door"), I usually wait a few seconds to see the image come in via text to confirm.
One of the things on my todo is set up pilight to recognize and receive the 433Mhz code from my remote doorbell mechanism to send me that "Someone's at the door" text.
That's a good point. I do see there's an update for the email-in node (thinking of your previous suggestion). So I just updated it and will issue a restart. x.17.x to x.18.x. Maybe it will solve the problem and I won't need to implement a mitigation.
Incidentally, where would you find the release notes for the versions? I poked around a bit, but didn't turn them up. The only thing seems to be the git log, but it's not specific to this node (node-red-nodes)...
I'm happy to help debug. If there are some log messages I could uncomment that could for example, reveal an endless loop, or a timeout block I could add, I'm happy to check it out next time it happens...
Well, I tried out the test flow from this thread. Although it contains an id to ostensibly restart just the flow with the supplied id, it does restart all flows... The docs don't seem to indicate that this is possible, though one dev in the threads said it needed an overhaul in 2020. I don't suppose the overhaul has happened and the docs are outdated? Either way, I'm going to implement the status-watch suggestion and redeploy via a deadman switch on that status node.