Node Red watchdog, assure long time working

GiovanniG · 20 February 2019 19:03

Hi, I've a problem with my raspberry3, for the second time (each time, about 2 weeks after last reboot) I have a frozen, the ping was replying but seems no TCP.. no SSH, no Node Red, no telnet, etc.
At second time seems things get worst, the log contains the same amout of nul chars in syslog but on second time not only TCP was affected.. also Node red stopped working (I've some modules which are indipendent from network, they read status of I2C I/O (buttons) and activate I2C PWM (I used it to drive relay)..
I'm out of home, I couldn't reach Raspberry to solve/reboot, so I asked someone to interrupt power for a while and after reboot problem was solved.
I discovered what happened in syslog:
Feb 19 17:00:00 raspberrypi kernel: [1051671.413479] Under-voltage detected! (0x00050005)
I put it at 5,05 volt.. to correct it I'll try to rise a bit, someone knows which is the good value? I don't want to excedd, I think till 5,5 it will be ok, but maybe better 5,3..

Anyway, I need something who testing the main services and evenually reboot the Raspberry, is it possible to do with Node Red? I searched and found a deprecated Watchdog module, then in another topic a gu asked and suggested to execute a sudo reboot.
Can I test every x seconds for example the 127.0.0.1:22? would be good also the router 192.168.1.1 on port 80, and eventually execute a sudo reboot?
I've no really good idea if this is a good solution, maybe I woud need something which runs at higer level (kernel) and reboot if it goes into panic, maybe this is my case because when it happens I've 2 rows in syslog full of null chars.. seems also kernel is not responding, or the log service, may you suggest something or recomend something else?
Thank you a lot!

JayDickson · 20 February 2019 19:14

I'd go for a heartbeat to an external system rather than rely on a device experiencing a brownout to reboot itself.

GiovanniG · 20 February 2019 20:19

Thank sfor reply, you mean an external device which interrupts the power?
FOr example a Arduino nano connected to GPIO, Node Red should give every XX seconds an impulse on GPIO, if it doesn't nano will deactivate relay from where Raspberry is powered?

GiovanniG · 20 February 2019 23:34

I've found a service for Raspbian.. someone tested it?
https://www.domoticz.com/wiki/Setting_up_the_raspberry_pi_watchdog

michaelblight · 21 February 2019 03:50

I have my Pi's send heartbeats to another machine, and if a heartbeat doesn't arrive for 30 minutes, I use an exec node to reboot it automatically using SSH and running sudo reboot (or docker restart). This code used to get called every few days, but since changing to the official Pi power supplies (even on Pi Zeros), it hasn't been needed as they've never had problems since. Money well spent.

GiovanniG · 21 February 2019 12:39

For me this is not the good solution, when a "freeze" occour Node Red may stop and can't reboot anything, also kernel may stop.. that's the works case, but maybe there is a watchdog working in that level and can reboot. Until now seems kernel never panic, even if syslog stopped, because Raspberry replied to ICMP ping packets.. so something should work there, but unfortunately I'm on holiday and I coudn't make more diagnosis. I think Node red was not working too for some reasons.. and if it doesn't work I can't reboot from it.
I think the best solution for me it something working in kernel priority, that if advice something wrong can reboot. To reload its timeout I would use Node red, for example node red will test a TCP connection, if it works it wil send something to this kernel watchdog to reset its counter and avoid reboot.
How to do it? Thank you

edje11 · 21 February 2019 14:31

Can your adapter deliver enough current ? 2.5A is the minimum for the rasp 3B+.
Are you using the bigtimer node? I had the same problem that you have a couple of weeks ago and that was due the bigtimer node.
Updating the bigtimer node solved the heavy cpu load on the pi.

moebius · 21 February 2019 14:50

A good solution would require a second independent hardware as watchdog.
Like a esp8266 and a relay (maybe sonoff), so the watchdog can initialize a cold reboot.

To increase reliability further, the main system could also watch the watchdog

GiovanniG · 23 February 2019 22:12

Thank you for your kind answers
I agree that an hardware reset is the best option ever, but I would like to avoid it for 2 reasons:

I need to work on the box I already installed, unfortunately not a comfortable place to work on it, it takes time which I would like to save.
Powering off during work eventually may create problems of corrupted datas stored/data loss on SD, if it was during a write cycle, better another choice.

Honestly I think that if there is a high priority process that may advice a problem on kernel and can reboot it's the best solution to try first, in both freze I had the kernel was working, so I suppose watchdog will work too..
Also, doing it means I don't need to touch hardware (I'll slightly increase voltage on Pi to avoid errors only).

Since Domoticz developed this watchdog for its program, maybe Node Red authors can plan to create a high ring priority Watchdog too? Thank you
I'm trying now to contact Domoticz and ask if Watchdog can be compatible with NR.

dceejay · 24 February 2019 08:10

There is already a watchdog program available in debian repository that may be worth trying www.sat.dundee.ac.uk/psc/watchdog/Linux-Watchdog.html

Colin · 24 February 2019 09:04

I do this by watching from another machine that the Pi is performing correctly (by monitoring the MQTT messages it should be sending). If it stops then I send a reboot command via ssh. If the operating system is still functioning that that should work. If the operating system has ground to a halt then nothing you can do short of a power cycle is likely to get it going again.
In addition I have a watchdog flow running in node-red in the Pi that watches to make sure the sensors, network and so on are working, and if any of those die then again I reboot in the hope that will bring it back to life.
I have to say though that since I started using good quality SD cards problems have been very few and far between. I have four Pis running like that. In addition I keep image backups of the cards so I can easily get them going again should I get a corrupted card.

moebius · 24 February 2019 10:57

I would suggest to run "pseudo hardware" devices, with a read only filesystem and a defined reboot every x days.

Btw, i doubt that increasing the voltage will have an effect - if the power supply is the problem, you need more current - also instabilities could also be caused by the power grid.

GiovanniG · 24 February 2019 11:35

Thank you for kind answers!
dceejay: seems good stuff, it's doing many checks, it needs in my opinion to find a way to interface it with Node Red, I mean.. it's useful to check something by NR (sensors, alive messages, end of cycles, etc.) so I think we need a module in NR with one input.. if no messages come after some time it will produce reboot, no by this module but by kernel watch dog. How to exchange data between Node Red and this Watchdog?
It seems Raspian kernel containg already a watchdog, the link you provide is a program which can interface to this watchdog and let us control something by our request. Maybe NR can connect directly to kernel watchdog? (by dedicated module)

Colin I understand what you say, thank you, but since my Node Red once stopped working I can't consider this enough, I'm focusing on external Kernel watchdog.

dceejay · 24 February 2019 12:59

The system watchdog can be configured in many ways... maybe try interface mode where it tests that an interface responds... in this case either tcp port 1880 (where Node-RED is running) - or some other port that you can configure your flow to respond to (eg a tcp in out on another port) .

Then the system watchdog will kick in whenever Node-RED doesn't respond. There are many options for it to use so reading the help for it should help you sort out what level of configuration you need.

Colin · 24 February 2019 14:03

The strategy I described does recover from a stopped node-red.

GiovanniG · 15 April 2019 15:25

thank you all for answers, I think the watchdog from Domotics is easier to install/configured, as soon I figured out how I post here.
By the time it seems important also to move the logs to ram to avoid many writes to the SD and unstability if it is not excellent quality. I used log2ram

GiovanniG · 20 April 2019 12:03

Mates I need you help here, I mean I need help from experts,

unfortunately I experienced again a problem with TCP connection after about 10 days, the power now is good, I fell it can't be from there any problem, I'm not sure about SD, now I'm logging to RAM and only once every hour I dump to SD. The last time NR was working (I2C bus promptly active) but no more TCP opened ports and also log2RAM stopped writing on SD. If it's a problem of SD how the TCP stack can be affected while the other services (like NR) keep working?

So I'm going to create a new topic to whom is interested to develop a real hardware watchsdog for Node Red, ulike the contribs for watchdog that can't reboot the Raspberry, to clean all unsense messages here and be more visible. Thank you

TotallyInformation · 20 April 2019 13:41

Have you created a new installation of Linux on a new SD-Card? If there is a problem with the existing card, it is possible that a file (or several) is corrupted slightly.

On my oldest installation - a Pi2 - I have a problem that, after my UPS flicks to battery and back, the network sometimes fails to re-engage. The stack is up but no traffic flows. I have a flow in Node-RED that monitors for responses from my router, if that flow isn't sending any traffic for a period, it triggers a reboot of the Pi which fixes the issue. Doesn't happen on my Pi3 so I'm not sure what causes it but the Pi3 is a much newer build of Rasbian, the Pi2 has been updated many times.

GiovanniG · 20 April 2019 13:43

thanks for reply, how can I check the integrity of files on Raspbian? I recently updated it wothout problems.
On Windows I would do a chkdsk /f /r c: and a sfc /scannow
Thanks

Colin · 20 April 2019 14:58

You can check for the file system corrupted using fsck, but you cannot check for corrupted files. SD cards are not like real discs with error recovery bits (as far as I know) so it is possible for something like a power fail while the card is being written to to cause effectively undetectable corruption. However, having said that, I don't think I have had such a corruption since I stopped using cheap cards. I have several devices that have been running for years and surviving the occasional power fail events without problems.
To ease the effort of recovery should such an failure occur then I make image backups of SD cards once they are fully functional, keep notes of any changes made (such things as node-red flows are saved to git on a network server so those can easily be brought up to date anyway) and when it gets to the point that I think it would be onerous to go back to the last image I make a new backup.

Topic		Replies	Views
Develope a real hardware watchdog module, anyone interested? General	31	2627	2 May 2019
Node-Red not reachable after irregular time intervals General	30	1087	17 March 2022
Remote reboot device from node red General	6	769	29 September 2021
Node red random restart General	49	3355	16 June 2020
Unplanned Restart / Reboot? General	45	2823	27 April 2020

Node Red watchdog, assure long time working

Related topics