Develope a real hardware watchdog module, anyone interested?

Hi mates, we all know how uncomfortable are unexpected "hangs" of our system.. sometime even critical, and how hard is to guess what can be wrong there.. as they may happen really seldom, almost impossible to replicate, very difficult to dignostic.
In past I had power troubles with my RaspberryPI3, now I solved them but I had once again an hang, maybe caused from my "not best quality" SD, maybe not, cause it affected the TCP stack.. and how TCP can be connected to SD (while other services like NR, I2C bus, ICMP ping, reep working regulary) I really don't understand.

To avoid any possible problems we need a watchdog, I mean a real watchdog that is able to reboot PI and solve the problem, eventually let us know that it happen wo be warned, I see it the only way to be sure it will work anycase.
There are different ways to obtain that, someone will suggest to interrupt the power supply somehow, maybe with external Arduino, others may suggest to use internet services that allert if the "ping" to PI will stop. I'm not intersted on these solutions, I would like to use the internal hardware watchdog of PI that should be activated turning on its service, and managed properly.

Hardware watchdog is present in every microcontroller/automation system and it can trig an hardware reboot, it should be present in every automation system, for this reason I'm suggesting NR developers to consider this too, in my opinion it's important.

Features needed:
PI should always be reachabel by SSH, so we need to check the port 22 if responsive on loopback.
Web NR management port 1080 is good to check too, as the NR is responsive
A module on NR where we can set the ports to be checked and accept incoming messages to check NR is working properly is welcome, if no message is incoming after certain time the watchdog is triggered.

ABout the watchdog services there is docmentation around, but hard for me to understand how to work with it properly, Domoticz did somenthng to manage it, maybe it chcks already ports too, it can be worth to have a look on it..
I asked their help but received none ((
https://www.domoticz.com/wiki/Setting_up_the_raspberry_pi_watchdog

The problem is that cutting the power on something like a Raspberry Pi can be rather fatal to the SD-Card.

If you really want to do it, you can easily use a SONOFF (ESP8266 based power switch).

On my live system, if Node-RED loses connectivity to the network, it triggers a controlled reboot which is all I've ever needed. My Pi's are on PC style UPS's to ensure that they are always on or that they can gracefully close down if the power has been out for too long.

Good point: it would be great to study this Raspbian watchdog service, and see if it is already including a shutdown/unmount scripts before trig the hrdware reboot.

Bad point: as I said I would exclude use any external device which can interrupt power supply. I would kindly pray contributors to don't talk about it, thank you

If it is not going to interrupt the power then what do you want it to do?

IN Raspberry there is Broadcom BCM2835 watchdog timer, in order to enable the watchdog timer, In many posts I saw that add in the /boot/config.txt a line with:

dtparam=watchdog=on 

My goal is to use this timer to reset hardware the Rasberry, without any external device, etc.
Answering your post I searched and found
https://www.raspberrypi.org/forums/viewtopic.php?t=147501
I hope I'll have soon time to read it

That looks interesting and potentially very useful. Hopefully you will be able to get something working and will post back here.

I'm really glad you consider this interesting! I've found some more infos and I asked for better clearance:
If is needed to type this on SSH:
modprobe bcm2835_wdt
echo "bcm2835_wdt" | sudo tee -a /etc/modules
apt-get install watchdog
update-rc.d watchdog defaults

edit config for this:
nano /etc/watchdog.conf
uncomment #watchdog-device
uncomment #max-load-1
and add: watchdog-timeout = 15 (for example)

Now I'm figuring put what is resetting every 15 seconds the counter...

In another therad someone else wrote:
In /boot/config.txt add/change:

watchdog=on

In /etc/systemd/systemd.conf, change #RuntimeWatchdogSec= to:

RuntimeWatchdogSec=10s

so what config file should be changed?

Don't know. In a quick bit of googling I found a number of posts suggesting apparently different things. Probably there is more than one way to skin a cat.

someone suggested me https://mmonit.com/monit/ which is a service able to keep monitor the system, restart services, send email, etc. It's not clear if it may also manage the hardware watchdog, I asked.
It looks interesting, I need to test it

I read the post about Raspberry bcm2835 watchdog, it looks that they configured it in a simple way, it means the daemon itself reloads the timer every 10 seconds, if not (this daemon is stopped or kerne is in panic) the RB reboots.
Now here we can have the case that kernel is working and daemon too, but TCP is hanged and RB unreachable, how to reboot it in this case?
We can make a function n NR that if a TCP is not reply, for different times we can rise the "bomb" to the daemon (a command string), but what if NR stop workng too? Who will start the bomb?

In the Daemon there are some possible controls:
May 20 09:33:02 raspi-server watchdog[707]: file: no file to check
May 20 09:33:02 raspi-server watchdog[707]: pidfile: no server process to check
May 20 09:33:02 raspi-server watchdog[707]: interface: no interface to check
May 20 09:33:02 raspi-server watchdog[707]: temperature: no sensors to check

But since now I haven't find how to use them, these can be interesting, for example check a file from NR, if it is not updated (hanged NR or we intentionaly from NR not udating it) the daemon will restart.
Need to figure out how

Reading more I figured out a simpler way to manage this Watchdog, it need s to be tested:
https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=147501&sid=d0e60e399b3af454586a96b0bb5963da&start=25#p1251435
I report it here:

  • you need not to inititalize/ load any drivers, dtoverlays etc.
  • not any kind of special configuration is required
  • don't install the watchdog package
  • don't fiddle with systemd watchdog settings
  • nonetheless the device "/dev/watchdog" is simply there
  • you can write "." to the device "/dev/watchdog" at any time to start the functionality
  • this triggers the hardware watchdog which expires in 15s per default (I don't know how to change this interval)
  • i.e. the machine reboots unless you write "." again to the device "/dev/watchdog" prior to the 15s expiry period
  • to stop the watchdog you simply write "V" to the device "/dev/watchdog"

most of my applications are running some sort of dispatch loop. So the only thing I have to worry about is
to rewrite the "." just in time.

trivial shell script to explain the watchdog behavior

Code: Select all

while :
do
    date
    echo . > /dev/watchdog
    sleep 14
done

if the loop fails to write the '.' just in time (i.e. prior to the 15s expiry period) the machine reboots instantly

So as he wrote we may just see if the device is present, if so we can add in NR a function to write to the device a "." to activate the process, and always from that time. If so it will be really comfortable, we activate it only when NR is active, and we don't have to worry how long will be the boot process. We can try a soft reboot (sudo reboot) before the watchdog will reset writing a "." before sudo reboot to gain more 15 seconds. We can also try dismount devices if we decide to hard reset.
Need to be tested,
how can I write from NR to /dev/watchdog a "."?

maybe try the exec node... name echo as the program and pass in as parameters the . > /dev/watchdog

Thank you, it would be interesting to write automatically a "V" to /dev/watchdog also when we stop NR, to avoid reboots when we are working/debugging/etc. How to?

Write a script to stop node red, consisting of

node-red-stop
echo "V"  > /dev/watchdog

I think it would be better to put quotes round your dot too, just for clarity.

and you may want to give slightly more than 1 second grace period before rebooting... if node.js starts it's garbage collection routine it can pause other execution, and if other tasks are occurring on the Pi you could accidentally reboot when you don't actually need to. But yes - this is an interesting topic. Will be great to see where you get to with it.

good point, is there a way to know if node.js starts it's garbage collection routine? Or something where it would be better to don't reboor the system?
My goal is to test something, for example the TCP connections, if all is ok I'll keep kick the Watchdog, maybe every second, if the control fails I'll send an email/Telegram message but keep kicking, if fails for some consecutive seconds I'll send the last kick+sudo reboot, now if here something is getting wrong on reboot without kicking the watchdog will reset system, if NR for any reson not being teminated it will not kick more too.

Tested, it works perfectly, as he described. Writing a . will activate and kick, writing a V will stop it resetting. But it works only on root!
With Exec module can I write on /dev/watchdog as root?
It would be great to create a module for this feature, Receiving a msg.topic "On" it will activate and kick the watchdog, Receiving a msg.topic "Off" it will deactivate sending an echo V. It's also important to send a V everytime NR send modules a shutdown, even after deploy.. to avoid undesidered reboots.
Can soeone kindly help? Thank you!

You would have to allow the user running Node-RED to use SUDO without a password for that specific command (safest). Better still would be to change the permissions on /dev/watchdog to allow it to be written to by the user running Node-RED.

Thak you for kind help! Changing the permission to Watchdog sounds really good, I've did it with chmod 666 /dev/watchdog under root, will it now keep this setting forever?
Thank you

Yes, it will keep the settings unless you reinstall your OS. However, you might want to lock that down a little more since you've now gone to the opposite extreme and are allowing anyone with access to your server to mess with the watchdog.

I would make sure that the ownership of the device is something like root:admins or root|wheel and then set chmod to 660 (rw-rw----). Then make sure that the user running Node-RED is a member of the chosen group.