Example of Node Red Going Wacko! (Or Node js?)

Nodi.Rubrum · 8 September 2020 22:05

While creating a flow, my Pi device seemed to freeze, but when I ran top it was not frozen, it was slammed. I had just tried to do a deploy, and the animated slide bar was endless moving back and forth. So I decided to check the flow file... why? Because I have seen twice before where the flow file disappeared. Any you guessed it, the flow file is gone!

root@pi3modelb1:/home/pi/.node-red# ls -l
total 108
drwxr-xr-x   2 pi pi  4096 Sep  4 18:03 context
drwxr-xr-x   2 pi pi  4096 Sep  4 18:03 cronplusdata
-rw-r--r--   1 pi pi   144 Sep  8 03:21 flows_cred.json
drwxr-xr-x   2 pi pi  4096 Aug 31 18:48 JsonDB
drwxr-xr-x   3 pi pi  4096 Aug 30 22:04 lib
drwxr-xr-x 208 pi pi 12288 Sep  8 02:48 node_modules
-rw-r--r--   1 pi pi  1136 Sep  8 02:48 package.json
-rw-r--r--   1 pi pi 68278 Sep  8 02:48 package-lock.json
-rw-r--r--   1 pi pi   538 Aug 31 21:59 settings.js
root@pi3modelb1:/home/pi/.node-red#

This is on a Pi 3 Model B. See below where and how the Pi device processor is (still) slammed. .

top - 21:54:08 up 2 days,  1:30,  1 user,  load average: 3.40, 3.35, 2.03
Tasks:  94 total,   1 running,  51 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.6 us,  0.7 sy, 65.5 ni, 33.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :   999812 total,    48016 free,   621804 used,   329992 buff/cache
KiB Swap:   102396 total,    60300 free,    42096 used.   307944 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
21254 pi        25   5  680224 593992  28472 S 262.1 59.4   6:06.51 node
  321 root      20   0   52932   2712   1248 S   5.4  0.3 228:21.51 pigpiod
  625 root      20   0   50420   2128   1248 S   0.4  0.2   5:37.55 python3
18494 root      20   0       0      0      0 I   0.4  0.0   0:00.86 kworker/u8:1-ev
21209 root      20   0   11660   4360   3604 S   0.4  0.4   0:00.42 sshd

Maybe someone has an idea what or how this can happen? I have left the Pi device for now, and it is still slammed for about 15 minutes. Whatever happened, my flow file is gone. So I lost the latest version of flow I was working on. Looks like the only option I have it to stop NR.

[Some Time Later...]

I checked it one more time... now about an hour later, and the CPU load of node is still 262% which is insane. Although the deploy progress animation finished (or timed out?). The flow file is still missing.

root@pi3modelb1:/home/pi/.node-red# ls -l
total 108
drwxr-xr-x   2 pi pi  4096 Sep  4 18:03 context
drwxr-xr-x   2 pi pi  4096 Sep  4 18:03 cronplusdata
-rw-r--r--   1 pi pi   144 Sep  8 03:21 flows_cred.json
drwxr-xr-x   2 pi pi  4096 Aug 31 18:48 JsonDB
drwxr-xr-x   3 pi pi  4096 Aug 30 22:04 lib
drwxr-xr-x 208 pi pi 12288 Sep  8 02:48 node_modules
-rw-r--r--   1 pi pi  1136 Sep  8 02:48 package.json
-rw-r--r--   1 pi pi 68278 Sep  8 02:48 package-lock.json
-rw-r--r--   1 pi pi   538 Aug 31 21:59 settings.js

Here is the kicker, the backup flow file is also gone. So the safe save code logic may need to be reviewed? It was my understanding that this should not happen, that the backup flows file and the current flows file should not be missing at the same time?

-rw-r--r-- 1 pi pi 25726 Sep  8 02:48 .config.json.backup
-rw-r--r-- 1 pi pi   144 Sep  8 02:47 .flows_cred.json.backup

This Pi device is not the same one that I have seen this issue on in the past, this is not the same SSD card. I have seen this on a Pi4, PiZero, and now on a Pi3. So this is not a case of the same device or media being in common to the now, 3 separate times I have seen this issue. Nor is it the same OS image or even same NR install, but it is the same version of OS, i.e. Buster (Debian 10). This time I checked for the flow file, back flow file, before I finally killed NR. It seemed that stopping NR could break the safe save logic. I was careful to not repeat that step. Oh, and this is not on the same version of NR either. Happened once on 1.0.6, 1.1.0, and now 1.1.3. Version of node.js are those installed with NR install of course.

Nodi.Rubrum · 8 September 2020 22:36

Just more information... I finally tried to stop NR, it stopped, but node.js continue to slam the processor. So I did a reboot, thinking that if it was an odd OS issue or media issue, I could see it in due course. But the Pi device came back up clean, no media check was done, the process queue looks normal.

top - 22:31:07 up 4 min,  1 user,  load average: 0.05, 0.07, 0.03
Tasks: 105 total,   1 running,  51 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.3 us,  0.6 sy,  0.0 ni, 97.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :   999812 total,   756616 free,    86224 used,   156972 buff/cache
KiB Swap:   102396 total,   102396 free,        0 used.   853160 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  362 root      20   0   10064   1660   1448 S   7.6  0.2   0:21.31 pigpiod
  688 root      20   0    8112   3284   2804 R   1.3  0.3   0:00.19 top
    1 root      20   0   27140   6040   4832 S   0.0  0.6   0:04.02 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.01 kthreadd
    3 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 rcu_gp
    4 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 rcu_par_gp
    5 root      20   0       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0-eve
    6 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0H-mm
    7 root      20   0       0      0      0 I   0.0  0.0   0:00.01 kworker/u8:0-ev
    8 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 mm_percpu_wq

Hope all this detail helps as applicable.

Trying_to_learn · 8 September 2020 22:53

Hi.

Please don't get excited. I can't help you with the answer.

But a word of caution.

A similar thing happened to me once and I did a silly thing.

I lost ALL my flows on the machine. ALL! Gone-ski!

I wasn't using a RasPi at that time. It was a NUC with heaps of memory and power.

Stop node-red

cd .node-red
node-red-stop
node-red --safe

Look at what happens there.

No promises. I am not good in this part of how node-red works.

Oh, and above all: be patient. 5 minutes is not too much of an ask.

Nodi.Rubrum · 9 September 2020 00:55

Excited, no. Surprised yes. I am familiar with this scenario, and yes it is likely a misbehaving flow. But that does not or should not result in a missing flow file, as well as a missing flow backup file. The key point here is the flow file should be protected, and there is an issue here in that respect. I seem to have a talent for tripping up the safe-save sequence that should be protecting the flow file(s). One idea was that if you kill NR that could cause a scenario where the flow file disappears. But in this case I never killed NR, but the flow file disappeared. Even if the flow file disappears, because you happen to kill NR as it is updating or replacing the flow file, say via a deployment update. Even if that is the case, the backup should still exist. In this specific case, the backup file is also is gone. This would, I suggest be worth some consideration. Hence I posted the scenario I experienced.

Trying_to_learn · 9 September 2020 01:07

I believe that when you press the deploy (big red) button.....

All the .json files are deleted and new ones made.

So be very careful.

Back them up to an completely different directory.
I learnt this the hard way. 3 years of flows.... Gone.

As an offering to you to maybe help you not do it again, here is a script I made up.
(Well, ok. Someone else wrote most of it and I just tweaked the paths, etc.)

#!/bin/bash
# ---------------------------------------
# Simple backup script v1.0
# ---------------------------------------

# Variables
myDate=`date "+%Y-%m-%d.%H.%M.%S"`
#backupFolderName="Backup_$myDate"
backupFolderName="$myDate"
backupSource="/home/pi/.node-red"
backupDest="/home/pi/Backups/NR"
backupFilter="*.j*"
backupExclude="lost\+found"

# Tell the user what we're working with
echo "The myDate variable contains: $myDate"
echo "A backup of $backupSource/$backupFilter will be made and stored in $backupDest/$backupFolderName"

# Begin backup
rsync -avz --progress $backupSource/$backupFilter --exclude=$backupExclude $backupDest/$backupFolderName
#RC = $?

# We're done.
echo "Done!"
exit  $RC

Then, in my .bash_aliases I have this line:

alias NRB="/home/me/Mine/NR_Backup.sh"

So, if I want to: every now and then in a terminal of the RasPi, I just type NRB and it is backed up.

You may want to adjust the destination path/s.

And you can invoke the script .... say monthly to back things up too.

Nodi.Rubrum · 9 September 2020 01:32

If that is true, that is not a safe-save model and I was told by @colin, if memory serves that a safe save mode or logic was in use.

Moreover, I found something else wrong with the environment, some how or some way the dashboard is missing, it is showing as uninstalled in npm and the palette. But I did not uninstall it. I restarted NR in safe mode, to resolve the bad flow issue, and various ui_* nodes were reported missing. This adds an odd wrinkle to the situation encountered.

Nodi.Rubrum · 9 September 2020 01:44

This is weird as well, after stopping NR and restarting it again in safe mode, now the dashboard shows installed. Is there any possible reason for such odd behavior? [Why do I get the 'fun' issues?]

Trying_to_learn · 9 September 2020 01:45

I feel this has gone way beyond my skill set.

I hope someone else will help you.

Just now I think most of the brains trust are asleep....

They should be back in about 7-ish hours.

Nodi.Rubrum · 9 September 2020 03:28

Cool. No worries.

This really is just a question of reviewing the code that does the file saves for the flows and flows backup files, and see if something can be done, even should be done, to avoid the scenario that the flow file is lost, if possible. As I noted before, I just seem to be able to trip over this issue more than others, I guess.

The issue with the manged palette doing something funky, I have been able to recreate three times now. So this is also an issue that might be of interest to the developer(s) of the core of NR. Or it might be something they already know of, and just I had not discovered it until now.

Bobo · 9 September 2020 03:32

While it would be unlucky, it is certainly possible that you may have more than one sd card that is dodgy. That would depend on their age, amount of accessing, how they have been treated etc.

So I wouldn't entirely dismiss the notion that the issue may be media related.

Or power related? Have you done a dmesg to check for voltage or other issues?

TotallyInformation · 9 September 2020 08:14

Something strange is happening certainly since your experience is very rare. I've not had a loss of data (outside some early experiences with the projects feature but that was me experimenting) ever since I started using Node-RED years ago. Indeed, one of the strengths of Node-RED is its stability.

As others have said, first things to check are:

Make sure you get a good SD card. Samsung Evo or Evo Pro for example. And get an oversized one, say 32GB (allows lots of room for wear levelling).
Check the power supply. Pi's are notorious for being picky about their power.

Finally check that you haven't got an OS level script or service running that touches Node-RED's folders (misbehaving backup?) and check that there is nothing in your flows that similarly may touch the userDir.

Colin · 9 September 2020 08:46

Just working through this now, but my first comment is that when it Deploys it renames the original file as .flows_whatever.json.backup so you need to use ls -al to see that (as it starts with a dot). That would get you back to the version before the deploy.

[Edit] Reading on I see that you have commented on the hidden files, even though you did not show them in the initial ls.

Colin · 9 September 2020 08:55

I believe that is not correct.

Trying_to_learn · 9 September 2020 08:56

@Colin ok, maybe not 100%. But I was only saying it sounded similar to what happened to me and I lost them all.

I was not wanting to go too deep into things. Also because I am not certain.

knolleary · 9 September 2020 09:02

In the existing code the following is done whenever it writes a file:

rename the existing file to the backup file
writes the new file

So the backup file should always exist - it is never explicitly deleted.

It has been noted, however, that by renaming the file first, if there's an issue whilst writing the file, you'll be potentially left without the file intact and you'll have to restore the backup.

So the code has been changed for the next release to do:

Copy (not rename) the existing file to the backup file
Write the new file to a temporary file
Once written, rename the temporary file to the actual file

This process means the original version of the file remains in place until the very last rename step (which should be close to an atomic action) - you cannot be left with the file corrupted or deleted.

All of that said, if there are any errors writing the file (in the current or future code), they should have been logged to give a clear indication as to what happened.

Colin · 9 September 2020 09:21

At what point in the sequence does node start processing the modified flow, particularly in the case of a partial deploy?

knolleary · 9 September 2020 09:41

The process I've described is what happens when the storage layer is asked to save the flow file. The flow engine doesn't start stopping/starting nodes (regardless of the type of deploy) until the request to save the flow file has completed.

Colin · 9 September 2020 10:00

Ok, you can see where my mind was going no doubt. Presumably that means fully complete, not just requested.
Though even that would actually mean just that the system has written it out to the disc write cache, I imagine, but for that not to get committed to the SD card would need a pretty major event at the OS level, or a h/w problem.

Could it be it relevant that top is showing node hogging the processor rather than
node-red. Normally I see node-red there.

@Nodi.Rubrum are you running node-red in the normal way using the systemd script installed by the update/install script?

[Edit] Another thought, is it possible that there is an error condition in the storage layer that is not properly caught and actioned, so that an error in the sequence goes undetected? Suppose the rename failed for some reason for example, but this was not noticed.

knolleary · 9 September 2020 10:13

It is entirely possible, although as I hope you can appreciate, this code has had quite a lot of testing and been exercised quite a lot over the last 6 years.

You're welcome to inspect the code and see for yourself.

cymplecy · 9 September 2020 10:17

Could I dare suggest doing operation 2 before operation 1, just to try to have at least two copies of a flow file in existence at the same time

Topic		Replies	Views
Flows Disappeared General	48	7278	27 July 2020
How to recovery lost flows file? General	18	3520	6 July 2020
Connection Lost After Some Time General	56	3800	19 September 2020
Node-red stopped working? General	85	9483	5 June 2019
Slow... (Everything) General	30	4839	27 September 2020

Example of Node Red Going Wacko! (Or Node js?)

Related topics