Yes, but this problem has only been seen a very small number of times so it must be something subtle. The question is what is it that @Nodi.Rubrum is doing that is unusual? Unless it is hardware of course.
When this has happened, I have had a run-away flow, that slams the processor(s), in the latest case, between 200 and 300% overload. I now can create the issue at will, just need a for loop in a function node that fails to update the index, thus the for loop goes forever, very fast.
There was a working theory that stopping NR could be part of the file lose issue, given how I killed NR when I first discovered this issue. Colin suggested that I might have killed NR in progress of flow file updating, and this seemed reasonable at the time. However, given I was careful to validate the flow and flow backup file state before stopping NR, in the latest tests, I don't believe our past theory as applicable now.
The 1st issue, if the file lost, which I have tripped over about 2 months ago, and first posted here. My surprise was that it could happen at all, when I develop code, write to key files, I always use a safe-save model. IMHO this should always be done for configuration files that are key to the robust nature of a solution.
The 2nd issue, which is odd as well, is that after the slam of the processor, and I do stop NR, gracefully, via systemctl stop nodred, for example. Then do node-red --safe, I saw, consistently, that the managed palette was reporting odd results, as well as getting a popup dialog window stating that various nodes were (now) missing. I documented this in this thread. If I stopped and restated NR one more time, this issue would disappear. This was a new twist.
As to hardware or software based, I have tested on multiple Pi devices, with multiple micro SSD cards. This is not device or media specific IMHO. I have used the same OS version for each test, i.e. Buster 10 as Pi OS, with all updates at the time of the test installed. I have also been able to create the file lost issue on various versions of NR, but I am sure that is not a new wrinkle given the file save code has not changed from version to version per my understanding.
I always run nodered via systemctl methods, for start, stop, status. The only exception is when I know I have a bad flow, and do node-red --safe to avoid flow default enablement.
Well yes - if you put an infinite loop in your code then as with any programming language you are going to tie up the process.
As I've described above, we do use a safe method for writing the file. We have identified an improvement that will be in the next release, but from an external point of view, when the call to write the file completes, we have done everything in our power to ensure the file has been written before moving on. Only then does the new flow get started. So it's hard to see how a badly behaved flow could cause the storing of the flow file to fail - given that is meant to have long completed before the flows are started.
The only explanation I can think of is that there is some remnant of the file writing process still left outstanding, and by putting the runtime into an infinite loop, that process is never allowed to complete. But I really cannot see what that could be given the lengths we go to getting the file write completed before it starts the flow.
The only way I can explain that is if when you start it with node-red
(rather than as a service with node-red-start
, that you are running as a different user, so it is loading from a different userDirectory - one without your extra nodes installed in. Check the start up log and compare the paths it logs.
I suggest two actions. 1) Keep this thread in mind, if or when this issue becomes significant. 2) This should be easy to recreate, I was able to do it several times last night, so if anyone else can create it consistently, then I leave it to development team to address it as reasonable or applicable. There is no doubt I find these issues, a significant part of my past IT career was QA/QC validation of software design, i.e. I was tasked with trying to break things
I agree, this maybe a case of the file system being guilty of telling NR, all good, but actual write is pending, or some variant of that scenario. As I said, I documented it, so there is reference for future need or consideration.
@Colin, I believe I was always running under my pi user id, for all test and such. When I ran top, that was from a different SSH session as root. But, if I test again, I will explicitly confirmed user context. Yes, NR installed under pi user id.
Is your system doing a lot of file reads or writes, either in node red or other processes?
Good question... No, I was developing and testing on a Pi device that only had a couple of flows which drive GPIO input and output. GPIO based IO is automated tasking that runs once per minute. This Pi device was not busy. I have other Pi devices that have extensive GPIO tasking, it takes a lot of flows or complex flows to task Pi3 and Pi4 devices. PiZero that is a different animal, but I don't do serious development on a PiZero.
I just got some 32GB micro SSD cards that are really fast, faster than any I have had before. All of my micro SSD cards are class 10 or better, but there is some speed variance across vendors of course.
And, I realized I did not answer another question... I use high quality power supply units. The newer Pi3 and Pi4 devices under voltage issues I think have driven everyone to better quality power, no?
Are you using the standard systemd script or have you tweeked anything?
Meant to say, when you say it is repeatable do you mean just the hang, or the losing files?
All standard. I use systemctl status/restart/stop/start as needed. Unit file is as the script deploys it. Use node-red --safe as noted above. The hang is repeatable, and as of last night testing, two times, I was able to get the files to disappear twice in a row. In previous experience, not as consistent, for the file loss. This morning I was not able to get the file loss to happen, in the one test I did. I will do some more testing in the next few days with the newer faster SSD cards I just got, to see if that has any impact or benefit to the scenario.
I note your file listings at the very start of this topic were done with ls -l
. That will not show the backup files as they start with a .
. Have you checked with ls -la
to see if they are still there? (Apologies if you've already addressed this... not read back through the whole thread again)
Can you describe exactly the procedure you go through after deploying please (and before if you think it relevant), so I can try to replicate it? How is the function node containing the loop triggered? In fact if you have a flow segment that can be exported that would be good.
Well, power issues can be caused by all sorts of things, including the cable.
I would still run dmesg (dmesg(1) - Linux manual page) at the command prompt and examine the kernel log. It is the best way to see if any voltage issues are being reported, and it only takes a second to do.
Sure, will do as I test, but last I checked, no such issues. I did just check the Pi3 device, no under-volt warnings in dmesg.
Steps to replicate:
- Start ssh session 1, start top, Start ssh session 2, make sure use id same as owner of NR instance
- Start browser, jump to wire editor, create flow, just invoke and function node with endless loop
- Deploy flow
- check top, node process should go crazy and move up above 100% and on a single core Pi3, loading will continue to about 262 or so and hold at that point or close to it for an extend period
- Return to the wire editor, create another flow, the specifics are not critical just do something to flag a pending deploy
- Do the deploy
- This should trigger the deploy animation and take a very long time, sometimes it times out, sometimes it will continue for a far extended time.
- While the deploy is attempting completion, check for the flow file and flow backup file existence
- In my cases of now, poof, files disappear
- At this point the deploy may still be running or have finally timed out, even may report connection to server is lost
- Check top, node should still be slamming processor
- At this point decide to let things continue, or trigger graceful shutdown of NR, i.e get node to back off slamming processor. If stopping NR, use session 2, make sure id is same as NR context.
I think I archived the broken flow, that I have since corrected. If I can find the broken original will forward to you. It was just parsing a few web page GET/requests against a cable modem. It was the function node that parsed the page results that had the bad loop.
Ok, I am glad I asked, a possibly key point is that you try to deploy with an already clogged up system, which I had not realised before.
I thought that node was single threaded so I am not sure how that ever manages to do anything. @knolleary any thoughts on that?
@Nodi.Rubrum have you verified the flow files exist and have been updated before you do the second deploy at step 5?
Everything is threaded in JS from what I have read, I see references to how node.js is threaded. The fact that you have to jump through hoops to get a simple pause to work, that does not block the main event queue of node/JS suggests that it is threaded. Now, could NR its self be single threaded? Given it is really just a node app, project?
If it is single threaded, NR or node, that would still not explain why NR does not get a failed write error on the deployment commit? Are you doing the file saves, file renames, etc. as async or sync method calls? Are the file operations wrapped in Promises?
I suspect NR is be fooled by something somehow. The wire editor acts like it thinks everything is ok, node hogs the processor capacity, and the files disappeared during the slamming of the processor.
But the system was not completely locked up, I could open and close SSH sessions, run top in separate/another session, since I had one open watching the NR log, in the 2nd test. Suggests the OS was still responsive even if NR was 'stuck' completing a deployment.
No. User-space code is single threaded - it's called the event loop.
If you have a function node in an infinite loop you are locking up the event loop and I can't see how the runtime would be able to even begin to handle the http request for the second deploy.
No, but they exist in the editor, so at least the main flow file must have existed. I had 5 flows in the editor, and it was a 6th new flow I was working on, that had the bad loop. I had worked on the 6th flow in several stages, testing each page request/response, before I created the bad flow issue. So I was deploying and checking the dashboard often after each deployment.