No I can not.
I can't see many other things what can go wrong. I just don't have this setup running.
That's why I didn't said the name of the running process. It may be everything starting from I don't know ... inconsistent power supply of the hard drive.
But what I can see was what I mentioned and those things can be somehow fixed or improved.
Certainly, if there is a hardware problem then all bets are off, but the claim by @thebaldgeek, if I understand correctly, is that when node red crashes due to a bug in a contrib node, then the stored context can become corrupted. If that is correct, then arguably that indicates a bug in node-red, which would be good to identify and fix.
That has never been a recommended version of node.js. Only LTS version (that is even numbered versions) should be used. So you should upgrade that to 16.
node-red 1.3.5 is pretty ancient, but it should have the code I showed earlier to prevent context file corruption.
You haven't said what versions you are using on the PIs (or whatever they are).
@davidz doesn't say how to duplicate it, unfortunately, though he is coming at the problem from the other side: Once corrupted, Node-red doesn't restart.
I have bare minimum knowledge on different OP systems/platforms where Node-RED can run and what is possible and what is not for those, but:
If it is possible to determine that some context file is corrupted at the startup procedure, and the user has some flag set up at settings.js which indicates that it is OK to even delete or create empty context files or make some other reasonable action to just to get the NR starting, I think I would take it as solution cos it is highly possible to create your flows accepting that possible unexpected restart with a bit of data loss alongside.
I agree that Node-red starting but without persistent context data is probably better than not starting at all.
But presumably people use filesystem storage precisely because they need to maintain previous data over a restart, and they might expect that different stores are either all valid or all invalid.
From the information available it seems that something around persistent context storage causes Node-red to be more fragile than other applications.
We really need a recipe to reproduce the issue on demand.
Well, the file corruption has been with the windows all the time. Even system files do get corrupted. Different file systems and hardware can behave differently, there is options to make the file writing immediate, without cache and so on. Everything is somehow related but it is too hard to prove anything. So to try it out, it takes just to power it up, give it enough workload and pull off the plug. If it didn't happen, you are lucky enough. But you may lose that machine with windows on it.
So the thing to rely on, is not only your choice of context (any data) storage.
It is not the thing the nodejs or Node-RED can change. I have found after taking the time to read the code and proposed best practices, Node-RED does follow all as good it is possible.
So what I was proposing slightly is just an option to get around it, in case user decides so.
It is of course much better now than it was. And it does use the journaling (by default at least). Some say it doesn't do it properly. But it is truly way out of my expertise.
I have seen the issue with the power loses at the time when I was using one small box without ups. By the way, I was using the PowerShell to periodically grab data, clean it up to my needs and store as simple and small text file. That was corrupted once. From that I wasn't blaming Node-RED and the solution was UPS.
Now I run an old laptop (with battery still good enough) so can't complain. But that's for me and can't be taken as solution for everyone.
But for that other case I don't know. Can it be tested with process.abort() somehow? Will it be "unexpected enough"?
Just to clarify, it is following power failure without clean shutdown. There have been suggestions that it can also happen under node-red crash situations, but that is still unclear.
I think there are two issues here, firstly is the restart failure following a corruption. I would vote for a settings.js setting to tell node-red to treat that as a warning rather than a fatal error.
Secondly is whether the code can be made more resilient to pulling the plug. I posted the code that writes the context earlier, it writes to to a temporary file and then renames that to be the real file, which would seem to be solid. However, when using an SD card the data is first written to RAM in the card, and then some time later the RAM is copied to the flash memory on the card. I am wondering whether it is there that the issue arises. Possibly forcing a file system sync operation (to flush the data to flash) before performing the rename might make it more resilient. I thought I might test that but it isn't clear to me how to do that using the fs module. @knolleary I wonder whether you could comment on whether this might be worth trying, or perhaps someone else in the team could comment.
Yes the file context corruption only happened with improper shutdown such as power interrupts.
The first issue arises from Node-RED stops starting when there is a context file corruption. Hopefully this can be addressed soon.
The second issue (improper shutdown) could be treated with hardware design such as low voltage detection and shutdown. Also increasing the context file default saving interval helps.
As an experiment I want to try and execute the system command sync tmpFile
before the return fs.rename(..) in
async function writeFileAtomic(storagePath, content) {
// To protect against file corruption, write to a tmp file first and then
// rename to the destination file
let finalFile = storagePath + ".json";
let tmpFile = finalFile + "."+Date.now()+".tmp";
await fs.outputFile(tmpFile, content, "utf8");
return fs.rename(tmpFile,finalFile);
}
but I don't know how to handle the async requirement. Alternatively there may be a way of doing the sync using fs but I can't immediately see it.
Any suggestions gratefully received.
Yes, as long as you are using the default NTFS, that is a fully journaling and (mostly) self-healing FS.
When the OS crashes badly, it is possible to end up with a slightly damaged FS however and there are some included tools for doing consistency checks. If you are having repeated problems with the FS, you should run those first. After that, it is almost certainly a drive fault or another hardware fault.