Context File corrupt/truncated - node-red auto-restart failure

When my device restarts abruptly - the context file get corrupt because of the truncated json. Automatic restart of node-red fails due to this.

One way is to go with a DB based context storage instead of file but MySQL or SQLLite could be little heavyduty for the device sometimes.

Have anyone solved this problem using something like untruncate-json or equivalent ?

TIA.

Are you seeing the file getting corrupted regularly? It should be a very rare occurrence. The file only gets written every 5 minutes, by default, and on node-red shutdown, so the probability of it being in the middle of writing it at the moment of power fail should be low. Also I would expect the node red code writing it to write a new file and then rename the files so that the new one takes over, I would not expect node red to be overwriting the existing file.

Colin
Yes, it happens fairly few times and we get to know only after few hours that it was done (IoT gateway edge use-case). There are multiple such devices and some devices could have frequent restarts because of some site specific issues.

Importantly we are sending data to server using guaranteed delivery flow - and this flow writes to the local context first (queue) and the data transfer flow would send them by reading from this context storage - fairly large amount of data being frequent read and written.

Thanks

Perhaps you could extend the existing file context store code to optionally write to two files instead of just one, and on startup allow for the fact that one of them may be corrupt. That could be a useful optional feature for others.

You mentioned the possibility of using mysql or sqlite instead of a simple file store. I would not expect either of those to survive the corruption any better than a simple file.

(Me not being an expert on node-red) would like to know what would be writing to two files means - how it helps - how to write - how to load context from either of these files during startup - what is the chance that both files are corrupt - wouldn't this be a significant overhead?

The database maintains atomicity and data integrity - there is nothing like half-written stage isn't?

Provided it wrote one of them, waited for the write to complete, and then wrote the other one then presumably only one of them could get corrupted by a sudden power fail.

What hardware is it being written to?

By default, the data is only written every 5 minutes. Have you changed the default update rate in settings.js?

I see that MySQL does use a journaling system, so by all means you could write a database version of the file store. As for how to do that, or how to write a double file version of the existing code (which is effectively a simple journaling system), I don't know as I have never been involved in that code.

It is an excellent idea for Node-RED to have a backup context, and load the previous version of context if the current context file is corrupted, or simply skip the corrupted context file to boot up.

BTW, according to the documentation, the context is updated every 30 seconds in default? The probability of getting corrupted is much higher in this case.

1 Like

Provided it wrote one of them, waited for the write to complete, and then wrote the other one then presumably only one of them could get corrupted by a sudden power fail.

We write with a flushInterval of 10seconds to ensure data is not lost. That makes the data corruption possibility higher. If we write to 2 files, it should happen within that. Should I look at extending the localfilesystem module here node_modules/@node-red/runtime/lib/storage/localfilesystem? This looks beyond me to modify something like this. :slight_smile: May be some pointers could help?

I see that MySQL does use a journaling system, so by all means you could write a database version of the file store.

Yeah, doing what mysql does using a journaling system would be interesting. Instead isn't it safe to use one such?

What kind of device is Node-red running on, and how come it restarts unexpectedly?

Is some sort of UPS possible to allow controlled shutdown?

It is running on several sites (IoT Edge device) on something like RPi device. Most cases have backup like UPS but some are without that and that is where we see this problem frequenting.

I am a bit surprised that these "IoT Edge devices like RPis" can survive multiple power outages (or whatever) with no other effect than corrupt context stores.
Are they running an operating system? Does it reside on a read-only filesystem with just the context stores on a read/write filesystem?

How big are the context store files? How slow is the filesystem? Writing these files really should be an infinitesimal proportion of the device's time.

Is the device connected to a LAN? Are there devices on the network not affected by the catastrophe which brings down the device?

It seems strange that you are using "a flushInterval of 10seconds to ensure data is not lost" but running devices without a UPS. As you say, flushing this frequently makes corruption more likely.

The device should be able to test the context stores when it boots up and before starting Node-red, so there is no good reason for you not to know about it until some hours later.

If you write two copies, one after the other, the chance of them both being corrupt ought to be zero.
At boot, test for valid json. If it fails, test and use the back-up copy.

You didn't answer my earlier question (or I missed it), what device is the context file being written to? If it is a flash device then the write consists of two stages. First the data is written to internal RAM in the chip. That is very fast. Then it is burnt to the flash part of the chip. That is slow. I expect you have noticed that if you write to an SD card (for example) and then unmount it, you may get a popup telling you to wait while the data is written. It is the flash write that you are waiting for. How big is the flash file? Are you sure that the write operation (including the flush to flash) can be completed in 10 seconds? If not then you may be telling the system to write a new version even before it has finished flushing the previous one. Since you are continually rewriting it then more and more updates will get queued, which sounds like a recipe for disaster.

You are correct about the default context save being 30 seconds. I wonder where I got the idea that it is 5 minutes.

[Edit] How is the storage device formatted (FAT, ext3 etc)?

Also, have you looked at the REDIS context store? That might be more robust?

I've long wanted to write a better context store library to be able to add events on update and other actions but also maybe to have it use a simple non-SQL db as well.

Sadly, time seems to be constantly against me.

Sorry Colin - missed answering that part. Node-red on a RPi (like) device writes to SD card - it is also installed in Windows type machine where it is written to a hard disk. Mostly the device is RPi types.

The way you are saying SD card writing is slow - we have a data capture frequency (from underlying equipment) of about 2-5 seconds. We have not faced issues of delayed writes in cases where there is UPS backup hence I am assuming the bottleneck is not there.

I will check on the device storage formatted and revert soon.

Yes, I have looked at Redis. Again I need to run a redis server in the device or in the local network and that would be a equal overhead as running a light weight database. Hence not the first preferred option.

Also, how large is the context file?

Also, how large is the context file?

The context file goes from 1 MB to 100 MB depending on how much data is written and how long it is offline. One record is about 2-5 KB.

I am thinking of an alternative where we write it to a CSV and not to the context file. And keep may be file name(s) of the CSV in the context storage. That way incomplete records/rows can be ignored. Atleast the system won't fail to restart on incomplete records as it does today and there are no semantic issues with the file parsing.

100MB every 10 seconds is 36GB/hour. I hope you are using top quality SD cards, otherwise they will not last long.

The reason for asking about the format of the SD card is that if it is using ext2 there is no journaling, so will not survive power down well. Better would be ext3 I think. You can check using df -T and looking for the appropriate mount point. Probably /.

Colin,
The type is ext4. And typically on sunny day scenario the file is hardly few KBs. Only when there is network failure (uplink MQTT Server) there is a pile up in which case the files would grow in size which is rare (worst is once a day)

Current problem is that in some sites, such issue happens where there is no UPS, and the device won't come back and we don't know for few hours and some cases even days. Wanted to fix that case also.

I have read that ext4 is bad for sd cards, but I do not have expert knowledge in that area. Ext4 does include journaling, so the file system should automatically recover from a power fail.

A node red command line option to treat context read failures as a warning might be worth while.