I’m sure a similar question has been asked many times but I couldn’t quite find an answer that I felt comfortable with. I tried reaching out on the Influx community but so far no responses.
I’m running InfluxDB 1 on a Raspberry Pi 4B with 4GB RAM. I am using Node-RED to log sensor data and would like to log several measurements once a second. I am pretty novice when it comes to data bases and best practices.
I realize that logging values once a second indefinitely is going to create millions of records very quickly and would be difficult for the Pi to query. In theory I will be capturing reports monthly. So if I were to setup a retention policy of 3-6 months, (just in case I mess up something on the reports and need to back track) records that are older than this would be deleted and free up space? Would this be a reasonable approach?
I know another route would be data aggregation but as the sensors could see quick small spikes that I would want to see, wouldn’t aggregation “smooth” these out?
I appreciate any advice on best practices for this type of logging. In the past the amount of records I’m dealing with have been much lower, so this will be a good learning opportunity for me
In principle the data access should not be a problem for a pi 4. The problem with using influxdb on a pi is with the SD card. Influxdb moves data about a lot and SD cards don't like that. If you use, for example, a usb disc or SSD for the influx then there should not be a problem.
Yes
You can use influxdb to do the aggregation on older data, using Continuous Queries, but rather then keeping the mean you could keep the max value (or max and min) for each period.
Or, if you are only interested in the spikes, do the aggregation before it goes to influxdb, keeping the max value every 10 seconds, or whatever is appropriate.
Thank you for the response Colin. This is exactly the type of explanation I was looking for.
Unfortunately for me I'm stuck with the SD card, although it is a 128GB Samsung EVO+ from Canakit so as far as SD cards go, I should get decent performance.
I wouldn't say that I'm ONLY interested in the spikes, but they are probably the most important. I think doing the aggregation on older data and keeping the max would be acceptable.
Is there anything else I should keep in mind when working with this amount of data? Will something like Grafana struggle to show graphs or pull in data? I won't be using the browser on the Pi to view Grafana.
You may be ok. I don't know. Certainly make sure you backup the data if it is important not to lose it.
I bought an old laptop with a dodgy screen for next to nothing for running influxdb and grafana. I manage it mostly through SSH and it rarely has the screen open, it has been running for several years without a hitch. It gets all its data through mqtt from the pi running the automation.
Ok that is good to know. I have a local weather station that grabs several metrics and logs every 5 minutes and that's been running great for a couple of years now. But even with two years of data the number of records is much much less than this one will be with a logged reading every second.
I just wanted to make sure I wasn't being foolish by thinking I could log once a second for a few months at a time. And then query 1 month's worth of data which would be around 2.5 million records.
When I create graphs in Grafana, if I use the "group by" parameter and set a fixed time period, and then set to "max" that is the best way to perform the aggregation we mentioned earlier, right?
Would I be able to set a retention policy of 1 year? Or would 3-6 months be better?
There are four float values and two integers. They are leak detection sensors and there is a possibility of small quick leaks that I need to capture. I'm also storing the alarm and error state of the sensors.
If you are monitoring for small leaks and using rapid sensor readings to support that, you probably don't need to keep the data for very long. Maybe consider using a continuous query and automated deletion to aggregate the rapid data and restrict it to a day or so.
As with any database server, you are trying to keep the majority data in-memory so that it isn't hitting the drives. This is especially true of a Pi using an SD-Card.
InfluxDB is excellent at managing these things. It is also very efficient at handling data of this type so almost certainly you would be better off using it rather than the node.js based Node-RED to do that aggregation. With the possible exception of using a compute flow to turn a stream of inputs into an exception event. Because that is all done in-memory with a focused process.
I would also check your sensors to make sure that they can actually cope with single second readings - many, especially cheaper ones, cannot. The danger being that you don't actually get the data you think you do. You've probably done this already but do check the datasheet for the sensor carefully to determine its minimum reliable cycle time.
I was running Node-RED, Mosquitto, InfluxDB, Telegraf and Grafana on a Pi3 with an Evo SD-Card and it ran perfectly for several years. The DB was receiving not only data from Node-RED but also direct from Telegraf (system performance and networking metrics, etc) and so would be receiving many inputs per second. But, I was only keeping that detail data for a week and continuously aggregating to hourly values (keeping a max, min and avg value for each measure).
Like others, I eventually moved over to an old laptop as a single Pi became a limiting factor and even the Evo card eventually started to degrade.
BTW, in terms of using SD-Cards - you do need to make sure you have a card that does wear levelling (the Samsung Evo cards do) AND that you have lots of spare capacity on the card (otherwise the wear levelling can't do anything useful). Do have a spare card on the shelf though for when your card does fail and plan to replace the card every few years.
That is good to know. I have verified in the datasheet that the update rate can exceed once a second.
This is part of an internal R&D project and a lot of the requirements are being set by another engineering team and I'm trying to accomodate.
I did push back on them to try to determine exactly what they are looking for and what realistic behavior they are expecting. I think I've convinced them that instead of logging basically every reading from the sensor, I can log state changes where applicable and any value above a threshold of interest. Then I can also provide a dashboard with a "live view" of the last 12-24 hours that won't need to be logged. They seemed to agree this was a good alternative.
You could get a cheap SSD and USB3 adapter for not a lot more. I use a couple of drives and periodically clone 1 to the other, then swap them over which means I have a full backup and reduce wear on individual drives. (much faster too)
Thank you everyone for the discussion. I'm marking Colin's answer as the solution because he was the first to answer my original question. But I appreciate all the responses and hopefully others can learn if they have similar questions.