Using Parquet file instead of csv?

nodenoob · 31 January 2022 12:54

I am currently using a csv-file dor saving and reading my data. Is there an easy way to use a parquel file instead. Because I read that this would be faster and would be saving storage. Does anyone of you have experience with parquel files?
Oh and what else would I need to do to use parquel files on my raspberry pi where also node-red is running?

TotallyInformation · 31 January 2022 13:26

Interesting, I'd never heard of that format. It is column-based rather than row-based.

I think the reason I've never come across it is because it is one of those appalling Apache things where they insist on everything using Java.

So unless you are using a Java service infrastructure, using something like Parquet is probably not on the cards.

Good news is that there are some other column-based databases out there if you look for them.

But the bad news is that comparing CSV to Parquet is like comparing cars to rockets - they both can get you from a-to-b but you will only want to drive one type when going to work!

CSV files do not require a service in order to interact with them and that is their main strength.

A better comparison would be comparing Parquet to a relational DB (which is row-based) I would have thought?

Colin · 31 January 2022 13:39

How big are the csv files going to be that you are concerned about size and speed?

nodenoob · 31 January 2022 13:51

It is planed to storage about 100.000 rows by year. I do not want to use influxdb because I do not want to pay anything. Is there any other easy to use method for saving and reading data with node-red on a raspberry pi?

kuema · 31 January 2022 14:00

Unless you plan to use their hosted service, InfluxDB is open source MIT-licensed software and can be run for free on your own hardware (even a Pi).

nodenoob · 31 January 2022 14:17

Ohh I thought is not open source and you have to pay for the service. Can I also easily read the data from influxdb as from a csv file and can display saved data in a chart after node-red has been stopped?

bakman2 · 31 January 2022 14:18

Yes and/or you could even use sqlite. 25k rows insert takes 1.5s.

kuema · 31 January 2022 14:23

Yes, of course. I would prefer InfluxDB over using files in this use-case. And it opens even new possibilities, like using Grafana for nice dashboards and reporting.

I'd recommend reading up on the InfluxDB basic concepts to find a schema that fits your data, meaning how to structure your measurements using tags and fields. It all depends on your data and how you want to use it, of course.

Colin · 31 January 2022 14:24

Yes, use Grafana for that.

bakman2 · 31 January 2022 14:25

I would prefer InfluxDB over using files in this use-case.

What is the use-case here ? influx is perfect/specifically meant for time series data.

kuema · 31 January 2022 14:29

You're right, of course. I was unknowingly assuming it's about time series data. But if that's the case, I'd use Influx.

nodenoob · 31 January 2022 14:35

I have electrical data from three inverter. Yeah its all data is time based, then I think I will go for influxdb that would be more elegant than using csv files, because I got to save about ten values per timestamp. And I will probably have about 105.000 timestamps a year.
And Grafana is also open source? Is there a way to use grafana in my node-red dashboard?

kuema · 31 January 2022 14:40

With that kind of data, Influx is the way to go. Evaluating and aggregating your data will be a lot easier, assuming you make proper use of tags and fields to structure your measurements.

Grafana is free and open source as well. It's comes with a standalone webserver, so you can show it in your NR Dashboard using an iframe.

TotallyInformation · 31 January 2022 15:23

It's a slight faff as you have to change some settings in Grafana so that it allows embedding and you will need to take a little care with security settings. But in general, as keuma says, yes

Not as easy as a CSV file - what is?

But once you've got your head around it, it is OK. Using Grafana also has the advantage that it has a query builder for InfluxDB built in - though do yourself a favour and avoid the newer Flux query language as Grafana doesn't have a point & click query builder for that yet.

Grafana will output data as charts or tables and there are many extensions for doing other clever things like overlaying data on an SVG image, creating graphs (nodes connected with wires ) and much more. You can even create interactive dashboard elements by creating links back to Node-RED http endpoints. So depending on your needs, you might not even need the Node-RED dashboard.

nodenoob · 31 January 2022 15:31

Okay thanks for the advise, As I read it I was about to set up a initial user in Inlfuxdb 2.1.1. But I guess I am gonna use 1.8. instead, when the flux query is not cool to use like you said.

TotallyInformation · 31 January 2022 15:46

Flux will get there I'm sure, its just a bit early. v1.8 can use both if you like so you can use it where it makes sense and ignore elsewhere.

Do yourself another favour and read up on continuous queries and retention policies.

A common mistake to make is letting InfluxDB continue to accumulate detailed data for ever. Eventually you will find it slowing your machine down or even crashing. Avoid this by using a continuous query to aggregate data for long term storage along with a retention policy that automatically trims the detail data keeping it to a manageable size.

For example, I may keep all of my environmental sensor data a 1min intervals for a month. But a continuous query aggregates all of the detail to hourly avg/max/min data points which I keep for 5 years. (*)

Similarly, I use Telegraf to collect system data into another db. That is a lot of data since most of it is taken at 15-30s intervals and there are hundreds of items. So I only keep that for a week.

That way, everything is kept manageable and tidy without any effort after the initial setup.

Terminology can also be confusing for InfluxDB beginners - it is helpful to know:

An InfluxDB measurement is similar to an SQL database table.
InfluxDB tags are like indexed columns in an SQL database.
InfluxDB fields are like unindexed columns in an SQL database.
InfluxDB points are similar to SQL rows.

This thread has more details: Need more detailed information on influxdb - General - Node-RED Forum (nodered.org)

(*) Incidentally, you may be wondering about performance. I once had all of the sensor data kept for about 3 years on a Raspberry Pi 2 - It was about that time that I began to see performance issues on the Pi

Bigger devices should have no problems with millions of entries.

BTW, if you do need to eventually go full enterprise mode. Note that the version of InfluxDB that fully supports clustering does cost money.

system · 1 April 2022 15:47

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Node red + InfluxDB question General database	7	439	6 February 2022
Node Red is to fast for influxdb2 ?! General	5	120	12 June 2025
How to effectively use InfluxDB FAQs database	23	2910	15 March 2021
Reading from CSV file insert values into influx-DB General	35	936	29 August 2022
General Question to experts General	3	197	15 August 2023

Using Parquet file instead of csv?

Related topics