Select files within a date range and merge them into one file

Hello,

I have a directory that contains a csv file for each day. I now want to select all files within a month e.g. May and merge them into one big csv file. The csv files all have the same structure.

I have already experimented a bit with red-contrib-fs-ops and can view all files in the directory and read their stats. Is there a node with parameterizable sorting functions or is this only possible via a function node with it's own code?

What operating system are the files under? And does each file have a column headings line?

Linux and yes, they have a header line. But this line is the same in all files.

Personally I'd do it with a shell script called from a Node-red exec node.

#! /bin/bash
INPUTDIR=/home/pi/data
OUTPUTFILE=/home/pi/data/result.csv

if test -e $OUTPUTFILE
then
  echo $OUTPUTFILE already exists - exiting
  exit 1
fi

if [ "$INPUTDIR" == $(dirname "$OUTPUTFILE") ]
then
  echo warning creating output file in same directory. Please move it!
fi

CSVFILES=$(ls $INPUTDIR/*.csv)    #Maybe need to sort list of files?
HEADERLINE=$(head -1 $(ls $CSVFILES | head -1))
echo $HEADERLINE > $OUTPUTFILE
for file in $CSVFILES
do
  echo processing $file
  grep -v $HEADERLINE $file >> $OUTPUTFILE
done

1 Like

I'm not good at shell programming and don't know how to add the date range selection. So i tried it with this function node but I get in the debug node always msg.combined -> undefined

var combined = [];
    
for (var i = 0; i < msg.files.length; i++) {
    var file = msg.files[i];
    var stat = msg.stats[i];
    var D1 = "01/03/2022";
    var D2 = "31/03/2022";
    var D3 = "15/03/2022";
    
    D1 = new Date(D1);
    D2 = new Date(D2);
    D3 = new Date(D3);
    
    if (D3.getTime() <= D2.getTime()
        && D3.getTime() >= D1.getTime()) {
        
        combined.push({
        path: msg.payload,
        file,
        ...stat
        })
    }
}

msg.combined = combined;
return msg;

What am i doing wrong ? The node works without the date range check. So it must have something to do with it.

I'm not good with javascript but your use of date strings looks iffy.

I think that if you want to pass a short date format to Date() it has to be in YYYY/MM/DD format.
Date("31/03/2022") is invalid.

That being said, I could only get msg.combined == [] never undefined

If you can pass start and end dates as parameters in "YYYYMMDDhhmm" format, I think this variation of the script should work. I have not tested it.

#! /bin/bash
INPUTDIR=/home/pi/data
OUTPUTFILE=/home/pi/data/result.csv
FROMDATE=$1    # Should be YYYYMMDDhhmm eg 202212310000
TODATE=$2

if test -e $OUTPUTFILE
then
  echo $OUTPUTFILE already exists - exiting
  exit 1
fi

if [ "$INPUTDIR" == $(dirname "$OUTPUTFILE") ]
then
  echo warning creating output file in same directory
fi

touch -t $FROMDATE /tmp/start_date
touch -t $TODATE /tmp/end_date
CSVFILES=$(find $INPUTDIR -type f \( -newer /tmp/start_date -a ! -newer /tmp/end_date \))

# ls -l /tmp/start_date /tmp/end_date
# echo $CSVFILES
# exit
HEADERLINE=$(head -1 $(ls $CSVFILES | head -1))
echo $HEADERLINE > $OUTPUTFILE
for file in $CSVFILES
do
  echo $file
  grep -v $HEADERLINE $file >> $OUTPUTFILE
done

Here is another way to do it :grinning:

Using just a function node.
image
You can use fs instead of fs-extra if you like but then you will end up with a more complex first line of the for block. fx-extra is used by Node-RED by the way. I'm using fast-glob here instead of using the fs-ops node.

// Use Fast-Glob to get the list of files
const entries = await fg(['*/csv/*.csv'])
msg.entries = entries

// get the files
msg.files = []
let fline
for (const fname of entries) {
    let f = await fs.readFile(fname, 'utf8')
    f = f.split('\n')
    fline = f.shift() // remove the first line
    f = f.join('\n')

    msg.files.push( f )
}

// Add the first line back and join all the headerless files together
msg.payload = `${fline}\n${msg.files.join('')}`

return msg

It uses a number of interesting modern JavaScript constructs so may be a useful learning exercise as well.

Obviously, this approach as with the others given has a problem. You can run out of memory. You can improve things by not keeping the individual files on the msg as well as the final output but you can still easily run out of memory if handling very large files.

So for large file handling, it is better to use a streaming approach whereby you stream in the files and stream them straight out to the output file.

The other limitation of this approach is that it only works if the columns are exactly the same for each input file.

Incidentally, both Python (via libraries) and PowerShell (no need for libraries) are better at CSV handling than JavaScript is.

Oh, and of course, you should look on npmjs.org for libraries that will do CSV merge directly - there are plenty.

Does that solution include the header line from each file? Is that a problem?

Nope :grinning:

It only adds the header again at the end, it removes it in the loop.

I "cheated" by letting each loop grab the header line. Though of course it doesn't need to. But to only grab it once would need more code and is more bother than it is worth. Maybe if you were dealing with thousands of files it might be worth doing some optimisation. But honestly, I'd use a library to do that and would certainly choose a streaming method.

I added a couple of comments to hopefully make things clearer. It was a fun brain exercise to do at the end of a long week.

Ha! If I was a js native I could have answered my own question without the comments :wink:

Fwiw I don't think my grep solution runs the risk of memory problems either. But I don't think the OP should use it if he is not a bash native.

1 Like

@TotallyInformation First of all, thanks for the help. In comparison my js knowledge is rudimäntär :wink: When i try your suggested solution i always get undefined errors. Therefore a few questions :wink:

Is it correct that the path to the csv file must be specified? Like:

const entries = await fg(['/home/pi/csv/*.csv'])

The date range selection part is not included in your solution or? My csv file looks like this:
I think the first line must be irgnored or?

sep=;
Datum/Uhrzeit;Verbrauchswert;Einheit;Verbrauch in Euro;Einheit;CO2-Ausstoss;Einheit;;Ansicht:;Datum;;24 Stunden;01.03.2022 23:52 Uhr
00:00;0,000;Wh;0,00;Euro;0,000;kg CO2
00:15;0,000;Wh;0,00;Euro;0,000;kg CO2
00:30;0,000;Wh;0,00;Euro;0,000;kg CO2
00:45;0,000;Wh;0,00;Euro;0,000;kg CO2
01:00;0,000;Wh;0,00;Euro;0,000;kg CO2

Yes, that looks fine. That should select all of the CSV files in the CSV sub-folder in /home/pi/ which, if you are logged in as the pi user will be your home folder and accessible as cd ~/ from the command line.

You can always add a line to the function node.warn(entries) after that line and it will dump the list of entries to the debug panel.

No, I didn't do that as I only had a few minutes spare.

However, because I used fast-glob to get the listing, it becomes fairly straight-forwards to change the call to fg to be an array of potential names instead of a single wildcard. But you might not even need that. If you want to get the files for the month of May this year and if your filenames use a sensible format such as YYYYMMDD-something.csv, you can build the glob pattern:

const month = (new Date()).toISOString().slice(0,7).replace('-','')
const entries = await fg([`/home/pi/csv/${month}*.csv`])

Note that I changed the simple single quotes to back-ticks so that I could embed the variable which will be done before the function gets called.

If you wanted more than 1 month, you could have a second entry in the fg array parameter.

I note that in your original code, your dates had / in them and this will not work for file names as the / and \ characters are reserved as folder delineators.

Yes, you will need to drop the first line for each file. Thats what the shift function will do for you so just add another one in the loop but don't bother to assign it to a variable as you don't need it.

const month = (new Date()).toISOString().slice(0,7).replace('-','')
const entries = await fg([`/home/pi/csv/${month}*.csv`])

Stupid question. Where to set the month I want to select?

Yes, you will need to drop the first line for each file

Now I get a long string output :+1: But the rows of the individual files were not added but appended one after the other. My goal is that the values of the columns add up and the rows stay the same. Merging was probably the wrong expression for it.

I can't tell what you are trying to achieve.
Maybe show us some sample data, a couple of files, each having just a few lines, and a file to show what the result should be?

Yes, I have expressed myself a little misleadingly. Am not so good at English :see_no_evil:
The picture should make it more understandable I hope.

The values in the second column are to be added. The other columns should remain the same. The number of rows should not change.

Do you want that to be a fixed month - that means you need to change and redeploy the flow to change it. Or do you want to send a message that changes the month to use?

You've also not actually said what format your filenames are in.

Well that puts a completely different aspect on the whole thing and you didn't mention that before I don't think?

We've been working on the assumption that by the word "merge", you meant that each file represented a different date and therefore you wanted them appended to each other.

To do a calculation on one of the columns, you have to break each line of each file into its component cells. It also requires you to specify one of the columns as a KEY.

Now you really need to tell us what you are trying to achieve because things have just taken 2 orders of magnitude more complex.

It is starting to look like that data belongs in a relational database rather than csv files.

How do you acquire the data - does it arrive as a spreadsheet?

Do you want that to be a fixed month - that means you need to change and redeploy the flow to change it. Or do you want to send a message that changes the month to use?

A message/button to change the month would be perfect.

You've also not actually said what format your filenames are in.

They look like this: 20220301_235257_id00016_24h.csv

Well that puts a completely different aspect on the whole thing and you didn't mention that before I don't think?

No, as I said, I unfortunately expressed myself incorrectly.

Now you really need to tell us what you are trying to achieve because things have just taken 2 orders of magnitude more complex.

I get every day a csv file which shows me the consumption (column 2) in 15 minute steps over the day. At the moment I read the latest file and create a bar chart. But now I would like to show the consumption over a month(Maybe selectable via the dashboard) in 15 minute steps.

I thought I can read the csv files, add the values for column 2 and then pass the payload to my existing nodes for the bar chart display.

Is that more understandable now? Or is something like this not possible with node red?

It is, though I don't know as I'd recommend trying to do it with just Node-RED since it is quite possible you will want to do more processing later on.

As JBudd say's, you should be looking at a database ideally. With Node-RED helping you get the data and push it to the DB. A timeseries DB such as InfluxDB would probably be a good choice. Or a DB that allows multi-dimensional "cube" type processing.

If you really don't want a db. I'd recommend reading each file using the file-in node and converting the data to JSON using the CSV node. Then as you read each file and after conversion, you can add the new data to a global or flow variable.

To be able to select the month, You will be able to do that using Dashboard with an input - you will save the month to another variable for use when you process the file(s).

You probably also need some method of marking each file as processed. Perhaps move them to another folder once you've processed them.

Frankly though, in my opinion (and doubtless others may disagree), you've now stepped beyond what Node-RED can currently do comfortably.

If I were to do this, I would use Microsoft Excel with PowerQuery. If you don't have a license for an up-to-date version of Excel, I'd use a Python Jupyter notebook using datatables with one or more of the common tabular processing libraries such as Pandas and NumPy.