Cleaning CSV data?

I have been working on creating an FTP server with Node Red and upload CSV files to it. Thank all for helping me so far on this quest ...

One thing I am sort of curious to implement when CSV files get appended is create another CSV file that is "cleaned." For example I cant figure out how to get around not appending the first row of the dataset that is the Date and variable name as well as it appears that last row is just blank. In Excel it looks like this:

image

Does this flow seem silly to try for the novice that has yet to fully grasp implementing databases into data acquisition practices?

Just practicing JavaScript functions and running this in Node I am sort of stuck trying to create future function block for Node Red to clean the data.

var MultilineString = `
Date,kW
1/1/2020 0:00,16.4
1/1/2020 0:15,16.8
1/1/2020 0:30,16.8
1/1/2020 0:45,16.8
1/1/2020 1:00,16.8
1/1/2020 1:15,16.4
1/1/2020 1:30,17.6
1/1/2020 1:45,16.4
1/1/2020 2:00,16.8
1/1/2020 2:15,16.8
1/1/2020 2:30,17.2
1/1/2020 2:45,18.8
`;

var csvHeaderDate = 'Date';
var csvHeaderVar = 'kW';
var rawData = MultilineString.split(',');
cleanedCsvData = []

for(var i = 0; i < rawData.length; i++){  
  console.log(rawData[i] === csvHeaderDate);
  console.log(rawData[i] === csvHeaderVar);
  //cleanedCsvData.push(rawData[i]);
}

console.log(cleanedCsvData)

At least in my head the ideal data cleaning would be removing any blank rows as well as leaving the first row where in this case it would be Date,kW but different dataset getting FTP'd to the server could have many columns and names.

Any continued tips greatly appreciated...

You can use the magical CSV node for both converting to and from comma seperated values
(with option in the settings to skip the headers)

Example :

[{"id":"0ce178f0cfcd4901","type":"inject","z":"54efb553244c241f","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"date","x":320,"y":820,"wires":[["b57e15d5e4a4d29d"]]},{"id":"b57e15d5e4a4d29d","type":"function","z":"54efb553244c241f","name":"","func":"msg.payload = `\nDate,kW\n1/1/2020 0:00,16.4\n1/1/2020 0:15,16.8\n1/1/2020 0:30,16.8\n1/1/2020 0:45,16.8\n1/1/2020 1:00,16.8\n1/1/2020 1:15,16.4\n1/1/2020 1:30,17.6\n1/1/2020 1:45,16.4\n1/1/2020 2:00,16.8\n1/1/2020 2:15,16.8\n1/1/2020 2:30,17.2\n1/1/2020 2:45,18.8\n`;\n\nreturn msg","outputs":1,"noerr":0,"initialize":"","finalize":"","libs":[],"x":470,"y":820,"wires":[["285a98dd61b977fb","d68e63091f3cb2c9"]]},{"id":"7f12fdd8afedf14a","type":"debug","z":"54efb553244c241f","name":"2","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":810,"y":820,"wires":[]},{"id":"285a98dd61b977fb","type":"csv","z":"54efb553244c241f","name":"","sep":",","hdrin":true,"hdrout":"none","multi":"mult","ret":"\\n","temp":"Date, KW","skip":"1","strings":true,"include_empty_strings":"","include_null_values":"","x":630,"y":820,"wires":[["7f12fdd8afedf14a","fa014084ec231bb1"]]},{"id":"d68e63091f3cb2c9","type":"debug","z":"54efb553244c241f","name":"1","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":570,"y":760,"wires":[]},{"id":"fa014084ec231bb1","type":"csv","z":"54efb553244c241f","name":"","sep":",","hdrin":"","hdrout":"none","multi":"one","ret":"\\n","temp":"Date, KW","skip":"0","strings":true,"include_empty_strings":"","include_null_values":"","x":820,"y":900,"wires":[["3f88baac437af267"]]},{"id":"3f88baac437af267","type":"debug","z":"54efb553244c241f","name":"reverse","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":970,"y":900,"wires":[]}]

I would definitely recommend using the CSV node rather than using the function node to manage your headers. It's a huge pain in the backside to add columns to a CSV after the fact, however, so I also recommend having your schema thought out first if you have multiple measurements that may end up in the same file.

As for schema considerations, I'd personally use the columns "Date," "Value," and "Unit of Measurement." That way you could ingest additional columns from input files in a way that would yield more rows without having to re-build the entire file (and make it much less storage efficient) if a new column is added in the future.

There is a lot to be said for using a proper database for your data storage, though it wouldn't remove the need to clean your data up before storing it.
Where does your data come from, can you present it without headers, new lines etc?

Regarding databases, I prefer Mariadb (= MySQL), some others seem to like MSSQL. And InfluxDB is specifically designed for time related data like yours.
I don't find it as simple to retrieve and analyse the data with Influx as with MariaDB and SQL.

Edited - possibly clarified :grinning:

FLUX is a really weird query language if you're coming from SQL. I really hated it at first coming from influxQL, which was a lot more SQL-like, but have slowly gotten used to it. The piped notation style it uses actually reminds me a lot of the logic in Node-RED flows. (Now I just wish it had more statistical functions than it currently does, particularly stats functions like kurtosis and skewness, because time series data begs to be used for statistical analysis.)

1 Like

@bbartling I would remove the Date,KW\n using a change node and JSONata expression.
example

[{"id":"d17a881.81b99f8","type":"inject","z":"c791cbc0.84f648","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"echo -n \"/public/api/ver1/accounts/new?type=binance&name=binance_account&api_key=XXXXXX&secret=YYYYYY\" | openssl dgst -sha256 -hmac \"NhqPtmdSJYdKjVHjA7PZj4Mge3R5YNiP1e3UZjInClVN65XAbvqqM6A7H5fATj0j\"","payloadType":"str","x":110,"y":860,"wires":[["aae75ed.57968a"]]},{"id":"aae75ed.57968a","type":"template","z":"c791cbc0.84f648","name":"","field":"payload","fieldType":"msg","format":"handlebars","syntax":"mustache","template":"Date,KW\n12/23/23,16.4\n13/34/45,15.6\n14/45/56,14.7\n","output":"str","x":290,"y":900,"wires":[["39f9ee50.cd87aa"]]},{"id":"39f9ee50.cd87aa","type":"change","z":"c791cbc0.84f648","name":"","rules":[{"t":"set","p":"payload","pt":"msg","to":"$substringAfter($$.payload, \"\\n\")","tot":"jsonata"}],"action":"","property":"","from":"","to":"","reg":false,"x":480,"y":900,"wires":[["c4f45a89.bcb1d8"]]},{"id":"c4f45a89.bcb1d8","type":"debug","z":"c791cbc0.84f648","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","x":680,"y":860,"wires":[]}]
$substringAfter($$.payload, "\n")

then append to file.

[edit] Unborn sorry pressed wrong reply, Again!

Thank you for your solution once again @E1cid ... Ill have to study up on this expression node! Any chance would you know how to remove blank cells?

For example when the CSV files get appended there is always a row of blank data nothing present.

image

Tell the file node not to append a newline or remove the newline from the original file.

Oh geez thanks for all the help learning a ton!