Hello,
I am glad to share a couple of nodes I have made to allow uploading very large datasets (gigabytes of e.g. CSV) to a back-end (e.g. SQL database) through Node-RED, with a UI for Node-RED Dashboard.
The first node is a widget UI for Node-RED dashboard, which streams the data to Node-RED as chunks (text or binary) with proper back-pressure (i.e. waiting for the back-end to be ready before sending more data, to avoid memory exhaustion) https://github.com/alexandrainst/node-red-contrib-ui-upload
The second node is a normal Node-RED node, which is similar to the default "split" node, but more powerful: it supports a stream of binary or text chunks as input, supports back-pressure (wait for the rest of the downstream flow to be ready before informing the upstream flow to send more data), and outputs lines of text, either one by one, or by batches of lines for higher performance (e.g. when inserting in an SQL database). https://github.com/alexandrainst/node-red-contrib-chunks-to-lines
I hope some of you will find that useful. Feedback welcome!
Best regards, Alkarex
I was just wondering, mysql and postgresql have the load data infile and copy x from csv (respectively) options which inserts csv data in bulk, which is extremely fast (at least in mysql) compared to separate sql inserts.
300.000+ records per second should be doable (depending on database engine type). Still should split a large csv into smaller files.
A word of warning however, looking at the code for these nodes, I can see they access the internal runtime data structures. These data structures are not part of any published API for nodes to use and might change from time to time.
@bakman2 The database tools you mention are indeed good, and I use them myself, but they first require that you either have a file uploaded to the server somehow (also doubling the amount of disc space), and/or some kind of server access (e.g. SSH). Furthermore, the option I propose allows on-the-fly processing of data, e.g. for custom transformations or filtering before insertion. So those are not for the same type of workflows.
@knolleary Indeed! There is as far as I can see no "official" mechanism (yet?) in Node-RED for handling back-pressure in a (semi-)automated way (e.g. like piping in Bash/Unix). I believe that with simple conventions (similar to what I have done in those two nodes), such a mechanism could be made "official", and would cope out-of-the-box with larger payloads. I can gladly provide more information, a more precise proposal, and some help if you get interested
but they first require that you either have a file uploaded to the server somehow (also doubling the amount of disc space), and/or some kind of server access (e.g. SSH).
This is true if you use the LOCAL keyword, else you can serve a file via the http-in/response nodes to the mysql database, which it can directly read. Then you could perform transforms on the csv output to the http-response node and it would still be faster than separate insert commands.
@bakman2 I might have missed something, but as far as I can see, the HTTP-In node can only cope with very tiny files of at max a few megabytes, right? (Same for the CSV node). Furthermore, although not as fast as native commands, the suggested flow (which inserts 4000 lines at a time) has a decent (to me) speed of ~30.000 lines per second on my quick test (including network transfer to a humble remote server with non-SSD disk, and a bit ancient Intel i5 processor).