ETL Pipelines with Node-RED?

jsprenkle · 10 January 2024 19:34

Darn! Lots less expensive than most middleware. Thank you

GogoVega · 10 January 2024 19:38

And today we find ourselves with contributions in every direction that no one even you could have imagined.

Wow, this tool has become super powerful. Thanks for that

Did you have a garage at that time?

gregorius · 10 January 2024 19:49

Is there something similar for Kafka? I remember putting together some code to view messages flowing through Kafka and a minimal interface for rewinding consumers. At the time, there were no visual tools for doing this.

bakman2 · 11 January 2024 06:05

There is a kafka client for NR and looking at the speed you are able to produce things you have this available in no-time :')

There are kafka explorers according to google.

gregorius · 11 January 2024 08:43

The ideas in that tool are very similar to what I put together but they definitely have a far better UI!

Well that's because Node-RED is so easily extended and there are so many JS packages

aguida79 · 11 January 2024 12:02

@gregorius, a web-based Flow Based Programming tool written in Go option is Project Flogo, http://www.flogo.io/, and in theory is suitable for ETL and streaming, but their community is really small compared to NodeRED, and is really complicated to get an answer when you have questions.

gregorius · 11 January 2024 12:23

Thanks for the tip

I had a quick look at the source code but it hasn't been updated in the last two years - do you know whether it is actively maintained?

aguida79 · 11 January 2024 12:51

They changed the GitHub repo, but did't update the page. The actual GitHub repo is this one:

And the last version is 1.6.7, from November 2023.

You can see a question about this in the old GitHub repo you mention:

gregorius · 11 January 2024 13:04

Thanks for the info & Sorry for not checking myself

TotallyInformation · 11 January 2024 17:22

Broken links on the website, mostly text/code based, Needs GO for some code.

striper · 11 January 2024 20:41

This is something I ran into today as Kafka "explorer":
https://github.com/obsidiandynamics/kafdrop
It seems to be used a lot. Haven't tried it myself, but I will check it out soon.

gregorius · 11 January 2024 21:30

Lucky you, that looks like a pretty nice tool for Kafka - wish I had it back in the day

gregorius · 11 January 2024 23:00

Slightly back on topic, I have completed an initial "I am happy with it" version my ETL pipeline --> here. For that version, I have now created an initial - RFC version of a bunch of nodes that help with doing streaming in Node-RED --> the pipestream nodes.

My initial nodes were too specific, i.e., a streaming node that knows that the Zip file contains CSV. Instead the pipestream streaming nodes can unzip, untar, ungzip, csv, jsonl etc all as seperate nodes which are combined into a pipeline:

In the screenshot, the top flow retrieves and stores a .jsonl.bz2 file using streaming, i.e., no data in memory (from web to disk without sending any messages!). The right flow streams the file from disk, unbzips the file and then parses the file using JsonL - all in a stream. What comes out at the end of the PipeEnd node are json objects, in a stream.

But wait, all these nodes are connected but they don't actually do anything? Yes, the PipeStart and PipeEnd "flow" is actually a kind of meta flow describing what the PipeEnd node should create as a pipeline. I know this isn't what Node-RED was meant to be but it can do it and all other nodes can be placed in between the stream nodes, there is no limit to what comes between a PipeStart and a PipeEnd.

But wait why not using the inbuilt CSV node and Json nodes etc? Because all the existing nodes cannot be integrated into a stream. They all need to be rebuilt to work as a scream. This is for example - don't quote me on this - not possible for Excel format since that's not a streaming format. CSV and JsonL are both streaming forms since they consist of independent lines that can be handled individually ... just as gzip, bz2 and zip are all streaming formats.

I encourage more work on streaming for Node-RED and any ideas and suggestions are very much welcome.

jsprenkle · 11 January 2024 23:24

Just out of curiosity what are you doing that needs streaming? Or nodes of any great size?

gregorius · 12 January 2024 00:06

Http request with 600mb file that gets passed to a write file node. That means a msg object with 600mb is generated between the http request node and the write file node. That causes NR to suffer.

So instead 600mb gets streamed to the file write directly from the http request without going through NR yet it's still built with NR.

This is quite normal for ETL pipelines were large datasets are pushed into date warehouses.

ralphwetzel · 12 January 2024 07:04

Wow. Great concept!

Are the standard nodes "good enough" to define the stream logic - for their particular share? If this was the case, it should be possible to (re)use their editor interface (to create & collect the logic), but not invoke their runtime code. We've done a similar thing for node-red-mcu. Following this path, one could integrate the streaming functionality seamlessly into NR - even red-triangle-ing nodes that cannot be part of a stream might be possible quite easily.

gregorius · 12 January 2024 09:34

This is very good question, it goes to the heart of the unix philosophy and what is flow based programming and where are the boundaries for tools?

For those that don't know: Unix philosophy says do one thing and one thing only and do it well. Flow based programming is the concept of an assembly line where each stage makes a minor modification to the thing on the assembly line. Each stage does the same thing but to different things that are moving along the assembly line.

Node-RED is the combination of those two concepts: data flows along the lines (errr wires) and gets modified by the rectangles (errr nodes) until the modified data flows out. Each node applies its specific modifications to the data that passes through it.

What's the advantage of doing this way? Complexity is minimised because nodes have very specific and clear responsibilities, secondly focus is on data flows and not text describing algorithmic logic, i.e., code. (Amongst other advantages such better communication between techies and non-techies.)

Now the question comes: is the central responsibility of the CSV node to do all things CSV in every environment (msg based or stream based) or does adding stream support to the existing CSV node break the unix philosophy because the current CSV node is a msg-oriented node and not designed for streaming applications?

CSV node being only a placeholder here but also a good example for the problems involved. So I'm using fast-csv because that's the library that supports cvs streaming (at least the one I found). The library used by the current CSV node does not have that library nor streaming.

However I c&p'ed most of the UI for the streaming CSV node from the existing CSV node - basically to duplicate its functionality. But since the CsvStream node does not do output, I didn't take the UI for the output part.

I see just as many valid arguments for integrating streaming into the existing nodes, as there for saying that streaming should be done separately. For me though the best argument for doing it separately is that I can't go and modify the existing CSV node without hacking the core of Node-RED - remember the CSV node is integrated into NR.

Also integrating streaming into Node-RED would require a major rethink of the concept of Node-RED, after all, Node-RED is a pipeline just as streaming is a pipeline, so why aren't Node-RED flows by default streaming flows? Because Node-RED was built before the invention of streaming (or rather popularisation). And it doesn't really matter why or why not NR is not streaming by default, the point is that NR went down one road and streaming is a parallel road to that.

My use of "meta programming" a flow that describes a pipeline that gets executed in the PipeEnd node is a hack it's not what NR is meant to be used for. That demonstrates that supporting streaming in NR should be well thought through since NR does support streaming but not in a Node-RED way. That for me would be another argument to keep it separated since it's possible but not in the intended way (i.e. the way Node-RED would or should do it.)

Sorry for the long rant but it's an important topic

ralphwetzel · 12 January 2024 10:05

I don't understand it as a rant - rather than a valuable exchange.

My view: You don't need to hack anything in the core at all. PipeStart is the entry point to the streaming flow. If you never call send() there, no (standard) message will run down this flow. The hacking part concentrates on getting the relevant logic from the nodes in-between PipeStart & PipeEnd and execute this in PipeEnd. That's - as you said - meta programming at its best...

For me, one of the beautiful aspects of Node-RED lays in the fact that using it for tasks beyond what is/was intended to, is part of its earliest legacy. This might collide with the Unix philosophy - yet opens endless options, like the one you're exploring here. The key for success is - again my view - to make those as accessible for potential users as possible.

Finally - it's your call. It's not to be questioned if you decide it's better to use dedicated nodes...

gregorius · 12 January 2024 12:54

That's what I meant, I would have to extend the existing CSV node with this logic ... that would be in the core ... then I'd wait until a release was made. I'd do that if everything was sorted and everyone would be happy with making a core change to Node-RED and having it "natively" support streaming.

Alternative is that PipeEnd has all the logic builtin, i.e., it has the steaming logic for the CSV, JSON, Yaml ... etc nodes. This would bloat the PipeEnd node and also completely break the unix philosophy - which is also stateless independent codebases.

110% this! Can only encourage everyone to think of NR as a bunch lines connecting rectangles! Not as wires and nodes ... a step back can give a different perspective.

Ironically the Unix philosophy encourages this since base applications (i.e. ed, more, awk, ...) rarely get touched or extended, so others build things such as gawk, less, sed[1][2]. I think a classic problem is knowing where to draw the line - that's where it gets hard!

For example, I started with web2disk but realised that it would soon be bloated because combining two responsibilities: http request and file writing. So I split that node up and made it more useful since the stream write to file can be used independently of the http request streamer.

Having said that, I think NR should not become too bloated with functionality that can be provided by external nodes. My thinking its better to improve or extend the extendability of NR than to extend its functionality. And this is also something that Linux is facing, hence I tend to compare Linux and NR - for better or for worse.

Standing on the shoulders of giants: sure I can do what I want but better is that everyone has a say and it's clear what the ideas are. I put together a prototype to itch a scratch but perhaps there is a better way to do it or more scratches that can be itched. (scratch == use case ;))

Ironically though, having now learnt that sed is the streaming ed, it does seem clear that Unix would use dedicated nodes for streaming!

[1] = "sed was based on the scripting features of the interactive editor ed" - wikipedia
[2] = btw stream ed - so unix in fact created separate applications to support streaming!

TotallyInformation · 12 January 2024 14:25

I was wondering whether it would be better to move the actual process from PipeEnd to a config node. Then every stream capable node would have a reference to that config node and, by definition, anything that didn't have such a reference would not be a streaming node?

I think the logic of that would be more obvious to editors and flow designers. It also puts the actual processing firmly into an obvious place.

Which is why, in the other thread I talked about a single capability rather than a single function.

Topic		Replies	Views
Is Node-Red feasible for ETL General	7	486	7 February 2023
Using Node-Red for Enterprise Software General	16	3976	5 April 2021
Data Prep Nodes General	4	390	29 March 2022
High throughput examples (for arguing for the platform) General	32	1804	18 January 2022
RedBack - Node-Red as Backend Share Your Projects	28	1262	22 March 2023

ETL Pipelines with Node-RED?

Related topics