ETL Pipelines with Node-RED?

gregorius · 8 January 2024 00:35

Don't take that too seriously, my sentiment was more related to the fact it was still just one core and not multi-core. But that's Node.js.

I would like to try the same experiment with multiple instances of NR and a mqtt between them.

But on the question clone, am I doing right in etltarball node? Clone message and with false, is that correct?

Edit: The link above now works but the question relates to:

var onFile = (path, content) => {
        var m = RED.util.cloneMessage(msg);
        m.path = path
        m.payload = content
        send(m, false)
};

which is what I call as a file comes available in the stream. My question is whether this optimal for being called a million times with path and content being the only change. The alternative I was using:

var onFile = (path, content) => {
  send({
     ...msg,
     path: path
     payload: content   
  },false)
};

this version ok if nothing else gets modified on the msg? Each msg also contains a hash with configuration details and this hash would be a reference in the {...msg, version but a new copy is made in the clone message version.

So the clone message would take longer and need more memory but the shallow copy would be problematic with all messages referencing the same configuration hash.

Is that correct?

bakman2 · 8 January 2024 06:12

This sentence peeks my interest.
I was looking at flowhub, this:

If you click the nR - the client keeps reloading itself.

gregorius · 8 January 2024 07:34

Ah, yes that's an inject node that reloads the browser on flow starts... Which works sometimes in the real NR. Unfortunately all the time in the inbrowser version.

Thanks for the heads up, will fix it in the next version.

The purpose of the node is to automatically reload NR client once the server has been restarted. The server needs restarting because I have installed a new version of the etl nodes.

Edit: is fixed for the new (and default version):

Screen Shot 2024-01-08 at 10.02.32

gregorius · 8 January 2024 12:39

Ah ok, just understood what you meant - then the counter does actually work

gregorius · 8 January 2024 16:05

Just an update on this, I've got the following going:

flatmem4

What this shows is the web2disk node streaming 145MB from web to disk. The disk2stream node then unzips that file, sees a 1.8GB JsonL file and begins to stream that line-by-line generating 2M messages, that end at the debug node.

What the point here is that memory is complete flat the entire time, it's all being streamed. The other thing is that the whole thing took 90s to do.

That's a promising start ...

jsprenkle · 9 January 2024 04:50

It'll be interesting to follow your project.

I'm a professional BizTalk developer. I immediately saw strong parallels between BizTalk and Node-Red. I don't think ETL is as strong a match though, for the reasons you already give.

I was considering implementing features from BizTalk in node-red, but don't really want to bloat it with features it doesn't need.

A message tracking database for debugging complex flows might be useful, but so far I haven't seen many flows complex enough to really need that. I noted OWASP recommends logging as a security measure. Amongst other things, it's hard to detect intrusions without it.

The ability to subscribe to flows would be very powerful too, but with great power comes great responsibility - sorry, couldn't resist

Good luck with the project!

TotallyInformation · 9 January 2024 11:25

One of the nice features about Node-RED is that contributed nodes only have an impact if you install and use them. So you wouldn't be adding bloat if the features were contained in contributed nodes, only if you were suggesting adding to the core.

It would be interesting to hear from you what features you were considering.

Yes, logging of user access should indeed be done in production environments. But this isn't necessarily done in Node-RED itself of course if you are using external IDM which I would always recommend for production instances.

Logging of changes to flows is a more interesting challenge however as I'm not sure that is something that Node-RED currently does? To be honest, I've never checked whether the higher-level logging or audit tracks a user deploying a change to the flows. That is something that really should be checked and if not, would be a useful addition to Node-RED. Though it would, again, need to take into account external IDM tooling so might need to specify a given header for example that would indicate the current user id.

What would that look like? Do you mean a way to get output from a flow? If so, the recommended approach would generally be to use MQTT to handle that aspect. Even if the output was complex, you might use MQTT as a notification service so that other tools would know to get an update query from a database for example.

gregorius · 9 January 2024 11:42

Thank you

This is true but I see Node-RED as a visual fronted to NodeJS and obviously NodeJS can definitely handle large data files.

I original thought ETL and Node-RED won't work but its more a case that no one has come with a good solution. It remains to be seen how well or how far I get this done, either way, I will make a bunch of learnings along the way - that's the most important part!

Already - IMHO - the web2disk node and disk2stream are useful nodes for handling large datasets and they have no requirement for being used in an ETL pipeline. Another advantage of Node-RED: nodes are context free and can be used for anything.

100% this, that's why Node-RED development can be very fast - there is no reason to hack around in the core of NR, you do everything using your own nodes and ideas. And combined with existing nodes, it becomes insanely fast to create something useful.

And the most important point for me is that doing this using a visual approach makes understanding far simpler than a bunch of Python libraries that get glued together in a complex collection of text-based code files. Am not talking about coders/hackers/programmers, am talking about data analysts, stakeholders and product owners.

TotallyInformation · 9 January 2024 11:50

That is absolutely true and why I'm keen to help with your initiatives in this area. Those stakeholders are poorly supported by visual processing tools. What they have are things like Microsoft Power Platform, Tableau and such like. Expensive, difficult to learn and use, and EXPENSIVE!

I have, many times, wished that Python had its own equivalent of Node-RED. Because of the massive support in Python for data analysis and data science, it has all the ETL and other processing libraries that Node.js does not. It's integration with C/C++/Fortran/etc compiled libraries also can make it a very efficient language for large-scale data handling (even though Python itself I don't think is massively efficient?).

gregorius · 9 January 2024 12:07

Don't tempt me

I've been thinking about that ever since I realised that a flow is just a large json file. Perfect intermediate format that can be interpreted by any other programming language. But then I realised how much server-side code there is attached to the nodes ...

I think there have been effects to do this but the work involved is/would be massive!

I believe that Airflow is probably the closest equivalent in the Python world. Combined with Jupyter and you have the typical ETL pipeline that most companies utilise in one form or another.

One thing I forgot to say was that NR also has the advantage that I can create flows for handing http requests (i.e. I can build a website), create data importers that fill random databases, I can do data conversion, I can create dashboards for my IoT devices, I can setup crontab triggers, I can emails, ... all in one tool, all in one flow and all visually, using the same coding paradigm.

How many companies use various web frameworks in Ruby/React/Vue/jQuery, use Python/Jupyter for their data analysis and SaaS solutions for customer support? And google sheets for payroll and HR!

TotallyInformation · 9 January 2024 12:26

Except: "Airflow was not built for infinitely running event-based workflows" whereas Node-RED is very much built that way. And it requires its own database engine instance.

This is certainly an advantage for certain types of organisation. But as I'm known to say (or drone on about depending on your point of view!), just because you can, doesn't mean you should. As with security and identity management, production services should use what is efficient and effective to get the job done while choosing technologies that are supportable and stable. In a production environment, I'd rarely suggest Node-RED for delivering a website - at least unless you wanted something highly flexible and data-driven. Even then, I'd certainly put it behind a proxy. Node-RED is fast to develop but will never be super-efficient because it has to carry a lot of baggage to deliver its flexibility. That isn't a criticism, merely an observation of reality.

But that it CAN do all that isn't in doubt and certainly has many advantages in the right situation and for the right kind of organisation.

BTW, anyone running their business off Google tools is asking for trouble - just ask the people who have randomly had their accounts deleted by Google. Not an uncommon experience I might add.

gregorius · 9 January 2024 12:59

You'd be surprised (or rather frighten) about the startup scene in Berlin ... it's google all the way down! Mail, Sheets, Documents, Wikis, Questionaires, Single-Sign-On .... everything via google.

and you know me: it's Node-RED or the highway!

For me it's not the technology that is important and yes, Node-RED isn't perfect but what Node-RED does extremely well, is the UI and the overview of the code. And that's why I keep going on about Node-RED for everything - I avoid focussing on the technology, rather I focus on the wetware, i.e., humans!

Collaboration and communication amongst folks working on the same project is what I see and what I want to improve. If there is was Python Node-RED, then I might use that instead. Or if there was a web-based Flow Based Programming tool written in Go, then I might use that. The absolute focus for me is help folks to work better together and FBP is one paradigm that makes coding less mysterious for non-techies and that's a good thing!

jsprenkle · 9 January 2024 13:03

Yes, in part.

BizTalk uses the messaging model adopted by many modern systems.
MQTT, Azure cloud services, ESB, MSMQ, etc. In AWS it's AmazonMQ, and the Apache web server implements ApacheMQ. Node-Red and your own uiBuilder use them too.

For the unfamiliar, events publish a message with a tag usually called a topic.
One, or many, bits of code subscribe to topics, and can take actions based on events.

For example: The the "fully kiosk" android app publishes the status of the android device it's installed on using MQTT. In my setup, to preserve battery life a charger controller subscribes and turns the battery charger on when it's below 15% charge, and off when above 80%.

BizTalk adds connectors to integrate large scale systems and format converters to support that. In the case of Node-Red imagine:

Publish messages from your weather station to openweather.org.
Subscribe to the National Weather Service to get local storm warnings.
Have parts of flows running on a cloud service (AI starts to get real here...)

Messages are supported in Node-Red already, but it's not as fully realized as it could be. There's no integrated way to view what's published and who subscribes to it other than digging through the flows with the mark 1 eyeball. Debugging nodes now is done by reading through text. It works fine until the system gets too large. It does nothing to help you analyze failures after they happen.
MQTT explorer is a small start in that direction.

I'm just not sure it's worth the effort to implement for small systems, or that anyone would be interested in it.

TotallyInformation · 9 January 2024 13:37

Been there and got both the tee-shirt (litterally I think!) and scars.

OK, now we are getting somewhere. I think this may be worth taking to a new topic if I'm honest. Because it think it is an idea worth thinking about further. And it also mirrors some ideas I'm developing in UIBUILDER as well.

Please do start a new thread and lets talk it through some more

aguida79 · 10 January 2024 00:58

If you want an alternative to NodeRED, that can run Python, maybe you can try Shuffle:

paulkeates · 10 January 2024 14:55

Hi,

I believe (in a deep irony) that long, long ago Node-RED started as a MQTT 'explorer' for an internal project at IBM...

Cheers,

Paul

jsprenkle · 10 January 2024 17:15

LOL! That's great.

Steve-Mcl · 10 January 2024 17:25

from Node-RED about page:

Node-RED started life in early 2013 as a side-project by Nick O'Leary and Dave Conway-Jones of IBM's Emerging Technology Services group. What began as a proof-of-concept for visualising and manipulating mappings between MQTT topics , quickly became a much more general tool that could be easily extended in any direction

So I would hesitate to describe its origins as an "explorer". More like a "visual gateway"?

jsprenkle · 10 January 2024 18:22

Very much like Biztalk in concept. One key difference being it uses XML for data versioning and XSLT for transformations. MQTT is pretty agnostic about data. Thanks for sharing

dceejay · 10 January 2024 19:16

Yes - it started as a way to map one MQTT topic to another and transform the data on the way through, only I rather quickly went and ruined Nick's pure vision by hacking in other transports like serial and tcp, and oops, it became Node-RED.

Topic		Replies	Views
A new type of flow General	39	452	25 December 2024
Is Node-Red feasible for ETL General	7	487	7 February 2023
Node Red at enterprise level General	24	5468	24 February 2021
Making it easier to work within a team with Nodered General	29	1763	15 October 2022
What are you using node-red for besides iot? General	38	504	23 July 2025

ETL Pipelines with Node-RED?

Related topics