Introducing node-red-cluster - a complete clustering solution for Node-RED (feedback and alternatives welcome)

Hi everyone,
I’ve recently developed and open-sourced a project called node-red-cluster - a full clustering solution for Node-RED, built around Redis/Valkey.

It aims to make Node-RED horizontally scalable while keeping a familiar workflow for developers.

What it provides

  • Clustered storage - one admin instance (with editor) and multiple workers that automatically sync flows from the admin.
  • Distributed context store - global, flow, and node contexts shared across all nodes via Redis/Valkey, with atomic ops and compression.
  • Leader election - ensures that scheduled jobs or singleton tasks run only once across the cluster, with automatic failover.
  • Package sync - keeps all Node-RED nodes/modules consistent across admin and workers.
  • Docker/Kubernetes examples

I built this because I couldn’t find a clear or complete clustering solution for Node-RED.
Most discussions or examples focus on partial setups (like shared contexts or replicated flows) but not a full architecture with leader election and plugin sync.

Now that the project is working well in my own tests, I’d love to hear the community’s thoughts:

Open questions / feedback I’m looking for

  • Have you ever needed to scale Node-RED horizontally? How did you approach it?
  • Do you see any better or simpler alternatives to this approach?

Any feedback, criticism, or alternative designs are very welcome.
I’d really like this to be a discussion about how Node-RED could scale more natively in distributed environments.

Thanks a lot for your time and ideas!

9 Likes

@Siphion good work.

What happens if the admin instance dies or gets too many requests it starts to drop messages or throttle? I'm asking this question because I understood that the admin has a flow that has a node that routes messages to workers. Is my understanding right?

Flowfuse has a cluster solution that enables HA. I never tried it but maybe you should look at it. I think they solved message distribution across all instances using the network layer with some kubernetes native feature. And the sync of flows between instances may have been done like you did, with custom storage and context plugins, but using postgres instead of redis.

No, the FlowFuse HA mode isn't that sophisticated today; each instance is its own thing, with shared context and load-balancing of incoming HTTP traffic. You still have to build your flows knowing that multiple copies are running in parallel - so making sure to use things like shared subscriptions in the MQTT nodes etc.

1 Like

To clarify this for any reading along, who, like me, can be confused by the terminology, HA mode is presumably High Availability mode, not Home Assistant.

2 Likes

To clarify - the admin doesn’t route messages to workers. The architecture works differently:

  • Admin role: Only responsible for the flow editor UI and saving flows to Redis. When you save a flow in the editor, it gets written to Redis and a pub/sub notification is sent to all workers.
  • Worker role: Each worker runs its own complete flow execution engine. They all execute the same flows independently - there’s no message routing from admin to workers.

The admin instance is essentially a “control plane” - if it goes down, workers continue executing flows without interruption. The only thing you lose temporarily is the ability to edit flows. Once the admin comes back up, you can resume editing.

For incoming requests (HTTP, MQTT, etc.), you’d typically use a standard load balancer (nginx, k8s service, etc.) to distribute traffic across workers. Each worker handles its own messages independently.

The leader election feature is specifically for scenarios where you need singleton behavior - like scheduled jobs (inject nodes with intervals) or tasks that should run exactly once across the cluster, not once per worker.

My solution tries to provide:

  1. Centralized flow editing (one source of truth)
  2. Automatic worker synchronization
  3. Built-in leader election for singleton tasks
3 Likes

Dope. Really good job. I got confused brcause there is a flow with worker nodes. Will read more later.

1 Like

@Siphion could you explain why you only sync dependencies that are part of this heuristic instead of copying the same package.json found in the admin? node-red-cluster/src/storage.ts at 9003ea9f2d66a25974ed17addb493523e6525f43 · Siphion/node-red-cluster · GitHub

1 Like

That would have worked perfectly fine!

But the main reason I went with the heuristic is to keep the synced message as small as possible, so workers can reach the correct state as quickly as possible. The smaller the message, the faster it arrives.
I know it's probably negligible in most cases, but I thought there might be projects with very large package.json files where this could make a difference.

2 Likes

Looking at it more carefully, the package.json managed by node-red already contains only the packages needed for flows (no devDependencies or extraneous stuff). If I wanted admin-specific packages, I wouldn't install them through the node red editor anyway... I'd install them at the system level outside of Node red's package management.
The heuristic could actually cause problems by filtering out legitimate packages that don't match the node-red-contrib-* or @* patterns. For example, a custom package named mycustom-nodes would be excluded.
You're right that simply syncing the entire package.json is simpler, more reliable, and doesn't exclude any valid packages. The message size difference is negligible since we're only sending package names, not the actual package contents.
I'll refactor this to sync the entire package.json instead.

Thanks for catching this! <3

2 Likes

Right, that’s what I meant. When I said “message distribution,” I wasn’t referring to routing Node-RED’s internal runtime messages between instances — that would break object references and class instances. I meant the load balancing of incoming HTTP requests across multiple Node-RED runtimes, which Kubernetes handles at the network layer using round-robin or similar algorithms.

As I understand it, you have a master editor and slave workers - that seems like a good approach to scale compute but not functionality - what if I want X instances of NR all with slightly different functionality? I don't want to criticise, it's just that I took another approach :slight_smile:

I don't know about simpler but alternative approach. What I did was use the internal Node-RED API for updating flows on an instance.

What I did was create a set of nodes around those APIs to install packages and flows on remote NR instances. Basically the SendFlow and InstallPackage nodes do the hard lifting.

Example flow that collects together a bunch of flow tabs and then sends them to a external NR installation. Including updating packages as required. That flow also uses FlowHub.org nodes to retrieve the required flows, so they do not need to be installed on the existing "master" NR instance. So you end up creating flows that describe flows on an external NR - meta programming for the win, or what the f2k when viewed from the other direction.

Also what I haven't yet created but would also could be useful are linked nodes (as the classic link nodes inside NR) but that use MQTT (or some other message bus) to seamlessly transport messages between NR instances. The compute engine flow goes in that direction by sending a flow to a remote instance and then having that instance execute the flow and then pass back the result to the calling NR.

The advantage of these approaches is that they use existing pieces of NR to do something different, the disadvantage is that none of this has been tested in the wild - so no idea how maintainable this all is.

So your not alone in thinking about these things :wink:

P.S. I've also been thinking of having flow tabs having hostnames (as env variables) to ensure that these flows get deployed to specific hosts. Again, one master NR that describes the architecture for several instances of NR.

1 Like

Hi gregorious, thanks for sharing your approach but that's actually solving a different problem than what node-red-cluster addresses.

Your system is more about orchestration and multi-tenant deployment: one master deploying different flows to different instances with different purposes. That's really useful for managing multiple independent node-red installations.

node-red-cluster is focused on clustering in the traditional sense:

  • Multiple workers executing the same flows
  • Shared state and context across all instances
  • Load balancing and high availability
  • Leader election for singleton tasks (e.g., scheduled jobs that should run only once across the cluster, not once per worker)

They're complementary approaches for different use cases:

  • Your approach: "I have 50 factories, each needs different automation flows"
  • My approach: "I have one application that needs to handle high load and stay up 24/7"

You could use your orchestration system to deploy flows to multiple node-red-cluster admin instances, and each cluster would then distribute those flows to its workers automatically, scale horizontally based on load, and handle HA locally.

So your master orchestrator would see each cluster as a single "logical" NR instance, but under the hood each one is actually a scaled cluster handling high availability and load balancing.

3 Likes

Thank you for the detailed answer and yes we definitely have different approaches.

One thing to note is that High Availability is something that FlowFuse solves for the NR. So it would be nice if they chimed in and commented on your approach.

Can I ask are you using this approach in a production setting? I ask because I am wondering what is your use-case that you go to such lengths to have a 24/7 NR cluster. NR itself is fairly stable and most use-case for it can deal with downtime. For example, using Kafka as message broker you can automatically replay messages not yet handled by NR if it falls down.

So there are alternative mitigations to the problems caused if NR fails. Which is also something to keep in mind when adding new technology (redis/valkey) to make something more stable. It is kind weird to say I'm adding this technology to make this other technology more stable - which happens a lot in the IT industry.

Also High Availability using a programming language that isn't designed for HA is also a strange situation. And I'm not criticising here, I just want to raise awareness. There are languages out there that are designed for HA - for example - Erlang but also Rust. So certain issues go away when using those languages.

I mention Erlang not without reason, my Erlang-Red project is not about HA but about porting visual programming to another community of developers. But what I have learnt is that the Erlang community takes a different approach to high availability by emphasising recovery: define clear steps how a process can recovery if it fails. So there is less an emphasis on keeping things running and instead healing systems when something does goes wrong.

The advantage of this approach is that there is less error handling code. Basically you code the happy path in Erlang and define recovery steps if it fails. When you then run your code, if you have cases that fail that shouldn't, then you add code to handle that - if you want to prevent recovery and handle the error situation.

2 Likes

Hello, this is so very interesting to read! I'm definitely not on your very skilled and advanced level but I certainly enjoy. I see a combo of both as a really interesting industrial solution. Long time ago I actually raised questions when Node-RED would support multi-user and multi-tenancy. I think when it all begun, no one did really know how it would evolve to what it has become today

I have a couple of thoughts and questions, maybe you could enlighten me?

  • in both of your approaches, how do you approach and solve the important data security aspect in the communication between master and workers?
  • in most common IoT solutions data collection (information services) is basic but remote access to a "worker" is missing. So I wonder if it could be possible to add/develop such? Basically a feature to establish a VPN channel directly between the master (computer) and a specific worker?

EDIT: In addition we now also see this, where are we going tomorrow?

Both of these are hand waving: the underlying VPN can handle that! Normally corporate VPNs are configured to seamlessly provide communication between two endpoints - something akin to http v. https - for the end user (in this case NR) the difference is an extra 's' in the url.

In the corporate world, VPNs are the responsibility of the IT department while NodeRED is the responsibility of the developers. So there is no possibility of the developer setting up a VPN, it's done by the IT folks but they don't setup NR.

These types of questions are great for conferences and meetups but they rarely play a role in corporates, there are fixed VPNs and security policies that you have to fit into. You just have to get NR working in that environment - full stop.

Access to the remote worker can be done via MQTT/message bus. Or rabbitmq or kafka or TCP/IP ... If you want to login to the box hosting the worker you're doing it wrong - IMHO. I think it's important to realise that Node-RED is all about communication so whether I'm communicating with a device or NR itself, it's all done via message buses and flows.

I guess tomorrow will be like today, after all, it seems to be working.

IMHO, Scaling/HA doesn't primarily come from the programming language (even though Erlang has it "built-in by design"), but mainly from the architecture. I really appreciate Erlang and its great design, but it's not for everyone and this is demonstrated by the fact that despite being around for over 30 years, it's still not mainstream and lacks the rich library ecosystem of other languages. (Comparing it to the nodejs ecosystem)

I'm using node-red in production in several contexts where we need a data orchestrator, particularly in situations where zero message loss and immediate responses are required.
I also use Apache NiFi in similar situations, but I don't appreciate its verbosity, heaviness, and architectural complexity required to achieve horizontal scaling/HA.
NR is extremely lightweight, much simpler to build flows via low-code (with a bit of JS), very easy to create plugins, and consequently has a nice selection of plugins in the palette manager (although most are not well maintained).

My goal was to give node red clustering capabilities without changing the conceptual way of working with node-red. There are no entry points or architectural changes from the user's perspective.

I'm not aware of any low-code orchestrator written in Erlang, so I'll follow your project with great interest... I believe it could benefit from node-red-cluster integration! :slight_smile:

On Kafka as an alternative: Certainly, setting up a message broker like (which you'd then need to cluster and make HA, exactly as you would with Redis in my solution) allows node-red to scale to some extent. However, if ti goes down, you lose:

  • Real-time responses
  • Consumption of critical messages that need to be processed as quickly as possible

That's what Kafka has been doing for about ten years now. Why would you want to replicate that?

A weird thing to be saying when you're adding Redis to NR - that's heaviness that might not be needed. I guess it's a point of view thing - for me, adding Redis to NR is heaviness, for you its scalability - fair enough.

I could say the same thing about npmjs.com and the NodeJS ecosystem in general. Whether it's most or some or X%, that's the nature of open source - some things lose traction and are then forgotten.

That's not how Erlang works - Erlang process can communicate amongst themselves out-of-the-box. I can have one Erlang running half-way around the world and another locally on my laptop. I can communicate with all processes as if they are all running locally. Erlang does all the heavy lifting of locating process and sending them messages - without me having to do anything special.

I simply don't need Redis to provide a communication layer for Erlang-Red - if I wanted to scale horizontally.

And here's the point: Erlang does get used, quite a bit in fact, but unfortunately it's mostly behind NDA doors, people can't talk about it. There's little open source because corporation - at least in the Erlang sphere - don't like sharing their solutions. It's a pity but it's also because Erlang isn't a web language, its a telco language and lives in boxes located on poles or street corners. Apparently it's also very popular with gambling sites - for it's real time distributed nature. It's also good for trading but web isn't its main use case.

Hello, thank you for the feedback. I would like to detail the architectural decisions we made, as they directly address the scaling and security concerns raised.

First, I want to clarify the strategic separation of roles. Developers are responsible for their standard workflow: creating flows, adding devices, and modifying them within their respective Node-RED instances. This core workflow remains unchanged. End Users, however, are the sole focus of our centralized permission system. All their access and modification rights are handled exclusively by the backend, ensuring a clear division between flow development rights and device interaction rights.

To establish secure communication, every Node-RED instance must register with the Django devices table. We employ Asymmetric Cryptography for robust authentication. The device’s public key (be it RSA or elliptic curve) is stored in the Django backend. The corresponding private key remains secret on the worker node and is used to sign a JWT (JSON Web Token). This JWT payload contains crucial information like the device ID, its expiration time, and the creation timestamp. When the worker node initiates the websocket connection, it includes this JWT in the Authorization header as Bearer JWT. This approach allows the Django backend to instantly verify the token's authenticity using the stored public key, ensuring Non-Repudiation and secure delegation of responsibility.

Regarding data security, we utilize TLS over the websocket connection, leveraging the same widely tested and trusted security mechanisms as HTTPS. The selection of websocket over MQTT was deliberate: 1) TLS provides superior security compared to older SSL implementations. 2) Websocket is native to the Django ecosystem (via Django Channels), which simplifies our infrastructure. 3) Many open-source MQTT brokers present significant challenges when trying to achieve a truly distributed architecture at scale, a complexity we avoid with websocket management.

Synchronization across multiple concurrent user sessions (e.g., 10 open tabs) is managed via Redis acting as a real-time pub/sub layer. Any signal originating from a Node-RED element is instantly broadcasted by Redis to all clients with read access, maintaining immediate consistency. When a user sends a command (e.g., toggles a switch), the signal enters Node-RED, the flow logic executes, and the resulting output state is broadcasted through Redis to update every user's view. Critical for recovery and initial loading, we allow granular control over state caching: for every element (sensor/actuator), the system defines how many historical states (or last messages) should be retained in the Redis cache. Upon losing connection or opening a fresh browser session, the user's frontend fetches this defined number of cached states (e.g., 20 points for a chart, 1 state for a switch). This guarantees that every user sees the exact same dashboard state, irrespective of their connection history.

Finally, addressing the challenge of multi-tenancy and access control in Node-RED: while simple token-based authentication can secure Node-RED externally (as discussed here: [How secure node red by tokens - #3 by tahasamy]), it does not address granular user permissions. Our solution bypasses the complexity of modifying Node-RED's internal permission system entirely. We deploy multiple dedicated Node-RED instances and centralize all access control in our backend (Django or the upcoming Quack Quack microservice). This architecture makes the backend the definitive gatekeeper, enabling us to decide precisely who can access which instance, facilitating clear developer isolation (assigning separate Node-RED instances to different teams). The immediate focus is evolving the "Quack Quack" module into a microservice to effectively manage all communication orchestrations (whether through dedicated Redis, Kafka, or Websocket management), which is vital for true distributed scaling.

1 Like

Just to clarify: my previous message wasn't meant as criticism towards Erlang at all. I actually appreciate Erlang's design a lot.

My comment about "zero message loss and immediate responses" was referring to the consumer side of Kafka, not Kafka itself. Kafka is excellent for durability and zero-loss delivery, but (as you know) it shines when you fire-and-forget a message and you don't need a synchronous reply from the consumer. That's what I meant: Kafka is great for streaming and decoupling, but it's not designed for immediate request/response flows where the producer needs the result immediatly.

Regarding the part:

"(although most are not well maintained)."

I was referring specifically to Node red nodes, which live in the Node.js/npm ecosystem, not Erlang. Absolutely no criticism intended towards Erlang or its ecosystem.

About:

"A weird thing to be saying when you're adding Redis..."

The "heaviness" I mentioned wasn't about Redis at all, but about the workflow-building experience in Nifi. Nifi is extremely powerful, but the UI and component model tend to be verbose and not as intuitive as NR.
(And in the end, even Apache nifi still requires Apache Zookeeper for clustering.)

Now, regarding this part:

"That's not how Erlang works..."

I think here we slightly misunderstood each other.
What I meant by "it could benefit from node-red-cluster integration" was not "Erlang needs Redis". Rather, I was wondering about how the node-red instances inside erlang-red behave.

From what I understood (please correct me if I'm wrong):

  • Erlang-erd can spawn and orchestrate multiple NR runtimes.
  • Those runtimes do not inherently gain Erlang's fault-tolerance model.
  • If a NR runtime crashes, Erlang can restart it but the flow that was running inside that specific NR instance is still momentarily down.
  • So the Erlang layer is fault-tolerant, but the NR workers are not distributed/replicated by design.

EDIT:
Sorry, I completely misunderstood how erlang-red works from your initial messages!
Now that I've analyzed the repo, I understand it's actually a complete backend rewrite: each node becomes a native Erlang process, there's no Node.js runtime underneath at all. The NR frontend is just the visual editor, but the entire execution engine is pure Erlang.

So yes, you're right, my approuch wouldn't be relevant there. I've scaled node-red keeping the original runtime and features intact, while you rewrote the backend from scratch, essentially creating a different application altogether.

I was probably unclear here. My scenario I have in mind is a typical industrial company, let's say they produce and sell machines to customers around. Now they want to keep track how those machines behave

  • collecting data, typical IoT topic,I call that information services
  • being able to remotly support customer, making adjustments to configurations etc, I call that remote services

So the master is with the manufacturer and the workers are installed in the machines, So when the manufacturer needs to support the customer he connects to the worker via a VPN that is not handled by the customers IT department