Hello, thank you for the feedback. I would like to detail the architectural decisions we made, as they directly address the scaling and security concerns raised.
First, I want to clarify the strategic separation of roles. Developers are responsible for their standard workflow: creating flows, adding devices, and modifying them within their respective Node-RED instances. This core workflow remains unchanged. End Users, however, are the sole focus of our centralized permission system. All their access and modification rights are handled exclusively by the backend, ensuring a clear division between flow development rights and device interaction rights.
To establish secure communication, every Node-RED instance must register with the Django devices table. We employ Asymmetric Cryptography for robust authentication. The device’s public key (be it RSA or elliptic curve) is stored in the Django backend. The corresponding private key remains secret on the worker node and is used to sign a JWT (JSON Web Token). This JWT payload contains crucial information like the device ID, its expiration time, and the creation timestamp. When the worker node initiates the websocket connection, it includes this JWT in the Authorization header as Bearer JWT. This approach allows the Django backend to instantly verify the token's authenticity using the stored public key, ensuring Non-Repudiation and secure delegation of responsibility.
Regarding data security, we utilize TLS over the websocket connection, leveraging the same widely tested and trusted security mechanisms as HTTPS. The selection of websocket over MQTT was deliberate: 1) TLS provides superior security compared to older SSL implementations. 2) Websocket is native to the Django ecosystem (via Django Channels), which simplifies our infrastructure. 3) Many open-source MQTT brokers present significant challenges when trying to achieve a truly distributed architecture at scale, a complexity we avoid with websocket management.
Synchronization across multiple concurrent user sessions (e.g., 10 open tabs) is managed via Redis acting as a real-time pub/sub layer. Any signal originating from a Node-RED element is instantly broadcasted by Redis to all clients with read access, maintaining immediate consistency. When a user sends a command (e.g., toggles a switch), the signal enters Node-RED, the flow logic executes, and the resulting output state is broadcasted through Redis to update every user's view. Critical for recovery and initial loading, we allow granular control over state caching: for every element (sensor/actuator), the system defines how many historical states (or last messages) should be retained in the Redis cache. Upon losing connection or opening a fresh browser session, the user's frontend fetches this defined number of cached states (e.g., 20 points for a chart, 1 state for a switch). This guarantees that every user sees the exact same dashboard state, irrespective of their connection history.
Finally, addressing the challenge of multi-tenancy and access control in Node-RED: while simple token-based authentication can secure Node-RED externally (as discussed here: [How secure node red by tokens - #3 by tahasamy]), it does not address granular user permissions. Our solution bypasses the complexity of modifying Node-RED's internal permission system entirely. We deploy multiple dedicated Node-RED instances and centralize all access control in our backend (Django or the upcoming Quack Quack microservice). This architecture makes the backend the definitive gatekeeper, enabling us to decide precisely who can access which instance, facilitating clear developer isolation (assigning separate Node-RED instances to different teams). The immediate focus is evolving the "Quack Quack" module into a microservice to effectively manage all communication orchestrations (whether through dedicated Redis, Kafka, or Websocket management), which is vital for true distributed scaling.