Performance change in handling messages between version 1.x and 2.x

Hi, I've been using an older version of Node-RED (1.2.9) for a while and it has been working great. I have flows that are handling multiple thousands of messages per second coming from a message broker.

I recently decided to update my version of Node-RED to a newer version (2.2.4) in order to install newer palettes that require Node-RED v2 or greater. Once I did this I noticed that my Node-RED instance wasn't able to handle the same amount of messages per second from the message broker that the older version was able to.

The flows would appear to be running as normal, but making changes such as enabling/disabling the debug node or making an edit and deploying would not take effect immediately, and it would appear that the flow still runs with the old configuration.

I can see the memory of the system slowly increasing over time, which leads my to believe that Node-RED is storing the input messages in memory and they are getting buffered quicker than the messages can be processed.

I decided to make a test flow to try to reproduce this behaviour in a separate environment to eliminate the message broker in the equation. I used Docker to run multiple different versions starting from 1.2.9. I determined that the flow works as I expect in v1.2.9 to v1.3.5, but something changed in the v.2.0.0 release that causes this behaviour.

Is this just an environment issue? I've tested on a couple different devices with different resources and the issue is reproducible on both devices. Is there a setting I could change that would fix this?

Here is the test flow that I've made

[{"id":"21f1836d8aed02aa","type":"inject","z":"f6f2187d.f17ca8","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"0.001","crontab":"","once":false,"onceDelay":0.1,"topic":"","payloadType":"date","x":130,"y":220,"wires":[["1fc8f3ec1b0af4ba"]]},{"id":"8d2b2bfd70b7107e","type":"debug","z":"f6f2187d.f17ca8","name":"","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":510,"y":220,"wires":[]},{"id":"3ede02d65f3f6a01","type":"inject","z":"f6f2187d.f17ca8","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"0.001","crontab":"","once":false,"onceDelay":0.1,"topic":"","payloadType":"date","x":130,"y":280,"wires":[["1b5ea672305cd65e"]]},{"id":"5a96c53bae933a8b","type":"inject","z":"f6f2187d.f17ca8","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"0.001","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"date","x":130,"y":340,"wires":[["0c04a913fe01dd88"]]},{"id":"6c84a07a16070fb5","type":"inject","z":"f6f2187d.f17ca8","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"0.001","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"date","x":130,"y":400,"wires":[["38389c8db37b777f"]]},{"id":"dd6c0c01e39ae2a9","type":"inject","z":"f6f2187d.f17ca8","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"0.001","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"date","x":130,"y":460,"wires":[["387a397b2674643a"]]},{"id":"1fc8f3ec1b0af4ba","type":"function","z":"f6f2187d.f17ca8","name":"","func":"let ctx = context.get(\"ctx\") || 0;\n\nctx = (ctx + 1) % 10000;\nnode.status({fill:\"green\",shape:\"dot\",text:ctx});\n\ncontext.set(\"ctx\", ctx);\n\nreturn msg;","outputs":1,"noerr":0,"initialize":"// Code added here will be run once\n// whenever the node is started.\ncontext.set(\"ctx\", 0);\nnode.status({fill:\"green\",shape:\"dot\",text:0});","finalize":"","x":320,"y":220,"wires":[["8d2b2bfd70b7107e"]]},{"id":"1b5ea672305cd65e","type":"function","z":"f6f2187d.f17ca8","name":"","func":"let ctx = context.get(\"ctx\") || 0;\n\nctx = (ctx + 1) % 10000;\nnode.status({fill:\"green\",shape:\"dot\",text:ctx});\n\ncontext.set(\"ctx\", ctx);\n\nreturn msg;","outputs":1,"noerr":0,"initialize":"// Code added here will be run once\n// whenever the node is started.\ncontext.set(\"ctx\", 0);\nnode.status({fill:\"green\",shape:\"dot\",text:0});","finalize":"","x":320,"y":280,"wires":[["3c04f5df57f62903"]]},{"id":"0c04a913fe01dd88","type":"function","z":"f6f2187d.f17ca8","name":"","func":"let ctx = context.get(\"ctx\") || 0;\n\nctx = (ctx + 1) % 10000;\nnode.status({fill:\"green\",shape:\"dot\",text:ctx});\n\ncontext.set(\"ctx\", ctx);\n\nreturn msg;","outputs":1,"noerr":0,"initialize":"// Code added here will be run once\n// whenever the node is started.\ncontext.set(\"ctx\", 0);\nnode.status({fill:\"green\",shape:\"dot\",text:0});","finalize":"","x":320,"y":340,"wires":[["23e3b5c2ac967f22"]]},{"id":"38389c8db37b777f","type":"function","z":"f6f2187d.f17ca8","name":"","func":"let ctx = context.get(\"ctx\") || 0;\n\nctx = (ctx + 1) % 10000;\nnode.status({fill:\"green\",shape:\"dot\",text:ctx});\n\ncontext.set(\"ctx\", ctx);\n\nreturn msg;","outputs":1,"noerr":0,"initialize":"// Code added here will be run once\n// whenever the node is started.\ncontext.set(\"ctx\", 0);\nnode.status({fill:\"green\",shape:\"dot\",text:0});","finalize":"","x":320,"y":400,"wires":[["55b9465d93c9451c"]]},{"id":"387a397b2674643a","type":"function","z":"f6f2187d.f17ca8","name":"","func":"let ctx = context.get(\"ctx\") || 0;\n\nctx = (ctx + 1) % 10000;\nnode.status({fill:\"green\",shape:\"dot\",text:ctx});\n\ncontext.set(\"ctx\", ctx);\n\nreturn msg;","outputs":1,"noerr":0,"initialize":"// Code added here will be run once\n// whenever the node is started.\ncontext.set(\"ctx\", 0);\nnode.status({fill:\"green\",shape:\"dot\",text:0});","finalize":"","x":320,"y":460,"wires":[["e8990acfc90d5a7f"]]},{"id":"3c04f5df57f62903","type":"debug","z":"f6f2187d.f17ca8","name":"","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":510,"y":280,"wires":[]},{"id":"23e3b5c2ac967f22","type":"debug","z":"f6f2187d.f17ca8","name":"","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":510,"y":340,"wires":[]},{"id":"55b9465d93c9451c","type":"debug","z":"f6f2187d.f17ca8","name":"","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":510,"y":400,"wires":[]},{"id":"e8990acfc90d5a7f","type":"debug","z":"f6f2187d.f17ca8","name":"","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":510,"y":460,"wires":[]}]

It has a function node that keeps a number in context memory and increases it every time it receives a message. I display this number as the node status of the function so it is easier to observe when the function is still receiving messages.

To reproduce the behaviour, I did the following:

  • Ran docker image with: docker run -d -p 1880:1880 nodered/node-red:<version> (any version 2.0.0 or above should have the issue)
  • imported json into flow and deployed the flow
  • made a change to the flow, such as remove a wire from one of the inject nodes and deploy
  • expected result: the function nodes reset their context and start counting up from 0 again (this happens in v1.2.9)
  • actual result: the function nodes continue to count up from their current context values. Even the function that is disconnected from the inject node continues to count

I tried playing around with the inject node intervals and adding/removing copies of the flow. I've found that in v2.0.0 and above I start to see the behaviour around 1000 messages per second, where the older versions seem to not have any issues.

Any help to point me in the right direction on what's happening here would be greatly appreciated! This is my first post, if there's anything I've missed or any additional information needed please let me know.

Thanks,

Justin

1 Like

Have you tried v3.0.2? If not, I think that you need to I'm afraid. since v2 is no longer being developed.

I seem to remember that it may have been around v2 time the way that nodes with multiple output wires from the same port was changed? In order to reduce problems with async nodes making changes that impact downstream node output, any node that has >1 wire on the same output port forces a clone of the message and that is certainly slower. Not tried your flow but that might be something to look for?

1 Like

Did the change to the way messages flow through the nodes happen at that time too? It used to be, if I remember correctly, that a message that flowed through a series of nodes would flow right through until it completed or came to some async code. The change was that each message moves on only one node at a time, then the next message waiting in the system (for whatever node) would get a look in, and so on. I can't find the detailed explanation of that now. It is conceivable that could result in performance differences.

The changes around message cloning were introduced in Node-RED 1.0 - Cloning messages in a flow : Node-RED - so will not be relevant here.

Given 2.0 was published 2 years ago, I cannot immediately think what was in there that would explain this.

I think we'd need to try to reproduce it (thanks for the steps to do so) and investigate a bit more. As mentioned, 2.x is in maintenance mode so our focus will be much more on 3.x - but that major version bump was more about dropping support for older node.js versions than due to any major internal changes of the runtime.

There is no "real" change in performance of the runtime but I suspect you think there is due to how the updates are coming from the node-red runtime to your browser.

As you can see from the 2 GIFs below, the client side status messages are updated faster than in V2+

chrome_4OCOg8AtDB

WindowsTerminal_NPBteHp3n2

2 Likes

I did try it with version 3.0.2 and I get the same result as any 2.x version.

I saw the article regarding the move to async between versions. In my test flow I avoided any nodes with multiple wires at the input or output. But as I understand it, this should really just affect the order in which nodes are processed when you have branching paths.

1 Like

Thanks for testing this out! This is great.

You are right, I am basing my theory only on the frontend page. It appears that there is loss of control when deploying and the flow is "stuck" processing the old messages. Could this be because the UI now attempts to "replay" each message on the browser side?

The other impact I mentioned briefly in my original post is that I see an increase in memory over time while letting this flow run. For me it causes an issue where eventually Node-RED will crash due to a heap allocation error. On my actual instance running in production, I am running it with the node command and passing in the --max-old-space-size flag to limit the memory Node-RED has to work with. This is done because there are other services/applications running on the same device and I want to make sure that no service/app hogs all of the memory.

I will do some more testing to see if the impact is only on the frontend.

Have you actually confirmed that you do eventually get a crash, or does it eventually top out at some value?
What memory is it that you see increasing?

Yes, I have seen the process crash because of this. Admittedly, it is a small number (256 mb) that I am giving the application.

The memory that is increasing is the RES memory of the node process running Node-RED when I look at it using htop

I can increase this and let it run to see if the memory does settle, or if it continues to increase over time.

Is it heavily using the processor? Is there a significant amount of swap being used?
If you look at the docs you may find that max-old-space-size does not actually do what you think it does.
However, if you do see it crash eventually, and the processor is not being heavily utilised, then it may be that you do have a memory leak somewhere. Again, you should probably make sure you are using latest versions of all nodes, and upgrade to the latest node-red.

To be clear, all that it does is change the steering that the V8 JavaScript engine uses for running its garbage recovery process. Increasing it can sometimes have an unfortunate side-effect of creating pauses in your application because GC is waiting until a larger amount of memory needs recovering and that (AFAIK) stops the main processing loop until completed.

No, the cpu is not maxed out, it is under 50%. There is no swap memory being used.

Noted about flag, I will have a look at the docs. I did another test with using a docker container and limiting the memory the container has. This way the flag is not used.

  • I ran docker run -dit -p 1880:1880 -m 512m nodered/node-red:3.0.2
  • imported the flow and deployed
  • let it run, eventually the container exits
  • docker logs <id> shows the following:
Welcome to Node-RED
===================

2 Mar 16:06:34 - [info] Node-RED version: v3.0.2
2 Mar 16:06:34 - [info] Node.js  version: v16.16.0
2 Mar 16:06:34 - [info] Linux 5.10.102.1-microsoft-standard-WSL2 x64 LE
2 Mar 16:06:35 - [info] Loading palette nodes
2 Mar 16:06:35 - [info] Settings file  : /data/settings.js
2 Mar 16:06:35 - [info] Context store  : 'default' [module=memory]
2 Mar 16:06:35 - [info] User directory : /data
2 Mar 16:06:35 - [warn] Projects disabled : editorTheme.projects.enabled=false
2 Mar 16:06:35 - [info] Flows file     : /data/flows.json
2 Mar 16:06:35 - [warn]

---------------------------------------------------------------------
Your flow credentials file is encrypted using a system-generated key.

If the system-generated key is lost for any reason, your credentials
file will not be recoverable, you will have to delete it and re-enter
your credentials.

You should set your own key using the 'credentialSecret' option in
your settings file. Node-RED will then re-encrypt your credentials
file using your chosen key the next time you deploy a change.
---------------------------------------------------------------------

2 Mar 16:06:35 - [info] Server now running at http://127.0.0.1:1880/
2 Mar 16:06:35 - [warn] Encrypted credentials not found
2 Mar 16:06:35 - [info] Starting flows
2 Mar 16:06:35 - [info] Started flows
2 Mar 16:06:53 - [info] Stopping flows
2 Mar 16:06:53 - [info] Stopped flows
2 Mar 16:06:53 - [info] Updated flows
2 Mar 16:06:53 - [info] Starting flows
2 Mar 16:06:53 - [info] Started flows

<--- Last few GCs --->

[7:0x7f28971573b0]   759068 ms: Mark-sweep 243.5 (258.3) -> 242.2 (258.3) MB, 256.3 / 0.0 ms  (average mu = 0.175, current mu = 0.106) allocation failure scavenge might not succeed
[7:0x7f28971573b0]   759330 ms: Mark-sweep 243.3 (258.3) -> 242.2 (258.0) MB, 233.0 / 0.0 ms  (average mu = 0.144, current mu = 0.108) allocation failure scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
./entrypoint.sh: line 14:     7 Aborted                 /usr/local/bin/node $NODEOPTIONS node_modules/node-red/red.js --userDir /data $FLOWS

Could this be because node.status is called on every message by the function nodes?

No, not unless there is a so far undetected bug in that area. It is perfectly normal to update the status on each message.

Have you checked that you are using the latest versions of all nodes you have installed manually?

What additional nodes have you installed? You can look in the file package.json to see that.

Are you manipulating any large objects? Images for example, or files?

1 Like

The logs I shared are from the docker container I ran yesterday. Fresh container, no palettes installed. No large objects, files or images involved. All I did was import the flow shared in the original post and deploy. You should be able to test this and reproduce.

I did two tests.

  1. Deploy flow and let it run. Notice steady increase of memory usage of the container over time.


    Only at ~200MB when I took this screenshot, but it increases steadily and would continue to increase if I let it run.

  2. Modify flow by commenting out the call to node.status on all 5 function nodes and deploy


    Now notice that the container's memory usage has stopped increasing. It remains steady at ~200MB for a while, and then starts to decrease and eventually settles around ~100MB.

The nodes I am using on my production instance also update node status on every message, and it can be receiving several thousand messages per second from the message broker it is connected to.
For me, this solves the memory issue as I can update the nodes that are doing this. I figured I'd share my test results with you in case it is valuable info.

Which nodes in particular are causing this ? Is it just function nodes ? ( in general we try to use status to indicate things like connectedness or errors rather than state per message.)

If you know you are doing high message rates then maybe only update the status when the % change is more than a full %. Eg compare with previous and only set when necessary. Or only update once per sec. Etc.

I don't use docker.

Does the memory continue increasing if you close the browser?

What if you halve the update rate? If it is a memory leak then it should take twice as long to get to the crash, but it should still crash.

In the example, it appears the function node is directly related to the increase in memory, and it is the constant calls to node.status that is the culprit. On my production instance, there are some custom palettes that do the same thing, I can update these.

Understood, I intend to follow this as best practice and only make a call to node.status when there is a meaningful update to be displayed.

No, when I close the browser the memory stops increasing.

This is exactly what happens in my experience. I have played with the update rates and find that at lower rates, the memory is stable, but at a certain threshold the memory starts to increase. (With the browser being open)

Based on that description of what you're seeing, I suspect it is simply swamping the comms link from the runtime to the browser and the status updates are backing up.

I would agree with this. Thanks for helping me understand what is happening. I have a way to avoid this issue by making sure none of the nodes are calling node.status too frequently.

1 Like