Memory leak. Heap dump shows it may be related to CommsConnection. how to pinpoint the culprit?

First, some general info
I have several flows, each with multiple MQTT in and MQTT out nodes. There are also multiple queries to a MySQL database, using the node-red-node-mysql. About 250 incoming messages are processed per second, all flows combined. The flows have run without problems for several months now. A few weeks ago however, node-red started crashing due to what appears to be a memory leak. I did not update node-red or any nodes in the weeks prior to that start of the crashes.

Here's the stack trace from the logs at the point of a crash, that I'll add for completeness. They don't tell me much (I think), except that the garbage collector isn't succeeding in keeping the memory from growing.

Mar 21 00:29:59 node-red[319284]: <--- Last few GCs --->
Mar 21 00:29:59 node-red[319284]: [319284:0x5877b90] 27751590 ms: Scavenge 3726.7 (4126.7) -> 3713.7 (4126.7) MB, 8.86 / 0.00 ms  (average mu = 0.343, current mu = >
Mar 21 00:29:59 node-red[319284]: [319284:0x5877b90] 27751637 ms: Scavenge 3728.1 (4126.7) -> 3714.5 (4128.0) MB, 7.46 / 0.00 ms  (average mu = 0.343, current mu = >
Mar 21 00:29:59 node-red[319284]: [319284:0x5877b90] 27751686 ms: Scavenge 3729.6 (4128.0) -> 3715.3 (4128.5) MB, 7.73 / 0.00 ms  (average mu = 0.343, current mu = >
Mar 21 00:29:59 node-red[319284]: <--- JS stacktrace --->
Mar 21 00:29:59 node-red[319284]: FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
Mar 21 00:29:59 node-red[319284]: ----- Native stack trace -----
Mar 21 00:29:59 node-red[319284]:  1: 0xca5580 node::Abort() [node-red]
Mar 21 00:29:59 node-red[319284]:  2: 0xb781f9  [node-red]
Mar 21 00:29:59 node-red[319284]:  3: 0xeca4d0 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [node-red]
Mar 21 00:29:59 node-red[319284]:  4: 0xeca7b7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [node-red]
Mar 21 00:29:59 node-red[319284]:  5: 0x10dc505  [node-red]
Mar 21 00:29:59 node-red[319284]:  6: 0x10dca94 v8::internal::Heap::RecomputeLimits(v8::internal::GarbageCollector) [node-red]
Mar 21 00:29:59 node-red[319284]:  7: 0x10f3984 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::internal::GarbageCollectionReason, >
Mar 21 00:29:59 node-red[319284]:  8: 0x10f419c v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallb>
Mar 21 00:29:59 node-red[319284]:  9: 0x10ca4a1 v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::Allo>
Mar 21 00:29:59 node-red[319284]: 10: 0x10cb635 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::All>
Mar 21 00:29:59 node-red[319284]: 11: 0x10a8c86 v8::internal::Factory::NewFillerObject(int, v8::internal::AllocationAlignment, v8::internal::AllocationType, v8::int>
Mar 21 00:29:59 node-red[319284]: 12: 0x1503a16 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [node-red]
Mar 21 00:29:59 node-red[319284]: 13: 0x7f2a3fed9ef6

Things I've tried
I was using node-red v1.3.7. I've now upgraded to v3.1.7, based on some other forum posts related to memory leaks. Unfortunately, that didn't change the behavior.

I've tried increasing --max_old_space_size from 512 to 2048. This just delays the crash (the crashes happen about every 8 hours at this point). I also clearly notice a performance drop long before the crash actually happens. That's the bigger issue.

I'm using some flow context here and there. So I checked its size using a small code snippet. The flow context size is well below 1Mb in each flow.

I've tried limiting the number of incoming messages from MQTT with a delay node, set to 'limit messages per topic'. I set it so the number of incoming messages is reduced by 80%. This actually removed the issue! Memory hovers around 400Mb, as it was before all the issues, and has stayed that way for 24 hours :grinning: Note that reducing the incoming messages by 50% was not enough.

Naturally, at this point I was assuming an inefficiency in my flow somewhere, causing messages to build up at a certain node. I've placed (many!) debug nodes set to 'count the number of messages', usually right before and after the SQL queries. However, I'm not seeing the expected buildup of messages anywhere, not even when things have already slowed down considerably.

The heap dump
In an attempt to get more clues, I took 3 heapdumps with node-red-contrib-heap-dump. The first was taken 5 minutes after startup, the others 20 and 30 minutes after startup. The first screenshot shows that after 30 minutes, there are 1.3 million objects more than at the first point in time, corresponding to a growth of about 57Mb.

Note: using top in linux, I can see that the difference in memory used by node-red between the same 2 points in time is a lot more: about 600 Mb. Not sure what to make of this.

Finally, the second screenshot explains the title of this post.
I see 100.000+ objects that all have a very similar 'retainers' profile. They seem to be retained by clearImmediate, which doesn't seem to be clearing much, as it's holding on to 80Mb worth of objects. I don't recognize anything of my own code in this 'retainers' profile, making it hard to know what I'm dealing with. Of course, I understand that the origin of the problem could still be my own code :wink:

Does anyone know what clearImmediate and CommsConnection refer to?
Can they have anything to do with node-red-node-mysql?
If found this about commsConnection.
But that may be a red herring, because I usually don't have the editor open in the hours before a crash.

Thanks for any help, I've been struggling with this for almost 2 weeks! :grimacing:

What platform is this? A Pi? Have you checked your system stats? Is there SWAP happening?

This was my first thought - that the data you are handling has crept up and you are maintaining a larger dataset in memory which the system is struggling to find a clear heap location when it gets updated. So not necessarily running out of memory but rather running out of large enough clear spots in the heap.

This is common in Node.js apps. Memory increases until the GC kicks in. As you forced it to delay GC, the memory grows even larger if it can.

I very much suspect that you are over-retaining data somehow. Quite possibly a large object, maybe a flow or global but could just be a context variable? That you are manipulating perhaps.

That shouldn't be an issue, it is just a socket connection for live data comms.

What platform is this? A Pi? Have you checked your system stats? Is there SWAP happening?

Not a Pi :wink: a Red Hat server. No SWAP is happening.

So not necessarily running out of memory but rather running out of large enough clear spots in the heap.

If I understand you correctly, that is still a possibility: there may be some data structure that's growing too large to place in the heap. So it's taking up memory space, but I won't see it in the heap dump. If that's true, I think I won't be able to reverse the problem with the limit node trick mentioned above. Once the object is too large, I would need a restart. I'm going to try that.

This is common in Node.js apps. Memory increases until the GC kicks in. As you forced it to delay GC, the memory grows even larger if it can.

I'm not delaying GC in general right? Just for the hypothetical object (and related objects) in question? As they're not in the heap, they can't be GC'd? Because I do see that GC is happening.

I very much suspect that you are over-retaining data somehow. Quite possibly a large object, maybe a flow or global but could just be a context variable? That you are manipulating perhaps.

I can't imagine how reducing the incoming message by 5-fold would work if that's the case. If that was the issue, I would expect the memory to grow more slowly, but still reach the limit around 5*8 hours. But it's been running for ~40 hours now without any increase. Anyway, I'll check again, I might have missed something.

That shouldn't be an issue, it is just a socket connection for live data comms.

Okay, I can rule that out then, thanks for the response!

Are you running mysql locally on the same machine or across the network?

Run sudo iotop to check whether the system is I/O bound for some reason.

Then you should not be using --max_old_space_size, that won't help you at all. Take it off and let Node.js manage itself dynamically.

Or changing too fast.

No, you are. That's what --max_old_space_size does. It changes the limit when GC kicks in.

I'm not a great expert in this so I'm somewhat guessing. But reducing the speed at which data is coming in may change how much processing of a large in-memory data object is being done - or how fast you are asking it to change.

That would indeed seem logical but JavaScript isn't always that logical in how it handles large data changes (well it probably is but it is certainly not transparent). This is one of the downsides of using it.

Are you running mysql locally on the same machine or across the network?

Across the network, indeed. And the queries are happening sequentially. So a lot of back and forth.

Run sudo iotop to check whether the system is I/O bound for some reason.

Will try on Monday!

@tnltj one question would be whether you have any debug nodes in your flows and are they getting sent lots of messages?

If this is something related to CommsConnection, then two likely drivers of traffic over comms will be the Debug messages and any node status updates.

Also, if you have not already done so, check that you are using the latest versions of all additional nodes (such as the mysql node).

Then post the node red startup log in case we can see anything in there.