First, some general info
I have several flows, each with multiple MQTT in and MQTT out nodes. There are also multiple queries to a MySQL database, using the node-red-node-mysql
. About 250 incoming messages are processed per second, all flows combined. The flows have run without problems for several months now. A few weeks ago however, node-red started crashing due to what appears to be a memory leak. I did not update node-red or any nodes in the weeks prior to that start of the crashes.
Here's the stack trace from the logs at the point of a crash, that I'll add for completeness. They don't tell me much (I think), except that the garbage collector isn't succeeding in keeping the memory from growing.
Mar 21 00:29:59 node-red[319284]: <--- Last few GCs --->
Mar 21 00:29:59 node-red[319284]: [319284:0x5877b90] 27751590 ms: Scavenge 3726.7 (4126.7) -> 3713.7 (4126.7) MB, 8.86 / 0.00 ms (average mu = 0.343, current mu = >
Mar 21 00:29:59 node-red[319284]: [319284:0x5877b90] 27751637 ms: Scavenge 3728.1 (4126.7) -> 3714.5 (4128.0) MB, 7.46 / 0.00 ms (average mu = 0.343, current mu = >
Mar 21 00:29:59 node-red[319284]: [319284:0x5877b90] 27751686 ms: Scavenge 3729.6 (4128.0) -> 3715.3 (4128.5) MB, 7.73 / 0.00 ms (average mu = 0.343, current mu = >
Mar 21 00:29:59 node-red[319284]: <--- JS stacktrace --->
Mar 21 00:29:59 node-red[319284]: FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
Mar 21 00:29:59 node-red[319284]: ----- Native stack trace -----
Mar 21 00:29:59 node-red[319284]: 1: 0xca5580 node::Abort() [node-red]
Mar 21 00:29:59 node-red[319284]: 2: 0xb781f9 [node-red]
Mar 21 00:29:59 node-red[319284]: 3: 0xeca4d0 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [node-red]
Mar 21 00:29:59 node-red[319284]: 4: 0xeca7b7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [node-red]
Mar 21 00:29:59 node-red[319284]: 5: 0x10dc505 [node-red]
Mar 21 00:29:59 node-red[319284]: 6: 0x10dca94 v8::internal::Heap::RecomputeLimits(v8::internal::GarbageCollector) [node-red]
Mar 21 00:29:59 node-red[319284]: 7: 0x10f3984 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::internal::GarbageCollectionReason, >
Mar 21 00:29:59 node-red[319284]: 8: 0x10f419c v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallb>
Mar 21 00:29:59 node-red[319284]: 9: 0x10ca4a1 v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::Allo>
Mar 21 00:29:59 node-red[319284]: 10: 0x10cb635 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::All>
Mar 21 00:29:59 node-red[319284]: 11: 0x10a8c86 v8::internal::Factory::NewFillerObject(int, v8::internal::AllocationAlignment, v8::internal::AllocationType, v8::int>
Mar 21 00:29:59 node-red[319284]: 12: 0x1503a16 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [node-red]
Mar 21 00:29:59 node-red[319284]: 13: 0x7f2a3fed9ef6
Things I've tried
I was using node-red v1.3.7
. I've now upgraded to v3.1.7
, based on some other forum posts related to memory leaks. Unfortunately, that didn't change the behavior.
I've tried increasing --max_old_space_size
from 512 to 2048. This just delays the crash (the crashes happen about every 8 hours at this point). I also clearly notice a performance drop long before the crash actually happens. That's the bigger issue.
I'm using some flow context here and there. So I checked its size using a small code snippet. The flow context size is well below 1Mb in each flow.
I've tried limiting the number of incoming messages from MQTT with a delay node, set to 'limit messages per topic'. I set it so the number of incoming messages is reduced by 80%. This actually removed the issue! Memory hovers around 400Mb, as it was before all the issues, and has stayed that way for 24 hours Note that reducing the incoming messages by 50% was not enough.
Naturally, at this point I was assuming an inefficiency in my flow somewhere, causing messages to build up at a certain node. I've placed (many!) debug nodes set to 'count the number of messages', usually right before and after the SQL queries. However, I'm not seeing the expected buildup of messages anywhere, not even when things have already slowed down considerably.
The heap dump
In an attempt to get more clues, I took 3 heapdumps with node-red-contrib-heap-dump. The first was taken 5 minutes after startup, the others 20 and 30 minutes after startup. The first screenshot shows that after 30 minutes, there are 1.3 million objects more than at the first point in time, corresponding to a growth of about 57Mb.
Note: using top in linux, I can see that the difference in memory used by node-red between the same 2 points in time is a lot more: about 600 Mb. Not sure what to make of this.
Finally, the second screenshot explains the title of this post.
I see 100.000+ objects that all have a very similar 'retainers' profile. They seem to be retained by clearImmediate
, which doesn't seem to be clearing much, as it's holding on to 80Mb worth of objects. I don't recognize anything of my own code in this 'retainers' profile, making it hard to know what I'm dealing with. Of course, I understand that the origin of the problem could still be my own code
Does anyone know what clearImmediate and CommsConnection refer to?
Can they have anything to do with node-red-node-mysql
?
If found this about commsConnection.
But that may be a red herring, because I usually don't have the editor open in the hours before a crash.
Thanks for any help, I've been struggling with this for almost 2 weeks!