I am trying to isolate/resolve socket resource consumption. The problem: The daily reboot of Android panels result in gradual socket consumption until the server (RPi) exhausts it's socket pool at around 500 sockets (default MAX SOCKETS count for the OS).
The environment: NodeRED/ExpressJS servers are Raspberry PIs. User-facing front ends are a number of Android panels with kiosk-ed browsers running a Vue2 app in UIBUILDER. For operational reasons the Android panels are configured to reboot daily at 4am, and have approx 50sec reboot time. I'm running several of these setups across multiple sites resulting in some variance in node versions, but seeing the same issue on all instances. All running NodeRED v3.0.2 with Node16. Node versions in use (according to npm ls) include:
UIBUILDER | socket.io
v6.4.0 | v4.6.1
v6.4.1 | v4.6.1
v6.5.0 | v4.7.2
v6.6.0 | v4.7.2 (This unit not yet in "Production" or part of these tests, but will be by the end of this week.)
Android panels are running a mix of Android 8/10/12 with no correlation between Android version and socket consumption. I'm primarily using one brand of Android panel, but I'm seeing the same issue with another brand. I'm gradually updating all panels of my primary brand to Android 12.
Restarting the NodeRED service is required to free up system resources. Prior to restarting I use netstat to observe current socket resource consumption. There are multiple sockets per Android panel lingering in the ESTABLISHED state. socket.io appears to create a surplus socket (or possibly more) at initial connection, and fails to clean up the socket when each panel goes through a 4am reboot. (Comments on the interwebs/elsewhere regarding Vue "use strict" make me a bit unsure whether to expect one-vs-two sockets.)
I can accept that perhaps two sockets per panel are needed but the growth gets well beyond this, in some cases up to 50 or more sockets for a single Android panel. The distribution of this issue is somewhat inconsistent - a panel will behave fine for a time, while at other times it consumes multiple sockets. I put the random distribution/freq down to things beyond my control - power/network instabilities. I can recreate the issue on my workbench by power-cycling a test Android panel, running a bare bones UIBUILDER instance templated from the "VueJS v2 & bootstrap-vue" template (to rule out any of my Vue coding).
My first thought was: If I can shorten the socket timeout to something much quicker (< 20 sec) than the panel reboot time (50 sec), old sockets should get cleaned up prior to new ones being established. That would possibly avoid any socket recycling/reuse issues which might be at play. I dug through the following docs to find some settings to control socket.io timeout durations.
https://totallyinformation.github.io/node-red-contrib-UIBUILDER/#/dev/server-libs/socket?id=socketio-server-options
Server options | Socket.IO
From that, I modified my NodeRED settings.js file adding the following.
uibuilder: {
socketOptions: {
pingInterval: 5000, //5 sec
pingTimeout: 4000, //4 sec
},
},
AFAIK, this should theoretically give me a max timeout of 9 seconds before the socket is considered to be timed out. I restarted the NodeRED service to apply the changed settings, and used netstat to observe socket utilisation.
Test 0 - Establish a "normal" state. After powering up the test panel it initially establishes five sockets to the server, then three-of-five quickly enter the FIN_WAIT2 state and ultimately timeout/disappear. The remaining two sockets stay in the ESTABLISHED state. All sockets were/are accessing the server on port 1880. As you read on, I suspect one of these two remaining sockets might already an underlying issue since I believe there should only be one socket between UIBUILDER instance and NodeRED engine.
Test 1 - Simulate a long power outage. I disconnect the (PoE-powered) panel. It takes approx 30 sec for one of the two sockets to change from ESTABLISHED to the FIN_WAIT1 state, with final teardown of this socket occurring another 120 sec after this. I find it curious that the timeout happened after 30 sec, not 9 sec per my socket.io config - there's something for me to learn there... The other socket of the Test #0 pair remains in the ESTABLISHED state indefinitely (> 12+ min beyond the first socket's teardown), even though the panel is not powered up. I think this demonstrates the core issue I'm having - I expect this second socket to timeout at some point but it never does, and can only be freed up var a restart of the NodeRED service.
Test 2 - Simulate power restore. After panel bootup it re-establishes again per the five sockets, a quick timeout of three sockets, stabilising to two ESTABLISHED sockets. Looking at the unique/semi-random client port# of the sockets, one of these two is the original socket from Test #0 which never timed out in Test #1.
Test 3 - Simulate 4am reboot. For this test I did a quick disconnect/reconnect of power to the panel, looking for possible race conditions between socket timeout and panel boot time. Socket timeout behaviour is per Test #1 above. I think I might be teetering around a timeout threshold value, though this observation could be unfounded correlation. If the Android panel attempts to connect to the RPi while the previous socket is in FIN_WAIT1 state on the RPi side, the initial cluster of five sockets drops back to three instead of two. (Spawning another lingering socket?) These three sockets include the original ESTABLISHED socket from Tests #0 and two new sockets from the "cluster of five" in Test #3. The FIN_WAIT1 socket established in Test #2 times out.
Test 4 - Repeat of Test #3 above. While the Android panel was rebooting, one RPi socket entered FIN_WAIT1 state while the remaining sockets stayed in ESTABLISHED state. Post-reboot we settle to five ESTABLISHED sockets, so a growth of two extra sockets going from Test #3 to Test #4.
As a side-test, I was a bit frustrated that that my settings.js=>UIBUILDER=>socketOptions timeouts seemed to be ignored, so I tried to find/modify the default behavior of UIBUILDERs socket.io library. All modifications were temporary, and have since been restored to default settings.
~/.node-red/node_modules/node-red-contrib-UIBUILDER/nodes/libs/socket.js line 216 maxDisconnectionDuration. Changed this from 120 seconds to 12 seconds (12000 mSec)
This did not appear to have any impact on the test results.
Other general observations:
- I don't think the presence of a socket in FIN_WAIT1 state is influencing subsequent socket creations, since Test #0 had no such FIN_WAIT1 sockets when it created a "faulty" socket.
- I don't see any obvious performance degradation on the panel when the RPi is alleging multiple socket connections to it - I suspect the issue is contained to the RPi side and there are probably not 50+ sockets connected end-to-end between server and client.
- I'm not seeing message duplication between my Vue app and the NodeRED flow, so I think such "duplicate" sockets are not replicating/passing data. They are simply consuming system resources (due to my daily 4am Android reboot) and remaining in the ESTABLISHED state on the RPi side until we reach a critical OS resource limit.
Where to from here? I'm currently mitigating this by monitoring the total socket count via netstat -n | grep tcp | wc -l
, flagging this when it gets above 350 sockets and arranging an out-of-hours restart of NodeRED service on an as needed basis. I'd prefer to get the root cause resolved than set up an automated service restart and ignore the root cause.
Yours,
Annuello.