tl;dr When I see around 250 users on my Node-RED powered website, the site load times become much slower (over 30 seconds) and I can't find why or how to fix it.
Website URL: https://tbg.airframes.io/
I’ve not made any changes to the website for a few months and it's slowly been getting worse in regards to load times around 8am each morning (Pacific Daylight Time) for about 2-3 hours.
Most of the site users are in North America.
Right or wrong, this is my best guess as to the cause of the issue, as very little else has changed beyond the number of users.
I think what happens is that people get annoyed with the slow load times and they give up and leave, as 50+ people do that, the site load times go back to sub 1 second.
The virtual machine running the site is in a data center, I can not see any stress on the VM indicating that it may be the cause.
Data that I’ve been using to try and help track down the root cause of of the load times:
I’ve been looking at btop
for any hints:
BTW, the reason Node-RED seems to be using so much RAM is because a core part of my website uses 2-3 largish (1.2 million records) sqlite dbs that I run in RAM for speed. The actual flows are only about 3mb in size.
I also leave pm2 monit
running as it shows errors that don’t show up in the editor debug tab.
This is one example I have seen from time to time... I have no idea what Node causes this or how it gets generated from my flow (if that is even the cause).
I also see these disconnect and force close errors scrolling up all the time - note the time stamps.
The disconnect and force close errors are constant. I don’t understand what these messages are or where they come from.
Even when the site is loading smoothly the errors scroll up around 1-2 per minute. Much slower than when the latency is higher.
The http latency value is real. When the time shown in pm2 monit goes up over 2000msec, the site really does take 2 seconds to first load, information on the site updates much slower ect.
This is my biggest ‘tell’ as to when things are not working smoothly.
Here is an example of PM2 information when there is a 28 second load time for the site:
I don't really understand a lot of the numbers here beyond the P95 latency as that is real and easy to measure on the site itself and is the biggest pain point to my site users.
I use Node-RED to do a call to PM2 once a minute and extract the P95 latency number. I have been plotting its value for a few months.
Here is a graph of the pm2 ‘Event Loop Latency p95’ value from mid July to Oct;
Here is the Google Analytics user activity over the same time period:
In my mind, the two correlate with a tipping point 8am each morning in North America: the number of users spike and the site load time spikes.
The tipping point is around 250 users.
In regards to what I have tried:
apt-get update
apt-get upgrade
reboot now
Currently, my versions are:
Node-RED 4.0.3
NodeJS 18.20.4
Linux 5.15.0-112-generic
VM is a 2Ghz AMD EPYC 7282 16-Core Processor with 32gb of RAM and 500gb SSD
I start Node-RED thus: pm2 start node-red --node-args="--max-old-space-size=8192" -- -v
I have had this issue with dash v1, but with only about 50 to 80 users. You can read about that here:
https://discourse.nodered.org/t/how-to-make-dashboard-webserver-more-robust/61295
The lack of visibility of these core metrics lead me to put in this GitHub request:
I understand that this use case is a bit out of the usual and so its hard to give an example flow to reproduce the issue.
I'm at a loss as to what to try next. I think the hardware I am running Node-RED on should be more than enough to keep up.
In short, I just don't see why site should not run as smooth today as 3 months ago before the user count went up.
I have to be missing a setting somewhere....
Thanks.