Dashboard load times > 30 sec with more than 250 connections to website

thebaldgeek · 10 October 2024 21:24

tl;dr When I see around 250 users on my Node-RED powered website, the site load times become much slower (over 30 seconds) and I can't find why or how to fix it.

Website URL: https://tbg.airframes.io/

I’ve not made any changes to the website for a few months and it's slowly been getting worse in regards to load times around 8am each morning (Pacific Daylight Time) for about 2-3 hours.
Most of the site users are in North America.
Right or wrong, this is my best guess as to the cause of the issue, as very little else has changed beyond the number of users.
I think what happens is that people get annoyed with the slow load times and they give up and leave, as 50+ people do that, the site load times go back to sub 1 second.

The virtual machine running the site is in a data center, I can not see any stress on the VM indicating that it may be the cause.

Data that I’ve been using to try and help track down the root cause of of the load times:

I’ve been looking at btop for any hints:

BTW, the reason Node-RED seems to be using so much RAM is because a core part of my website uses 2-3 largish (1.2 million records) sqlite dbs that I run in RAM for speed. The actual flows are only about 3mb in size.

I also leave pm2 monit running as it shows errors that don’t show up in the editor debug tab.
This is one example I have seen from time to time... I have no idea what Node causes this or how it gets generated from my flow (if that is even the cause).

I also see these disconnect and force close errors scrolling up all the time - note the time stamps.

The disconnect and force close errors are constant. I don’t understand what these messages are or where they come from.
Even when the site is loading smoothly the errors scroll up around 1-2 per minute. Much slower than when the latency is higher.

The http latency value is real. When the time shown in pm2 monit goes up over 2000msec, the site really does take 2 seconds to first load, information on the site updates much slower ect.
This is my biggest ‘tell’ as to when things are not working smoothly.

Here is an example of PM2 information when there is a 28 second load time for the site:

I don't really understand a lot of the numbers here beyond the P95 latency as that is real and easy to measure on the site itself and is the biggest pain point to my site users.

I use Node-RED to do a call to PM2 once a minute and extract the P95 latency number. I have been plotting its value for a few months.

Here is a graph of the pm2 ‘Event Loop Latency p95’ value from mid July to Oct;

Here is the Google Analytics user activity over the same time period:

In my mind, the two correlate with a tipping point 8am each morning in North America: the number of users spike and the site load time spikes.
The tipping point is around 250 users.

In regards to what I have tried:
apt-get update
apt-get upgrade
reboot now

Currently, my versions are:
Node-RED 4.0.3
NodeJS 18.20.4
Linux 5.15.0-112-generic
VM is a 2Ghz AMD EPYC 7282 16-Core Processor with 32gb of RAM and 500gb SSD

I start Node-RED thus: pm2 start node-red --node-args="--max-old-space-size=8192" -- -v

I have had this issue with dash v1, but with only about 50 to 80 users. You can read about that here:
https://discourse.nodered.org/t/how-to-make-dashboard-webserver-more-robust/61295

The lack of visibility of these core metrics lead me to put in this GitHub request:

github.com/FlowFuse/node-red-dashboard

Dashboard metrics widget (connected users / load times / memory use)

opened 04:10PM - 29 Nov 23 UTC

thebaldgeek

size:L - 5 feature-request widget

### Description The v1 dashboard is 'ugly' to get system type metrics out of. … I have found that the bear minimum metrics would be number of connected humans, dashboard load time and Node-RED memory use. First up, any large dashboard needs to be run with the `--max-old-space-size=8192` (https://discourse.nodered.org/t/out-of-memory-crash-on-windows-10-pc/54141/22?u=thebaldgeek) start up command. Must. Not optional. Next up, make changes to settings.js (https://discourse.nodered.org/t/out-of-memory-crash-on-windows-10-pc/54141/4?u=thebaldgeek) so that heap and memory use can be seen at start up. Make heavy use of browser developer tools (https://discourse.nodered.org/t/how-to-make-dashboard-webserver-more-robust/61295/26?u=thebaldgeek) ![image](https://github.com/FlowFuse/node-red-dashboard/assets/7385116/f603d65f-0fdf-4c00-8f9c-78b5c33d1289) Extensive time with Google is required to dig into the parts loaded by the dashboard so long load items can be identified and tweaked (or usually given up on and simply removed) for faster response. Then use extensive wrangling with the ui_control node to try and measure user clicks etc to track the number of connected users vs Cloudflare cache IP addresses. (Using geo-ip API to try and find 'cloudflare' in the IP address return to remove those from 'tracking' so knowledge of actual number of connected users can be determined). Note about Cloudflare. Since the dash is both single user and single page app, using Cloudflare cache is a real blessing and curse. Its needed so users don't trample over each other, but since since the whole site is dynamic, there is very little content that can be cached, so it does not actually help beyond helping each user to see their data for as long as needed. Also the cache causes the editor to have little melt-downs pretty often and you lose changes unless deployed before the cache timer expires, so you end up editing and deploying 100's of almost click by click tiny changes vs making the changes you want over about 5 minutes and thus timing out the cache for the editor and losing all that work. Lastly, its critical to make use of the exec node to pull PM2 metrics (`pm2 jlist`, then a JSON node, then pull this key:value -> `msg.payload[0].pm2_env.axm_monitor["HTTP P95 Latency"].value`) This boils down to a 'connected user graph vs dashboard load latency' graph. ![image](https://github.com/FlowFuse/node-red-dashboard/assets/7385116/ee589e92-6a28-4fa8-9494-9a7856bbe7f4) Graph of the past month showing 80 users results in upwards of 80 second load times before seeing the /ui/ v1 dashboard home page. (Graph in non-Node-RED dashboard as long graph times are a problem for me in using the v1 graph node). I was able to use all this data over the space of a year or so to make improvements to the flows running the dashboard and thus get the load times down to the 'acceptable' value of 60 to 80 seconds and disconnects no more than about every 5 minutes. I found that people give up when its over 2 minutes to load the dashboard and frequent timeouts are the biggest turn-aways. I propose a 'dash_metrics' widget that does a lot of this digging (and more) for the user. Difficult things made easy. ### Properties _No response_ ### Events _No response_ ### Controls _No response_ ### Existing Examples _No response_ ### Have you provided an initial effort estimate for this issue? I am no FlowFuse team member

I understand that this use case is a bit out of the usual and so its hard to give an example flow to reproduce the issue.
I'm at a loss as to what to try next. I think the hardware I am running Node-RED on should be more than enough to keep up.
In short, I just don't see why site should not run as smooth today as 3 months ago before the user count went up.

I have to be missing a setting somewhere....

Thanks.

pandoras_node · 11 October 2024 00:09

Networking bottlenecks?

thebaldgeek · 11 October 2024 01:00

I moved the site to a data center to ensure I had a big fat pipe to ovoid any of these issues.
The dash v1 site was at a different data center and yes, there I had some network contention issues. It was a LOT easier to troubleshoot and see the issue using standard Linux CLI tools.

The new data center does not have any of those issues.
Here is what I have moved so far today.
169 Mb of aircraft data into the data center (I use MQTT as its super small and works so well with Node-RED).
3.6gb out to browser clients. The day has just begun, so the math says yet another 100odd GB day.
today 169.67 MiB / 3.64 GiB / 3.81 GiB / 121.78 GiB

Its a hard fact that Node-RED is a single page app, but I moved from dash v1 to dash v2 as I know the team is working hard to make the site have more than 1 base URL.
When that happens, it will be SUPER interesting to see the network traffic stats.
I have the top 5 pages and they are WAY above all the rest of the site. Looking forward to splitting at least those pages into their own base URLs.

Bottom line, there is zero evidence that the data center or VM is choking the upload data.

joepavitt · 11 October 2024 06:52

I was hoping to try to access the site at this time today, but I need to finish early, so won't be able to.

The amazing analysis you've done here definirely suggests bottlenecks in NR or Dashboard. An interesting analysis would be to use the in-browser performance tools to see what is actuslly taking the time onload. Whether it's the initial connect, the data retrieval, etc.

It may even boil down to a limitation on the concurrent connections the WebSocket (SocketIO) server can handle

hotNipi · 11 October 2024 07:09

Couple of things I saw:
Site has rendering errors most probably originated from ui_template's
Data over the socket can be optimized. Seems like payloads contain raw data which is not used for components. Any data the front-end doesn't use should be filtered out. The gain may be quite significant.

TotallyInformation · 11 October 2024 10:18

Clearly the front-end scripting is a big issue in terms of user experience. That isn't the reason for the impact you are describing though I don't think. But it makes the user experience relatively slow even when you aren't having the described issue.

Also:

Really very high levels of comms - that could well be part of the issue.

Do you have a reverse proxy you can use? That may help, at least with loading of more static resources and so would reduce the specific impact on Node-RED when lots of users connect.

thebaldgeek · 11 October 2024 13:09

I am ALL ears here!!
The data tables MUST be responsive, so I had to hand code the HTML and CSS to get that to sort of work.
It ended up being just a big old slug of web code in a ui_tempate node

What 'front end' data do you see? How can I identify it and how can I strip it out before it gets sent?
I did see a LOT of code in my browser CSS debugging sessions that was not mine and I just ignored it as I was not sure who was making it, or if it was coming from D2, or was just part of the browser debug code.

This sounds like its in Vuetify playing field? Nothing I can do about it?

Is this under my control somehow?

I think I am stuck here given the data heavy aspect of the website that has been user driven. (They are asking for pretty much all the tables to be tripled in length!!)
Not much I can do about it until the D2 base URL can support more than one?

Even after reading a few Google results for 'reverse proxy' I am not sure what one is, or how to configure it or what it might do for me.
The data center has set up Caddy for me.... Someting to do with the SSL certificate?
Seems that Caddy has a reverse proxy function: Reverse proxy quick-start — Caddy Documentation
I will take a look.

Thanks all for your valuable feedback.
Very much apricate it.

thebaldgeek · 11 October 2024 13:13

Quick update. The data center already has the Caddy reverse proxy turned on for me.
The site has been running it this whole time.

I guess the core issue is that the site has very little static resources vs the large number of dynamic data tables, so I am not getting much help from it.
I don't want to, but am oddly interested in how bad the site would be with it turned off...

TotallyInformation · 11 October 2024 13:18

Not when using D2. Use UIBUILDER if you want full control over the front-end.

One issue you are facing is the overheads of how D2 interacts between Node-RED and the browser by using VueJS and Vuetify. There are a lot of overheads to that because D2, like Node-RED, is designed to be beginner friendly. So just like Node-RED itself, there are a LOT of overheads that you may or may not be utilising here. Just as, for example, the fact that you could get much higher throughput for a hand-crafted system rather than using Node-RED.

Do you need D2 here? If you don't, consider keeping D2 for other tasks and use UIBUILDER for this page. UIBUILDER supports multiple independent pages and even multiple independent uibuilder nodes.

Caddy can be used as a reverse proxy.

Think of it simply as an advanced cache. If a resource is fairly static and especially if used by many client connections, the proxy will hold a copy of the static data and serve it directly. This is as opposed to how Node-RED has to dynamically build the page for every single connection (a slight simplification of what actually happens but close enough).

That is a bit more complex but there should be examples you can find here in the forum and possibly elsewhere online.

Another thing that a proxy can do for you is manage the overheads of TLS encryption. This would remove the overheads from Node-RED. Indeed, Caddy has a built-in feature to make working with Let's Encrypt certificates very easy.

If you need to optimise the performance of a web page, you need to start by making as much of it as possible STATIC. Then only updating the bits that really need to be updated. This is explicitly what UIBUILDER is designed to help you do.

thebaldgeek · 11 October 2024 13:33

The user onslaught is starting a little early this morning.

This graph is updated once a minute with the P95 latency value from PM2.

Looking at all my netdata graphs (list):

I don't see anything jumping out at me.
I guess it really is just the nodejs overhead.

Time to make some users unhappy and strip some pages out.
I think I can run more than 1 Node-RED and thus more than one dashbaord URL on the same server?

Steve-Mcl · 11 October 2024 13:35

Just spotted a big clue:

transport=polling:

This indicates that HTTP polling is being used as the transport mechanism instead of WebSockets. Polling is used as a fallback (mechanism provided by socket.io) when a WebSocket connection cannot be established immediately. Polling is much slower than WebSockets because it involves repeated HTTP requests to check for updates.
Clue: socket.io initially tries polling and later upgrades to WebSockets when possible. A 10-second hang could mean it's getting stuck trying to upgrade to WebSockets or the polling process itself is slow due to network, server load, or configuration issues.

Check ppoints:

Load balancers or proxies: If you have load balancers or reverse proxies in place (like NGINX), ensure they're properly configured to handle WebSocket traffic. Misconfiguration here often causes fallback to polling

What @hotNipi is stating here is: If your table displays 10 rows with 15 columns each BUT your data transmission contains data for 30 columns (i.e. 15 columns are just simply never displayed) then this is wasteful. I have not seen (or looked for that condition on your site, so I cannot comment further)

PS: The dashboard table is responsive. columns collapse to row based data based on either the default breakpoints (or your custom breakpoints set in the page config)

thebaldgeek · 11 October 2024 13:37

I've tried using UIBUILDER 4 times in earnest (ie, spent a few hours working through the docs and testing it) and kicked at it a few times in passing.
I just cant see to get past go with it and actually get any Node-RED data to show up in it.
Sadly, I am just not the target audience for it since I am not a coder.

thebaldgeek · 11 October 2024 13:39

When I built the site in June, this was not the case.
I am happy to revisit this, but am not sure that changing from ui_tempate to ui_table will fix this issue?

hotNipi · 11 October 2024 13:40

The websocket messages.
Her's one for you. It seems to carry data for first table in main page.

tbArgs duplicates the payload
drop, qos, retain I don't think are useful here. Seems like the origin of this message is MQTT so the properties carried over to frontend.
I can't of course analyse in deep cos the data may do other things than just to be rendered but those just popped out immediately.

And this is only one message!, The traffic is pretty heavy.

thebaldgeek · 11 October 2024 13:47

Im trying to wrap my head around what you mention here as it seems important.
Here is the code and payload for your screnshot..

This is then fed into the ui_tempate node

I don't see where its duplicated more than once?

EDIT, sorry, there is a bug in Node-RED that does not show all the template code...

  <v-data-table
    v-model:search="search"
    :items="msg?.tbArgs[0]"
    :headers="headers"
  >
    <template v-slot:header.station="{ item }"> Station </template>
    <template v-slot:item.today="{ item }"> {{ item.today }} </template>
    <template v-slot:item.yest="{ item }"> {{ item.yest }} </template>
    <template v-slot:item.speed="{ item }">
      <!-- Render a Linear Progress Bar for the "current" column -->
      <v-progress-linear
        v-model="item.speed"
        min="0"
        max="100"
        height="25"
        color="blue"
      >
        <template v-slot:default="{ value }">
          <strong>{{ item.speed }}mpm</strong>
        </template>
      </v-progress-linear>
    </template>
    <template #bottom>
      <!-- Leave this slot empty to hide pagination controls -->
      <!-- <hr> -->
    </template>
  </v-data-table>
</template>

<script>
  export default {
    data() {
      return {
        search: '',
        headers: [
          { key: 'station', title: 'station' },
          { key: 'today', title: 'msg today' },
          { key: 'yest', title: 'msg yest' },
          { key: 'speed', title: 'msg speed' },
        ],
      }
    },
  }
</script>
<style>
  tbody tr:nth-of-type(even) {
    background-color: rgba(63, 63, 63);
  }

  tbody tr:nth-of-type(odd) {
    background-color: rgb(98 ,98, 98);
  }

</style>```

hotNipi · 11 October 2024 13:49

Set the debug node to show full msg. Then you see what goes out really.

Steve-Mcl · 11 October 2024 13:53

Perhaps, but then you are using the old ACE editor which does not understand VUE

Monaco editor (the editor that powers VSCode) has been built in since Node-RED v2 and became the default in Node-RED v3.

What versions of Node-RED are you using?

did you deliberately use the less capable ACE editor over MONACO?

Or perhaps you have an old settings.js with the editor hard coded to use ACE?

Below is what your template code looks like in MONACO. Note the syntax highlighting colours and full text display with minimap on the right:

thebaldgeek · 11 October 2024 14:02

Im on 4.0.3

No, never changed it from the default install.

I will check how to change it, but cant restart Node-RED as it dumps the sqlite memory db and it makes a lot of grumpy users, so will need to pick a time when there are a lot of updates that need doing.

Thanks for pointing it out, I had no idea.

thebaldgeek · 11 October 2024 14:04

Ok, this is sort of mind blowing.
I had no idea that msg.payload was ALSO going out over the wire to each users browser.
I thought just the HTML and CSS was going out.....

I have a TON of cleaning up to do.
Thanks for pointing this out.

TotallyInformation · 11 October 2024 15:07

Quite possibly the result of an unconfigured proxy? If the Caddy proxy is live, it needs to be configured for websockets.

Hmm, and yet you are delivering a website that is in need of some TLC? There is a bit of a disconnect here.

I am generally very responsive to people who reach out for advice by the way.

Topic		Replies	Views
How to make Dashboard webserver more robust? Dashboard	33	1374	2 June 2022
Node-RED dashboard performance across tablets & mobiles Dashboard	38	3614	8 September 2020
Dashboard become unresponsive Dashboard	13	1336	7 July 2020
Help with dashboard crashing Dashboard	19	4093	14 November 2018
Easy way to remove lag? General	30	6399	20 April 2019

Dashboard load times > 30 sec with more than 250 connections to website

Related topics