Performance for industrial data gathering

I've been doing a node-red application for tracking material data at work (PCB facility).

Currently, I poll data from different machinery at different intervals:

  • 3s intervals from 26 machines (for now) that run on Siemens S7 PLCs.
  • 15s intervals from 3 databases running Microsoft SQL EXPRESS to get the material stored in smart storage units.

The data gathered is written to a MySQL database that holds a table for each machine type, tracking the timestamp and characteristics of the material at the moment it passed by.

In the future, I'll add several more machines that run on Omron NX PLCs, and possibly gather information from some additional databases from process machinery.

When I started the trials a few weeks ago, I was using the node-red-contrib-ui-led to display the machine status, but the impact in performance was enormous. I was also scanning much more often (500 ms) to catch the machine lights' actual status (they blink on a 1 Hz clock).

If I tried accessing the dashboard from a Raspberry Pi or one of the connected Smart TVs we have in the plant, they would display the "Connection lost" quite often, and miss the refresh. Changing from one tab to the other took a lot of time, even when most tabs only had about 6 to 8 machines on them.

In order to improve performance, I removed all the node-red-contrib-ui-led nodes and changed them to text nodes that used fontawesome icons to simulate lights. I also increased the polling time to 3000ms, and instead of getting the status lights, I scanned the general status and some critical errors I needed. This increased the number of variables I had to read (a few more booleans), but reduced the polling frequency.

Instead of using many test fields on the dashboard, I just used a few and sent them strings with formatted HTML code that used tables to display the information. This allowed me to increase the information density on each group, and only use about 4-5 text nodes per machine instead of 15-16 nodes. This is how it looks currently (the data has been obscured for obvious reasons, but you get an idea):
image

Each row is a different text node formatted as a table. I used css to apply the card-like borders.

I also started using subflows to standardise things. Basically, for regular automation from the same maker, I have a subflow that looks like this:

The input is connected to each S7 node that polls the PLCs and returns the data I need, the outputs 1-5 are connected to the text nodes in each group, and exits 6 and 7 are used for diagnostics.

You'll notice that I split the Data Tracking nodes (essentially, it's the same database), because the polling is almost syncronous for all machines, and the DB couldn't handle so many requests made at the same time in the same connection. Splitting it like this seemed to fix that particular problem.

On a PC, RPi or Smart TV browser, the performance now it's very similar: the dashboard looks responsive and refreshes often, but from time to time, the page looks as if it missed an update: most data disappears, only to be refreshed in the next poll cycle (about 3s later).

This didn't happen before I included the database part and was only displaying the data in real time. This reminds me of what happened before with the Pis or the Smart TV browsers, except that I don't get the "Connection lost" message

Even though it's not a big problem, it is a bit annoying, specially considering I will have to add more information in the next few months, and was even thinking about using the node-red to handle the Smart Storage units and AGVs in the plant instead of the suppliers' proprietary software (which is terribly buggy, has no documentation (and is only sparsely commented), it is extremely demanding with system resources, and performs poorly).

This would put more stress on node-red. If it can't handle so much data effectively, I could have one node-red server to gather information and another one to handle smart storage and AGV solutions, or even two. I could also split the data-gathering tasks in two (or more) servers for different parts of the factory, with all of them writing the tracking information to the same centralised database.

Since I only have a few weeks' experience with node-red, I would appreciate any input/advice regarding performance improvements.

I haven't got any immediate suggestions, but I will point out, in case you didn't realise, that most of the issues are probably in the browser rather than the machine running node-red (you haven't told us what that is). You could check this by running the browser in a separate PC to check. Are you running a browser on the machine running node-red?

The node-red server is running in a VM cluster we have at work.

I've tried accessing the dashboard from Chromium-based Edge and Firefox on Windows 10, from Chromium on a Raspberry Pi model B with 4 GB Ram, and from a browser in an LG Smart TV.

Results are almost identical regardless of the browser and device I use to access the dashboard.

Have a look at the VM stats (CPU usage, memory and so on) and see if it is overloaded.
Is the sql server running in the VM or elsewhere?

The SQL servers I use are running on different machines, I gather data from three different PCs in different machines, and write the data from those and from the machines to a server in our VM cluster that is running Windows.

Node-red is running in a different VM that runs (I think) Red Hat linux. I will check the VM stats next week, because I do not have direct access to the server and the IT guy who manages it is in home office at the moment.

Will report back as I find out something, thanks for the suggestion.

In the mean time, I'm checking if it would be feasible to implement a finite state machine in node-red to run the AGVs and smart storages, because the supplier's C# application running in those PCs is buggier than a plague of locusts.

Apparently, the node-red server is having very high CPU usage:

I'll try to see if I can get an accurate diagnose, I started having lag after including the database (SQLExpress and MySQL) sections, but it might as well be that all the S7 polling with multiple variables in different memory blocks overtaxes it from the start.

Maybe splitting the messages in smaller chunks, passing only the data each subflow needs, instead of shoving the full message to each subflow?

Do subflows tax the CPU more than duplicating the blocks several times?

Interesting project. Subflows vs. dublicating blocks won't make the difference, but how many nodes and what type they are, will impact performance. The function node for example is quite expensive, any optimizations there will be of immediate benefit. Try to do as much work as you can in one function. Can you show us the code in these function nodes? There might be a lot of optimizations that can be done there.

Edit: if not enough performance can be gained from fixing the function nodes, checkout https://flows.nodered.org/node/node-red-contrib-unsafe-function

I doubt very much if minor changes such as subflows or using unsafe-function will make much difference. It is more likely to be something like the database activity that is causing the problem. In your situation I would disable sections of the flow in an attempt to work out what the critical components are, though I know that can often be difficult.

If the database was the problem, the database host would be showing lots of CPU usage, not node-red. Changes to the code can absolutely make a difference, you can have very inefficient code taking up lots of CPU cycles. Don't forget that every invocation of the function node spins up a new nodejs vm, unsafe functions do make a difference.

@RedShift Regardless of whether it solves your CPU usage issue, one of the strongest recommendations I can make is - if at all possible, reduce the amount of separate PLC polls & if possible, group all values into a contiguous block. This will become especially important if you need consistent data. Under the hood of many drivers they dont (or cant) grab values from different areas in one transaction. Essentially, they access each area your request in separate polls. (disclaimer: I have not studied the S7 protocol or the driver implementation)

As an example, on another issue I assisted on, the engineer was polling many different PLC areas and was getting strange results in the database. In the end, because the 1st area POLL occurred almost 200ms BEFORE the last POLL, values in the PLC had changed.

Another thing to consider, instead of polling individual bools or bytes, read all data in bytes and parse the data at node-red side. We do this at work and make use of the buffer-parser node to turn the 200 contiguous bytes (that we read in one single fast operation) into the required bits, bytes, ints, uints etc.

2 last things...
OMRON NX can do MQTT
S7 also has MQTT blocks
Perhaps if you might want to consider pushing data instead of pulling?

Hope that helps - even if only to give more avenues of thought.

That might be the issue, I use a lot of function nodes to process the data gathered there and prepare the html I send to the text nodes for display. I also use functions to prepare the SQL strings to send to the MySQL and SQL Express nodes.

This is how a typical machine looks like on the outside. names are blurred for confidentiality reasons. I gather the data from an S7 node, and use a change node to add some configuration constants to the message. Then it goes into a subflow where I process all the data and sends out HTML strings to each text node for displaying the information as cards:

image

This is what the big subflow looks like. Basically, I divert the information depending on which type of machine it is, then process the information with the function nodes that may apply, then format the information into an HTML string, and output that for the text nodes in the main flow:

Regarding the S7 programming:
The machines are still in warranty, so modifying their program at this point is tricky without voiding the warranty. The Taiwanese supplier gave us the PLC program but the structure of their program is a mess, there are barely any comments, and even for Chinese-speaking people it is challenging to understand the few comments that there are, to the point that only the guy who programmed each machines can fully understand them, and other programmers from the supplier don't even want to touch the program.

I know that eventually, we'll have to go in deep and probably end up rewriting the whole thing, but this will literally take years because the factory is in production and I can't stop a machine for a few months to fully reprogram it. Next year, I'll try setting up a separate DB inside the PLC and try to write all the relevant information there to optimize reads.

Another supplier is using OMRON PLCs and he'll set up his 10+ machines for us, so that's some work that I won't have to do. Yesterday I got another supplier that will gather all the data from his 2 machines using MySQL, so I can just poll once every few minutes and should not represent too much load on top.

Eventually, I'll have over 100 devices overall all reporting to this set-up, but the ones that are missing will likely use databases or CSV files to report the data.

No - Only the first invocation instantiates the sandbox - it is then re-used for each subsequent call. Yes there is a small overhead passing parameters in and out - but much less than the initial instantiation. Other calls like jsonata in the change/switch nodes can be just as expensive.

Turns out we're a bit puzzled about the CPU usage... I had updated some nodes last week (including the dashboard), and apparently the IT manager hadn't rebooted the server yet (I don't have direct access).

After rebooting the server, the CPU usage went back to 8-12%. Looks like it could be some side effect of the update not being applied yet, because I didn't do any modification to the flows.

I have a somewhat similar setup. I monitor 20 "machines" at 2sec intervals. I then run that data through quite a few functions and output results. I am using a combination of web sockets and usb-serial interfaces. I am not storing the data for any purpose as the flows are acting solely as a central control hub. Running on a Pi 3B the typical cpu usage is less than 1%.

I did at one point in the past have a memory runaway issue and the machine started thrashing which exhibited similar results that you were seeing. Next time the cpu is maxing out it would be good to do a "top" command to see which processes are maxing it out and what the disk io looks like.