Modbus TCP loosing connection to clients, does not reconnect

Hi All,

I know this is not a new issue, I read about this here, but all the topic and references to github issues seem to get closed without any solution as far as I can see. I wonder if anyone else here have/had similar issues.

I am reading data from a modbus TCP device using the node-red-contrib-modbus node, and I notice that from time to time I looses connection to the client. The getter nodes are still triggering, but it does not read anything from the modbus master.
I tried sending the reset command to the queue, but it does not help. I need to re-deploy the flow to make it work.

I also noticed that this usually occurs if there is a network outage. Like if my ISP does something on the router, like a firmware update and the network is lost for a few minutes.
I have seen references that the tcp error handling may not be implemented correctly, but these references are from 2020 and 2021. Could it be that it is not fixed for more than a year?

Anyway, I am interested if anyone has similar issues here.

Regards,
Csongor

What version of node-red and node.js are you using? For node.js use
node -v
What version of the modbus node? Look in Manage Palette. Upgrade it if not on the latest.

I have to admit, I am not in the latest version from anything:
Node: v14.18.2
NR: v2.1.3
node-red-contrib-modbus: v5.16.0

Do you know which one is the issue, or it is rather a case of upgrade everything to the latest version?

I would start with the modbus node version. However, you might think it a good time to upgrade to nodejs 16 or 18 and the latest node red at the same time. Also check all your other nodes are on the latest. The further behind you let your system get the more problems you may encounter doing an upgrade.

OK, let me upgrade the full stack tonight. Probably a general Linux update is due as well. I hope nothing gets broken, I have too much stuff running in NR now.
I may be able to simulate broken connection by just unplugging the network cable from the PLC.
I will be back.

As long as you do not overwrite your existing SD card then if it is a disaster all you have to do is to plug it back in again. I do hope that you already have backups of everything important, remember SD cards (and discs and computers) can go up in smoke, literally or metaphorically, at any moment.

About a year ago I have moved from a RPi to a Dell Optiplex micro PC (i5, 8GB RAM, 256GB SSD). It is a lot faster but still feels like a Pi.
I do back up my node-red folder to my NAS, but maybe I should do a full backup of the harddrive? I am just not sure how to that in Linux.

I use clonezilla for that. You would need to boot of a clonezilla live USB/DVD. I don't generally back up the whole disk though, just my home directory. My assumption is that re-installing the OS should not be that big a deal, and in fact is sometimes a good idea to get rid of the crud that seems to accumulate over a few years.

OK, what difference clonezilla makes from just using regular copy to back up my home directory?

Nothing. I said I don't generally use it, but if, for example, you want to change the disk in a laptop to a larger one, or SSD for example, then clonezilla is a good way of doing that. Also I do use it for making image backups of my PI SD cards as they are rather less reliable than discs. Though modern ones are much better than they used to be.

Thanks, I did the NR update to the current 3.0.2 version and updated all the nodes as well. Let me run this for a few hours and I will try to simulate a network outage.

It is still not working.
I unplugged the network cable from the PLC (modbus TCP) waited for 5 minutes and after plugging back it was not able to resume communication.

And the following warnings are coming from the configuration node:

sequential dequeue command not possible for Unit 1
no sending is allowed for Unit 1
valid Unit 1
Saia PLC serial sending allowed for Unit 1

And these messages continue. Once every second.

I kinda understand these messages, and I understand that it keeps trying, but why cannot recover from it?
I was thinking that maybe because all the messages and piled up in the queue, but even when I send a msg.resetQueue = true message nothing happens.

Any ideas?

What is it that's triggering your MODBUS? Do you have a timed inject node with the message or something? I have several MODBUS modules I run on a five second refresh rate. I can unplug any of them for long lengths of time, and have done so on numerous occasions, and they all pick right back up with no issue.

If you can, share your flow or that part of your flow with us so we can see what's going on. Like I say, I can literally duplicate what's happening to your module and not get the same result so I don't know what to point to yet.

But to not make a fruitless post, I'll venture this. Not a lot of people use the flex-getter module. The module is actually where the developer has spent most of his recent efforts at bug fixes and improvements according to the posts I've seen. It's a lot more robust and capable of handling issues. Also, it sounds like you have multiple getter nodes firing into the same MODBUS unit:

(emphasis added)

The problem that can arise from having several getter nodes is that the traffic can collide on the MODBUS system and cause issues. That's not much of a problem over TCP because of its full duplex nature as much as it is over serial connections that don't have it. But it can still happen and cause problems. The flex-getter has a built in queue so you can actually feed several simultaneous queries into it at once and it will process them one at a time without the collisions. That may solve your problem by just switching from several getter nodes to one flex-getter and have all your queries piped into that.

Give that a try. And if it doesn't work, post what you can and we'll go from there.

Yes, yes, yes, it seems to be working.

I had 3 get nodes reading holding registers from different ranges every 10 second, another getter reading 1000 coils every 2 second and one more holding register every 15 minutes.
I have replaced all these with flex getter.

I did the same test (unplug the plc, wait and plug it back), and indeed it does restart polling without any issues.

I also update a few registers with much less frequency and for that I used flex-write and also regular write. Since those do not have an internal trigger, I assume it is OK to leave modbus-write as it is and not change them to flex-write. Or what do you think?

That can get tricky. While the flex-getter and flex-write nodes both have internal message queues, they don't interface with each other to make the queues sequential between the two of them. If a write sends a message while a read is still processing, it could collide and cause an error where neither the write or read complete.

There is a node called node-red-contrib-simple-message-queue that has helped with mine and other's MODBUS implementations in the past. You'll make a setup like this when you use it:


Whatever is currently sending a message to a MODBUS node will instead send it to the simple queue. This includes reads and writes. You will want to setup some kind of flag in the messsages to distinguish what they are (read or write), like setting msg.topic = "read" or something. Configure the simple queue to send the first message through so that the process can start normally.

Once the messages start entering the queue, the queue will send them one at a time through the diverter function. This is where you'll figure out if they're a read or write and send them to the appropriate node.

if(msg.topic == "read"){
    return([msg,null]);
}
if(msg.topic == "write"){
    return([null,msg]);
}

The message will process on the MODBUS channel depending on if it's a read or write. When that is done, it will send the return to the trigger sender. The main purpose of this function is to send the {trigger:true} property back to the simple queue to send the next available message through. You can use this function as well to do any post message processing or diverting if necessary, such as discarding any results from the write messages so they don't mess up the nodes down the line (which I would recommend). But the main function is to send the message back to the simple queue. In all its simplicity, this is the least you would need it to hold:

return([{trigger:true}, msg]);

Adding discarding write results would look like this:

if(msg.topic == "write"){
    return([{trigger:true}, null]);
}
if(msg.topic == "read"){
    return([{trigger:true}, msg]);
}

This will ensure that no matter when messages are sent to the MODBUS channel, whether they're read or write, they won't hit the channel at the same time because you have a gatekeeper watching to make sure something goes out the other side before something else enters.

Hopefully that helps any future problems before they happen. Glad it's working for you!

1 Like

Thanks a lot for this. I used the same concept but I have written my own message queue in a function node. When you are continuously querying the device, I think there is no point storing the same read or write message multiple times, but only to keep the last message. This is what I have done in a custom code.

I am only concerned about one thing: I have implemented the same feedback loop as in your example above (Trigger Sender). But this relies on the flex getter or flex write to remember the last request and send it when the device is available. But if for any reason this information is lost and the getter/write does not send the message, Trigger Sender will not instruct the message queue to send out the next message.
I don't know if this scenario exists or not.

Therefore I am thinking that this flow may require a fallback part, where it keeps sending the last message every minute? But that may fill up the queue for no reason.

What do you think?

If you need to queue messages and, if necessary, retry until they are successfully sent then have a look at this flow which is designed for exactly that situation. All you need to do is to be able to feed back a success or failure message and it will do the rest. The example shows it being used for sending emails, but the principle is the same.
https://flows.nodered.org/flow/05e6d61f14ef6af763ec4cfd1049ab61

This is good efficiency when you're looking at a program from a scalability perspective. When you may be having thousands of queries a second (i.e. search engine traffic), you want to squeeze every drop of efficiency where you can. Eliminating duplications is a good way to enhance the flow. If the scalability isn't require though, simplicity can become the more important focus which means to put in what works the easiest so it can be maintained more simply. I can't make the call on your program, but it sounds like you have an idea where it needs to go. Since you're looking at eliminating the extra traffic, let's work with that.

The TCP/IP protocol uses a setup you might find helpful in this situation. When a TCP packet is sending information on something like a file, it flags the packet with a number similar to 1 of 5 or something. When the receiver gets the packet, it simply sends back the number as the acknowledgement of the packet received. When the sender gets the number, it knows that packet was received and can either send the next packet or resend the packet if the packet isn't acknowledged. Perhaps you could implement something similar in your case? Resetting the count whenever the queue is empty and all messages have been handled will help you keep your count low.

A setup I can imagine is you have a queue of messages that go into your function. You hold them in your incoming queue with a number. Whenever a message is transmitted, it goes into a trasmitted state (or think of it as a transmitted queue with one spot). If the transmission is successful, it goes into an outgoing queue or is deleted. When all queues are empty, reset counter and wait.

The only thing I have to point out is that the only queries you're concerned about are your writes. If a read errors, it will just be updated on the next read, so the only information lost is for that time period. If a write errors, it may make the difference in functionality you need. That missing write can mean the difference between successful control and disaster.

It's your code so you can make the call on what you do. But my suggestion would be to keep the simple queue setup and feed back your write status. If the write was successful, delete the write command. If the write was unsuccessful, resend it. You've mentioned you don't anticipate the writes to be frequent, so handling write errors will be easy without the queue being very complex. And even if you had a big problem with writes, infrequent write commands would not fill a queue very fast before you could implement a fix to handle it.

Keep it simple and it will be easier to make work in the long run.

Start with the Modbus node version, in my opinion. However, you might feel that this is a good moment to simultaneously upgrade to nodejs 16 or 18 and the most recent version of node-red. Make sure all of your other nodes are updated as well. The farther behind your system gets, the more difficulties you could have upgrading.

Thanks @jackharry. I am up-to-date with the exception of Node.js. I used the node-red update script and I believe it used to update Node.js as well, but this it did not do that.

But just today, I lost connection with my modbus master again. And this time there was no network issue I could notice. What I can tell that my scheduler was waiting a trigger feedback form the flex node that it has executed the last request but it did not come.
When I was sending the trigger manually, the communication came back up immediately. So I think I need to implement some sort of retry in my setup.