Trouble with unreliable modbus-flex-getter

Roll back more than 2 years worth of updates? Sounds risky! Would be a long day to see through all updates and what I trade that for.

I got loads of checks on msg content to determine if error. For example the msg.error as you point out, but I prefer:

const errors = [];

if(msg?.payload == null || msg?.payload?.length === 0 || msg?.payload === ""){
  // payload should be array or string
  errors.push("Empty payload.");
}

if (msg?.error != null) {
  if (msg?.error?.message) {
  // error exists in output
    errors.push(msg.error.message);
  } else {
    errors.push(JSON.stringify(msg.error));
  }
}

if(errors.length > 0){
  throw new Error(errors.join(" "));
}

const output = msg.topic;

output.payload = msg.payload;
output.topic = msg.retryConfig.topic;

// clean up
delete output.retryConfig;
delete output.logMessage;
return output;

A bit overkill here, but it's the same way I do it for other I/O nodes which sometimes doesn't produce any error at all, but instead just spits out empty payload.

How do I retry without complex logic and throwing the exception to a catch node? I did find out about the retry node from the library, so that could simplify things. But since I effectively made the same myself, I get more fine-grained control.

Before developing the guaranteed-delivery node I attempted to do the same thing with a subflow, and found it impossible to avoid race conditions that could mess things up. The sort of situation that is difficult to handle with a number of separate nodes is when messages arrive at the input at unexpected times. I haven't looked in detail at your flow, but imagine a message is passed in and some sort of error occurs, this is passed to a node that decides that the message must be retried but just before it is sent back to the beginning to be retried, another message arrives at the input. Confusion can reign. Perhaps you have handled that condition, but there are numerous other timing issues and I was unable to find a flow that coped with them all, other than using semaphore nodes. It just got more and more complex to handle. Putting all the retry handling in one node (either a function node or a contrib node) is, I believe, the only way to guarantee reliability. Hence I developed the guaranteed delivery node.

1 Like

I hope not to encounter any async or race condition problems. The mechanics are really separate. It start with guaranteed output.

This is what I do before sending it to modbus (unsafe I/O node):

  1. get an object from context (or create it first time if it doesn't exist)
  2. generate an unique id, store it in the msg and in the context obj as a variable name (key). Value is current timestamp.
  3. Send 2 copies of the msg, one to modbus, one to a wait node

At this point, modbus can handle the request with its own timeout. At the same time, the shadow copy waits in delay and eventually goes into a function node to add error message "timeout - msg lost".

Finally, both paths meet:

  1. get the object from context (it must already be created at this time)
  2. look up the unique id from the msg in the context obj
  3. if it exists, this message is the first to arrive, so delete it from the context obj, then pass along the msg
  4. if it doesn't exist, that means it's already arrived, so return null.

You think this use of context objects on keys could encounter race conditions?

The retry mechanic is much simples and doesn't use context at all. The incoming message clones itself, stored in a backup variable in the msg. And initializes the attempt/retry counter. After the request, if it has some indication it failed, it throws an exception. The catch node then sends it to function node, test if attempt/retry counter reached max. If it did, route it to error. If not, reapply the backup of the original msg, increment attempt/retry counter and send it to retry route with a delay for good measure.

  1. Register incoming messages:
function getTrackedMessages() {
  let trackedMessages = flow.get("trackedMessages", "memory");
  if (!trackedMessages) {
    trackedMessages = {}; // initialize (if doesn't exist)
    flow.set("trackedMessages", trackedMessages, "memory");
  }
  return trackedMessages;
}

msg.topic.retryConfig.trackingId = RED.util.generateId();
const trackedMessages = getTrackedMessages();

trackedMessages[msg.topic.retryConfig.trackingId] = new Date().getTime();
return msg;
  1. set error (on shadow copy, after delay):
// prepare failed msg
delete msg.payload;
msg.error = {
  message: "ERROR: Request timeout - msg lost in modbus node."
};
return msg;
  1. register outgoing messeages
const trackingId = msg.topic.retryConfig.trackingId;
if (!trackingId) {
  node.error("ERROR: Message without tracking id!", msg);
  return;
}

const trackedMessages = flow.get("trackedMessages", "memory");
if (trackedMessages[trackingId]) {
  // first message arrives
  delete trackedMessages[trackingId];
  delete msg.topic.retryConfig.trackingId; // clean up
  return msg;
}


// msg already passed
return;

What happens if another modbus request arrives at the start before the first one has been completed?

If 2 msg are handled at the same time by modbus flex getter? They have unique IDs. When one ins completed, it goes to register outgoing messages. Here we check if it exists in trackedMessages object:

if (trackedMessages[trackingId]) {
  // first message arrives
  delete trackedMessages[trackingId];
  delete msg.topic.retryConfig.trackingId; // clean up
  return msg;
}

The next message has different trackingId. I'm not 100% there won't be race condition, but at least it shouldn't be on different trackingIds?

Last week, thanks to community feedback in this discussion, extensive debugging, attempt at understanding documentation and a deep dive into source code, it became apparent that modbus flex getter by design doesn't always produce empty message on fail. Further, it doesn't always throw an error. So no way to detect or catch exceptions without adding extra logic around.

Today, I noticed something else baffling about the modbus flex getter. Steps to reproduce:

  1. Add config node (as normal).
  2. Add flex-getter (as normal) using the config node.
  3. All works ok.
  4. Add a 2nd flex-getter, also using the same config node.
  5. Disable it.
    Result: Now the 1st modbux flex-getter is dead!

It turns out when you disable/remove one flex-getter, it also disables the config node. Even if used by another flex-getter! The shared config node is dead. This of course is not visible anywhere in the editor. You don't even receive output. Normally this won't happen as we only use one flex-getter per config node, but it happened during debugging/development/testing.

So now we have to be super careful when adding, removing, disabling or configureing flex-getters?!?

Are you using Serial or TCP modbus?

1 Like

TCP modbus yeah. Or rather, TCP to a dongle converting to serial RTU. We have another device with modbus TCP and that runs flawless.

I still don't understand the Optionals configurations:


Is Show/Log a matter of stdout/stderr and NR debug sidepanel? Which takes precedence, the node config or the flex getter?

  • Show Activities: Displays node status
  • Show Warning: ?
  • Show Errors: ?
  • Log failures: ?
  • Log states changes: ?

in the past I've been caught out with message collisions and contention for the serial port on serial modbus, largely due to the relatively slow speed, and having logic spread across multiple threads/flows. The same thing would have worked flawlessly over modbus TCP.

The solution was to carefully sequence requests so I didn't ever send a request until the previous response was either received or timed out.

Just saying, it's something work checking. I understand your gear is old and maybe less reliable, and you do need it to handle errors gracefully, but generally RS485 is pretty solid, especially at lower baud rates.

I realize this thread is getting a little old but just wondered if the issues I am having are in any way related...

I had a port USB/RS485 module, a Waveshare, connecting 3 separate devices, a PLC running at 19200, a meter at 9600 and an IO module also at 9600.
Additionally there is an inverter which is Modbus TCP.

Initially I was just polling the lot every second, for basic data, and not worrying If I needed to send something to the PLC, assuming that the Flex Nodes would handle the queuing.

That wasn't stable, I assumed due to the USB device, so I dropped the pole frequency and sequenced them manually.

I still want worrying about occasionally sending stuff to thew PLC, because of the queues but it became obvious fairly quickly that all was not well, and still isn't...

When first deployed all looks OK, there would be the very occasional fail, but generally stuff works almost all the time. However, depending on the pole rate, and remembering that stuff is still sequenced with the flow, after an hour or two the fail rate on all devices would start creeping up and once it did it would o up exponentially until nothing was talking.

At which point the only way to restore coms was to restart Node-Red. Disabling all FlexNodes and re-enabling them didn't work disabling/enabling flows didn't work.
The rest of the flows looked just fine and looking at the OS, a Raspberry 5 running PI OS from an SSD, there didn't appear to be any issue with processor load or memory.
And yet the issue feels like a memory leek.

I swapped out the 4 port USB serial converter for 3 separate units, on separate USB ports.
That helped with the PLC error rate somewhat, the original worst offender, so perhaps port contention was a contributory factor but the time it takes to 'lock up' hasn't changed much, its just that the errors show up on the grid meter first now.

I am now polling all 4 getter nodes at the same time, all be it, only every 2 seconds, and stopping the pole to the PLC if UI need to write something, and it is still locking up.

I haven't posted code because it is a mess after so much mucking about. I think I have tried everything obvious, including only ever doing one thing at once to a single device on its own USB port and nothing stays stable.
Less traffic results in less errors so a longer time before it needs to be restarted but after fiddling with everything I can think of for well over 6 weeks I am unable to get it stable.

The really odd thing is that I have another system elsewhere, talking to several Modbus devices, 2 PLC's, 4 grid meters and 4 inverters, the latter 4 being RTU whilst the rest are TCP.
I only have one issue with that system, which is a single getter latching up at initialize, every few weeks, and that getter is reaching out over the web to a remote TCP/RTU converter to read a meter on the other side of the estate.
I don't see the same degradation over time and the one node that I haven't been able to get properly stable can be 'recovered' by disabling/enabling it.

Were it not for the problem node, with the TCP/RTU converter on the other system, I would be swapping out USB/RTU to TCP/RTU on the newer system that is locking up.

I am utterly at a loos as to know where to go with it and it is driving me daft now as well as making me look somewhat inept...

Anyone have any thoughts? Is there a simpler Modbus Node I can use, even I have to build a flow to handle queues and errors?

Lastly I use the basic getter, no inbuilt queuing, and have 3 of them utilizing separate USB ports, should I still avoid traffic on more than one of them at a time, or can I use them in parallel?

FYI.
I have asked about the 'Flex getter stuck initializing' thing before... Didn't find an answer!

Odd issue with Modbus Flex Getter - General - Node-RED Forum

Thanks for looking,
Al

Thanks for sharing your experience. We had one modbus usb adapter which smelled burnt haha so also use nose when trouble shooting. One time we had 2 devices that failed 99% of the time, but when using separate modbus adapters they were stable. Perhaps due to cheap china hardware?!? Getting all sorts of problems and different behaviors, not just for this, but other protocols too, both hardware and software (including node red palette community nodes).