Client Disconnect but status still connected

Hi, I'm having some problems with MQTT that disconnects from time to time. I read several topic but honestly I still don't know why this happens and I cannot figure it out. I have several control in order to check what's going on, but nothing:
This is the situation.

  • I have node-red running on two raspberry. Each of them has a different MQTT client
  • Clients send messages ranging from one second to one minute (depending on the topic)
  • Mosquitto brocker runs on a Debian machine (TLS connection)
  • On the Debian machine, another node-red instance received MQTT messages and process them. Here I can even check the birth/will/close msg related to the client on the pi.

I have three kind of problems, I don't know if they are all related each others. These situations don't happen regularly and not all together or in the same order:

  • The client on the raspberry closes its connection, without any apparent reason
    2023-04-17 16:12:16: Client Local_2 closed its connection --> brocker's log
    However node-red is still runnng and on its log I don't see any trace related to the mqtt node. What is strange is that in this situation node-red still display the MQTT node as connected. I can confirm it even because I've put a status node with a function node saving the status as context. And if I check it, it's connected. But, since MQTT settings has birth,will, and close messages, node-red running on Debian receives the close message of this client

  • The broker close the client connection (raspberry client)
    2023-04-17 14:11:40: Client Local_2 has exceeded timeout, disconnecting --> brocker's log
    in this situation, I can see on node-red log the MQTT node activities, but I don't understand why this happens

17 Apr 14:52:04 - [info] [mqtt-broker:Client] Disconnected from broker: Local_2@mqtts://www.example.comt:12345
17 Apr 14:52:20 - [info] [mqtt-broker:Client] Connected to broker: Local_2@mqtts://www.example.comt:12345`
  • On the raspberry, node-red after some notification like this one

(node:5151) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 1001 drain listeners added to [TLSSocket]. Use emitter.setMaxListeners() to increase limit

stop running exiting with fatal error

<--- Last few GCs --->
[5151:0x4d0cee0] 63827198 ms: Mark-sweep (reduce) 318.7 (360.6) -> 318.7 (352.8) MB, 195.2 / 0.1 ms  (average mu = 0.993, current mu = 0.001) external memory pressure GC in old space requested
[5151:0x4d0cee0] 63827408 ms: Mark-sweep (reduce) 318.7 (352.8) -> 318.6 (352.3) MB, 210.3 / 0.1 ms  (average mu = 0.984, current mu = 0.001) external memory pressure GC in old space requested
<--- JS stacktrace --->
FATAL ERROR: v8::ArrayBuffer::NewBackingStore Allocation failed - process out of memory
nodered.service: Main process exited, code=killed, status=6/ABRT
nodered.service: Failed with result 'signal'.

This is client configuration on the pi
image

What version of Node-RED and Node-JS are you running?

Why do you have such a HUGE keep alive? Set this to something sensible like 30s ~ 60s

How are the PI devices connected to the network - WiFi? Is it reliable?

Do you have unique clientId set for EVERY MQTT Client connection (that for each "broker config" on every node-red and every client in your network whether it is a a NodeJS thing, a smart thing, another PC running MQTT Explorer etc etc)

Have you tried MQTT V5? The broker provides much richer information for disconnection etc.

Thanks @Steve-Mcl .

  1. Node version is the v3.0.2, node JS v16.19.0, Mosquitto v2.0.12
  2. This huge keep alive is just a test, since even with the default 60s, client get disconnected. Studying how the keep alive mechanism works, I hoped this would have prevented the disconnection by the broker if for some minutes the client doesn't publish anything (that can happens). For what I've understood, if the brker doesn't receive anything for one and half the keep alive times, it disconnects the client.
  3. All devices are connected via cable. They are on different places, so not under the same LAN
  4. Yes correct, every node-red instance (MQTT client) has its own MQTT credential and its own client ID, as well. I even have MQTT explorer for further check with its own different configuration
  5. All clients are with MQTT V5

Now I have added an alert so that the main node-red (the one on the same debian Machine where Mosquitto is installed) inform me as soos as possible that something goes wrong so that I can immediatly check on the logs. if this can help

Not quite. There is an internal PING (an MQTT ping packet) that keeps the connection alive. If this (or other traffic) is not seen, connection is terminated. I.e. you don't have to send or receive messages to stay connected, the internal PING does that.

Fwiw: I once had a minimal PLC MQTT client that did not publish pings and in that case I had to manually publish once every 30s but that was a basic implementation - node-red uses mqtt.js which DOES do mqtt pings under the hood.

If you suspect this is the issue, you should wireshark the connection (or if your broker has verbose logging), look for PINGREQ and PINGRESP messages.

There is a reasonably straight forward description here: MQTT Keep Alive Interval Explained With Examples

Thanks, yes I already read more than once this explanation. The problem is that I cannot understand these my problem where they come from :frowning:

  • Which level of log do you suggest for node-red? I know how to change it and now it is set as trace.
    Mosquitto runs as server. How can I set its log as verbose?

Edit mosquittio conf

if you see # log_type all or log_type info or log_type error then change that to log_type all, save and restart mosquitto

Ok, let's try to solve one problem per time.
This is node-red-log. I suspect it is due to MQTT node because It seems to be related to TLSSocket, which MQTT connection use. But I'm not able to identify and solve the problem.

19 Apr 04:26:09 - [info] [mqtt-broker:Client] Connected to broker: Local_1@mqtts://www.myexample.com:12345
19 Apr 08:02:44 - [info] [mqtt-broker:Client] Disconnected from broker: Local_9@mqtts://www.myexample.com:12345
19 Apr 08:03:16 - [info] [mqtt-broker:Client] Connected to broker: Local_1@mqtts://www.myexample.com:12345
19 Apr 11:20:00 - [debug] [cronplus:5 HistSave] running 'schedule1' ~ 'topic1'
 now time Wed Apr 19 2023 11:20:00 GMT+0200 (Ora legale dell’Europa centrale)
 crontime Wed Apr 19 2023 11:20:00 GMT+0200 (Ora legale dell’Europa centrale)
(node:626) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 1001 drain listeners added to [TLSSocket]. Use emitter.setMaxListeners() to increase limit
(Use `node --trace-warnings ...` to show where the warning was created)
19 Apr 11:30:00 - [debug] [cronplus:5 HistSave] running 'schedule1' ~ 'topic1'
 now time Wed Apr 19 2023 11:30:00 GMT+0200 (Ora legale dell’Europa centrale)
 crontime Wed Apr 19 2023 11:30:00 GMT+0200 (Ora legale dell’Europa centrale)
<--- Last few GCs --->
[626:0x5d6fee0] 28598558 ms: Mark-sweep 233.9 (261.6) -> 232.1 (261.9) MB, 228.2 / 0.1 ms  (average mu = 0.548, current mu = 0.192) allocation failure scavenge might not succeed
[626:0x5d6fee0] 28599552 ms: Mark-sweep 234.2 (262.1) -> 232.1 (262.1) MB, 276.6 / 0.1 ms  (average mu = 0.663, current mu = 0.722) allocation failure scavenge might not succeed
<--- JS stacktrace --->
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
nodered.service: Main process exited, code=killed, status=6/ABRT
nodered.service: Failed with result 'signal'.

Have you got multiple broker config nodes for the same broker in one instance of node-red? Normally you should only have one config node for each broker. Multiple MQTT nodes can share the same broker.

Thanks Colin. This is exactly what I do. I have just one Broker configuration shared with the mqtt in/out node


Is there a way I can follow this detected memory leak detected?
I don't have any http node, so the TLSSocket just let me thing about MQTT

I don't understand how you have got two different client ids

Every 10 mins you have a schedule named 'HistSave' - what does this do?

I assume you are "collecting" values in memory then dumping them to database on the 10m mark?

If so, how much data is being collected? And where are you saving it (and how)

Sorry, my mistake, I edited some sensible data and Local_1 and Local_2... I'm sorry., yes there is just one client as I show in the image

I think the fact you are crashing node-red is the first step to solving your issues.

As I asked before:

Yes @Steve-Mcl I agree thanks. Sorry, I was even replying to you but I was in a hurry and couldn't finish it.
It works like this: data are sent directly to MQTT as they are available, with qos=0. Time ranges from one second to 1 minute, it depends.I have a status node for the MQTT so if it's not connected, data are momently saved in context. Each 10 minutes I make a check and if there is no connection I save this context data in a csv file, cleaning the context memory. When the MQTT client is already connected, first I send the data present in context, and then this csv files. These files are sent one by one when context is empty, and in order to try to avoid overload or speed connection problem, I even put a count, so files are sent each 20 data sent on the fly.
Dimension: data on the fly are around 5-10 kB. The file can reach 300-400 kB. I noticed that this problem mainly happens just on one raspberry (both of them are identical), So I even suspect it could be an internet connection problem, since they are in two different places. But node-red shouldn't crash. So there must be something else

If you run top in the pi is the memory going up rapidly?

Some times I have chacked it, but it seems normal. I'll do a more accurate check

Memory is increasing little by little, so probably it will crash in a couple of hour or so. I'm following it.
In the meantime I'm looking at the messages passing through the server and received by the debian PC. The messages received on the fly are older than 10-20 minutes (no the files). How is hold and managed the queue? The on the fly messages are not retained and with qos = 0. So, how it is possible that I'm receiving such as hold messages?

Add debug nodes in the pi to see what is going on. If messages are not going out as quickly as they are coming in then perhaps that is your build up of memory. If it is difficult to follow in real time then if you set the debug nodes to output to the console then you can analyse the log later to see what happened.

Some more updates

  1. Node-red scrashed again but I wasn't able to see top command during this event. I will try it again. But as explained above, I'm more and more confident that the problem in some way is related with the queu messages that increase, probabile due to the bad internet connection.
  2. I was able to look more into detail what happens for the other problem, the unusual client disconnection. I have noticed this situation: from time to time, the broker sent the will message. What sounds strange to me is that after the will message, the pi is still reachable remotely, the MQTT node is shown as connected, its status node doesn't register any changed, on the log there is no disconnection activity. So, for the node-red point of view, the client seems to be correctly connected.
    Moreover, if I publish a message to a topic this client subscribed to, it never receives it. So, from the server point of you it is effectively disconnected (I can see this message correctly reach the broker without leave it).
    After some minutes I can see on the node-red log (as well as by the MQTT status node) that the client disconnects and suddenly reconnects, so the broker correctly sends the birth message. So, at the end, it is like there is a period of time when the broker registers the client disconnection, but for this client everything is ok and it still sees the connection as active, continuing to send messages. Where this messages go I don't know, but as I said in the previous reply, it is strange that I receive old messages on the fly, messages with qos 0 and no retained flag.
    How is this possible? Does is this behavior correct?

This is the broker log where I have put some info for what I have understood
Client_0 is node-red running on the debian PC
Local_1 is the client running on the pi
Client_Test is the client I used to test the real Local_1 connection status, as described above
sudo tail -f /var/log/mosquitto/mosquitto.log | grep "Area1"

2023-04-20 18:34:44: Sending PUBLISH to Client_0 (d0, q0, r0, m0, 'Area1/test', ... (367 bytes))
2023-04-20 18:34:44: Received PUBLISH from Client_0 (d0, q0, r0, m0, 'Area1/test', ... (46 bytes))
2023-04-20 18:34:44: Sending PUBLISH to Local_1 (d0, q0, r0, m0, 'Area1/test', ... (46 bytes))
2023-04-20 18:36:52: Sending PUBLISH to Client_0 (d0, q2, r1, m314, 'Area1/Connect', ... (21 bytes))   ---> WILL MESSAGE
2023-04-20 18:46:08: Received PUBLISH from Client_Test (d0, q0, r0, m0, 'Area1/test', ... (8 bytes))   ---> SENT BY THE TEST CLIENT
2023-04-20 18:47:35: Received PUBLISH from Client_Test (d0, q0, r0, m0, 'Area1/test', ... (8 bytes))   ---> SENT BY THE TEST CLIENT
2023-04-20 18:47:37: Received PUBLISH from Client_Test (d0, q0, r0, m0, 'Area1/test', ... (8 bytes))   ---> SENT BY THE TEST CLIENT
2023-04-20 18:53:03: New client connected from 127.0.0.1:46658 as Local_1 (p5, c0, k64800, u'user_1').
2023-04-20 18:53:03:    Area1/Connect
2023-04-20 18:53:03:    Area1/# (QoS 0)
2023-04-20 18:53:03: Local_1 0 Area1/#
2023-04-20 18:53:03: Sending PUBLISH to Local_1 (d0, q0, r1, m0, 'Area1/Connect', ... (21 bytes)) 		---> BIRTH MESSAGE
2023-04-20 18:53:03: Received PUBLISH from Local_1 (d0, q2, r1, m30396, 'Area1/Connect', ... (22 bytes))
2023-04-20 18:53:03: Sending PUBLISH to Client_0 (d0, q2, r1, m316, 'Area1/Connect', ... (22 bytes)) 	---> BIRTH MESSAGE
2023-04-20 18:53:45: Received PUBLISH from Local_1 (d0, q0, r0, m0, 'Area1/test', ... (5439 bytes))
2023-04-20 18:53:45: Sending PUBLISH to Client_0 (d0, q0, r0, m0, 'Area1/test', ... (5439 bytes))
2023-04-20 18:53:45: Received PUBLISH from Client_0 (d0, q0, r0, m0, 'Area1/test', ... (46 bytes))
2023-04-20 18:53:45: Sending PUBLISH to Local_1 (d0, q0, r0, m0, 'Area1/test', ... (46 bytes))

This is node-red log

20 Apr 18:30:00 - [debug] [cronplus:5 HistSave] running 'schedule1' ~ 'topic1'
 now time Thu Apr 20 2023 18:30:00 GMT+0200 (Ora legale dell’Europa centrale)
 crontime Thu Apr 20 2023 18:30:00 GMT+0200 (Ora legale dell’Europa centrale)
20 Apr 18:30:20 - [debug] [cronplus:2 Old_in_DB] running 'schedule1' ~ 'topic1'
 now time Thu Apr 20 2023 18:30:20 GMT+0200 (Ora legale dell’Europa centrale)
 crontime Thu Apr 20 2023 18:30:20 GMT+0200 (Ora legale dell’Europa centrale)
20 Apr 18:40:00 - [debug] [cronplus:5 HistSave] running 'schedule1' ~ 'topic1'
 now time Thu Apr 20 2023 18:40:00 GMT+0200 (Ora legale dell’Europa centrale)
 crontime Thu Apr 20 2023 18:40:00 GMT+0200 (Ora legale dell’Europa centrale)
20 Apr 18:41:34 - [trace] utils.writeFile - written content to /home/pi/.node-red/.sessions.json.$$$
20 Apr 18:41:34 - [trace] utils.writeFile - renamed /home/pi/.node-red/.sessions.json.$$$ to /home/pi/.node-red/.sessions.json
20 Apr 18:41:34 - [trace] comms.open nsMIs4p7+n3Q80M/y+SgRnN0nst98IsublVFKc2HM64=
20 Apr 18:44:32 - [trace] utils.writeFile - copied /home/pi/.node-red/flows.json TO /home/pi/.node-red/.flows.json.backup
20 Apr 18:44:32 - [trace] utils.writeFile - written content to /home/pi/.node-red/flows.json.$$$
20 Apr 18:44:32 - [trace] utils.writeFile - renamed /home/pi/.node-red/flows.json.$$$ to /home/pi/.node-red/flows.json
20 Apr 18:50:00 - [debug] [cronplus:5 HistSave] running 'schedule1' ~ 'topic1'
 now time Thu Apr 20 2023 18:50:00 GMT+0200 (Ora legale dell’Europa centrale)
 crontime Thu Apr 20 2023 18:50:00 GMT+0200 (Ora legale dell’Europa centrale)
20 Apr 18:52:48 - [info] [mqtt-broker:Client] Disconnected from broker: Local_1@mqtts://www.myexample.com:12345
20 Apr 18:53:03 - [info] [mqtt-broker:Client] Connected to broker: Local_1@mqtts://www.myexample.com:12345

I have never done it, but I think I got what you mean. I'll try it.
Anyway I'm sure about the delay because I have a timestamp inside the payload, so I see from the debug node the current time and the timestamp the message was sent to the broker.
I'm even noticing that when the messages are on time, it works as expected (except for this disconenction that happens from time to time, even 1-2 times per hour). But when this delay starts, it becames more and more huge, causing probably this memory leak