Exec fails to execute all instrution in a bash script

I am executing a bash script from the exec node. It's a fairly long script with 18 sleeps of 15 seconds each and 18 ssh commands and some local commands in between. It seems to stop in the script at somewhat random points. I have seen it stop in the first ssh, the 3rd sleep, and several other places.
When I run the script and the command line it runs to completion every time. Has anyone else seen this behavior?

This is part of the script, there are 9 sections like this for 9 different QMGRs.

echo " " >> $LOGFILE
echo "#############################################" >> $LOGFILE
echo "## Starts SSH to MMIS_PROD_BATCH_CLAIMS ##" >> $LOGFILE
ssh 192.168.0.220 << CMDS
nodered
node20!RED22
mqcli
strmqm MMIS_PROD_BATCH_CLAIMS
exit
exit
CMDS
echo "## Spot 01 MMIS_PROD_BATCH_CLAIMS ##" >> $LOGFILE
sleep 15
echo "## Spot 02 MMIS_PROD_BATCH_CLAIMS ##" >> $LOGFILE
export MQSERVER="MQEXPLORER.SVRCONN/TCP/192.168.0.220(9401)"
QMGR=MMIS_PROD_BATCH_CLAIMS
FIX=$TODAY"_"$QMGR".fixed"
FIXED=/home/wsadmin/mqbackup/$FIX
/opt/mqm/bin/runmqsc -c $QMGR -u nodered < $FIXED >> $LOGFILE
echo "## Spot 03 MMIS_PROD_BATCH_CLAIMS ##" >> $LOGFILE
sleep 15
echo "## Spot 04 MMIS_PROD_BATCH_CLAIMS ##" >> $LOGFILE
ssh 192.168.0.220 << CMDS
nodered
node20!RED22
mqcli
endmqm MMIS_PROD_BATCH_CLAIMS
exit
exit
CMDS
echo "## Stop of MMIS_PROD_BATCH_CLAIMS COMPLETE ##" >> $LOGFILE
echo "#############################################" >> $LOGFILE

Any ideas what could be causing the exec node to "hang" ... though the pid shown by the exec node is gone, so the execution is somehow terminated.

Looking for ideas.

BTW ... all of the variable ARE defined above this section. Like I said, it works fine if I just run the script at the command line.

Are you running in exec mode or spawn mode?

Other than that, since it is pretty impossible to say what your script is actually doing, it is hard to know what might be wrong. Though you have quite a number of env variables it looks like - have you verified that the node-red host environment actually contains those?

Also are you running node-red as a service or directly in a terminal? If you are running as a service then stop that and run it directly in a terminal, using the node-red command, and see if it makes any difference.

Otherwise, if you have not already done so, connect three debug nodes, showing the full message, to the outputs of the exec node and check to see what comes out when it fails.

Finally, start node-red in a terminal and post the terminal output here, up to the point where it fails. Copy/paste please, not screenshot.

Do you mean that you see (for example)
"## Spot 01 MMIS_PROD_BATCH_CLAIMS ##" in the logfile but not "## Spot 02 MMIS_PROD_BATCH_CLAIMS ##"?
Does the machine executing this script freeze or crash?
Do you have logfiles on the remote machine?
What is the exit code from the exec node?
Do the actions on the remote machine output to stdout/stderr? Are you capturing it?
Your script doesn't have the error handling I'd expect to see.

ps Do you have an explicit timeout set on the exec node?

@TotallyInformation

I tried both exec and spawn modes, it didn't make any difference.

The variables are all contained with in the script and are all defined in a section above the clip. It is self contained and does not require external data, other than some files it moves.

Node-RED is running on CentOS v8 and the remove "machine" is a virtual MQ Appliance.

What I am doing is "a poor mans HA" with MQ appliances ... long story short, Normal MQ Appliance HA does not work well in our network infrastructure, it was causing outages even when it didn't fail over. Not MQ's fault. These machines are all virtual (a lab environment) and running in the same subnet. I have absolute control over all the machines.

There is nothing in the logs for the MQ Appliance that show a failed connection, it either works or it doesn't it would seem. There is nothing in my log file that shows a failure if any kind either, it just seems to stop at some random point.

The script manages 9 QMGRs on the appliance, the section of code does the following:

  • SSH to the appliance, and starts QMGR #1
  • run a local instance of runmqsc to make a remote connection to QMGR #1 and push a config file to it (the config if generated in another script where it extracts a mqdmpgfc from QMGR #1 from another appliance, that script works fine from Node-RED)
  • SSH to the appliance and stop QMGR #1
  • Rinse and Repeat 8 more times

As I mentioned, if the script is run from the user that runs Node-RED, it works like a charm every time.

@Colin
It is running as a service by using PM2 I will try running directly today.
I do have debug nodes on all 3 the outputs of the exec and they never show a failure, just the current command prompts, etc.
I will try running in a terminal and capturing the output today as well.

@jbudd
That is exactly what I see, it will show the debug messages and then just stop ... when I get tired of waiting to do "ps -ef" and grep for the PID that Node-RED shows under the exec node and it's just gone.
There is no exit code, no error, no anything.
The remote machine is an MQ Appliance, there are no failures or error there that I can find either.
A for a timeout, I never set anything like that. Is there a default? it does way 15 sec for the QMGR to start, then 15 sec after the update runs just to make sure thing "happen" in the QMGR, for each of the 9 QMGR. But I think if this were a timing issue it would stop at the same place all the time, that is not the case here ... it does seem to be random. I have seen is stop on sleep commands in various places, and I have seen it stop between a sleep and an SSH command in multiple places.

As a note when this move out of the lab into our "real" environments there will be 20 QMGRs, 5 each in 4 environments.

If you run just one of the remote sessions then exit the script, is that reliable?

I have not tried that, will do that today as well. I generally get the script "right" running from the command line then run it in Node-RED once it works. That is the way I developed it. Get one working then cut/paste/edit 8 more times. but the script runs to completion every time without errors when run at the command line. But I will try that today.

What I generally do as well. But I suspect something odd, maybe timing-wise when exiting one SSH session and moving to another.

If it does work as a single, it would be easy enough to parameterise the script so that you could run it multiple times with different parameters from node-red.

You could also create a script that sits on each remote and just execute that for each server rather than trying to run a set of commands each time. I also sometimes use a wget command in my scripts that passes result data back to node-red using an http-in/-out pair of nodes. I also have a way of outputting log data to system logs rather than file logs which means that you can use standard rotations if you want to.

Do you get an output on o/p 3 when it fails? If so what return code does it show?
Do you ever get an output on o/p 2?

What versions of nodejs and node-red are you running?

I also think you need to monitor the second and third outputs from the exec node (stderr and exit code).

I don't know what your remote commands do

nodered
node20!RED22
mqcli
strmqm MMIS_PROD_BATCH_CLAIMS

However I note that none of them have nohup, so if the ssh connection fails for some reason, the remote process will be terminated. No idea if that would cause these symptoms, nor why the connection would fail from NR but not from the CLI

Is it possible to encapsulate those remote commands in a script, with progress messages, and feed them to a debug node?

Thank you @Colin !!
I tried running Node-RED from the command line instead of PM2 and it worked just fine. So then I stopped and restart it in PM2 and ran the same script and it failed again, this time though I was watching the PM2 log for Node-RED and at about 30 seconds PM2 simply restarted Node-RED .... hence the PID just going away.

Now I have to dig into PM2 and see what is causing that behavior.
But Node-RED and the exec node are off the hook on this one. :slight_smile:
Thanks guys!!

The usual issues that may cause problems when running as a service are:

  1. No attached display.
  2. Environment variables missing or different.
  3. In particular the PATH or working directory may be different - the solution is to specify full path names for all commands and files accessed.

Point 3 is probably the commonest.

1 Like

Update: Got rid of PM2 and created a Linux service and it's still doing it. :frowning:
Here is my service file: node-red.service

[Unit]
Description=Run Node-RED as a system service
Wants=network.target network-online.target
After=network.target network-online.target

[Service]
Type=forking
User=wsadmin
Group=wsadmin
Restart=on-failure
SuccessExitStatus=0
ExecStart=/bin/bash -c "exec /usr/local/bin/node-red & >>/home/wsadmin/node-red-logs/node-red.log 2>&1"

[Install]
WantedBy=multi-user.target

Seems to be every 90 seconds that it restarts.

What is in the node red log?

What OS are you running?

I think this one was my bad .... when I wrote the service file I was missing a & after node-red in the start up before piping the outputs to a log .... I added that (which is in the code above) and the restarts stopped but now the node-red log does not seem to populate ... still looking into the syntax for:

ExecStart=/bin/bash -c "exec /usr/local/bin/node-red & >>/home/wsadmin/node-red-logs/node-red.log 2>&1"

Got it .... the final service files looks like this ....

[Unit]
Description=Run Node-RED as a system service
Wants=network.target network-online.target
After=network.target network-online.target

[Service]
Type=forking
User=wsadmin
Group=wsadmin
Restart=on-failure
SuccessExitStatus=0
ExecStart=/bin/bash -c "exec /usr/local/bin/node-red >>/home/wsadmin/node-red-logs/node-red.log 2>&1 &"

[Install]
WantedBy=multi-user.target

It does not restart every 90 seconds and does update the logs if I kill -9 the node-red pid ... so I think I'm all good to go.

Thanks for you patience everyone, thank Colin too for putting me on the right track.

1 Like

Which OS are you using?

It's CentOS 8 ... in the real environments it will be RH 7 or 8.

OK, I don't know about those.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.