Advice needed: Job handler with backup

I've been using node-red to monitor data from a few dozen machines in an industrial plant. Some of these machines are loaders that get the material from a carrier rack and put it into a production line, or unloaders that get the material from the line and put it into a carrier rack. We also have 3 smart storage units that store both full and empty carrier racks.

We have some AGVs (mobile robots) that handle full and empty carrier racks between the lines and smart storage. The AGVs are controlled by their own Fleet Manager that can receive instructions from a REST API. The control program by the AGV integrator is unstable, and the programmer will quit the company at the end of this month, so we're not sure if we'll get any more support after that. Since the data-collection application was successful, I started building a parallel AGV control system ialso using Node-red.

So far, I started by using the node-red-contrib-finite-statemachine node to create the states for a single AGV throughout a job. There are basically three types of jobs: go home, go charge, or a put/get job.

  • Go home just involves moving to the home position and wait in an idle state.
  • Go charge involves going to the charger, plug in, wait for the charge to be finished, and detach from the charger.
  • Get/Put job involves navigating to a machine port, docking, transferring a carrier rack from machine to robot or vice-versa, and undocking.

Therefore, taking a rack from a machine to the storage is a get job plus a put job. And since the empty port needs to be refilled, a full cycle is actually 4 jobs: GET a full rack, PUT the full rack into storage, GET an empty rack, and PUT it into the line so it can be filled up next.

This, of course, is from the perspective of a single robot. The layer on top must be a dispatcher system that monitors the machine statuses and creates the jobs as needed.

And here comes my dilemma: I can create the jobs as messages, put them in a queue for each robot, and once a robot has finished one of the partial jobs, unqueue them and go to the next one. This is a handy option, because if there is an error or malfunction of some sort in the robot (the battery dies or the WiFi network has an error), I know exactly which job it was doing, so once the robot is back online, I can relay it again to the robot and pick it up from there.

However, let's say there is a crash in the node-red server, or that the server needs rebooting because of some updates being installed, network maintenance, or flow redeployment. If I didn't back-up the queues, I will lose all the job information, and chaos will be ensued. The current application is using a MSSQL database to track the jobs, but they missed some of the parameters and the job tracking is lousy, in general.

I am using a MySQL DB to log the commands being sent to the robots for logging purposes, so I could use the same DB to back up all the messages sent to the queue and delete the entry when a job is finished.

So far, the idea I had is as follows:

  • Store the job queues directly in JSON format with a timestamp and a code as an ID when they are generated.
  • Whenever a queue is updated, I overwrite the whole entry with the updated queue.
  • if something happens, upon powering up, the node-red flow will start up by reading if there are any job queues for the AGVs and load them into memory, then reset all the FSMs to the initial state, and let them load their job queues, and repeat the last assignment if it was unfinished.

I think this approach would work, but might be too convoluted. Can anyone think of a more straigntforward method?

We lose all running data when rebooting node-red. So for using data later, it must be saved as text on HDD or any database

I know.

I was thinking of backing up the working queue as a JSON object, in a MySQL DB every time it's updated.

But at least one person has told me to forget about that and use the DB to store the queue directly instead of using it as a mere backup.

What concerns me of the just-DB approach is the speed: if the DB has several requests at that the time, it might take a moment to get a reply. If several jobs are being conducted at the same time, plus data monitoring for 30+ machines, the system will slow down a bit.

On the other hand, if I only use it as a back-up, it does not really matter if the DB has a delay of 2-3 seconds or not, because I just push the updated queue to the DB (which will get updated, eventually), but I work on the memory queue. As a bonus, it could still work even if the DB server is broken.

However, I don't know if this is just overcomplicating things or there is a simpler solution.

Long short, it is a matter of I/O. If too many I/O cause problem, then it is not a matter of Node RED or database choice, but a whole system design

As you are on real system with industrial standard. You need to calculate number of data and length, compare with your real hardware to ensure it s slow or not
You can also try to separate a server for saving DB, that will keep your node red server fast and stable.If too many data, then cluster server is a must

I do have a cluster server. For development, I'm using node-red and the MySQL DB in the same server, but I can split them in the future if I need to.

I was wondering if there are other (simpler) ways to do what I am suggesting, since I started with node-red less than three months ago, and tend to overthink some stuff.

I think the first question I'd ask is what is the total size of things? That's to say what order of magnitude of robots do you need to manage and how long will each robots Q get?

If the whole thing can be managed in-memory, then you might get away with using just the retained context feature which simply saves context/flow/global variables to file in JSON format (note the default write timeout, you can adjust that though).

If you need to use a DB, and that certainly has benefits in terms of scalability and reliability, I think that I would write each job to the db and use my flow to pass a pointer - essentially just using that as a prompt (a baton if you like) to keep the current process going. I would probably have a separate flow that read the DB when the robot was ready for its next job.

While that is somewhat less efficient from a read/write perspective, that isn't likely to be a performance bottleneck unless you are dealing with thousands of robots (though obviously I'm somewhat guessing there). It is more flexible because it would mean that a future potential improvement would be to be able to dynamically task robots. So if one robot had to make a longer journey, you could have a separate flow that watched for that and made a decision to swap tasks and give the next one to a different robot.

All of this is speculation of course :grin:

@TotallyInformation:

At the moment, things are kind of small regarding AGVs (we just have 6 of them). Because of the unreliability of the current system, some operators are reticent to use them, and that's one of the main reasons I want to improve on that.

Queues would never get too big. The reason for that is that the AGVs are assigned to different areas, and the biggest one of them has three machines (with two ports each) and a smart storage unit, so the worst case scenario there in a normal flow is that in a specific area, we have three full racks to be picked up (that would be three full exchanges, or 12 single jobs).

Under most conditions, each area has an AGV assigned, but the biggest one can have two, if both of them have enough battery charge. Under normal conditions, each AGV handles the full exchange by itself, so if both are available, the area would assign a full exchange to one AGV, then when the next job comes, it would be assigned to the next AGV available (if the other is charging, it will keep the job in the queue until there is an available one).

This means, at the most, Each AGV would have 4 different get/put jobs, and the area would have the information to create the following ones as they come, but for the biggest area it could not be more than 6 jobs (6 full exchanges between the lines and the storage). Each full exchange would turn into 4 separate jobs when assigned.

Due to the size of it, it would be entirely possible to handle everything in memory, but I still would like to keep a back-up in a DB, just in case.

At the moment I'm also using global variables for each AGV to handle the context, so both the AGV and the dispatcher finite state machines can see the current status.

My plan is to log the relevant information of the AGV state each time it receives a command (for debugging), and on another table, have a full record of the last state, which will include the job queue, probably by storing the full JSON object of the AGV, complete with voltage, every time it's assigned a job queue, and updated when each single job is completed.

There's an optimization I may want to have for quick exchanges, but it requires two available AGVs and involves one of them picking a full rack and the other picking the empty rack, and then swap places to deliver. This ensures faster exchanges and can increase the throughput, BUT it requires two operational AGVs in a single area, but even if we have two of them assigned, they eventually need charging (which means I'm one AGV short), and the transition between the normal cycle and the quick swap cycle is complex, because it can be done easily only if both agvs are idle when a job comes in.

For further optimizations, I'll need to rewrite the control software for the smart storage units to anticipate demand and have an empty rack prepared so the AGV can pick it up straight away while the other goes pick up the full one.

Two thoughts:

Something you might try, at least for development purposes, is node-red-contrib-multiple-queue. If you use the Restore from state saved in option, with non-volatile context storage and buffering disabled, you will always have a backup of the queues.

In principle, it is not possible to implement cooperative behavior using simple queues. In general, a central dispatcher would be needed that can manage the content and operation of the queues. At that point, the queues are just a data store for the dispatcher, and there might be better architectures -- for example, a single job pool from which the dispatcher assigns jobs to robots on the basis of an algorithm that uses the complete state of the system and some priority scheme.

@drmibell I will check the node, thanks. It might be interesting for me depending on the non-volatile storage possibilities.

About the dispatch, I know. I just described what you're mentioning in my previous message: each area has a job handler, and will assign the jobs to the AGVs available in it. In the places where areas have a bit of overlapping, there will be a superior instance reassigning the AGVs to each area.

1 Like