Locating arbitrarily named log files

I wanted to poll some ideas for a particular problem I have.

In the whole lot of machines I have at work, some of them keep their activity logs / data in databases (optimal), where as many of them just do so on text files.

From the machines that use text files for the activity logs, some of them keep them in rotating logs with the exact same name ( xxxxxxxxx.log, xxxxxxxxx.log.001, and so on), which is the best idea when working with logs.

However, there is a specific problem with some of the machines, which arbitrarily named files that are not consistent (-<batch_number>.log, for instance), sometimes dropped in the same folder, and sometimes with a consistent name, but put inside arbitrarily named folders (with the batch number, for instance).

Now, this might be fine if you are looking for a specific batch, or if you want to manually parse a specific file, but I was hoping to monitor the log files with tail and get (more or less) real time updates, but without knowing the file name (or path), it seems a tad difficult to do.

Is there any way I could monitor the latest file in a folder/subfolder and monitor that dynamically with the Tail node?

The other problem I see is that, if the file has already more than one line and I start monitoring then, I will only receive the new lines, so I should probably use a file node to parse anthing the file has at the beginning, and the tail node to monitor any new addition.

Or make my own rotation log file, locating the new logs and overwriting a single file in a known position? That might work for monitoring, but should I need a separate service for that, or could I do it in NR?

Logging is an industry all of its own! And many organisations spend a LOT of time, effort and money to consolidate, monitor and alert on logs.

So there are a lot of variables you need to consider up-front if you want to manage the problem.

First thing to think about is WHY? Why do you want to capture the logs. Then WHAT? What do you want to do with the information?

You also need to consider the platform. Linux, for example, has quite a few logging capabilities that can be pressed into service to help bring logs together. But there are also tools such as Telegraf that may be usable to consolidate data from logs and either/both send to a timeseries db such as InfluxDB and/or to MQTT making it easy for Node-RED to process.

In other words, you might want to give some thought to what tools might make your life easier BEFORE you get into potentially complex Node-RED flows.

Log data can be absolutely massive so working out how to limit the data is one of the key factors when working this all out.

For some context:

I work in a PCB factory. One of the production plants (where I am) has all new machines, most of them prepared for tracking, and that can be easily connected. Not all of them. Some still are using these old text files instead of a, let's say, a MySQL database (something I can't understand, it's like making a machine controlled by relays when you can put a PLC).

For about 2-3 years, I've been getting these machines' data (both real time data for monitoring purposes and creating alerts, and production tracking data), and I put it in a MySQL database. Using queries, I can later retrieve information like how much time has elapsed for manufacturing a specific batch, and so on. Everything is done in Node-Red because I had essentially zero budget, so I am restricted to use whatever does not require licenses to run (like Node-Red and MySQL).

During this time, I asked repeatedly to everyone over me (including top management) that we should expand the scope of what I was doing and include the older plants in the data collection scheme. However, for various reasons, it wasn't considered important at the time by the CEO.

Said CEO was fired at the beginning of the year, and the new one wants to have real production data and ultimately insert it into our SAP system (batch tracking is currently done by hand). Now I have been given a two month deadline (end of October) to get accurate enough production time readings, so they can calculate the production cost better, and start optimizing from there.

Of course, it is impossible to have everything, so I will start by focusing on batch and individual workpiece tracking, and for the lines that keep a consistent speed at all times, or that the production time is one of the key parameters (like on chemical baths) we will start with an estimated calculation based on the parameter and the machine throughput.

For some stand-alone machines, I have text logs that should tell me when the first and last panels from a batch are inspected, and that will be enough for now. In the future, I will work with the production data within the logs and extract what the production and quality departments may need, but that will be done on a second phase.

So the scope and what I want to do is very clear. The machines that I have to get the data from are already set, some of them are 30 years old or more, some have text logs, some have nothing, and some have a modern enough PLC, so I can get what I need directly from there. Many of the machines (specially the complex ones for which we still have a service contract) are locked by the supplier, and no documentation or API is given, so if they have text logs, that's what they have (and I can't change it to MQTT, or OPC/UA).

As for the database space, that is a problem for IT, not mine. In the end, everything should go to the SAP database, and I will be keeping only the last month or two of data in my MySQL.

1 Like

Cool. How close to "real-time" do you need the calculations?

For example, in some scenarios it might be better to grab whole logs on a timer (CRON for example) and then calculate things a bit more leisurely. That would save a lot of hassle over trying to watch log files for changes.

For normal logs (single file, always the same name) I was planning to mount the remote folders in the machine PCs as folders in my node-red server, then use the Tail node to get the new lines added. The load in the NR server and the DB is more evenly distributed then (most process machines will add between 1-3 lines per minute, at most).

If I schedule processing whole logs, I'll have peaks of really high usage that will probably slow down the NR server. I tested it with a standard log file (about 20 MB of text) and the DB slowed down sensibly. Despite all lines of the log being added to the DB, NR got some timeout errors from the DB node if I processed the file line by line without throttling. The structure of my flow for log parsing is this:


note that I have two inputs, one with a file node (for tests) and one with a tail node (for actual monitoring). I also have inject nodes to test different types of line.