I think I need machine learning, do you agree?

Over the past 5+ years I have built an extensive aircraft tracking/decoding website around Node-RED.
As time goes on, I am spending more and more time writing custom function block code to decode each type of string / text message.
Its got to the point where most of my time is spent getting substring and indexOf wrangling to extract each part of each message.
In short, it seems to me that this approach is unsustainable as each aircraft and airline has subtle differences in message structure.

It was suggested to me over the weekend that I should look at machine learning to model the information and have it extract the data I want out of the text strings.

Reviewing the forums and flows library there seems to be a few nodes that might work, but before I go down the rabbit hole, I was wondering if its even the right approach.

Here is a small example of the raw strings I am working with.

00:16:0808-11-20AES:AE022AGES:022.60032S!20FINI/ID60032S,BLUE01,ZPR008A21308/MR1,2/AFLPLA,KPSM/TD071030,10325053
01:30:0808-11-20AES:AE20C7GES:442.77186A!H1G-#MDINI/ID77186A,RCH356,AAM18131E306/MR0,0/AFLPLA,KPSM/TD070705,113038C4
19:37:0708-11-20AES:AE1472GES:822.77180A!H1N-#MDINI/ID77180A,RCH871,AJRF3368F313/MR1,0/AFFJDG,OAIX/TD081130,1059997B
20:49:0908-11-20AES:AE123AGES:822.44128A!H1P-#MDINI/ID44128A,RCH836,AJZA3362C312/MR0,0/AFFJDG,OKAS/TD081245,1245C15B
04:16:0508-11-20AES:AE0580GES:D02.70035B!33F/AMCTACC.INI01081215INITIALIZERCH551MC0035ABR02Y5XD313KDOVLERT081400/
22:33:0708-11-20AES:AE10BFGES:822.10196A!H1D-#MDINI/ID10196A,RCH873,JJRF3361F312/MR0,0/AFRODN,FJDG/TD071355,135525BA
22:47:0908-11-20AES:AE117EGES:822.21112A!H1C-#MDINI/ID21112A,RCH877,JJZA3363C312/MR0,0/AFFJDG,OKAS/TD081345,1345F5F3
07:32:0808-11-20AES:AE0243GES:022.80047S!20G01/Y/MC/0047/08/KIAB/KSVN/1701/0000/SNAP85//6A9E

Its hard to highlight the data I want extracted in the forums, but in short,
Get 6 characters after 'AES:'
Get 6 characters after the first dot.
(They are the super easy ones).
Get the callsign (its in different places), but as an example, BLUE01, RCH356, RCH871, RCH551 etc. (Note, its not always 6 characters, it can often be 7).
Then the really tricky stuff, the airport codes, they are usually toward the end of the message.
eg, first message, LPLA and KPSM
Fifth message is a bit tricky, but I need KDOV and LERT.
Lastly, I need the date and time, in the first message its at the end '07' and '1030'.
Fifth message I need '08' and '1400'.

As I said, I have been doing a LOT of JavaScript string code in function blocks that is fed each type and structure of each message type from upstream switch blocks and more functions blocks to test and direct each message format to the correct 'decoder'.

I have hundreds and hundreds of raw and decoded messages, so feel I could 'feed' a machine learning system pretty good and teach it what I want.

Note that the example messages are very small subset of messages, I have about 5-7 main message sources each with a few dozen message sub types.

I would really like to stay in the Node-RED eco system, but can branch out-decode-branch in if need be.

Thanks for your thoughts.

Hi @thebaldgeek interesting topic - I will keep an eye on this thread for sure.

In the mean time, have you considered REGEX for extracting the parts of the string? While it would still mean maintaining the flow manually, it should simplify your flows somewhat. (guessing)

Hi Steve. Until I started decoding the messages, I never even knew about REGEX, I am not any good at it, and have to lean on a few of the web based testing sites, but yes, we make as much use of it as I can.

It seems really powerful in 'checking' rather than 'getting'?
For example, here I test if an aircraft is part of the USA air force;

var re = new RegExp("^ADF[7-9]..$|^ADF[A-F]..$|^A(E|F)....$");
if (re.test(tail)){
    msg.airforce = "USA"; return msg}

Here is another use case where I check my 'getsubstring' code;

// If it has a number at the start of the message or ILS or RWY a - anywhere or brackets anywhere or a ! anywhere in the 'call' it aint no call1sign
if (msg.call1.match(/^\d/) || (msg.call1.match(/^(ILS)$/)) || (msg.call1.match(/^(RWY.*)$/)) || (msg.call1.match(/^[-]$/)) || (msg.call1.match(/(\))$ | (\()$/)) || (msg.call1.match(/(\!)$/))){
    msg.call1 = ":";
}

So yeah, I use it, but as I said, only to check rather than get.
If you have any example websites where you can use it to intelligently find and extract the sort of data I gave in the example, I am more than open to that approach.

Is it possible to write some rules in English that define for the general case (but with variable parameters of some sort) for what you need to extract?

Really interesting question.
After being at this for a few years, yes, I can look at any of the messages and quickly 'read' them and extract the keywords I am after.

The challenge is that the messages have different delimiters. Sometimes a single dot, sometimes a double dot, sometimes a forward slash, sometimes a double back slash. This makes it hard for me to say out loud 'For this message, go forward three slashes and get the next 4 characters or until the first comma'.

Is that what you mean? Being able to say out loud how to get the data from each message structure?

Is there a name for this format.. presumably there is a spec for it somewhere ?

Generally it is known as ACARS (Aircraft Communication Addressing and Reporting System.)
The spec is laughable. Each airline, each aircraft manufacturer and each aircraft tech just makes it up as they go along. There are so many variations and differences that it is just astonishing the system even works as well as it does.
Then you have the main types. ACARS (analog), VDL (digital ACARS), L-Band (1.5gig), C-Band (3.8gig) and HF (High-Frequencies). Each have their own sub flavor of the spec.

hah - great - gotta love those "evolving" specs ! ... I was just looking at https://github.com/TLeconte/acarsdec#acarsdec - which seems to do some of it using an SDR... and has a JSON output option.

Yes, I use that exact software on the remote stations (Raspberry Pi), each remote station feeds the JSON into my central Node-RED server.
It is that JSON formatted strings that I am battling with among other things.

@thebaldgeek are the strings coming from one source consistent?

If so, then it should be quite easy to capture the parts of interest for each string format.

e.g...

this regex...

captures these...
image

link to the regex I used

some online regex sites even generate working JS for you.

EDIT...
obviously the captures I did are likely nonsense :slight_smile:

1 Like

The example strings I first posted are all from the same source and are only a very small subset of possible string sequences.
The inconsistencies from any one source is the core of my challenge.

I have never seen REGEX concatenated like that.
If I could detect each variation of message types from each source and direct it to a function block with the REGEX extractor, it might make things a bit easier.
Pre-detecting the variations within each source is big part of my challenge.

Here is a partial screen shot of one of my methods of routing and decoding the different message formats from one source.

Each inject block has an example string I have seen from that source.
You can see 15 of them stacked up there.
I use this so I can 'walk the tree' when I make a change to the decoding flow and make sure I have not broken any other combination of string structure from that source.

2 Likes

I would possibly approach it like this...
for any one source, define a REGEX for each "style" then push that through a second REGEX to capture the parts.

e.g. Lines 1, 2, 3, 4, 6 & 7 appear to be same format - lets call them FORMAT-1 and they can be quickly identified by .+AES:.+?:.+?!.+?,.+?,.+?,.+?,.+?,\d+.*

e.g...


regex

Then pass them FORMAT-1 strings through a FORMAT-1 REGEX to capture the parts of interest

e.g..

regex


Line 5 appears to be another format and can be quickly identified by .+AES:.+?:.+?!.+?\/.+?\..+?.*


Line 8 appears to be another format and can be quickly identified by .+AES:.+?:.+?!.+?\/.+?\/.+?\/.+?\/.+?\/\d+.*


Hope that gives you an idea as to how you might streamline this?

If not, you might learn a little more about REGEX :blush:

Huh, I never thought of using REGEX to detect and 'switch' the different formats to the correct decoder.
Might be a bit more reliable than my current JavaScript string hell that I am in at the moment.

I will still end up with 20-30 REGEX detectors and the same number of decoders for each stream (as I said, I have 6 main streams, with around 25 variations each), so we will end up with around 300 REGEX blocks.
Doable I guess..

Thinking a little more...

If the variations are sufficiently different (in any one stream), you could pass all strings straight through a well crafted "decoder" & only one will match. This would negate the need for a "detector" regex (just jump straight into the decoder/capture regex)

In theory, if there are 6 streams with 25 variations max, then that is 150 (or less) regexes to setup.

1 Like

I noticed in the data that everything up to the bang (!) is consistent. Then after that are three characters that determine the 'format' of the rest of the line. Though all of the 'H' lines seem to be the same format. So in the sample of data I see 20F, 20G, H**, 33F formats. Write your code to look at those three characters and then call the right extractor base on that. Or as later suggested ... possibly just do a switch/case on those three characters. That's how I would probably go. Yes it could possibly be a large code block but very little of it would actually execute for each run. You would have a "catch all" at the end that could send you an email with a new or unrecognized pattern too. My 2 or 3 cents.

Because I felt like playing with a little code this afternoon. I built this little "tester".

nr-01

Of the 8 samples you provided 6 actually shared the same data format, no. 8 was easy and no. 5 ... well PITA comes to mind. :slight_smile:

I made a few assumptions on no. 5

  1. That the actual callsign is RCH551
  2. That since that is the callsign it will always follow INITIALIZE
  3. That callsigns end with a number generally end with a number.

Given those assumptions here is a nearly pure javascript solution (1 line of RegEx .. because I'm not that good at it either) .

[
    {
        "id": "3cd1ece7.287d74",
        "type": "inject",
        "z": "39cd371d.047428",
        "name": "S1",
        "topic": "",
        "payload": "00:16:0808-11-20AES:AE022AGES:022.60032S!20FINI/ID60032S,BLUE01,ZPR008A21308/MR1,2/AFLPLA,KPSM/TD071030,10325053",
        "payloadType": "str",
        "repeat": "",
        "crontab": "",
        "once": false,
        "onceDelay": 0.1,
        "x": 110,
        "y": 80,
        "wires": [
            [
                "a7c1ae43.1a457"
            ]
        ]
    },
    {
        "id": "84578dbd.1999a",
        "type": "inject",
        "z": "39cd371d.047428",
        "name": "S2",
        "topic": "",
        "payload": "01:30:0808-11-20AES:AE20C7GES:442.77186A!H1G-#MDINI/ID77186A,RCH356,AAM18131E306/MR0,0/AFLPLA,KPSM/TD070705,113038C4",
        "payloadType": "str",
        "repeat": "",
        "crontab": "",
        "once": false,
        "onceDelay": 0.1,
        "x": 110,
        "y": 120,
        "wires": [
            [
                "a7c1ae43.1a457"
            ]
        ]
    },
    {
        "id": "f72e756f.2e6628",
        "type": "inject",
        "z": "39cd371d.047428",
        "name": "S3",
        "topic": "",
        "payload": "19:37:0708-11-20AES:AE1472GES:822.77180A!H1N-#MDINI/ID77180A,RCH871,AJRF3368F313/MR1,0/AFFJDG,OAIX/TD081130,1059997B",
        "payloadType": "str",
        "repeat": "",
        "crontab": "",
        "once": false,
        "onceDelay": 0.1,
        "x": 110,
        "y": 160,
        "wires": [
            [
                "a7c1ae43.1a457"
            ]
        ]
    },
    {
        "id": "b8e9e997.676798",
        "type": "inject",
        "z": "39cd371d.047428",
        "name": "S4",
        "topic": "",
        "payload": "20:49:0908-11-20AES:AE123AGES:822.44128A!H1P-#MDINI/ID44128A,RCH836,AJZA3362C312/MR0,0/AFFJDG,OKAS/TD081245,1245C15B",
        "payloadType": "str",
        "repeat": "",
        "crontab": "",
        "once": false,
        "onceDelay": 0.1,
        "x": 110,
        "y": 200,
        "wires": [
            [
                "a7c1ae43.1a457"
            ]
        ]
    },
    {
        "id": "b86269ef.934348",
        "type": "inject",
        "z": "39cd371d.047428",
        "name": "S5",
        "topic": "",
        "payload": "04:16:0508-11-20AES:AE0580GES:D02.70035B!33F/AMCTACC.INI01081215INITIALIZERCH551MC0035ABR02Y5XD313KDOVLERT081400/",
        "payloadType": "str",
        "repeat": "",
        "crontab": "",
        "once": false,
        "onceDelay": 0.1,
        "x": 110,
        "y": 240,
        "wires": [
            [
                "a7c1ae43.1a457"
            ]
        ]
    },
    {
        "id": "ee4733dd.ab206",
        "type": "inject",
        "z": "39cd371d.047428",
        "name": "S6",
        "topic": "",
        "payload": "22:33:0708-11-20AES:AE10BFGES:822.10196A!H1D-#MDINI/ID10196A,RCH873,JJRF3361F312/MR0,0/AFRODN,FJDG/TD071355,135525BA",
        "payloadType": "str",
        "repeat": "",
        "crontab": "",
        "once": false,
        "onceDelay": 0.1,
        "x": 110,
        "y": 280,
        "wires": [
            [
                "a7c1ae43.1a457"
            ]
        ]
    },
    {
        "id": "1330b08b.18526f",
        "type": "inject",
        "z": "39cd371d.047428",
        "name": "S7",
        "topic": "",
        "payload": "22:47:0908-11-20AES:AE117EGES:822.21112A!H1C-#MDINI/ID21112A,RCH877,JJZA3363C312/MR0,0/AFFJDG,OKAS/TD081345,1345F5F3",
        "payloadType": "str",
        "repeat": "",
        "crontab": "",
        "once": false,
        "onceDelay": 0.1,
        "x": 110,
        "y": 320,
        "wires": [
            [
                "a7c1ae43.1a457"
            ]
        ]
    },
    {
        "id": "3e6fe386.f1ccec",
        "type": "inject",
        "z": "39cd371d.047428",
        "name": "S8",
        "topic": "",
        "payload": "07:32:0808-11-20AES:AE0243GES:022.80047S!20G01/Y/MC/0047/08/KIAB/KSVN/1701/0000/SNAP85//6A9E",
        "payloadType": "str",
        "repeat": "",
        "crontab": "",
        "once": false,
        "onceDelay": 0.1,
        "x": 110,
        "y": 360,
        "wires": [
            [
                "a7c1ae43.1a457"
            ]
        ]
    },
    {
        "id": "a7c1ae43.1a457",
        "type": "function",
        "z": "39cd371d.047428",
        "name": "Parser",
        "func": "var raw = msg.payload;\nvar recType = raw.substring(41, 44);\n\nvar output = {};\noutput.AES = raw.substring(20, 26);\noutput.DOT = raw.substring(34, 40);\nvar parts;\n\n\nswitch (recType){\n    case '20F':\n    case 'H1G':\n    case 'H1N':\n    case 'H1P':\n    case 'H1C':\n    case 'H1D':\n        parts = raw.split(',');\n        output.callsign = parts[1];\n        output.AC1 = parts[3].substring(parts[3].length-4, parts[3].length);\n        output.AC2 = parts[4].substring(0, 4);\n        output.ts = {\n            'day':parts[4].substring(parts[4].length-6, parts[4].length-4),\n            'time':parts[4].substring(parts[4].length-4, parts[4].length)\n        };\n        break;\n\n    case '33F':\n        parts = raw.substring(raw.length-15, raw.length);\n        \n        // ASSUME: RCH is the start of this callsign\n        // ASSUME: callsigns start after the INITIALIZE substring\n        // ASSUME: callsigns end with a number ???\n        \n        // What are we going to look for?\n        var sstring = 'INITIALIZE';\n        // Where is the end of it?\n        var csStart = raw.indexOf(sstring)+sstring.length;\n        // Get 7 characters\n        var tmpCS = raw.substring(csStart, csStart+7);\n        // Strip the last character if it is a LETTER.\n        output.callsign = tmpCS.replace(/[A-Z]$/,\"\")\n        output.AC1 = parts.substring(0, 4);\n        output.AC2 = parts.substring(4, 8);\n        output.ts = {\n            'day':parts.substring(8, 10),\n            'time':parts.substring(10, 15)\n        };\n        break;\n\n    case '20G':\n        parts = raw.split('/');\n        output.callsign = parts[9];\n        output.AC1 = parts[5];\n        output.AC2 = parts[6];\n        output.ts = {\n            'day':parts[4],\n            'time':parts[3]\n        };\n        break;\n\n    default:\n        output = 'No parser for type \"'+recType+'\"';\n}\n\nmsg.payload = {\n    'raw':raw,\n    'recType':recType,\n    'output':output\n}\n\nreturn msg;",
        "outputs": 1,
        "noerr": 0,
        "x": 350,
        "y": 220,
        "wires": [
            [
                "fb49e3c8.93543"
            ]
        ]
    },
    {
        "id": "fb49e3c8.93543",
        "type": "debug",
        "z": "39cd371d.047428",
        "name": "Show Payload",
        "active": true,
        "tosidebar": true,
        "console": false,
        "tostatus": false,
        "complete": "payload",
        "targetType": "msg",
        "x": 560,
        "y": 220,
        "wires": []
    }
]

There are better RegEx parser people than I on here, I'm sure, but these are fairly straight forward parses (at least these samples) so I'm not sure if there would be significant performance difference either way. Anyway, hope this helps a little.

  • ray
2 Likes

Thanks every one for your replies thus far.
I can see that I have done something bad, and something "good".
The bad thing I did was provide a too small of an example data set.
The "good" thing I did was exactly the same as everyone replying to this thread did.... and that is, build code to detangle the data in front of me.... The problem is that is exactly what I have been doing for the past 5 years and its just getting out of hand.... As I said, I am looking at roughly 300 different string constructions. Not 3-5.

Here is a better sample set from one of my 5 data sources. (In this case, L-Band ACARS messages).
Keep in mind that these are just another small example data set, I have seen dozens of strings from this one data source.


L-Band 18:31:07 13-11-20 AES:AE0681 GES:82 2 .80086S ! 20 D

	INI/ID80086S,PEARL17,CPZCPFF1P317/MR0,1/AFRJSM,RODN/TD122255,0615C439


L-Band 18:36:33 13-11-20 AES:AE1452 GES:82 2 .55148A ! H1 U

	- #MD/A6 FUKJJYA.ADS.55148A070C0BC40D010E010F01150125DFD6

	ADS-C message:
	 Periodic contract request:
	  Contract number: 12
	  Reporting interval: 256 seconds
	  Predicted route: every 1 reports
	  Earth reference data: every 1 reports
	  Air reference data: every 1 reports
	  Aircraft intent data: every 1 reports, projection time: 37 minutes
	

L-Band 18:37:55 13-11-20 AES:AE1452 GES:82 2 .55148A ! H1 X

	- #MD/A6 FUKJJYA.ADS.55148A01E281

	ADS-C message:
	 Cancel all contracts and terminate connection
	

L-Band 18:38:05 13-11-20 AES:AE1452 GES:82 2 .55148A ! H1 Y

	- #MD/AA FUKJJYA.AT1.55148A22225C6840F8F6

	FANS-1/A CPDLC Message:
	 CPDLC Uplink Message:
	  Header:
	   Msg ID: 4
	   Timestamp: 08:37:49
	  Message data:
	   END SERVICE
	

L-Band 00:46:04 13-11-20 AES:AE0577 GES:D0 2 .60026B ! 31 W

	/AMCTACC.FTX01130845
	FRE
	ETAD NOT GOOD AT THIS TIME. EDDN AND ETAR ARE GOOD.


L-Band 00:46:55 13-11-20 AES:344344 GES:D0 2 .EC-LUX ! AA L

	/NYCODYA.AT1.EC-LUX22A2EC2840821F

	FANS-1/A CPDLC Message:
	 CPDLC Uplink Message:
	  Header:
	   Msg ID: 5
	   Timestamp: 08:46:48
	  Message data:
	   END SERVICE
	

C-Band 00:53:15 13-11-20 AES:3B7781 GES:D0 2 .F-RAJB ! SA 9 Flight E11011

	S17AE110110ES085310VS

	Media Advisory, version 0:
	 Link Default SATCOM established at 08:53:10 UTC
	 Available links: VHF ACARS, Default SATCOM
	

L-Band 00:54:14 13-11-20 AES:AE146A GES:02 2 .77172A ! H1 U

	- #MDWXR/ID77172A,RCH484,PAM410001311/MR1,18/MTKRIV,1014,TAF KRIV 131014Z VRB04KT 9999 SCT200 QNH3007INS T7C-AUTO RESPONSE BASED ON AVAILABLE DATA-CONTACT C2 AGENCY FOR ADDITIONAL INFO9CBC


L-Band 00:54:25 13-11-20 AES:344344 GES:D0 2 .EC-LUX ! H1 R

	- #MDREQPOS037B

I need to extract different data from the different strings.
This is why I am wondering if machine learning is not a better way to go?

Quick big picture - There are dozens of websites that gather and display aircraft position data. (ADSBExchange, Flightaware, FlightRadar24, RadarBox, OpenSky, SkyScan etc).
There are zero sites that provide ACARS (message) data with ADSB (position) data. This is what I am trying to build in Node-RED.
The problem I am up against is that over the past 5 years I am spending more time writing REGEX/JavaScript decoders than putting the data together on a map!

OK, I can see your issue. But not being "an air traffic" person I have noticed, from a purely data pattern point of view, that some of the samples in the new data set you provided do not fit even remotely into what you were asking for originally. So that begs another few questions.

  1. How many different data sets are there in the 300 strings?
  2. Is there any kind of standards in place for this or do airlines just make it up as they need it?

If both of those are variables, I don't even know how machine learning will help. You can have the best AI/machine learning out there and if I just make up a new string of letters and numbers that only really mean anything to me ... how it is going to figure it out? There almost has to be some sort of standards, and if there are you can code to those as long as they don't just willy-nilly change them on a whim, like some other "standards" that are out there.

Sorry for the delay getting back to you. I was out of town for a few days....

  1. There are probably 40-50 sets. That's why I have been at it for 5ish years, I just kept thinking 'just one more string and I will be done'. It really has been death by a thousands regexs. Its only now I sort of woke up one day while deep in Node-RED and came to see that I am never going to be just 'one more' away from finished.... The encouraging thing is that have both decoded and raw strings to feed the beast.
  2. Nope, no standard.

The problem and question is an odd one for sure.