Extracting values from very long text files

Hello everyone.
I have a flow that uses ocr to converts long pdf files (usually around 30 pages) to text files.
I then read those text files using a file in node.
From those text files I need to extract certain specific paragraphs.
I have 2 questions regarding that matter.

  1. How can that be possible?
  2. When reading the text files the debug node does not show the entire payload/context as its very long. Is there a workaround for this?

Thanks!

You would need to show us an example & indicate what part you need.

Yes, but dont do it (hint, its in settings.js). Making the debug output larger can cause slowdowns.
Instead, on the debug node settings, tick the box that says Console - then the full value will be sent to the node-red console output.

Thanks for the response,

Assuming I am looking for a paragraph that begins with a specific word and i'd like to extract the entire paragraph or X number of lines.

Again, without an example, there could be multiple solutions (some better than others).

For example...

  • does the data lines have \n, \r or both?
  • is the data enclosed in HTML?
  • how do you recognise the end of a paragraph?

Would be better if you provide a real world sample data and highlight what you need.

I prefer to use operating system utilities for a job like that because I know they can handle arbitrarily large files without loading the entire file into memory.

Since I use linux on a raspberry pi I can select individual lines with grep.
To extract entire paragraphs I use awk or perl.

I'm sure Windows has similar utilities, just not so good :stuck_out_tongue_winking_eye:

I uploaded a sample payload and marked in red the lines I want to extract.

In which case I would use regex.

Unfortunately, you posted the sample as an image (so I cannot paste the text into a demo flow to show you how)

But the regex would be something like /The Contract end date.*?\./gms

image

so here is a demo with crap test data...

[{"id":"83746a428adcbc17","type":"inject","z":"7eecf5f1d763605a","name":"dummy data","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":" The Contract start date blah blah  asd sdlfkj s;doafk ja;sldkf j asd asdodijapsiodfj aspidj;asld 2022.  The Contract end date blah blah  asd sdlfkj s;doafk ja;sldkf j asd asdodijapsiodfj aspidj;asld 2022.   The contract end date blah blah  asd sdlfkj s;doafk ja;sldkf j asd asdodijapsiodfj aspidj;asld 2022.  The Contract end date on a single line 2021.   The contract end date blah blah 2022.","payloadType":"str","x":998,"y":384,"wires":[["056008d289d9ada1"]]},{"id":"056008d289d9ada1","type":"function","z":"7eecf5f1d763605a","name":"get paragraph \"The Contract end date\"","func":"const regex = /The Contract end date.*?\\./gms;\nlet m;\n\nwhile ((m = regex.exec(msg.payload)) !== null) {\n    // This is necessary to avoid infinite loops with zero-width matches\n    if (m.index === regex.lastIndex) {\n        regex.lastIndex++;\n    }\n\n    // The result can be accessed through the `m`-variable.\n    m.forEach((match, groupIndex) => {\n       node.send({payload: match});\n    });\n}\n","outputs":1,"noerr":0,"initialize":"","finalize":"","libs":[],"x":1070,"y":432,"wires":[["52182669c4a415b7"]]},{"id":"52182669c4a415b7","type":"debug","z":"7eecf5f1d763605a","name":"payload","active":true,"tosidebar":true,"console":false,"tostatus":true,"complete":"payload","targetType":"msg","statusVal":"payload","statusType":"auto","x":1164,"y":480,"wires":[]}]
1 Like

Your data seems not to have any new lines, never mind paragraphs!

This looks like a single clause:

Contract Term
The Contract end date, wherever such reference appears in the Contract, shall be changed from May 31, 2020 to May 31, 2021.

So does this:

Child Support (Applicable to natural persons only; not applicable to corporations, partnerships or LLCs).
Contractor is under no obligation to pay child support or is in good standing blah blah.

The first clause heading (?) has no full stop, the second has one.
That's going to make it harder to extract the clauses you want.
Does your OCR program have the ability to include new line characters?

Thank you very much for the detailed explanation!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.