Extracting values from very long text files

milner236 · 2 May 2022 09:34

Hello everyone.
I have a flow that uses ocr to converts long pdf files (usually around 30 pages) to text files.
I then read those text files using a file in node.
From those text files I need to extract certain specific paragraphs.
I have 2 questions regarding that matter.

How can that be possible?
When reading the text files the debug node does not show the entire payload/context as its very long. Is there a workaround for this?

Thanks!

Steve-Mcl · 2 May 2022 09:38

You would need to show us an example & indicate what part you need.

Yes, but dont do it (hint, its in settings.js). Making the debug output larger can cause slowdowns.
Instead, on the debug node settings, tick the box that says Console - then the full value will be sent to the node-red console output.

milner236 · 2 May 2022 09:41

Thanks for the response,

Assuming I am looking for a paragraph that begins with a specific word and i'd like to extract the entire paragraph or X number of lines.

Steve-Mcl · 2 May 2022 09:43

Again, without an example, there could be multiple solutions (some better than others).

For example...

does the data lines have \n, \r or both?
is the data enclosed in HTML?
how do you recognise the end of a paragraph?

Would be better if you provide a real world sample data and highlight what you need.

jbudd · 2 May 2022 09:46

I prefer to use operating system utilities for a job like that because I know they can handle arbitrarily large files without loading the entire file into memory.

Since I use linux on a raspberry pi I can select individual lines with grep.
To extract entire paragraphs I use awk or perl.

I'm sure Windows has similar utilities, just not so good

milner236 · 2 May 2022 09:54

I uploaded a sample payload and marked in red the lines I want to extract.

Steve-Mcl · 2 May 2022 10:02

In which case I would use regex.

Unfortunately, you posted the sample as an image (so I cannot paste the text into a demo flow to show you how)

But the regex would be something like /The Contract end date.*?\./gms

Steve-Mcl · 2 May 2022 10:05

so here is a demo with crap test data...

[{"id":"83746a428adcbc17","type":"inject","z":"7eecf5f1d763605a","name":"dummy data","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":" The Contract start date blah blah  asd sdlfkj s;doafk ja;sldkf j asd asdodijapsiodfj aspidj;asld 2022.  The Contract end date blah blah  asd sdlfkj s;doafk ja;sldkf j asd asdodijapsiodfj aspidj;asld 2022.   The contract end date blah blah  asd sdlfkj s;doafk ja;sldkf j asd asdodijapsiodfj aspidj;asld 2022.  The Contract end date on a single line 2021.   The contract end date blah blah 2022.","payloadType":"str","x":998,"y":384,"wires":[["056008d289d9ada1"]]},{"id":"056008d289d9ada1","type":"function","z":"7eecf5f1d763605a","name":"get paragraph \"The Contract end date\"","func":"const regex = /The Contract end date.*?\\./gms;\nlet m;\n\nwhile ((m = regex.exec(msg.payload)) !== null) {\n    // This is necessary to avoid infinite loops with zero-width matches\n    if (m.index === regex.lastIndex) {\n        regex.lastIndex++;\n    }\n\n    // The result can be accessed through the `m`-variable.\n    m.forEach((match, groupIndex) => {\n       node.send({payload: match});\n    });\n}\n","outputs":1,"noerr":0,"initialize":"","finalize":"","libs":[],"x":1070,"y":432,"wires":[["52182669c4a415b7"]]},{"id":"52182669c4a415b7","type":"debug","z":"7eecf5f1d763605a","name":"payload","active":true,"tosidebar":true,"console":false,"tostatus":true,"complete":"payload","targetType":"msg","statusVal":"payload","statusType":"auto","x":1164,"y":480,"wires":[]}]

jbudd · 2 May 2022 10:24

Your data seems not to have any new lines, never mind paragraphs!

This looks like a single clause:

Contract Term
The Contract end date, wherever such reference appears in the Contract, shall be changed from May 31, 2020 to May 31, 2021.

So does this:

Child Support (Applicable to natural persons only; not applicable to corporations, partnerships or LLCs).
Contractor is under no obligation to pay child support or is in good standing blah blah.

The first clause heading (?) has no full stop, the second has one.
That's going to make it harder to extract the clauses you want.
Does your OCR program have the ability to include new line characters?

milner236 · 2 May 2022 10:55

Thank you very much for the detailed explanation!

system · 16 May 2022 10:55

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Incomplete read out from text string ( file ) General	3	526	27 November 2019
Partial reading in a txt file General	2	351	5 December 2020
Parsing txt file General	4	3278	3 April 2019
Extract specific text and associated value from a string? General	15	4589	29 June 2019
Get data text last line Dashboard	10	598	10 November 2022

Extracting values from very long text files

Related topics