Text search in docx and pdf files

alexander77 · 16 December 2024 08:22

Hi,
i am looking for a node that acts like the F3 function in an pdf document. The purpose is to search for keywords in an bunch of technical documents available as pdf or docx.

alexander77 · 16 December 2024 08:25

Sorry, pressed the return key too early.
I searched in the forum and found a hint to the switch node and a post regarding the text search in log files. I tried to find something in the flows section but maybe i used wrong keywords, englisch is not my first language.

Any hints available?

Thanks in advance!

bakman2 · 16 December 2024 08:57

You can install the node-red-contrib-pdf-reader node.

Use a file read node, set it to output buffer, run it through the pdf node and attach a debug node.

TotallyInformation · 16 December 2024 21:30

Both of those types of files are binary not text, so you need something that can either peer inside the binary for embedded text or that can convert it to something simpler.

docx files are actually largely zipped XML. So you could simply unzip them and search as text but you would likely get a bunch of hard to understand XML code as well. Otherwise, you should look for a node.js library capable of parsing a docx and then use its API to search.

Alternatively, use something like the CLI Pandoc tool to covert to simpler and more easily parsed text.

alexander77 · 17 December 2024 09:34

@bakman2: thank you, i tried it with 3 different pdf's. Always get the message 'FieldError: Missing payload data'
There are more pdf-nodes available, will try something else.

@Totallyinformation: Very useful information, indeed! Will follow this branch of development!

alexander77 · 17 December 2024 11:03

@bakman: pdf hummus does a much better job

TotallyInformation · 17 December 2024 13:17

There appear to be a few node.js libraries such as pdf-parse - npm

Should be usable with a function node.

system · 17 March 2025 13:18

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
I need to check if the PDF file has a specific text General	7	161	30 August 2022
Node-red-contrib-pdfjs, no output General	6	471	24 October 2020
PDF file cut out General	8	307	25 April 2022
Where can I find searchable documentation for nodered? General	8	124	10 November 2024
How to convert text files to PDF file General	5	478	8 August 2022

Text search in docx and pdf files

Related topics