HTML node parser ; how to use the selector function to lookup a value?

ChillXXL · 8 October 2020 16:14

I try to extract a value from a website and use the HTTP (get) node connected to a HTML node.
I cannot figure out how to extract a value on the page.

The value I want to retrieve is "F56386285397M4Y403" and is on the buttom of the page which will be different each time te page is refreshed.

On the bottom of the HTML is the value located:

*<....*
*<script type="text/javascript">*
*// <![CDATA[*
*jQuery(document).ready(function() {liftAjax.lift_successRegisterGC();});*
*var lift_page = "F56386285397M4Y403";*
*// ]]>*
*</script></body>*
*</html>*

I know the path is :"/html/body/script[10]/text()"

How can I point the HTML node to this path and retrieve the value of var "lift page"?
(I do receive input in the HTML node from the HTTP node)

TotallyInformation · 8 October 2020 17:02

The problem is that your value is not in the DOM so you cannot use a simple selector to get hold of it.

The simples approach is to read the html page as text and then apply a change node using a regular expression to grab the text.

This expression for example, will find the text and store it in a token:

/\*var lift_page = "(.*)"/

You can use a replace option in a change node to just output $1 which will return the code on its own.

ChillXXL · 9 October 2020 08:06

Thanks! Now I understand why it didn't worked because of the text is not in the DOM.

I cannot make the change node to work. The text is output by the HTML node (a string with132 characters; just like the light grey text in above post):

and the change node connected to the HTML node is adjusted like:

The change node will not replace but outputs the complete msg. Can you see what is wrong?
(PS can a split in a function also work?)

zenofmud · 9 October 2020 08:15

In your first post you have

*<....*
*<script type="text/javascript">*
*// <![CDATA[*
*jQuery(document).ready(function() {liftAjax.lift_successRegisterGC();});*
*var lift_page = "F56386285397M4Y403";*
*// ]]>*
*</script></body>*
*</html>*

does each line actually have an asterisk as the first character? If ther is no asterick then you need to change your change node to reflect that.

ChillXXL · 9 October 2020 08:19

The "*" are put in by this NodeRed forum post app. with the "</>" button. The actual output of the HTML node is now:

zenofmud · 9 October 2020 09:00

Hmmm not when I do it

<....*
<script type="text/javascript">*
// <![CDATA[*
jQuery(document).ready(function() {liftAjax.lift_successRegisterGC();});*
var lift_page = "F56386285397M4Y403";*
// ]]>*
</script></body>*
</html>*

anyways, in your regex you have /\*var lift_page = "(.*)"/ not being a regex expert (it makes my brain hurt) should you have the begining\*?

TotallyInformation · 9 October 2020 19:11

Nope! And regex hurts less than JSONata Anyway there are lots of useful online regex testers.

The following is enough anyway:

/lift_page = "(.*)"/

And I wouldn't bother with the html extract, just send it the whole page text.

ChillXXL · 10 October 2020 21:13

still the whole ouput. Is this setting ok (see highlight)?:

update: replace with [$] "env variable" is not the right way; should be a string.

TotallyInformation · 11 October 2020 00:02

You need:

.*lift_page = "(.*)".*

I forgot that you don't need the regex initial/trailing slashes and you are trying to replace everything except the value so you need to include everything before and after as well.

ChillXXL · 11 October 2020 13:44

Thanks for the adjustment. It is a bit better but not yet perfect. It replaces the whole 'var lift page....' bit for only the code I need. But how can I het rid off all the other parts before and after. The output is now:

// <![CDATA[
jQuery(document).ready(function() {liftAjax.lift_successRegisterGC();});
F6057066220042K3GSG
// ]]>

TotallyInformation · 11 October 2020 15:32

Not in my test it isn't and I don't know how it could be since .*lift_page = " selects everything before the value and all of that is thrown away when the replace value is just $1

The main things that might be improved in that regex are:

It is possible that "(.*)" might actually select too much since regex is "geedy" by default. Won't happen with the example text you've shown though.
Selecting for " might fail if the author of the page decides to switch to default single quotes instead of double.

Best thing to do is to find a regex testing website, paste the HTML source into it and try out the regex.

ChillXXL · 11 October 2020 17:11

Thanks again. I think that I know what is going on why it isn't working. The tekst I quoted before, see this post below, was by clicking on the node and copy the string. BUT when I just look at the node, see picture below, little "Carriage Return" symbols are show (see yellow highlite):

It seems that these carriage returns somehow break the replacement node. I say this because the word "var" in front of "var lift_page" is removed correctly with your supplied regex formula.
See above output of the change node. What do you think?

I can not change the output of the HTML node any better than this:

In the meanwhile I'll used a function node with a split and that works correctly:

msg.payload = msg.payload.split(" ")[5].substr(1,18);
flow.set("endpointid", msg.payload) //set endpointid
return msg;

system · 10 December 2020 17:11

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Newbie on HTML. Looking to extract value General	17	1810	18 March 2022
Grab text value from HTML General	9	2523	12 July 2020
Data Scraping - Get the value of an input (html code) to Node Red General	4	504	15 October 2020
Question about extraction of text on HTML webpage General	7	281	21 October 2022
How to get a value of an HTML page General	5	2082	12 March 2021

HTML node parser ; how to use the selector function to lookup a value?

Related topics