HTML node parser ; how to use the selector function to lookup a value?

I try to extract a value from a website and use the HTTP (get) node connected to a HTML node.
I cannot figure out how to extract a value on the page.

The value I want to retrieve is "F56386285397M4Y403" and is on the buttom of the page which will be different each time te page is refreshed.

On the bottom of the HTML is the value located:

*<....*
*<script type="text/javascript">*
*// <![CDATA[*
*jQuery(document).ready(function() {liftAjax.lift_successRegisterGC();});*
*var lift_page = "F56386285397M4Y403";*
*// ]]>*
*</script></body>*
*</html>*

I know the path is :"/html/body/script[10]/text()"

How can I point the HTML node to this path and retrieve the value of var "lift page"?
(I do receive input in the HTML node from the HTTP node)

The problem is that your value is not in the DOM so you cannot use a simple selector to get hold of it.

The simples approach is to read the html page as text and then apply a change node using a regular expression to grab the text.

This expression for example, will find the text and store it in a token:

/\*var lift_page = "(.*)"/

You can use a replace option in a change node to just output $1 which will return the code on its own.

Thanks! Now I understand why it didn't worked because of the text is not in the DOM.

I cannot make the change node to work. The text is output by the HTML node (a string with132 characters; just like the light grey text in above post):

and the change node connected to the HTML node is adjusted like:

The change node will not replace but outputs the complete msg. Can you see what is wrong?
(PS can a split in a function also work?)

1 Like

In your first post you have

*<....*
*<script type="text/javascript">*
*// <![CDATA[*
*jQuery(document).ready(function() {liftAjax.lift_successRegisterGC();});*
*var lift_page = "F56386285397M4Y403";*
*// ]]>*
*</script></body>*
*</html>*

does each line actually have an asterisk as the first character? If ther is no asterick then you need to change your change node to reflect that.

The "*" are put in by this NodeRed forum post app. with the "</>" button. The actual output of the HTML node is now:

Hmmm not when I do it

<....*
<script type="text/javascript">*
// <![CDATA[*
jQuery(document).ready(function() {liftAjax.lift_successRegisterGC();});*
var lift_page = "F56386285397M4Y403";*
// ]]>*
</script></body>*
</html>*

anyways, in your regex you have /\*var lift_page = "(.*)"/ not being a regex expert (it makes my brain hurt) should you have the begining\*?

Nope! And regex hurts less than JSONata :smiley: Anyway there are lots of useful online regex testers.

The following is enough anyway:

/lift_page = "(.*)"/

And I wouldn't bother with the html extract, just send it the whole page text.

still the whole ouput. Is this setting ok (see highlight)?:
image

update: replace with [$] "env variable" is not the right way; should be a string.

You need:

.*lift_page = "(.*)".*

I forgot that you don't need the regex initial/trailing slashes and you are trying to replace everything except the value so you need to include everything before and after as well.

Thanks for the adjustment. It is a bit better but not yet perfect. It replaces the whole 'var lift page....' bit for only the code I need. But how can I het rid off all the other parts before and after. The output is now:

// <![CDATA[
jQuery(document).ready(function() {liftAjax.lift_successRegisterGC();});
F6057066220042K3GSG
// ]]>

Not in my test it isn't and I don't know how it could be since .*lift_page = " selects everything before the value and all of that is thrown away when the replace value is just $1

The main things that might be improved in that regex are:

  1. It is possible that "(.*)" might actually select too much since regex is "geedy" by default. Won't happen with the example text you've shown though.
  2. Selecting for " might fail if the author of the page decides to switch to default single quotes instead of double.

Best thing to do is to find a regex testing website, paste the HTML source into it and try out the regex.

Thanks again. I think that I know what is going on why it isn't working. The tekst I quoted before, see this post below, was by clicking on the node and copy the string. BUT when I just look at the node, see picture below, little "Carriage Return" symbols are show (see yellow highlite):
image

It seems that these carriage returns somehow break the replacement node. I say this because the word "var" in front of "var lift_page" is removed correctly with your supplied regex formula.
See above output of the change node. What do you think?

I can not change the output of the HTML node any better than this:

In the meanwhile I'll used a function node with a split and that works correctly:

msg.payload = msg.payload.split(" ")[5].substr(1,18);
flow.set("endpointid", msg.payload) //set endpointid
return msg;

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.