Question about html node

gregorius · 4 January 2024 17:28

Hi There,

I'm just utilising the html node and have a question about the options:

Screen Shot 2024-01-04 at 18.19.20

with those three options, it's not possible to get the attributes and the content.

For example with the selector being 'a':

<a href="url" target="_blank"><b>some text</b></a>

with the first option "html content", I get <b>some text</b> as payload
second option, payload becomes some text, and
third option: { href: 'url', 'target': '_blank' }

What I would like to have is something like:

{ href: 'url', 
  'target': '_blank', 
  _: "<b>some text</b>" 
}

is that possible? What am I doing wrong?

Thanks for any tips ...

UnborN · 4 January 2024 18:00

What if you use the Html node twice ?
Once to get the attributes and one for the content and
then use a Join node to merge the results in one msg

Test Flow :

[{"id":"3bb37f38619ad9e9","type":"inject","z":"54efb553244c241f","name":"html","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"<a href=\"url\" target=\"_blank\"><b>some text</b></a>","payloadType":"str","x":270,"y":3180,"wires":[["b332d974fcb36acd","71c95918952af649"]]},{"id":"b332d974fcb36acd","type":"html","z":"54efb553244c241f","name":"attributes","property":"payload","outproperty":"payload","tag":"a","ret":"attr","as":"multi","x":460,"y":3140,"wires":[["5d8eb3e4bf5506ae"]]},{"id":"6cd4a1f22a03b652","type":"debug","z":"54efb553244c241f","name":"debug 59","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","x":860,"y":3180,"wires":[]},{"id":"71c95918952af649","type":"html","z":"54efb553244c241f","name":"content","property":"payload","outproperty":"payload","tag":"a","ret":"html","as":"multi","x":460,"y":3220,"wires":[["3a885d7c3413c462"]]},{"id":"b1508985cc7d5986","type":"join","z":"54efb553244c241f","name":"","mode":"custom","build":"object","property":"payload","propertyType":"msg","key":"topic","joiner":"\\n","joinerType":"str","accumulate":false,"timeout":"","count":"2","reduceRight":false,"reduceExp":"","reduceInit":"","reduceInitType":"","reduceFixup":"","x":690,"y":3180,"wires":[["6cd4a1f22a03b652"]]},{"id":"5d8eb3e4bf5506ae","type":"change","z":"54efb553244c241f","name":"","rules":[{"t":"set","p":"topic","pt":"msg","to":"attributes","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":555,"y":3140,"wires":[["b1508985cc7d5986"]],"l":false},{"id":"3a885d7c3413c462","type":"change","z":"54efb553244c241f","name":"","rules":[{"t":"set","p":"topic","pt":"msg","to":"content","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":555,"y":3220,"wires":[["b1508985cc7d5986"]],"l":false}]

Result :

gregorius · 4 January 2024 18:16

Sure I was thinking that too .... but that would assume that the underlying Cheerio lib never changes and will always provide the same order ... .oO(what could possibly go wrong?!?)

plus it would mean parsing the same content twice and that isn't exactly efficient.

I was looking at the implementation and before I parse the content twice and merge and hope the order is the same (since there isn't anything else to link the data), I would just roll my own!

Thanks for the idea/approach

dceejay · 4 January 2024 18:18

Happy for anyone wanting to enhance the node with a PR

gregorius · 4 January 2024 18:29

Hm. Then give me an idea how the core team would like to do this?

Looking at the code:

if (node.ret === "html") { pay2 = cheerio.load($(this).html().trim(),null,false).xml(); }
if (node.ret === "text") { pay2 = $(this).text(); }
if (node.ret === "attr") {
    pay2 = Object.assign({},this.attribs);
}
//if (node.ret === "val")  { pay2 = $(this).val(); }

Is val() actually the complete element? I don't know what Cheerio val() return ...

Would something like:

if ( node.ret === "complete" ) {
    pay2 = Object.assign({ _: $(this).html() },this.attribs);
}

would do the job or is _ reserved? From memory, xml parsers usually use _ as reference to the contents:

Screen Shot 2024-01-04 at 19.27.36

Btw: I could also use the XML to parse the original HTML but the html is broken:

Screen Shot 2024-01-04 at 19.28.46

gregorius · 4 January 2024 19:31

One more question, is this line

if (msg.hasOwnProperty("select")) { tag = node.tag || msg.select; }

wrong and shouldn't it be

if (msg.hasOwnProperty("select")) { tag = msg.select || node.tag; }

since else if the tag is sent on the node and the msg, the the node will continue to be used? I thought the msg overrides the values of nodes?

TotallyInformation · 4 January 2024 19:32

No, Cheerio follows jQuery-like syntax but val() only returns the value of an input, select or textarea.

To get slot content (effectively the innerHTML) and attributes at the same time seems to require 2 calls, one to attr() and one to html() (or text()).

dceejay · 4 January 2024 19:52

... worth a try and the _ should be as per how the user configures it - so yes _ by default.

dceejay · 4 January 2024 19:54

hmm no - generally - by default if the user has set a parameter in the config then the msg should not override it. If they leave it blank (or unset) then yes it can be overridden. So the order is correct as-is

gregorius · 4 January 2024 21:12

So I created a node to replace the original node - originally I called it html2 but only for purposes of testing my changes.

It does what it's meant to do but before I submit a PR, I just want to check that code is ok:

including the html to attrib hash: here and here
adding the chr (name taken from the xml node) property for defining the attr name
added the option
added chr field here and here
and moved name field to top

The UI changed by moving the name field to the top but also there is a fourth option:

Screen Shot 2024-01-04 at 22.04.07

and the field for the attribute is shown if the fourth option is selected:

Screen Shot 2024-01-04 at 22.01.05

but hidden when not:

Screen Shot 2024-01-04 at 22.01.19

UnborN · 4 January 2024 22:13

Nice work

Since you are in the code, have the experience and planning a PR
can you check why when we set a new field for the Output it doesnt seem to work correctly ?

For example i want the output in

doesnt produce the correct result .. it just outputs the original msg.payload

Test html2 flow

[{"id":"3bb37f38619ad9e9","type":"inject","z":"54efb553244c241f","name":"html","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"true","payloadType":"bool","x":310,"y":3380,"wires":[["80f45e51097c7ded"]]},{"id":"db6cb891535b2186","type":"debug","z":"54efb553244c241f","name":"debug 60","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"true","targetType":"full","statusVal":"","statusType":"auto","x":780,"y":3380,"wires":[]},{"id":"30216608caf646c7","type":"html2","z":"54efb553244c241f","name":"","property":"payload","outproperty":"payload.newField","tag":"a","ret":"compl","as":"single","chr":"_","x":610,"y":3380,"wires":[["db6cb891535b2186"]]},{"id":"80f45e51097c7ded","type":"template","z":"54efb553244c241f","name":"","field":"payload","fieldType":"msg","format":"handlebars","syntax":"mustache","template":"<a href=\"url1\" target=\"_blank\"><b>aaaaaa</b></a>\n<a href=\"url2\" target=\"_blank\"><b>bbbbbb</b></a>","output":"str","x":460,"y":3380,"wires":[["30216608caf646c7"]]}]

[EDIT]
must be because the original payload is a string and not an object and we cannot tag on another new key field to it ?

gregorius · 4 January 2024 22:52

correct, try using something like msg.snafu.fubar as output property and you get what you want.

The thing doing the setting is this line and setMessageProperty handles about any magic you throw at it! But not if the attribute exists and you expect it to be an object when its a string.

So I would say it's not a bug but a feature!

TotallyInformation · 4 January 2024 23:54

You have to change msg.payload into an object before you can add a property to it. As @gregorius says, it is because msg.payload already exists as a string and you cannot add a property to a string.

bakman2 · 5 January 2024 04:36

This is great stuff.

In the "previous" html node (years ago), there was an xpath option as well, which was very useful, because when capturing from the web inspector, one can copy the xpath which could immediately be used/replicated within NR, while the selector copy only gets the class/id and does not traverse into the DOM tree.

gregorius · 5 January 2024 08:48

I could imagine - but I am only guessing - that disappeared because you can create a function node with cheerio and then do your own parsing of html content:

[{"id":"e44d1e4a55be1eee","type":"function","z":"07755b701d94ac63","name":"function 32","func":"var $ = cheerio.load(msg.payload)\n\n$(\"a[title=Sitemap]\").each( (idx,e) => {\n    node.send({\n        ...msg,\n        payload: Object.assign({ _: $(e).html().trim() }, e.attribs)\n    })\n})\n","outputs":1,"timeout":0,"noerr":0,"initialize":"","finalize":"","libs":[{"var":"cheerio","module":"cheerio"}],"x":1254,"y":3237,"wires":[["9f4deae00a9b478f"]]}]

this function node searches for a "Sitemap" link and emits it when found. I believe function nodes couldn't include third party modules for a long time? (It could also be that the html node changed the underlying library from xpath to cheerio - cheerio doesn't really support xpath)

Of course this is a perfect replacement for what I did above and it is also a reminder that most nodes within Node-RED can/could be replaced by a well written function nodes. Hence I would - to a certain extent - argue that the above changes made to the html node do not make it into the core since my opinion is that the core should be as thin/stable/solid as possible with extra functionality being added via plugins.

Why? Because the core should be as stable as possible without random failures and constant updating. Plugins are simpler to debug, fix and update then having to patch the core, followed by making a new release and finally have everyone update their installations.

I don't have to update my Linux just because Firefox has a bug - that's how I see Node-RED and its plugins. Linux has the C standard library so it's clear what Linux has to provide as APIs - this isn't so clear for Node-RED since there is no "base" functionality that is standardised. So should the html node offer a fourth option or not? No standard defines a html node and its functionality.

Sorry for the rant, snowy cold morning here

TotallyInformation · 5 January 2024 14:23

No snow here, just cold. Though myself and Mrs K. have been planning an Arctic Circle trip today - we hope to be visiting Finland or Norway soon.

But more on-topic, while I mostly agree with what you've said, I think there would be room for an option that recovers both an element's content and its attributes. I don't think that is unreasonable and would be complex for a non-coder to achieve. But yes, grabbing content from an HTML page can get arbitrarily complex, needing complex queries which are best served by code.

gregorius · 5 January 2024 15:02

I don't need to, they've come to me!

Ok, so you would be for a fourth option on the existing html node but no xpath specification? I would see that as a good option ...

btw just hit the utf-8 v. iso-8859-1 encoding on CSV data problem ... should the CSV node handle that? No, for there is the iconv node!

TotallyInformation · 5 January 2024 16:07

Honestly, I'm not fussed about an xpath option. Not against it, just not interested in it.

I've done more than my fair share of xpath and I've no intention of ever going back to it!! The world moved on and so did the HTML spec. CSS Selectors are the way forward and that's all I use.

gregorius · 5 January 2024 16:15

xpaths are still a big part of scraping data from websites - I'm just playing around with data-crawler implementation in Node-RED and xpaths are definitely useful. Perhaps that is also related to this being server-side and not in the browser.

Cheerio is a big exception since it implements the jQuery API on the server side - mind-blowing to be honest!

TotallyInformation · 5 January 2024 16:26

As I say, I'm not against xpath - just had too much of it in the bad-old days when XML ruled the world - before JSON was a think.

Yes, I even considered it for the uib-html node but jsdom was the better and more comprehensive choice for being able to convert from a JSON schema to HTML.

Topic		Replies	Views
Extract different values from a multiline string General	24	4886	2 October 2018
Html node not extracting content General	13	564	23 September 2021
HTML parse node issue General	7	815	18 January 2022
Cheerio with Function and HTML Nodes Core Development	9	1820	22 November 2019
Dropdown node in node red General	34	6046	9 April 2021

Question about html node

Related topics