HTML Scrape - missing elements

dougle03 · 11 October 2022 10:17

Hi all,
So trying to help someone out on another group with a problem.
Using a html request node to fetch a webpage, then into a html node to select on div. Getting a predictable number of arrays but it's missing one vital item of content.
The web address is:
https://www.decathlon.it/p/gazebo-campeggio-arpenaz-base-m-6-posti-con-telo-pavimento/_/R-p-157674?fbclid=IwAR2pEGMC_dbWnQ-6NSLsgNMC3UhAE0iRE2Y6jjVCaoaS-boimCOFmapJ2ps

The text we're looking for is this:

However in NR debug it's missing.
The price is coming in ok, just not the availability. Sorry it's in Italian, the person I'm trying to help is... Italian.. lol
Any ideas?

Steve-Mcl · 11 October 2022 11:41

Yes. It will be dynamically populated from script that queries something else (possibly another endpoint)

See similar post about 1h ago: Dynamic HTML scraping - #2 by Steve-Mcl

TotallyInformation · 11 October 2022 11:46

This is almost certainly due to the web page creating that information dynamically using JavaScript. When you request a page using http-request, you are not running the page's JavaScript so you don't get the data. This is a very common problem when trying to scrape modern pages.

The 2 ways around it are:

Find out what the remote JavaScript does to get the data (possibly a separate API call) and see if you can make that call directly yourself.
Use a web test tool that runs a headless browser and use that to get the page since that will run JS on the page if you want it to.

Library - Node-RED (nodered.org)

dougle03 · 11 October 2022 11:58

Thanks for responding. I'd come to the conclusion that elements rendered in Chrome were not rendering within the HTTP node request.
I think your suggestions of ways around it might be a bit beyond both mine, and the person who I'm helping's abilities.
He'll just have to refresh the page until it's in stock... lol

Thanks again, and it's good to know I wasn't going mad trying to find it.

whilst on, is there any plans to add a search function to the debug panel? I feel like that would be a useful tool when sorting through large arrays perhaps?

paulkeates · 11 October 2022 12:07

Hi,
I don't know about 'search in debug' but there is a feature I find very useful in debug - the ability to 'pin' an item. Once pinned, you rerun the flow that created the debug and the pinned item is now automatically 'opened' as well as has a highlight to help you visually find it amongst the 'clutter'. Look for the small exclamation point when hovering on a row with the mouse pointer.

And for your other question: totally agree with [Steve-Mcl]. It is possible to get that data - either through an (implied) API or by using WDIO (webdriver IO) nodes to completely automate a set of pages. Both can be done if the need is there - both take more time / effort.

Cheers,

Paul

dougle03 · 11 October 2022 13:00

Yep, didn't realise when I posted that it would be dynamic content issues. I do now. Ta

TotallyInformation · 11 October 2022 13:06

There are, of course, tools that will automate the refresh for you in the browser and even tools that will watch for a text or image change on a page and do something.

TotallyInformation · 11 October 2022 13:29

You can filter of course. And if the entries are expanded, you can use your browser's Find on Page to search. You also need to remember that the outputs are restricted by default to 1k of text.

If you need more, a custom logger might be useful to you. You would output the debug data to log and use a custom logger function in settings.js - you could then output the data in a way that lets you have a separate browser tab. I do that so that I can have uibuilder trace logging without having to wade through trace logs for everything.

In settings.js logging property:

        mqttLog: {
            level: 'trace',
            metrics: false,
            audit: false,
            handler: function (settings) {
                const nrLogLevels = {
                    10: 'FATAL', 20: 'ERROR', 30: 'WARN ', 40: 'INFO ', 50: 'DEBUG', 60: 'TRACE', 98: 'AUDIT', 99: 'MTRIC'
                }
                const myCustomLevels = {
                    levels: {
                        'FATAL': 10,
                        'ERROR': 20,
                        'WARN ': 30,
                        'INFO ': 40,
                        'DEBUG': 50,
                        'TRACE': 60,
                        'AUDIT': 98,
                        'MTRIC': 99
                    },
                    colors: {
                        'FATAL': 'redBG',
                        'ERROR': 'red',
                        'WARN ': 'orange',
                        'INFO ': 'yellow',
                        'DEBUG': 'green',
                        'TRACE': 'cyan',
                        'AUDIT': 'grey',
                        'MTRIC': 'grey'
                    }
                }

                const mqtt = require('mqtt')
                const client = mqtt.connect('mqtt://home.knightnet.co.uk')

                return function (msg) {
                    if (msg.level < 51 || msg.msg.includes('[uibuilder') || msg.msg.startsWith('+-') || msg.msg.startsWith('| ') || msg.msg.startsWith('>>')) {
                        client.publish('nrlog/live', JSON.stringify(msg))
                    }
                }
            }
        },

So the log output that I'm interested is sent a message at a time to an MQTT topic. I then use a simple flow (just a single uibuilder node) to provide an output web endpoint. In my case, I add an MQTT library direct to the page to listen to the MQTT output but of course you could also manage that via Node-RED instead.

paulkeates · 11 October 2022 13:51

Hi,

Before I forget: the other cool feature in debugging is the ability to open the Debug panel in a new browser. There is a (quite small) icon at the very bottom-right of the Debug panel. Hover text shows: 'Open in new window'. Click it. You can then position that new (browser) window on a separate monitor for when you want to manage your screen real-estate efficiently.

Cheers,

Paul

dougle03 · 18 October 2022 15:26

Thanks for the info.
For sure yes setting up some other logging would be an option. It's only when doing web scraping that I generally would need to search the debug, rather than twirling down endless arrays hoping to find the entry needed - also the reason the ctrl+f is a bit useless. It just struck me as odd that a panel with potentially lots of info in it does not have a simple search function. I'll try the mqtt logging, but it feels like a sledgehammer to crack a low-code nut..

TotallyInformation · 18 October 2022 20:15

I don't necessarily disagree there. I guess you get used to it and then forget what it can be like to get started.

Did anyone mention the contributed logging node(s) by the way, I've lost track now but there is at least one. Trouble with web-based logging is that it quickly slows down the browser as the amount of data builds up. The ideal would probably be some kind of dedicated log view client separate from Node-RED itself.

dougle03 · 18 October 2022 20:59

No, but I'll have a search tomorrow, see what I can find. Never occurred to me that there might be a dedicated logging node...

Yes, maybe a dedicated logging UI would be useful, There is the option pin-out the existing window but that does not change the function or features of course.

system · 17 December 2022 21:00

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Web scraping issue General	22	3041	16 January 2021
Http-request node does not get all the page data (AKA How to scrape dynamic data from a web page) FAQs	3	624	16 November 2022
I think, this question is beyond NR, but asking anyway? General	6	443	13 April 2021
Dynamic HTML scraping General http-request	6	699	11 October 2022
Need some help scraping a website General	4	903	19 November 2020

HTML Scrape - missing elements

Related topics