Unable to scrape a not fully loaded website

jenssen · 25 July 2024 19:43

Hello,

Is it possible to scrape all the data from a site that is not "fully loaded"?

For example, I want to scrape the website Bleep - Pre-Orders. When you load it in a browser, you can see on the right a big scrollbar, the website is not fully loaded. It will load further when you scroll down.

When I try to scape it, it looks that I only receive the data which is initially loaded, so without scrolling. My flow is below, as you can see in the debug node, not all data is loaded.

[{"id":"d3a1e077d8c0f541","type":"inject","z":"2fc137b35399c3d4","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"date","x":80,"y":220,"wires":[["df5cda1d001255bd"]]},{"id":"df5cda1d001255bd","type":"http request","z":"2fc137b35399c3d4","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"https://bleep.com/stream/pre-order-products","tls":"","persist":false,"proxy":"","insecureHTTPParser":false,"authType":"","senderr":false,"headers":[],"x":230,"y":200,"wires":[["620f5bf0deb1416b"]]},{"id":"620f5bf0deb1416b","type":"html","z":"2fc137b35399c3d4","name":"","property":"payload","outproperty":"payload","tag":"div > div.product-info.music > dl > dd > a","ret":"text","as":"single","x":400,"y":80,"wires":[["ff8ec50abd8668cf"]]},{"id":"ff8ec50abd8668cf","type":"debug","z":"2fc137b35399c3d4","name":"debug 616","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","x":710,"y":140,"wires":[]}]

Steve-Mcl · 25 July 2024 19:49

Hi. Firstly the flow you have posted is corrupt due to incorrect posting

In order to make code readable and usable it is necessary to surround your code with three backticks (also known as a left quote or backquote ```)

``` 
   code goes here 
```

You can edit and correct your post by clicking the pencil icon.

See this post for more details - How to share code or flow json

Onto your issue. As with most websites, the context is loaded in part by other URLs and JavaScript. I.e. what you are after is probably not in the request path but rather comes later via JavaScript/Ajax/other means.

Steve-Mcl · 25 July 2024 19:52

Open your browsers devtools, watch the network tab & refresh the page. Check the responses of all the requests. With lich, the data you are interested in comes via a http request to a different path. Then you can use that URL in a http request.

jenssen · 25 July 2024 19:52

Hello, yes made a mistake, but I already fixed the post.

Ok, thanks, will check that.

Steve-Mcl · 25 July 2024 19:56

See previous post advise about finding where the data comes from using the network tab in devtools

Failing that, search the forum for selenium or puppeteer or playwright. They are like web browser you can control and use to scrape - but they are slow, heavy, often difficult or don't work and therefore (imo) a last resort.

system · 23 October 2024 19:57

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Http-request node does not get all the page data (AKA How to scrape dynamic data from a web page) FAQs	3	625	16 November 2022
Web scraping with java Script General	6	1420	7 May 2021
HTML Scrape - missing elements General	12	399	17 December 2022
Scraping a web page with data embedded in a script tag General	5	772	12 March 2021
Web scraping issue General	22	3043	16 January 2021

Unable to scrape a not fully loaded website

Related topics