Unable to scrape a not fully loaded website

Hello,

Is it possible to scrape all the data from a site that is not "fully loaded"?

For example, I want to scrape the website Bleep - Pre-Orders. When you load it in a browser, you can see on the right a big scrollbar, the website is not fully loaded. It will load further when you scroll down.

When I try to scape it, it looks that I only receive the data which is initially loaded, so without scrolling. My flow is below, as you can see in the debug node, not all data is loaded.

[{"id":"d3a1e077d8c0f541","type":"inject","z":"2fc137b35399c3d4","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"date","x":80,"y":220,"wires":[["df5cda1d001255bd"]]},{"id":"df5cda1d001255bd","type":"http request","z":"2fc137b35399c3d4","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"https://bleep.com/stream/pre-order-products","tls":"","persist":false,"proxy":"","insecureHTTPParser":false,"authType":"","senderr":false,"headers":[],"x":230,"y":200,"wires":[["620f5bf0deb1416b"]]},{"id":"620f5bf0deb1416b","type":"html","z":"2fc137b35399c3d4","name":"","property":"payload","outproperty":"payload","tag":"div > div.product-info.music > dl > dd > a","ret":"text","as":"single","x":400,"y":80,"wires":[["ff8ec50abd8668cf"]]},{"id":"ff8ec50abd8668cf","type":"debug","z":"2fc137b35399c3d4","name":"debug 616","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","x":710,"y":140,"wires":[]}]

Hi. Firstly the flow you have posted is corrupt due to incorrect posting

In order to make code readable and usable it is necessary to surround your code with three backticks (also known as a left quote or backquote ```)

``` 
   code goes here 
```

You can edit and correct your post by clicking the pencil :pencil2: icon.

See this post for more details - How to share code or flow json


Onto your issue. As with most websites, the context is loaded in part by other URLs and JavaScript. I.e. what you are after is probably not in the request path but rather comes later via JavaScript/Ajax/other means.

Open your browsers devtools, watch the network tab & refresh the page. Check the responses of all the requests. With lich, the data you are interested in comes via a http request to a different path. Then you can use that URL in a http request.

Hello, yes made a mistake, but I already fixed the post.

Ok, thanks, will check that.

See previous post advise about finding where the data comes from using the network tab in devtools

Failing that, search the forum for selenium or puppeteer or playwright. They are like web browser you can control and use to scrape - but they are slow, heavy, often difficult or don't work and therefore (imo) a last resort.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.