Http-request node does not get all the page data (AKA How to scrape dynamic data from a web page)

TotallyInformation · 17 October 2022 18:24

This question comes up regularly in the forum.

It usually starts with someone asking "how do I get XYZ data from this web page?". Where it turns out that the page is creating the required data dynamically using JavaScript.

There are two ways to resolve this issue. Which you use depends on how the page is working. So the first option is probably only really open to you if you can make sense of the code in your browser (view that by right-clicking on the page and selecting "Inspect" or by opening the browser's developer tools and going to the "Elements" tab.

Find an API call on the page that returns the data you really want
Use a "headless browser" to run the pages JavaScript and scrape the resulting data.

1) Find an API call on the page that returns the data you really want

To do this, you have to search through the web page's code or perhaps look at the network tab in the browser dev tools. Then once found, you need to see if you can call the API (which is another web endpoint) or whether it has other security that might be too hard to easily work out.

The advantage of this is that you are likely to get exactly the data you want in a form easily consumed in Node-RED (e.g. JSON or XML) and you won't have to pull the data out of some HTML.

2) Use a "headless browser" to run the pages JavaScript and scrape the resulting data.

There are a number of existing nodes that you may be able to use. Search for nodes that use nbrowser or puppeteer for example.

You could even run your own browser "headless" if you are running Node-RED on a server that has a Chromium or Firefox based browser installed. On Windows, for example, the following command line will grab a processed page using the Microsoft Edge Chromium-based browser:

"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe" --headless --disable-gpu --enable-logging --dump-dom https://nodered.org

Jaxom_99 · 9 November 2022 09:33

Thank you @TotallyInformation for this concise and clear explanation. It fits exactly what I was looking for in this forum. Could you point out to more ressources for using headless browsers ?

I have this website of a solar production graph where the "hidden" API is hard to use (JWT auth through javascript, multiple requests, etc..) so I wish to use option 2 as you describe, but I don't know where to start... I found the data as a Json object from within my browser though, so I guess it should be doable

Thanks in advance to the community for any pointers
(PS : depending on answers, I could split this to a new thread...)

TotallyInformation · 9 November 2022 15:36

Start here: Library - Node-RED (nodered.org)

system · 16 November 2022 18:24

This topic was automatically closed after 30 days. New replies are no longer allowed.

Topic		Replies	Views
NBrowser - help General	5	357	31 October 2022
I think, this question is beyond NR, but asking anyway? General	6	443	13 April 2021
HTTP GET - rendering problem General	11	583	23 September 2020
Dynamic HTML scraping General http-request	6	699	11 October 2022
Using Node-RED to grab data from Webpage (advice required) General	12	575	15 September 2023

Http-request node does not get all the page data (AKA How to scrape dynamic data from a web page)

1) Find an API call on the page that returns the data you really want

2) Use a "headless browser" to run the pages JavaScript and scrape the resulting data.

Related topics