Http-request node does not get all the page data (AKA How to scrape dynamic data from a web page)

This question comes up regularly in the forum.

It usually starts with someone asking "how do I get XYZ data from this web page?". Where it turns out that the page is creating the required data dynamically using JavaScript.

There are two ways to resolve this issue. Which you use depends on how the page is working. So the first option is probably only really open to you if you can make sense of the code in your browser (view that by right-clicking on the page and selecting "Inspect" or by opening the browser's developer tools and going to the "Elements" tab.

  1. Find an API call on the page that returns the data you really want
  2. Use a "headless browser" to run the pages JavaScript and scrape the resulting data.

1) Find an API call on the page that returns the data you really want

To do this, you have to search through the web page's code or perhaps look at the network tab in the browser dev tools. Then once found, you need to see if you can call the API (which is another web endpoint) or whether it has other security that might be too hard to easily work out.

The advantage of this is that you are likely to get exactly the data you want in a form easily consumed in Node-RED (e.g. JSON or XML) and you won't have to pull the data out of some HTML.

2) Use a "headless browser" to run the pages JavaScript and scrape the resulting data.

There are a number of existing nodes that you may be able to use. Search for nodes that use nbrowser or puppeteer for example.

You could even run your own browser "headless" if you are running Node-RED on a server that has a Chromium or Firefox based browser installed. On Windows, for example, the following command line will grab a processed page using the Microsoft Edge Chromium-based browser:

"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe" --headless --disable-gpu --enable-logging --dump-dom https://nodered.org
1 Like

Thank you @TotallyInformation for this concise and clear explanation. It fits exactly what I was looking for in this forum. Could you point out to more ressources for using headless browsers ?

I have this website of a solar production graph where the "hidden" API is hard to use (JWT auth through javascript, multiple requests, etc..) so I wish to use option 2 as you describe, but I don't know where to start... I found the data as a Json object from within my browser though, so I guess it should be doable :smiley_cat:

Thanks in advance to the community for any pointers :wink:
(PS : depending on answers, I could split this to a new thread...)

Start here: Library - Node-RED (nodered.org)

1 Like

This topic was automatically closed after 30 days. New replies are no longer allowed.