How to deconstruct web pages with http-request

This is the base line of my code:

[{"id":"dcd80a56f41caa81","type":"http request","z":"65c9b63cb09879a0","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"","tls":"","persist":false,"proxy":"","insecureHTTPParser":false,"authType":"","senderr":false,"headers":[{"keyType":"Accept","keyValue":"","valueType":"text/plain","valueValue":""},{"keyType":"User-Agent","keyValue":"","valueType":"other","valueValue":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0"}],"x":5160,"y":370,"wires":[["3af232784383f3b8"]]},{"id":"89f681ae64bd6135","type":"inject","z":"65c9b63cb09879a0","name":"Node-Red","props":[{"p":"url","v":"https://discourse.nodered.org/","vt":"str"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","x":4990,"y":370,"wires":[["dcd80a56f41caa81"]]},{"id":"3af232784383f3b8","type":"debug","z":"65c9b63cb09879a0","name":"debug 10","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"true","targetType":"full","statusVal":"","statusType":"auto","x":5320,"y":370,"wires":[]}]

Not rocket science.

It works but it also doesn't - for me.

This is what I'm given:

{"_msgid":"953d597ecca0db9b","payload":"<!DOCTYPE html>\n<html lang=\"en-GB\">\n<head>\n  <meta charset=\"utf-8\">\n  <title>Page Not Found - Node-RED Forum</title>\n  <meta name=\"description\" content=\"\">\n  <meta name=\"generator\" content=\"Discourse 3.5.0.beta6-dev - https://github.com/discourse/discourse version 09a457f2ab3e0f3c0e78eb105dcbe0f98afcf85b\">\n<link rel=\"icon\" type=\"image/png\" href=\"https://us1.discourse-cdn.com/flex026/uploads/nodered/optimized/1X/598c4a26af3e3272e341a28c9f4adb5c75c8f5dc_2_32x32.png\">\n<link rel=\"apple-touch-icon\" type=\"image/png\" href=\"https://us1.discourse-cdn.com/flex026/uploads/nodered/optimized/1X/d073cd938eafa2e558d7c2cd59003b3ef4963033_2_180x180.png\">\n<meta name=\"theme-color\" media=\"all\" content=\"#fff\">\n\n<meta name=\"color-scheme\" content=\"light\">\n\n<meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0, minimum-scale=1.0, viewport-fit=cover\">\n<link rel=\"canonical\" href=\"https://discourse.nodered.org/404\" />\n\n<link rel=\"search\" type=\"application/opensearchdescription+xml\" href=\"https://d...","topic":"","statusCode":404,"headers":{"server":"nginx","date":"Thu, 29 May 2025 11:09:36 GMT","content-type":"text/html; charset=utf-8","transfer-encoding":"chunked","vary":"Accept-Encoding, Accept","content-security-policy":"upgrade-insecure-requests; base-uri 'self'; object-src 'none'; script-src 'nonce-5fJZWcHpglQsgqKwPJETa2GTf' 'strict-dynamic'; frame-ancestors 'self'; manifest-src 'self'","x-xss-protection":"0","x-content-type-options":"nosniff","x-permitted-cross-domain-policies":"none","referrer-policy":"strict-origin-when-cross-origin","cross-origin-opener-policy":"same-origin-allow-popups","x-request-id":"87c71362-4510-464f-ad5e-d6ada482d008","cdck-proxy-id":"app-router-tieadvanced03.sea2, app-balancer-tieinterceptor1b.sea2","strict-transport-security":"max-age=31536000","x-node-red-request-node":"4917474a"},"responseUrl":"https://discourse.nodered.org/","redirectList":[],"retry":0}

But that seems to be the bigger page.
Which is alright.

But I want to learn .... how/where I get the stuff needed to then list the posts/thread.

It is really confusing for me as I am really flying blind, but I'm just curious with something and I don't really even know what it is. Just curiosity - I think.

Maybe I should ask here, but I'm at a loss where I could better ask.
And I am using NR.

(Oh, to help I hope)
NR v 4.0.9
node v20.19.2
Ubuntu 22.04

Firstly, for whatever reason, your response is a 404 (not found)

Forgetting that for 1 moment, even if you did get the full node-red discourse page, it would likely NOT contain any entries.

What many sites (including discourse) do is use JavaScript to call APIs (other URLs) that return the data (in a nice format like JSON) then JS is used to render the values e.g:


On static web pages, where you get ALL of the HTML (with the tables and rows pre-populated) -you would use the HTML nodes (and carefully crafted selectors) to break it down.

Yes, ok. You are right. I am sure it didn't always give me an error 404.

So if I am wanting to parse discourse - say node-red.......

I'm trying to get my head around the DEBUG in the browser.
(Whole new world)

Could you help me get my head around the mechanics of how it works for NR?
What URL do I need to get the discourse page then when I get that...... How do I find the magic stuff that will then give me the thread list?

I may need to change the parsing from text to JSON.... :person_shrugging:

Sorry and thanks.

It isn't that easy. Because Discourse uses modern authentication, you need valid tokens in order to be able to access it. The browser generally hides that stuff from you but if you want to go in via a flow, you will have to manage it all.

Have a look at the network tab in dev tools - you will want to reload the page when you do. You will see HUNDREDS of entries. Each one is a connection from your browser tab to the server.

Click on the main entry which I think is shown just as a number. You should then see the headers. In there you will find a session cookie, some specific Discourse headers and a bunch of security headers.

Then check out the Application tab. You will see that Discourse uses local storage:

and cookies.

You might have joy using the headers as-is in a Node-RED request and that might work temporarily. But once your session expires, it will then fail.

1 Like

Yeah, ok. That is a whole other level of stuff I hadn't considered.

But for the sake of trying, I am logging into discourse.nodered.org from a private browser and I still get to see the thread list.

I'll stop posting in THIS reply as the rest of it isn't to you and I don't want to pollute this reply.

If you've not really used http-request before - you might want to choose a simpler site to learn on. :slight_smile:

1 Like

Steve,

thanks for the picture, but I seem to be missing something.

I was looking in the wrong place though already.

Network. Thanks.
but I can't get preview and I am not seeing the latest json message.

(Just saw a reply. I'll post this and read it now.)

Yeah, thanks.

I seem to throw myself at difficult things.

1 Like

I just checked and I don't get a latest.json file at all when looking at the Node-RED forum home page (which defaults to the Latest view).

It is a bit different when I use OPERA with the .... debug/tools window open.

But again, I am not seeing the latest.json (sorry hard to read the pic while replying)
I get a LOT of stuff scrolling past me.
If I click on just about any entry on the left, I do get some ....

Hang on!

Found it, but it is named only latest. not latest.json. (my mistake)

Ok, I'm getting there.

I'll post this and keep digging to see where I get.

Worth also trying Firefox. I don't normally use it but it can give a somewhat different perspective in the dev tools. Worth trying anyway.

That's what I use, but when I tried, I didn't see the latest one.....
Ok, but maybe I wasn't looking for it with a close match.
But all I saw was ......

FF:

You can see I have NETWORK ticked.
I reloaded the page.

But OPERA....

Whole different enchilada.

Maybe I should ask in the FF group to why/HOW there is such a difference between the tools and what they show.

But I'm not sure that will help me or not.

That's what I was saying. I now see that you only get latest.json if you were on a different view and you clicked on "Latest" with the network tab open.

You have different filters set. In FF, you are filtered on html, in Opera you have no filter.

1 Like

Thanks.

Mia culpa.

(Dumb question - but I'm sure you know me enough by now...) :wink:

How do I change the settings.

Oh, late note: I said OPERA as the other browser..... I meant CHROME.
But.....

(I think it would be obvious if you looked at the top of the screen shot where it says CHROME and not OPERA)

image

image

Those are the pre-set filters, you click on the one(s) you want.

1 Like

Ok, when I press the ALL button, better.

But I'm not seeing a preview option.

which is stumping me, as in Chrome, I see the page on the right side.

Here, I can't find that option.

AH!

Is it called response?

In Chrome, there is preview and response.
To me - indulge me - they look like the same thing, but in different layouts.

One is how I see it on the web page and one is the code behind the web page.

But in FF, there is only the response option and that only shows me the web side.
Ok, fair enough.
The code isn't too important - though I have been wrong before.

It is interesting that in Chrome, I am seeing the code and I would think that I would/should get that information returned to me.

But when I run FF and use NR to do the request I don't seem to be getting it.

I'll keep digging.

Progress - of sorts

But not really.

So this is where I'm at.
(Using Chrome.)

So in this mode I am at discourse.nodered.org and I can see my thread/topic.
this one.
Sorry not marked.
I think it is the second one from the top on the right side.

But it's there.

Now if I look at what NR gives me - even running (to call it) via Chrome

So I've sent the request in.... got a reply.
(I'll post it in a second.)

All things being equal, it should have the data I showed in the previous piccie - yes?

It doesn't.
Copied the entire message.

{"_msgid":"91c6bf933f5f09bb","url":"https://discourse.nodered.org/","topic":"","statusCode":404,"headers":{"server":"nginx","date":"Thu, 29 May 2025 12:44:58 GMT","content-type":"text/html; charset=utf-8","transfer-encoding":"chunked","vary":"Accept-Encoding, Accept","content-security-policy":"upgrade-insecure-requests; base-uri 'self'; object-src 'none'; script-src 'nonce-Gq2p73DX2XRNJfINxuN99WPyX' 'strict-dynamic'; frame-ancestors 'self'; manifest-src 'self'","x-xss-protection":"0","x-content-type-options":"nosniff","x-permitted-cross-domain-policies":"none","referrer-policy":"strict-origin-when-cross-origin","cross-origin-opener-policy":"same-origin-allow-popups","x-request-id":"2d89d71a-07d9-4683-9e5b-7b151dd1b453","cdck-proxy-id":"app-router-tieadvanced03.sea2, app-balancer-tieinterceptor1b.sea2","strict-transport-security":"max-age=31536000","x-node-red-request-node":"147a9164"},"responseUrl":"https://discourse.nodered.org/","payload":"<!DOCTYPE html>\n<html lang=\"en-GB\">\n<head>\n  <meta charset=\"utf-8\">\n  <title>Page Not Found - Node-RED Forum</title>\n  <meta name=\"description\" content=\"\">\n  <meta name=\"generator\" content=\"Discourse 3.5.0.beta6-dev - https://github.com/discourse/discourse version 09a457f2ab3e0f3c0e78eb105dcbe0f98afcf85b\">\n<link rel=\"icon\" type=\"image/png\" href=\"https://us1.discourse-cdn.com/flex026/uploads/nodered/optimized/1X/598c4a26af3e3272e341a28c9f4adb5c75c8f5dc_2_32x32.png\">\n<link rel=\"apple-touch-icon\" type=\"image/png\" href=\"https://us1.discourse-cdn.com/flex026/uploads/nodered/optimized/1X/d073cd938eafa2e558d7c2cd59003b3ef4963033_2_180x180.png\">\n<meta name=\"theme-color\" media=\"all\" content=\"#fff\">\n\n<meta name=\"color-scheme\" content=\"light\">\n\n<meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0, minimum-scale=1.0, viewport-fit=cover\">\n<link rel=\"canonical\" href=\"https://discourse.nodered.org/404\" />\n\n<link rel=\"search\" type=\"application/opensearchdescription+xml\" href=\"https://d...","redirectList":[],"retry":0}

So, there is a disconnect - for me - to what I'm being given.

Chrome shows me (clearly) that the data is there. My thread.
But when I ask via NR, it just gives the upper layer of stuff.

So is it I am not asking for the full page in the flow, or ...... ?????

This may be a simple thing over which I am tripping.

Thanks in advance.

Oh.....

It is getting the infamous 404 error.
Just noticed that.
:confused:
It's 22:55. I've been up since 06:30.
I think it is time for bed.

Did a slight tweak to the flow.

[{"id":"dcd80a56f41caa81","type":"http request","z":"65c9b63cb09879a0","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"https://discourse.nodered.org/latest","tls":"","persist":false,"proxy":"","insecureHTTPParser":false,"authType":"","senderr":false,"headers":[{"keyType":"Accept","keyValue":"","valueType":"text/plain","valueValue":""},{"keyType":"User-Agent","keyValue":"","valueType":"other","valueValue":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0"},{"keyType":"Accept","keyValue":"","valueType":"application/json","valueValue":""}],"x":5160,"y":370,"wires":[["3af232784383f3b8"]]},{"id":"89f681ae64bd6135","type":"inject","z":"65c9b63cb09879a0","name":"Node-Red","props":[{"p":"topic","vt":"str"},{"p":"payload"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"foo","payloadType":"str","x":4990,"y":370,"wires":[["dcd80a56f41caa81"]]},{"id":"3af232784383f3b8","type":"debug","z":"65c9b63cb09879a0","name":"debug 10","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"true","targetType":"full","statusVal":"","statusType":"auto","x":5320,"y":370,"wires":[]}]

Accept JSON too.

Changed reply, but still not there.

{"_msgid":"30c9cf56b3d9ed59","topic":"","payload":"{\"users\":[{\"id\":2,\"username\":\"knolleary\",\"name\":\"Nick O'Leary\",\"avatar_template\":\"/user_avatar/discourse.nodered.org/knolleary/{size}/3_2.png\",\"admin\":true,\"moderator\":true,\"trust_level\":4},{\"id\":4905,\"username\":\"Christian-Me\",\"name\":\"Chris\",\"avatar_template\":\"/user_avatar/discourse.nodered.org/christian-me/{size}/10774_2.png\",\"trust_level\":2},{\"id\":9113,\"username\":\"euromem\",\"name\":\"Euromem\",\"avatar_template\":\"https://avatars.discourse-cdn.com/v4/letter/e/d9b06d/{size}.png\",\"trust_level\":0},{\"id\":2367,\"username\":\"mbonani\",\"name\":\"Mauricio Bonani\",\"avatar_template\":\"/user_avatar/discourse.nodered.org/mbonani/{size}/10631_2.png\",\"trust_level\":2},{\"id\":4220,\"username\":\"dclear\",\"name\":\"\",\"avatar_template\":\"/user_avatar/discourse.nodered.org/dclear/{size}/9107_2.png\",\"trust_level\":2},{\"id\":58,\"username\":\"Paul-Reed\",\"name\":\"Paul\",\"avatar_template\":\"/user_avatar/discourse.nodered.org/paul-reed/{size}/66906_2.png\",\"moderator\":true,\"trust_level\":3},{\"id\":-1,\"username\":\"system\",\"name\":\"system\",\"...","statusCode":200,"headers":{"server":"nginx","date":"Thu, 29 May 2025 13:02:01 GMT","content-type":"application/json; charset=utf-8","transfer-encoding":"chunked","vary":"Accept-Encoding, Accept","x-frame-options":"SAMEORIGIN","x-xss-protection":"0","x-content-type-options":"nosniff","x-permitted-cross-domain-policies":"none","referrer-policy":"strict-origin-when-cross-origin","x-discourse-route":"list/latest","cache-control":"no-cache, no-store","x-discourse-cached":"skip","x-request-id":"8f76f6fb-5b97-4476-bd09-e72dd3f3ee01","cdck-proxy-id":"app-router-tieadvanced03.sea2, app-balancer-tieinterceptor1b.sea2","strict-transport-security":"max-age=31536000","x-node-red-request-node":"0531d0ea"},"responseUrl":"https://discourse.nodered.org/latest","redirectList":[],"retry":0}

Looks like progress.
But I'm not sure it is in the direction of help for me.
(I'm off now.)

In FF, the origin code is found in "Debugger", which is similar to "Sources" in Chromium-based browsers.

You requested the latest.json resource file. So set the http-request output to "Parsed JSON":