Fetching number from web page

I'm not new to node-red, but new to web scraping, and need some help.

I'd like to fetch the Hit Counter number in the footer of the following web page:
www.katharina-eppenberger.ch

Here is the web developer info of that section of the page:

As a flow I did the following:
inject -> http request -> html -> debug

In the http request node I inserted the URL for the web page.
In the html node I used the found class ".footerHolder" in the Selector field.
As output I selected 'only the text part'.
This worked nicely. As output in the debug node I got the following string: "Hit counter #"

Instead of the number, I got a hash sign.
The web developer page from my browser clearly shows a number.
The node-red flow produces a hash sign.

So my browser gets something differently back than the http request node.

Does someone have any hints how I can get at the real number with my web scraping attempt?

Kind regards,
Urs.

[{"id":"3b418d942099c879","type":"tab","label":"KE Web","disabled":false,"info":"","env":[]},{"id":"6ad1dcfd41d6c086","type":"comment","z":"3b418d942099c879","name":"Fetch hit counter from katharina-eppenberger.ch","info":"","x":360,"y":40,"wires":[]},{"id":"912e78ff26768e82","type":"inject","z":"3b418d942099c879","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"date","x":120,"y":100,"wires":[["87694a90cd180284"]]},{"id":"87694a90cd180284","type":"http request","z":"3b418d942099c879","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"https://katharina-eppenberger.ch/","tls":"","persist":false,"proxy":"","insecureHTTPParser":false,"authType":"","senderr":false,"headers":[],"x":290,"y":100,"wires":[["e5632512d48c79ce"]]},{"id":"e5632512d48c79ce","type":"html","z":"3b418d942099c879","name":"","property":"payload","outproperty":"payload","tag":".footerHolder","ret":"text","as":"multi","chr":"_","x":470,"y":100,"wires":[["c39bda80eb6dbfd9"]]},{"id":"c39bda80eb6dbfd9","type":"debug","z":"3b418d942099c879","name":"debug 3","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":640,"y":100,"wires":[]}]

Not all things on a web page come from HTML.

Many things are populated by JavaScript after the main HTML is loaded.

Typically on a page with dynamic values (like stock qty or temperature etc), the value is retrieved through a separate API call to some resource (then JS is used to populate the HTML/DOM at run time)

Therefore, requesting the page is not the right thing - you need to look at the network requests the page makes and see which one brings back the value you want. Once you know its URL you might be able to pull that in (depends on security and how data is returned)

1 Like
curl 'https://katharina-eppenberger.ch/index.php' \
  --compressed \
  -X POST \
  -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:146.0) Gecko/20100101 Firefox/146.0' \
  -H 'Accept: */*' \
  -H 'Accept-Language: en-GB,en;q=0.5' \
  -H 'Accept-Encoding: gzip, deflate, br, zstd' \
  -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
  -H 'X-Requested-With: XMLHttpRequest' \
  -H 'Origin: https://katharina-eppenberger.ch' \
  -H 'Sec-GPC: 1' \
  -H 'Connection: keep-alive' \
  -H 'Referer: https://katharina-eppenberger.ch/index.php' \
  -H 'Cookie: PHPSESSID=c38ebe0b3afa91cf53c4a5d06e77ef0d' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'TE: trailers' \
  --data-raw 'supermode=stats&page=/index.php&referrer='

apparently that's the request - copied as cURL from the browser network panel.

Good stuff.

To help the OP (and future readers)...

As can be seen, the HTML is indeed sans number!

So, if the OP/user were to further inspect the page source, they might see...

        <script type="text/javascript">
            $(function() {
                var dataText = "supermode=stats&page=/&referrer=";
                $.ajax({
                    type: "POST",
                    url: "index.php",
                    data: dataText,
                    success: function(data) {
                        $("#footerHitCounter").text("Hit counter #".replace(/#/, data)).addClass("show");
                    }
                });
            });
        </script>

Which suggests they need to do a http-request (POST) to index.php with the data value supermode=stats&page=/&referrer= (and probably a cookie)

And, as expected, you can see that network request in the browsers dev tools.

Here is how you get what @gregorius got...

WindowsTerminal_tRl5Un1pQY

3 Likes

Many thanks for the hints. I marked it at solution because it obviously is the solution.
I'll still have to figure out how to do it in my node-red flow. So for me it is not solved yet. But this is the fun part of it anyway and that's what cold winter nights have been designed for.

2 Likes

Perhaps me showing that I know how to open a web developer window gave you the impression, that I understand more than I do in reality.

From what I gathered above I did the following:
Inject node: msg.payload = supermode=stats&page=/&referrer=
http request node:
Method: POST
URL: "https://katharina-eppenberger.ch/index.php"
Headers: Cookie PHPSESSID=c38ebe0b3afa91cf53c4a5d06e77ef0d

My guess/hope was, that the injected payload is then sent to the URL with the index.php script. But I do not get anything sensible back, just part of a HTML page.

Please suppose the dumbest version of me and help me with one or two more hints.
Many thanks and kind regards,
Urs.

Good question, how to copy a request into Node-RED. I used the "Copy as fetch" option

and got this JS code:

await fetch("https://katharina-eppenberger.ch/index.php", {
    "credentials": "include",
    "headers": {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:146.0) Gecko/20100101 Firefox/146.0",
        "Accept": "*/*",
        "Accept-Language": "en-GB,en;q=0.5",
        "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
        "X-Requested-With": "XMLHttpRequest",
        "Sec-GPC": "1",
        "Sec-Fetch-Dest": "empty",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Site": "same-origin"
    },
    "referrer": "https://katharina-eppenberger.ch/",
    "body": "supermode=stats&page=/&referrer=https://katharina-eppenberger.ch/",
    "method": "POST",
    "mode": "cors"
});

strangely the fetch(...) function isn't natively supported by the function node, so I had to install node-fetch lib to get the function node working with that bit of code:

[{"id":"77e620a38b02f1a5","type":"function","z":"abef864c7a219350","name":"function 2","func":"var d = await fetch(\"https://katharina-eppenberger.ch/index.php\", {\n    \"credentials\": \"include\",\n    \"headers\": {\n        \"User-Agent\": \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:146.0) Gecko/20100101 Firefox/146.0\",\n        \"Accept\": \"*/*\",\n        \"Accept-Language\": \"en-GB,en;q=0.5\",\n        \"Content-Type\": \"application/x-www-form-urlencoded; charset=UTF-8\",\n        \"X-Requested-With\": \"XMLHttpRequest\",\n        \"Sec-GPC\": \"1\",\n        \"Sec-Fetch-Dest\": \"empty\",\n        \"Sec-Fetch-Mode\": \"cors\",\n        \"Sec-Fetch-Site\": \"same-origin\"\n    },\n    \"referrer\": \"https://katharina-eppenberger.ch/\",\n    \"body\": \"supermode=stats&page=/&referrer=https://katharina-eppenberger.ch/\",\n    \"method\": \"POST\",\n    \"mode\": \"cors\"\n});\nconsole.log(d)\nmsg.payload = await d.text()\n\nreturn msg;","outputs":1,"timeout":0,"noerr":0,"initialize":"","finalize":"","libs":[{"var":"fetch","module":"node-fetch"}],"x":483,"y":492,"wires":[["b24e60dddafd86d1"]]},{"id":"90017ef0b46fd3c4","type":"inject","z":"abef864c7a219350","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"date","x":252,"y":609,"wires":[["77e620a38b02f1a5"]]},{"id":"b24e60dddafd86d1","type":"debug","z":"abef864c7a219350","name":"debug 43","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","x":732,"y":437,"wires":[]}]

but that's basically now doing the request you require.

P.S.

You'll have to edit the function node, I left a console.log in there by mistake!

P.P.S. The fetch is frontend JS code (i.e. browser code) and therefore isn't natively supported in NodeJS (i.e. server code). Hm.

1 Like

Hello Gerrit
Many thanks. I stumbled over the node-fetch thing.
I imported your flow.json and did not change anything (except for deleting the console.log line).
The function node complains that 'fetch' is not defined.
Then I did what I usually do, try to install missing stuff with the palette manager. node-fetch is not installable via the palette manager.
Then I successfully installed it via the shell 'npm install node-fetch', rebooted, still got the same error. I'm running it on a Raspberry Pi, standard installation via the script provided. Do I need to be in a specific directory for the npm command?

Then I had a closer look at a screenshot in your post. It shows a configuration part 'Modules' where you seem to be able to import Modules. I can't do that in my version of node-red (v4.1.3)

I'm stuck again, slightly further down the road, but still stuck. This is getting embarrassing. But I need another hint again.

Kind regards,
Urs.

Yes you can. That has been a feature for a long time.

Encouraged by the clear statement of Colin 'Yes you can', I read the documentation of the function node.
I found the relevant section: By setting functionExternalModules to true in your settings.js file ...
This is not yet enabled in my setting, probably because this is not the default.
Small steps, I know.

Ah yes, you are correct. I had forgotten that you need to set that.

I was way over my head with this problem. It fetches the correct number now.

How to put it into influxdb and make a grafana chart, that part I know already.
Thanks @Steve, @Gerrit and @Colin for guiding me along here.
Kind regards,
Urs.

1 Like

Crazy, I didn't know that you had to activate that else I would have said something. I guess I activated once (about 3 years ago) and then copied my original settings file across to all the other instances I use :slight_smile:

Ideally that option (to add libraries on the function node) shouldn't be shown (and perhaps a help message about activation in the settings.js file) if it can't be used ....

It is possible with just http request nodes, but need to install the gzip node as the response is zipped.

in short, perform get request, sets php cookie, keep it alive get cookie value, set headers, perform post, unzip.

Keep in mind that the hit counter will increase when you perform these requests, might be false readings.

[
    {
        "id": "d98914d687afb2cd",
        "type": "http request",
        "z": "78fecb3942131a17",
        "name": "",
        "method": "GET",
        "ret": "txt",
        "paytoqs": "ignore",
        "url": "https://katharina-eppenberger.ch/index.php",
        "tls": "",
        "persist": true,
        "proxy": "",
        "insecureHTTPParser": false,
        "authType": "",
        "senderr": false,
        "headers": [
            {
                "keyType": "other",
                "keyValue": "",
                "valueType": "other",
                "valueValue": ""
            }
        ],
        "x": 350,
        "y": 140,
        "wires": [
            [
                "d9d6923755b5b6ac"
            ]
        ]
    },
    {
        "id": "4b4f3d0b748ca8d8",
        "type": "inject",
        "z": "78fecb3942131a17",
        "name": "",
        "props": [
            {
                "p": "payload"
            }
        ],
        "repeat": "",
        "crontab": "",
        "once": false,
        "onceDelay": 0.1,
        "topic": "",
        "payload": "supermode=stats&page=/index.php&referrer=",
        "payloadType": "str",
        "x": 190,
        "y": 140,
        "wires": [
            [
                "d98914d687afb2cd"
            ]
        ]
    },
    {
        "id": "924d11de457a1d21",
        "type": "debug",
        "z": "78fecb3942131a17",
        "name": "debug 11",
        "active": true,
        "tosidebar": true,
        "console": false,
        "tostatus": false,
        "complete": "true",
        "targetType": "full",
        "statusVal": "",
        "statusType": "auto",
        "x": 960,
        "y": 140,
        "wires": []
    },
    {
        "id": "fe8f54806292d11f",
        "type": "http request",
        "z": "78fecb3942131a17",
        "name": "",
        "method": "POST",
        "ret": "bin",
        "paytoqs": "ignore",
        "url": "https://katharina-eppenberger.ch/index.php",
        "tls": "",
        "persist": false,
        "proxy": "",
        "insecureHTTPParser": false,
        "authType": "",
        "senderr": false,
        "headers": [],
        "x": 670,
        "y": 140,
        "wires": [
            [
                "0a2ccce3d738964c"
            ]
        ]
    },
    {
        "id": "d9d6923755b5b6ac",
        "type": "function",
        "z": "78fecb3942131a17",
        "name": "headers",
        "func": "msg.payload = \"supermode=stats&page=/index.php&referrer=https://katharina-eppenberger.ch\"\nmsg.headers = {};\n\n\n\nmsg.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:146.0) Gecko/20100101 Firefox/146.0'\nmsg.headers['Accept'] = '*/*'\nmsg.headers['Accept-Language'] = 'en-GB,en;q=0.5'\nmsg.headers['Accept-Encoding'] = 'gzip, deflate, br, zstd'\nmsg.headers['Content-Type'] = 'application/x-www-form-urlencoded; charset=UTF-8'\nmsg.headers['X-Requested-With'] = 'XMLHttpRequest'\nmsg.headers['Origin'] = 'https://katharina-eppenberger.ch'\nmsg.headers['Sec-GPC'] = '1'\nmsg.headers['Connection'] = 'keep-alive'\nmsg.headers['Referer'] = 'https://katharina-eppenberger.ch/index.php'\nmsg.headers['Sec-Fetch-Dest'] = 'empty'\nmsg.headers['Sec-Fetch-Mode'] = 'cors'\nmsg.headers['Sec-Fetch-Site'] = 'same-origin'\nmsg.headers['TE'] = 'trailers'\nmsg.headers['Cookie'] = msg.responseCookies.PHPSESSID.value;\n\nreturn msg\n\n",
        "outputs": 1,
        "timeout": 0,
        "noerr": 0,
        "initialize": "",
        "finalize": "",
        "libs": [],
        "x": 520,
        "y": 140,
        "wires": [
            [
                "fe8f54806292d11f"
            ]
        ]
    },
    {
        "id": "0a2ccce3d738964c",
        "type": "gzip",
        "z": "78fecb3942131a17",
        "name": "",
        "x": 810,
        "y": 140,
        "wires": [
            [
                "924d11de457a1d21"
            ]
        ]
    },
    {
        "id": "1231b1ac49a9032a",
        "type": "global-config",
        "env": [],
        "modules": {
            "node-red-contrib-gzip": "0.0.3"
        }
    }
]
1 Like

For reference, a new instance does not need to set this flag. functionExternalModules is set true by default since NR 2.0.0. I suspect in this case, the user has either previously set this false or (more likely) is using a settings file from NR V1.3.x days when functionExternalModules was set to false by default

2 Likes

That is exactly the case. The node-red-update-script refuses to replace my old settings.js file because I seem to have made changes (which I completely forgot about) with the new v2.x settings.js file. I decided to ignore that error message since everything worked fine after each upgrade.

I took this opportunity to finally fix this and use the settings.js file from NR V2.x

I detected that the hoster of my choice (infomaniak.com in Switzerland) uses a hit counter with some built in cleverness. It seems to ignore repeated requests from the same IP, or something like that. As a consequence, with my tons of trials (that is the way I learn node-red) the hit counter did not increase.