Parsing html data

alexander77 · 5 January 2020 21:04

Hi,
it's first time i try to read data from a web-page. I can read the data, i get ~7700 characters, and when looking into the data manually i find the data i'm looking for somwhere in the middle:
· <a href=results.php?userid=xxxxxxx&offset=0&show_names=0&state=2&appid=>Überprüfung ausstehend (459)
The variable, in this case 459, is what i want to get out of all the stuff. Unfortunately i have no idea how to parse and extract that.
Would be happy for any help!

E1cid · 5 January 2020 22:56

To get you started.

Match() https://www.google.com/url?sa=t&source=web&rct=j&url=https://www.w3schools.com/jsref/jsref_match.asp&ved=2ahUKEwiN3Zf_xu3mAhWPh1wKHfUXASQQFjAAegQIAxAB&usg=AOvVaw2YJp4ROddwl7vIVk3_Jd1v

and for regex https://stackoverflow.com/questions/13802334/js-regex-to-find-href-of-several-a-tags

cymplecy · 6 January 2020 07:11

If Überprüfung ausstehend ( is unique in the 7700 characters then you could use this method if you want to avoid programming a function node or using Regex

[{"id":"79354ee5.cb5df","type":"inject","z":"35be93aa.c0667c","name":"","topic":"","payload":"userid=xxxxxxx&offset=0&show_names=0&state=2&appid=>Überprüfung ausstehend (459) foo bar ","payloadType":"str","repeat":"","crontab":"","once":false,"onceDelay":0.1,"x":670,"y":100,"wires":[["eea2ec68.a8565"]]},{"id":"eea2ec68.a8565","type":"split","z":"35be93aa.c0667c","name":"","splt":"Überprüfung ausstehend (","spltType":"str","arraySplt":"1","arraySpltType":"len","stream":false,"addname":"","x":790,"y":100,"wires":[["f33f628c.167d3"]]},{"id":"49d28a89.dee4d4","type":"debug","z":"35be93aa.c0667c","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"true","targetType":"full","x":1270,"y":100,"wires":[]},{"id":"f33f628c.167d3","type":"switch","z":"35be93aa.c0667c","name":"","property":"parts.index","propertyType":"msg","rules":[{"t":"eq","v":"1","vt":"num"}],"checkall":"true","repair":false,"outputs":1,"x":910,"y":100,"wires":[["564b6f92.d9063"]]},{"id":"564b6f92.d9063","type":"split","z":"35be93aa.c0667c","name":"","splt":")","spltType":"str","arraySplt":"1","arraySpltType":"len","stream":false,"addname":"","x":1030,"y":100,"wires":[["7e78ca37.387f54"]]},{"id":"7e78ca37.387f54","type":"switch","z":"35be93aa.c0667c","name":"","property":"parts.index","propertyType":"msg","rules":[{"t":"eq","v":"0","vt":"num"}],"checkall":"true","repair":false,"outputs":1,"x":1150,"y":100,"wires":[["49d28a89.dee4d4"]]}]

alexander77 · 6 January 2020 16:58

THX for both replys. The code sample works as standalone but not with real data; i need to figure out why.
Currently i'm reading the chapter RegExp from W3schools.

alexander77 · 6 January 2020 20:16

Looks like the HTTP request does not pass all of the data received.
Debug on the output of the http request looks like that:
payload: string

    <html lang="en">
    <head>

    <meta name="viewport" content="width=device-width, initial-scale=1">
<title>Log in</title>

    <meta charset="utf-8">
    <link type="text/css" rel="stylesheet" href="https://setiathome.berkeley.edu//bootstrap.min.css" media="all">

        <link rel=stylesheet type="text/css" href="https://setiathome.berkeley.edu/sah_custom_dark.css">
    <link rel="icon" type="image/x-icon" href="https://setiathome.berkeley.edu/images/logo7.ico"/>

    <link rel=alternate type="application/rss+xml" title="RSS 2.0" href="https://setiathome.berkeley.edu/rss_main.php">
    </head>
<body >

<div class="navbar-... statusCode: 200 headers: object date: "Mon, 06 Jan 2020 19:45:43 GMT" server: "Apache/2.2.15 (Scientific Linux)" x-powered-by: "PHP/5.3.3" expires: "Mon, 26 Jul 1997 05:00:00 UTC" last-modified: "Mon, 06 Jan 2020 19:45:43 UTC" cache-control: "no-cache, must-revalidate, post-check=0, pre-check=0" pragma: "no-cache" content-length: "7792" connection: "close" content-type: "text/html; charset=utf-8" x-node-red-request-node: "217924cc"

When i root this to a function, containing

var str = msg.payload;
var n = str.search(/ausstehend</i);
msg.payload = n;
return msg;

i get a -1
When i change the search string to something more closer to the beginning, like /container-fluid/
the result is 886.
This feeds my assumption, that the payload is shortened somewhere.

molesworth · 7 January 2020 18:22

Haven't you seen the warnings? You should never parse HTMl using a regex...

E1cid · 7 January 2020 19:46

Never ever? Why, what will happen? Will the earth end or my PC explode?

Topic		Replies	Views
Obtain value from HTML file General	3	540	8 January 2020
Extracting value from HTML \ Javascript General http-request	3	471	16 November 2022
Parsing data on a html website General	7	1402	22 July 2018
Get data from http General	4	812	10 December 2018
HTML-parse extract value from javascript General	12	355	20 June 2023

Parsing html data

Related topics