Scraping and making sense of a website

Hi,

I'm looking for some help to understand how to logically make sense of the output of a website that I'm scraping, it's more of a "how do I organise the data logically due to the random amount of games in each league"

So far I have the below code but my magor hurdle at the moment is that the games are seperated into leagues which range from 1 to 15 ish games per league. Oh and there's also adverts inbetween all of that (Code for that at the bottom (which I think works correctly))

I've checked the website via dev tools in the browser and there's no json to use for all of this, or maybe someone that knows better than I do can spot anything? The site owners are also not contactable so I can't ask for an API or some other feed.

[{"id":"760938c6afbe7e2b","type":"moment","z":"769bf287f2a6b1df","name":"","topic":"","input":"payload","inputType":"msg","inTz":"Europe/London","adjAmount":"1","adjType":"days","adjDir":"add","format":"[Soccer football predictions, statistics, bet tips, results]YYYY-MM-DD[/starttime]","locale":"en-US","output":"url","outputType":"msg","outTz":"Europe/London","x":570,"y":790,"wires":[["7388280fef6ee3b1"]]},{"id":"7388280fef6ee3b1","type":"http request","z":"769bf287f2a6b1df","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"","tls":"","persist":false,"proxy":"","insecureHTTPParser":false,"authType":"","senderr":false,"headers":,"x":780,"y":790,"wires":[["a3dd532fe83bc01d"]]},{"id":"a11c8222d998d100","type":"inject","z":"769bf287f2a6b1df","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"date","x":425,"y":790,"wires":[["760938c6afbe7e2b"]],"l":false},{"id":"a3dd532fe83bc01d","type":"html","z":"769bf287f2a6b1df","name":"Predictions","property":"payload","outproperty":"payload","tag":"div.predictions:nth-child(6)","ret":"text","as":"multi","x":950,"y":790,"wires":[["dd4c5c1d5ec50b10"]]},{"id":"dd4c5c1d5ec50b10","type":"debug","z":"769bf287f2a6b1df","name":"debug 217","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":1005,"y":730,"wires":,"l":false}]

and I''m trying to get something like this, 4 lines. Country, League, Game Info and then other stats. I already have a solution to further make sense of these 4 lines of text so I don't require help with that bit.

image

[{"id":"62b931abdecebc1f","type":"inject","z":"769bf287f2a6b1df","name":"","props":[{"p":"payload"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"["IRAN","PERSIAN GULF PRO LEAGUE","1X 2022-12-30 10:30-Zob Ahan-Sanat Naft","4937144441155025103565"]","payloadType":"json","x":485,"y":910,"wires":[["a3fd03329ccd5439"]],"l":false},{"id":"a3fd03329ccd5439","type":"debug","z":"769bf287f2a6b1df","name":"debug 218","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":585,"y":910,"wires":,"l":false}]

Removing adverts

[{"id":"4872ac58c9cbbfe1","type":"change","z":"769bf287f2a6b1df","name":"","rules":[{"t":"change","p":"payload","pt":"msg","from":"actions Advertisement\s+TIP","fromt":"re","to":"actions TIP","tot":"str"},{"t":"change","p":"payload","pt":"msg","from":"actions Advertisement\s+.+\s+.+\s+.+\s+TIP","fromt":"re","to":"actions TIP","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":670,"y":100,"wires":[["82bb6ee264445af1","3ee86e7ea913d296"]]}]

This is where I gave up and decided to ask for help! :slight_smile:

[{"id":"6458d2195aeba8f0","type":"change","z":"769bf287f2a6b1df","name":"Build array","rules":[{"t":"set","p":"payload","pt":"msg","to":"$split(payload,',')","tot":"jsonata"},{"t":"set","p":"payload[0]","pt":"msg","to":" TIP","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":395,"y":150,"wires":[["fd64e246d8ec7a11","5cc9b4f36aa05401"]],"l":false},{"id":"3ee86e7ea913d296","type":"change","z":"769bf287f2a6b1df","name":"Change","rules":[{"t":"change","p":"payload","pt":"msg","from":"your","fromt":"str","to":",your","tot":"str"},{"t":"change","p":"payload","pt":"msg","from":"prediction","fromt":"str","to":"prediction,","tot":"str"},{"t":"change","p":"payload","pt":"msg","from":"STAGE","fromt":"str","to":"STAGE,","tot":"str"},{"t":"change","p":"payload","pt":"msg","from":"TIP","fromt":"str","to":"TIP,","tot":"str"},{"t":"change","p":"payload","pt":"msg","from":"1X2H1HXH21.52.53.5BTSOTS","fromt":"str","to":",1X2H1HXH21.52.53.5BTSOTS,","tot":"str"},{"t":"change","p":"payload","pt":"msg","from":"close","fromt":"str","to":",close","tot":"str"},{"t":"delete","p":"topic","pt":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":270,"y":150,"wires":[["6458d2195aeba8f0","da8dfd3d890df76a"]]},{"id":"da8dfd3d890df76a","type":"debug","z":"769bf287f2a6b1df","name":"debug 202","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":295,"y":200,"wires":,"l":false},{"id":"82bb6ee264445af1","type":"debug","z":"769bf287f2a6b1df","name":"debug 203","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":835,"y":100,"wires":,"l":false},{"id":"fd64e246d8ec7a11","type":"debug","z":"769bf287f2a6b1df","name":"debug 204","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":385,"y":200,"wires":,"l":false},{"id":"4872ac58c9cbbfe1","type":"change","z":"769bf287f2a6b1df","name":"","rules":[{"t":"change","p":"payload","pt":"msg","from":"actions Advertisement\s+TIP","fromt":"re","to":"actions TIP","tot":"str"},{"t":"change","p":"payload","pt":"msg","from":"actions Advertisement\s+.+\s+.+\s+.+\s+TIP","fromt":"re","to":"actions TIP","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":670,"y":100,"wires":[["82bb6ee264445af1","3ee86e7ea913d296"]]},{"id":"deeaeec07e09e271","type":"split","z":"769bf287f2a6b1df","name":"","splt":"\n","spltType":"str","arraySplt":"3","arraySpltType":"len","stream":true,"addname":"","x":1075,"y":150,"wires":[["4b43fac699daf4e4"]],"l":false},{"id":"397ea147bfb45adc","type":"change","z":"769bf287f2a6b1df","name":"","rules":[{"t":"change","p":"payload[0]","pt":"msg","from":"^\s","fromt":"re","to":"","tot":"str"},{"t":"change","p":"payload[0]","pt":"msg","from":" - ","fromt":"str","to":", - ,","tot":"str"},{"t":"change","p":"payload[0]","pt":"msg","from":", - ,","fromt":"str","to":",","tot":"str"},{"t":"change","p":"payload[0]","pt":"msg","from":"\d{2}:\d{2}","fromt":"re","to":",","tot":"str"},{"t":"change","p":"payload[0]","pt":"msg","from":",\s.+","fromt":"re","to":"","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":585,"y":150,"wires":[["da6eb1c4cb91ee42","6627b661e585cac9"]],"l":false},{"id":"da6eb1c4cb91ee42","type":"change","z":"769bf287f2a6b1df","name":"","rules":[{"t":"delete","p":"payload[1]","pt":"msg"},{"t":"delete","p":"payload[2]","pt":"msg"},{"t":"delete","p":"payload[3]","pt":"msg"},{"t":"delete","p":"payload[3]","pt":"msg"},{"t":"delete","p":"payload[3]","pt":"msg"},{"t":"delete","p":"payload[3]","pt":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":665,"y":150,"wires":[["9fe06ab3c07444b9","0da6b3514c1d48f1"]],"l":false},{"id":"9fe06ab3c07444b9","type":"change","z":"769bf287f2a6b1df","name":"","rules":[{"t":"change","p":"payload[3]","pt":"msg","from":"^\s","fromt":"re","to":"","tot":"str"},{"t":"change","p":"payload[3]","pt":"msg","from":" - ","fromt":"str","to":", - ,","tot":"str"},{"t":"change","p":"payload[3]","pt":"msg","from":", - ,","fromt":"str","to":",","tot":"str"},{"t":"change","p":"payload[3]","pt":"msg","from":"\d{2}:\d{2}","fromt":"re","to":",","tot":"str"},{"t":"change","p":"payload[3]","pt":"msg","from":",\s.+","fromt":"re","to":"","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":755,"y":150,"wires":[["42e79eb7bf89c4a9","fc8e0611a62b4253"]],"l":false},{"id":"42e79eb7bf89c4a9","type":"change","z":"769bf287f2a6b1df","name":"","rules":[{"t":"delete","p":"payload[4]","pt":"msg"},{"t":"delete","p":"payload[5]","pt":"msg"},{"t":"delete","p":"payload[6]","pt":"msg"},{"t":"delete","p":"payload[6]","pt":"msg"},{"t":"delete","p":"payload[6]","pt":"msg"},{"t":"delete","p":"payload[6]","pt":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":835,"y":150,"wires":[["43c2917c922b2505","e1f9dba13ae8dfc7"]],"l":false},{"id":"43c2917c922b2505","type":"change","z":"769bf287f2a6b1df","name":"","rules":[{"t":"change","p":"payload[6]","pt":"msg","from":"^\s","fromt":"re","to":"","tot":"str"},{"t":"change","p":"payload[6]","pt":"msg","from":" - ","fromt":"str","to":", - ,","tot":"str"},{"t":"change","p":"payload[6]","pt":"msg","from":", - ,","fromt":"str","to":",","tot":"str"},{"t":"change","p":"payload[6]","pt":"msg","from":"\d{2}:\d{2}","fromt":"re","to":",","tot":"str"},{"t":"change","p":"payload[6]","pt":"msg","from":",\s.+","fromt":"re","to":"","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":915,"y":150,"wires":[["260ce30f0a1a876d","98863841996a0c76"]],"l":false},{"id":"260ce30f0a1a876d","type":"change","z":"769bf287f2a6b1df","name":"","rules":[{"t":"delete","p":"payload[7]","pt":"msg"},{"t":"delete","p":"payload[8]","pt":"msg"},{"t":"delete","p":"payload[9]","pt":"msg"},{"t":"delete","p":"payload[9]","pt":"msg"},{"t":"delete","p":"payload[9]","pt":"msg"},{"t":"delete","p":"payload[9]","pt":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":995,"y":150,"wires":[["6925443292b5ae4c"]],"l":false},{"id":"5cc9b4f36aa05401","type":"change","z":"769bf287f2a6b1df","name":"Change","rules":[{"t":"delete","p":"payload[0]","pt":"msg"},{"t":"delete","p":"payload[0]","pt":"msg"},{"t":"delete","p":"payload[0]","pt":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":485,"y":150,"wires":[["86496dfcc2d3d34c","397ea147bfb45adc"]],"l":false},{"id":"86496dfcc2d3d34c","type":"debug","z":"769bf287f2a6b1df","name":"debug 205","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":475,"y":200,"wires":,"l":false},{"id":"6627b661e585cac9","type":"debug","z":"769bf287f2a6b1df","name":"debug 206","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":575,"y":200,"wires":,"l":false},{"id":"0da6b3514c1d48f1","type":"debug","z":"769bf287f2a6b1df","name":"debug 207","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":665,"y":200,"wires":,"l":false},{"id":"fc8e0611a62b4253","type":"debug","z":"769bf287f2a6b1df","name":"debug 208","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":755,"y":200,"wires":,"l":false},{"id":"e1f9dba13ae8dfc7","type":"debug","z":"769bf287f2a6b1df","name":"debug 209","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":835,"y":200,"wires":,"l":false},{"id":"98863841996a0c76","type":"debug","z":"769bf287f2a6b1df","name":"debug 210","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":915,"y":200,"wires":,"l":false},{"id":"6925443292b5ae4c","type":"debug","z":"769bf287f2a6b1df","name":"debug 211","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":995,"y":200,"wires":,"l":false},{"id":"4b43fac699daf4e4","type":"debug","z":"769bf287f2a6b1df","name":"debug 212","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":1075,"y":200,"wires":,"l":false},{"id":"fce0c48a266dfb2b","type":"moment","z":"769bf287f2a6b1df","name":"","topic":"","input":"payload","inputType":"msg","inTz":"Europe/London","adjAmount":"1","adjType":"days","adjDir":"add","format":"[Soccer football predictions, statistics, bet tips, results]YYYY-MM-DD[/starttime]","locale":"en-US","output":"url","outputType":"msg","outTz":"Europe/London","x":375,"y":100,"wires":[["dc40285b1dd779cf"]],"l":false},{"id":"dc40285b1dd779cf","type":"http request","z":"769bf287f2a6b1df","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"","tls":"","persist":false,"proxy":"","insecureHTTPParser":false,"authType":"","senderr":false,"headers":,"x":425,"y":100,"wires":[["1181e2d265754707"]],"l":false},{"id":"1181e2d265754707","type":"html","z":"769bf287f2a6b1df","name":"Predictions","property":"payload","outproperty":"payload","tag":"div.predictions:nth-child(6)","ret":"text","as":"multi","x":475,"y":100,"wires":[["4872ac58c9cbbfe1"]],"l":false},{"id":"34d8983c5c733950","type":"inject","z":"769bf287f2a6b1df","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"date","x":315,"y":100,"wires":[["fce0c48a266dfb2b"]],"l":false}]

FYI the end result is this which I already have working but I have to scrape each individual league separately, I'm trying to scrape the entire page and also include the Country and League names which I have to define manually rather than using the scraped data

Sorry but your initial flow is being reported as invalid JSON by node-red so I can't really help. Try exporting again in a "compact" form and paste as text between 3 backticks.

Sorry, try this

[{"id":"760938c6afbe7e2b","type":"moment","z":"769bf287f2a6b1df","name":"","topic":"","input":"payload","inputType":"msg","inTz":"Europe/London","adjAmount":"1","adjType":"days","adjDir":"add","format":"[http://www.statarea.com/predictions/date/]YYYY-MM-DD[/starttime]","locale":"en-US","output":"url","outputType":"msg","outTz":"Europe/London","x":400,"y":330,"wires":[["7388280fef6ee3b1"]]},{"id":"7388280fef6ee3b1","type":"http request","z":"769bf287f2a6b1df","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"","tls":"","persist":false,"proxy":"","insecureHTTPParser":false,"authType":"","senderr":false,"headers":[],"x":610,"y":330,"wires":[["a3dd532fe83bc01d"]]},{"id":"a11c8222d998d100","type":"inject","z":"769bf287f2a6b1df","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"date","x":255,"y":330,"wires":[["760938c6afbe7e2b"]],"l":false},{"id":"a3dd532fe83bc01d","type":"html","z":"769bf287f2a6b1df","name":"Predictions","property":"payload","outproperty":"payload","tag":"div.predictions:nth-child(6)","ret":"text","as":"multi","x":780,"y":330,"wires":[["dd4c5c1d5ec50b10"]]},{"id":"dd4c5c1d5ec50b10","type":"debug","z":"769bf287f2a6b1df","name":"debug 217","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":835,"y":270,"wires":[],"l":false}]

I think you can select all competitions with something like (assuming you are using the html node to parse it):

.competition :not([id='']) 

the advertisings are sitting in similar div's but don't have an id attribute, exclude them

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.