HTTP - Extracting Specific Information

Hi Everyone,

I am somewhat new to Node-Red but I am leaning a lot. I use it for Hassio automation.

One of the things I want to do is extract some information from a website to use in a notification automation.

The url for the site is: https://www.toronto.ca/services-payments/recycling-organics-garbage/houses/collection-schedule/?addr=5%20Everson%20Dr&lat=43.757538663&lng=-79.406747012

Highlighted are the piece of the information.

I am not sure how to proceed on this. Can you please provide me with some directions?

Thanks!

Maybe use a service like https://www.octoparse.com/ (free tier) to scrape the data from the webpage, and then access octoparse's api via node-RED for the result.
I haven't tried it, but a quick look shows promise!

1 Like

Scraping projects always interest me, I jsut took a quick look at the toronto gov page, and it's a difficult one. The data is loaded dynamically through javascript, but I haven't found the source for that yet. Give me half an hour to an hour and I might have the source directly for easy parsing.

Edit: Okay, easier than expected though a type I've never seen before. The collection data is actually stored on Google Docs, inside a spreadsheet. This code gets loaded into the page and shown. You can get these out for any address in Toronto through a request to the city's GIS API on their site, then from the response you get a key, in this case Friday2, similar to that PDF that is linked. Next, you can get a google spreadsheets URL from somewhere (I've yet to find the exact link to it), and the API call for it allows you to download it in CSV format, meaning you can easily read it out with Node-RED. Some good news for a change. That CSV file has all collection dates for the year 2020, for all of Toronto, but each row is prefixed with the key from the GIS API response, so the ones relevant for your address are those with that key. I'll see if I can find you the exact links in a bit.

4 Likes

I'm not sure that the API is in their free tier, it looks as though it isn't.

This JS file present on the page handles all the magic behind the scene, and defines where the spreadsheet can be found and how to use the GIS. However, it is minified and obscured so not the easiest one to read :slight_smile: https://www.toronto.ca/resources/swm_collection_calendar/scripts/swm_collection_calendar.js

In short, you've two options:

  1. Trust that the site (it's wordpress based) and the google APIs and generated URLs are stable enough that those addresses don't change over the course of 2020, and that you can just hardcode those links to read out the information you need. If you prettify the code you'll find a link to this spreadsheet: https://docs.google.com/spreadsheets/d/1Om0nwrYzeombeuMf-1pMksyG7oaTdXVpN3vR7-qrjdo/pubhtml It is also available as CSV file, and with a bit of additional searching as JSON too. If you do a search through the GIS service, your location is returned with the Friday2 code. Meaning that you can now take the spreadsheet as base, and look for rows with Friday2 in the calendar column. That should be teh easiest way.
  2. This is usually my preferred way: I don't trust those addresses to be stable and if possible will have my scraper figure out that information on its own. Take the swm_collection_calendar.js file as a base, do a text search (regex maybe) for the gis.toronto.ca endpoint and the corresponding path to be defined, dynamically format the URL, and continue from there so the system can recover if the files were to change. From there, do the same with the spreadsheets address and dynamically set it up. Do note that this option is definitely advanced usage. I actually utilised this method in a black box third party automated testing job I had to do once for my work.

For a more detailed description of how the page works, the swm_collection_calendar.js defines a number of endpoints that are checked one by one. First the address search field on the page. The contents of that box are appended to https://map.toronto.ca/geoservices/rest/search/rankedsearch?searchString=, with the special characters URL encoded. You can automate this step in Node-RED through an inject node and a simple HTTP request:

[{"id":"5b2164ff.2af9f4","type":"http request","z":"a3aaf733.357c88","name":"","method":"GET","ret":"obj","paytoqs":true,"url":"https://map.toronto.ca/geoservices/rest/search/rankedsearch","tls":"","persist":false,"proxy":"","authType":"","x":250,"y":400,"wires":[["bba60015.07f7d8"]]},{"id":"e14673e0.7b49f8","type":"inject","z":"a3aaf733.357c88","name":"","topic":"","payload":"{\"searchString\":\"<search query>\"}","payloadType":"json","repeat":"","crontab":"","once":false,"onceDelay":0.1,"x":110,"y":400,"wires":[["5b2164ff.2af9f4"]]},{"id":"bba60015.07f7d8","type":"debug","z":"a3aaf733.357c88","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","x":430,"y":400,"wires":[]}]

Through that JSON response you get the Latitude and Longitude of the location, which are then inserted in the GIS service API: The base address for that can be found in the javascript file too, and is https://gis.toronto.ca/arcgis/rest/services/primary/cot_geospatial21_mtm/MapServer.
This is further appended with the following syntax: <HOST>/3/query?where=&text=&objectIds=&time=&geometry=<LNG>%2C<LAT>&geometryType=esriGeometryPoint&inSR=4326&spatialRel=esriSpatialRelWithin&relationParam=&outFields=AREA_ID%2CAREA_ATTR_ID%2CPARENT_AREA_ID%2CAREA_SHORT_CODE%2CAREA_NAME%2CAREA_DESC&returnGeometry=false&maxAllowableOffset=&geometryPrecision=&outSR=3857&returnIdsOnly=false&returnCountOnly=false&orderByFields=&groupByFieldsForStatistics=&outStatistics=&returnZ=false&returnM=false&gdbVersion=&returnDistinctValues=false&returnTrueCurves=false&resultOffset=&resultRecordCount=&f=pjson . In that <HOST> is the base URL, <LNG> has to be replaced with the longitude and <LAT> with the latitude. This can be done in a change node with JSONata, for example. That too returns a JSON, in which you can find the key.

Finally, the google sheets URL is formatted, this is a bit more complicated but can be found there too. Check the Javascript file if you're truly interested in how that part works :slight_smile:

Thank you and I will try to make sense of all of this. This is definitely beyond me at the moment, but I am hoping some heavy reading will allow me to come up with something useful.

Just as an fyi, the schedule for a specific address does not change over the year. So if csv can be read once and stored, that could be enough to do the notification for the entire year.

This is not the right solution but I can set that up till I figure out all the good info you provided. I probably should have mentioned that I haven't been a developer for over 15 years so I have a learning curve. But this stuff is exciting so I will be able to get there with help.

Edit: Thanks for posting the flow. I will import and dissect...

hi @afelix, again thank you very much for your help so far. for the stage 1 of this solution where I am reading the csv file (yes, I straight up downloaded and saved the file on the Home Assistant server :stuck_out_tongue:. Also the address I need to use actually needs Friday1 schedule).

Friday1 Schedule

I used a file node and then a csv node to read the file. I want to now use an inject node to read the file and extract the M values, to figure out if it is garbage, recycle etc. every Thursday at 8 pm to create a notification.

Is the right way to do this to write a custom function? (might need some help writing this too :P)