Using node-red for RSS / website monitors to replace Huginn

HeinN · 28 July 2019 05:49

Greetings together,

I write here as I try to transition myself from Huginn to node-red as I found node-red much easier to install while being leaner and multifunctional on the hardware.

Problem is, I cannot program and don’t really understand how to setup on node-red what I want. Basically, I just want to replace Huginn in:

RSS Agents that collect some 50 feeds and push them through filters to then get an output as RSS again.
Monitor changes on websites and alert me about the change as an RSS output
Combine multiple Twitter channels, filter them and output as a RSS

I know it is possible but - as a noob - I would like to ask here for help whether somebody could provide me with example-Flows to practically learn from. Any input is highly appreciated.

Thanks and cheers,
Hein

bakman2 · 28 July 2019 06:45

There is a reason for Huginn's existence: this is not simple stuff.

RSS feeds, no problem.
RSS feeds through filters and back to RSS could already be quite involved.
Monitoring changes on websites, could be hard or simple depending on what you want exactly.

But you need to start with something simple and go from there.

If you add an inject node + http request node + xml node + debug node
and connect those nodes together in that order and add an rss feed url in the http request node.
Click deploy and click the inject node button: you will get the rss feed in json format.

What do you want to filter ?

ukmoose · 28 July 2019 06:58

For RSS after you’ve worked out your filtering I’d then replace the http request node with the feedpasser node which only passes on new entries.

bakman2 · 28 July 2019 07:01

Although i would agree, the feedparser node does not accepts input, you would need 50 of those nodes in your flow :')

HeinN · 28 July 2019 12:51

Is there a way to have a list of rss sources in one node rather than one RSS feed per node? That alone would already be a great starting point.

bakman2 · 28 July 2019 13:36

Sure, multiple ways.

You could use a function node, containing an array of url’s and output those 1 by 1 by setting a msg.url and feed it in a http request node.

HeinN · 28 July 2019 14:30

Thanks already up to here.

But how come for something so basic there are no standard flows to download from the repository ... I am aware that I am fishing for a start here.

Either way, it feels for now I am stuck on Huginn.

TotallyInformation · 28 July 2019 15:00

This is a great idea - unfortunately for those of us who would like to see something like this, nobody has yet had this come to the top of their project ideas list!

As others have said, this absolutely is not "basic" I'm afraid. Just handling RSS alone is not simple. You have to track updates, control caching, think about invalid feeds, filter unsafe HTML content.

I've looked at this topic repeatedly over the years I've been using Node-RED but the complexity has never put it onto the top of my list of projects.

Anyway, why don't we break your request down into parts and we can think about how to achieve it?

Firstly - web page monitoring - this should be fairly easy to achieve but we need some info:

How many pages do you think you will monitor?
How often do you want to check them?
Do you want to check by page metadata (which might not always work - e.g. a cache expiry header) or by actual page difference (which might have lots of false positives due to, for example, on-page adverts)
Do you want to check just a subset of a page rather than the whole page? (more complex).

HeinN · 29 July 2019 18:35

Thanks TotalInformation.
I understand these things are not high on the agenda but also was not aware that RSS feeds are so difficult. I stumbled upon node-red only due to this old blog entry and to my ears, it just sounded easier and echoed exactly my concerns. Difference seems to be that he is a whiz and I am not.
https://dannysu.com/2016/12/29/huginn-to-node-red/

I also contacted him as he seemed to have had success with his flow and was the developer of the feedparser node. No response though thus me being here. Node-red seems not made for mainstream, neither is Huginn but the agends at huginn work easier. A shame actually as I can easily see more people without programming skills being highly interested to use node-red. Especially with an installation as easy as with node-red!

As for the website monitor:
a) For now I have about 15 website monitors which I assume would translate into 15 monitoring nodes, all feeding into a flow where there’s is only one output-node.
b) I am looking for subsets on pages for triggers. To ID the relevant parts/ CSS subsets I use the Chrome Extension SelectorGadget. With that I monitor some pages each day and some once a week.
c) Metadata is not relevant for me as a trigger, only the published content. In the best case the content goes through a filter and if it passes keywords, triggers an action such as extracting the changed text from subset of the website.

How does that sound?

dceejay · 29 July 2019 19:48

I think given the time this discussion has taken I could have cut/paste 50 feedparser nodes and configured them

bakman2 · 29 July 2019 20:47

b) I am looking for subsets on pages for triggers. To ID the relevant parts/ CSS subsets I use the Chrome Extension SelectorGadget.

You can use the html node to 'extract' a part of a page.

What is the end goal of using RSS as an output format ?

TotallyInformation · 29 July 2019 21:15

If you've followed lots, you will probably know that not all feeds are equal just as not all web pages are. As feeds are generally auto-generated, they have to extract data from the html for each page and then take or create the appropriate metadata to be able to form the XML required for the RSS feed. Take a look at the raw RSS if you haven't already and you will see just how complex it is and how much can go wrong.

Adding to that, many feeds only take an "above the line" extract from the html which is very unhelpful for anyone who wants to read the feed offline (on public transport with poor connectivity for example). To say nothing of advertising being inserted into some feeds.

I did like his explanation in that post and absolutely agree with the sentiments. As I say, I've also struggled with this for years. In fact the only reason I still use an iPad is that I have what I consider to be the best RSS reader of any I've ever tried on any platform - no longer developed so at some point it will doubtless stop working. I also pay for Feedly which I use to manage several dozen feeds - some of which I read occasionally and some regularly.

So I do have a useful workflow and that, combined with the complexity of creating something new keeps me from progressing.

I used to use Yahoo! Pipes to reanimate partial feeds back to full pages but of course this is also long gone. I had hoped this could be reproduced in Node-RED but so far the complexity has defeated me.

And he hasn't updated the node either - not sure if that matters. If he has abandoned it, the joy of open source is that someone else can pick it up.

Untrue, there are many mainstream uses for Node-RED but its background is IoT not ETL (Extract, Transformation & Load) and this use case falls firmly into the ETL camp. NR continues to improve in this area but there is much still to do.

Yes, I agree totally and I try to do my bit to further that agenda. However, development needs people and time - both of which are in very short supply.

Now to something more positive...

OK, that is manageable I think. Each site will need its own flow though we might find that some of each flow is the same and that can be bundled into a subflow.

Subsets are more complex but the good news is that there is a node or two that will help us and you already know how to get to the right section of HTML using CSS selectors.

Yes, we can do that for sure.

So, next question:

Are any of the sites publicly accessible? If so, can you share the URL and the CSS selector? If so, we can create an example flow.

This is absolutely up Node-RED's street and it has been covered before, indeed I think there is at least one flow on the flows site that is a complex example. I believe I also have a copy of something on my blog. Yes ...

HeinN · 30 July 2019 03:35

I wan to have all alerts of website changes together with the RSS feeds in one feed reader instead of checking multiple systems. I add more details when I answer TI!

HeinN · 30 July 2019 03:51

Well, similar here. I also use feedly but combined with Newsify on iPad. The combination works flawless and beautiful. Yahoo Pipes was great which eventually lead me to Huginn. It is just a resource hog. Being bold a few weeks ago, I decided to try and dump this combo in favor of node-red + Tiny Tiny RSS on a self hosted box. Not as convenient as Newsify but it as some features that I find quiet attractive for the future.

On a lighter note, I do not expect full length articles in the feeds. I am good with heterogeneous feeds as long as they end up structured in my reader.

All sites are public access! It’s just too many and too annoying to visit them. I come back and post an example later this day. Again, thanks already for support - both mental and intellectual.

HeinN · 30 July 2019 17:32

Hello TotalInformation,

I got around to leave ok up one of the sites.

URL:

Subpart:
.sectionpar

Of course this site also has a twitter account but no rss. I think as an example it should work.

Cheers,
Hein

afelix · 30 July 2019 18:01

So, correct me if I’m wrong, but those websites you want to monitor have news sections and you want those parsed into rss feeds as if they already offered their own feed?

HeinN · 30 July 2019 20:08

Basically, it boils down to that, yes. Some sites have no news but rather publications. Such as SEC reports, court decisions etc. but the overall principle is the same: at one specific point on the site something changes that causes an alert with the notification, what actually changed.

afelix · 31 July 2019 06:40

Coming from a Python world this sounded pretty simple to me. Until I realised that it requires several common, carefully curated libraries that don’t have an equivalent in JavaScript. In Python, there’s BeautifulSoup, aimed specifically at websites that might not even be valid html and still parse them. There’s Bleach, a library created by Mozilla to sanitise untrusted HTML. It’s a must have when working with user input or in any way 3rd party html. But without an equivalent of the same quality in JavaScript (I’m aware of npm library bleach, it’s similar but implemented from a different concept with a different strategy), porting any of this to JS might not be as wise.

The monitoring aspect is easy enough if you don’t care about valid html and sanitising potentially dangerous html files: take a http request node, put in the website you want to check and trigger that node at a regular time. Then parse the page, use an xml/html node and JSONata if you prefer that. I use xmldom + xpath (from a function node) if I’m working with html/xml source file, and jsonata when the source is json. Next, parse it to an array with the latest articles/items. You could use an RBE node behind it, but since the output would be a rss feed that could be skipped. Add a timestamp with last updated, and parse the entire thing back to the RSS/XML standard, for example through a template node.

If you use a subflow for it and abstract the variables to “environment variables” (the subflow properties, if you look at the configuration), you get a single node that does all the magic and can be configured with for example web address, identifier to look for, number of results to return, and so on.

As for getting it back to a usable RSS feed, you can go with a http-in and http-response node, where the feed gets returned in between. Some local caching where a call would just return the locally stored version makes it even lighter, but putting that subflow node in between could work too.

There’s definitely options, but the question will be if you should use NR to replace Huginn here or maybe use an exec node to call scripts (in other languages) that do this.

TotallyInformation · 31 July 2019 12:17

OK, cool. Do you have an example CSS select for that site?

For example, the following selector contains the 8 "Insights" on the page:

#faceted-app-content-pwc-gx-en-industries-tmt-media-jcr-content-content-free-1-a987-par-collection > div.container.ng-scope

But I note that turning off JavaScript prevents loading of that data. This means that you actually need a headless "browser" so just doing a request and scan of the returned data may not be enough it will need checking.

TotallyInformation · 31 July 2019 12:50

And as expected, the built in html node is not able to extract the data because the page uses Angular to deliver those items to the user dynamically.

[{"id":"7efa14c7.bc1cac","type":"inject","z":"63281c77.40a064","name":"","topic":"","payload":"","payloadType":"date","repeat":"","crontab":"","once":false,"onceDelay":0.1,"x":120,"y":1000,"wires":[["d77cae08.8c25d"]]},{"id":"adb70203.b0191","type":"debug","z":"63281c77.40a064","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","x":710,"y":1000,"wires":[]},{"id":"d77cae08.8c25d","type":"http request","z":"63281c77.40a064","name":"","method":"GET","ret":"txt","paytoqs":false,"url":"https://www.pwc.com/gx/en/industries/tmt/media.html","tls":"","proxy":"","authType":"","x":290,"y":1000,"wires":[["b8429ad4.e11d78"]]},{"id":"b8429ad4.e11d78","type":"html","z":"63281c77.40a064","name":"","property":"payload","outproperty":"payload","tag":".collectionv2__content","ret":"html","as":"single","x":500,"y":1000,"wires":[["adb70203.b0191"]]}]

So you are going to need something more comprehensive such as:

Here is an example that is getting closer - BUT notice how complex this single page is! If all of your sites are this complex, it will take quite some time to get your data. However, you can see that Node-RED can do all of this and it can handle the complexity.

[{"id":"b0c2e02a.20c2c","type":"nbrowser","z":"63281c77.40a064","name":"","methods":[{"name":"gotoURL","func":"goto","params":[{"type":"str","value":"https://www.pwc.com/gx/en/industries/tmt/media.html","typeDefault":"str"}]},{"name":"getHTML","func":"getHTML","params":[{"type":"str","value":".collectionv2__content","typeDefault":"str"},{"type":"output","value":1,"typeDefault":"output"}]}],"prop":"nbrowser","propout":"payload","object":"msg","objectout":"msg","close":true,"show":true,"ssl":false,"outputs":1,"x":280,"y":1080,"wires":[["9f2d3650.01e398"]]},{"id":"85a40495.f8cd38","type":"inject","z":"63281c77.40a064","name":"","topic":"","payload":"","payloadType":"date","repeat":"","crontab":"","once":false,"onceDelay":0.1,"x":120,"y":1080,"wires":[["b0c2e02a.20c2c"]]},{"id":"16be2e09.5b70a2","type":"debug","z":"63281c77.40a064","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","x":710,"y":1080,"wires":[]},{"id":"9f2d3650.01e398","type":"html","z":"63281c77.40a064","name":"","property":"payload","outproperty":"payload","tag":"*","ret":"html","as":"single","x":450,"y":1080,"wires":[["16be2e09.5b70a2"]]}]

Topic		Replies	Views
How can i make a flow that checks for RSS feeds and then posts on Discord (via Webhook)? General http-request , dashboard	80	2460	26 June 2022
Extract different values from a multiline string General	24	4998	2 October 2018
Node-red in workplace. What are you using it for? General	36	5966	19 September 2019
RSS feed on dashboard Dashboard	15	4921	28 June 2019
Trying to retrieve info from a website problems, I'm just a rookie General	19	4687	6 December 2018

Using node-red for RSS / website monitors to replace Huginn

Related topics