Automatically grab the main text out of a webpage, or in other words, it turns pretty webpages into boring plain text/json data. This is a Node-RED wrapper for the npm module unfluff.
Makes it super easy to create a custom Alexa Flash briefing when combined with node-red-contrib-wavenet.
title - The document's title (from the <title> tag)
softTitle - A version of title with less truncation
date - The document's publication date
copyright - The document's copyright line, if present
author - The document's author
publisher - The document's publisher (website name)
text - The main text of the document with all the junk thrown away
image - The main image for the document (what's used by facebook, etc.)
videos - An array of videos that were embedded in the article. Each video has src, width and height.
tags- Any tags or keywords that could be found by checking <rel> tags or by looking at href urls.
canonicalLink - The canonical url of the document, if given.
lang - The language of the document, either detected or supplied by you.
description - The description of the document, from <meta> tags
favicon - The url of the document's favicon.
links - An array of links embedded within the article text. (text and href for each)
Maybe the functioning of the node is not clear. It will fail if the URL is not an article. I use it with the node-red-contrib-get-reddit node to get the top 5 news headlines and then parse the content URL of each post for the full story and pipe to Alexa.
Your URL parser has no support for ports set in the address.
The GetAddrInfoReqWrap.onlookup from the second message is an error caused by a DNS lookup that failed, where the web address it tried looking up was actually an IP address, hence no DNS query needed. The Google lookup query (text: "The requested URL /null was not found on this server. That’s all we know.") failed because the parser would set the part after the backslash in the domain/tld as path, but since there was no backslash and no checks were done, the value is set to null. So rather than parsing http://google.com it would parse http://google.com/null.
This is caused by the following, when parsed_url.path is set to null
I think the reason it failed the DNS lookup on IP addresses is because of the use of the HTTPS client, but I do not know enough about the Node internals to explain exactly why.
Beyond that, your URL parser will also (likely) fail on addresses like sub.sub.domain.tld, because nested subdomains are just as valid as sub.domain.gov.uk type tlds.
Have been trying to do that for a few days.It's a little more complicated than unfluff. Are you having no success with unfluff? Am also curious as to how are you using the node?