[Announce] node-red-contrib-unfluff

balsimpson · 22 December 2019 17:38

An automatic web page content extractor

Automatically grab the main text out of a webpage, or in other words, it turns pretty webpages into boring plain text/json data. This is a Node-RED wrapper for the npm module unfluff.

Makes it super easy to create a custom Alexa Flash briefing when combined with node-red-contrib-wavenet.

Any feedback is welcome.

zenofmud · 22 December 2019 18:26

Crashed NR when I used the IP of my node-red `http:192.168.48.99:1880

22 Dec 18:24:29 - [red] Uncaught Exception:
22 Dec 18:24:29 - Error: getaddrinfo ENOTFOUND 192.168.48.99:188
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:60:26)

balsimpson · 22 December 2019 18:29

Am sorry to hear that, but why were you using the node-red IP? It takes a web page URL and parses the content.

zenofmud · 22 December 2019 18:35

why were you using the node-red IP?

Because I wanted to see what it would do.

It also has a problem with http://google.com gets this

12/22/2019, 1:27:43 PMnode: db0f5c61.e30f88
msg.payload : Object
object
title: "Error 404 (Not Found)!!1"
softTitle: "Error 404 (Not Found)!!1"
date: null
author: array[0]
publisher: null
copyright: null
lang: "en"
tags: array[0]
image: null
videos: array[0]
links: array[1]
text: "The requested URL /null was not found on this server. That’s all we know."

balsimpson · 22 December 2019 18:38

yes, that's the expected behaviour. If you supply a URL of an article like http://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u, it will correctly parse the details. I guess I need to add some error correction. I appreciate the feedback.

bakman2 · 22 December 2019 18:40

What is the difference compared to the html node ?

zenofmud · 22 December 2019 18:42

Crashes using http://wordpress.org and after a very long time, it crashed NR using http://w3schools.com with

22 Dec 18:40:37 - Error: connect ETIMEDOUT 199.59.242.153:443
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1128:14)

I think I'll wait for the next update...

balsimpson · 22 December 2019 18:48

outputs a JSON object with the following fields:

title - The document's title (from the <title> tag)
softTitle - A version of title with less truncation
date - The document's publication date
copyright - The document's copyright line, if present
author - The document's author
publisher - The document's publisher (website name)
text - The main text of the document with all the junk thrown away
image - The main image for the document (what's used by facebook, etc.)
videos - An array of videos that were embedded in the article. Each video has src, width and height.
tags- Any tags or keywords that could be found by checking <rel> tags or by looking at href urls.
canonicalLink - The canonical url of the document, if given.
lang - The language of the document, either detected or supplied by you.
description - The description of the document, from <meta> tags
favicon - The url of the document's favicon.
links - An array of links embedded within the article text. (text and href for each)

balsimpson · 22 December 2019 18:52

Maybe the functioning of the node is not clear. It will fail if the URL is not an article. I use it with the node-red-contrib-get-reddit node to get the top 5 news headlines and then parse the content URL of each post for the full story and pipe to Alexa.

zenofmud · 22 December 2019 18:59

No matter what, it shouldn't cause NR to crash.

balsimpson · 22 December 2019 19:00

I agree. Will look into it.

afelix · 22 December 2019 19:24

I'm quickly reading through the code and I'm noticing a number of things, in no particular order:

Abstract comparison (==) instead of the usual strict comparison (===) that's preferred most of the time when working with javascript
You wrote your own URL parser rather than using the builtin URL parser Node has: https://nodejs.org/docs/latest-v8.x/api/url.html
You check for the protocol in your own URL parser, but then use Node's HTTPS client for all requests: https://nodejs.org/api/https.html#https_https_request_options_callback , regardless of whether the webpage is on a secure HTTP connection or not.
Your URL parser has no support for ports set in the address.

The GetAddrInfoReqWrap.onlookup from the second message is an error caused by a DNS lookup that failed, where the web address it tried looking up was actually an IP address, hence no DNS query needed. The Google lookup query (text: "The requested URL /null was not found on this server. That’s all we know.") failed because the parser would set the part after the backslash in the domain/tld as path, but since there was no backslash and no checks were done, the value is set to null. So rather than parsing http://google.com it would parse http://google.com/null.

This is caused by the following, when parsed_url.path is set to null

let options = {
    host: parsed_url.domain,
    path: '/' + parsed_url.path,
    method: 'GET'
}

 parsed_url = {}

    if ( url == null || url.length == 0 )
        return parsed_url;

    protocol_i = url.indexOf('://');
    parsed_url.protocol = url.substr(0,protocol_i);

    remaining_url = url.substr(protocol_i + 3, url.length);
    domain_i = remaining_url.indexOf('/');
    domain_i = domain_i == -1 ? remaining_url.length - 1 : domain_i;
    parsed_url.domain = remaining_url.substr(0, domain_i);
    parsed_url.path = domain_i == -1 || domain_i + 1 == remaining_url.length ? null : remaining_url.substr(domain_i + 1, remaining_url.length);

I think the reason it failed the DNS lookup on IP addresses is because of the use of the HTTPS client, but I do not know enough about the Node internals to explain exactly why.

Beyond that, your URL parser will also (likely) fail on addresses like sub.sub.domain.tld, because nested subdomains are just as valid as sub.domain.gov.uk type tlds.

balsimpson · 23 December 2019 02:55

thank you so much for a detailed feedback. I'll see how I can error correct it.

johnmoe · 21 March 2020 18:31

Hi @balsimpson, i was wondering if you could integrate with mozilla's readability library which seems to be regularly updated compared to unfluff library.

balsimpson · 19 June 2020 10:42

Have been trying to do that for a few days.It's a little more complicated than unfluff. Are you having no success with unfluff? Am also curious as to how are you using the node?

Topic		Replies	Views
Extract different values from a multiline string General	24	4884	2 October 2018
Using Node-RED to grab data from Webpage (advice required) General	12	575	15 September 2023
Http request help from web General	13	259	10 August 2022
How to scrap with node-red when http request fails? General	8	971	31 August 2019
Basic Fetch Error General	30	4502	15 March 2020

[Announce] node-red-contrib-unfluff

An automatic web page content extractor

Related topics