[Announce] node-red-contrib-unfluff

An automatic web page content extractor

Automatically grab the main text out of a webpage, or in other words, it turns pretty webpages into boring plain text/json data. This is a Node-RED wrapper for the npm module unfluff.

Makes it super easy to create a custom Alexa Flash briefing when combined with node-red-contrib-wavenet.

Any feedback is welcome.

1 Like

Crashed NR when I used the IP of my node-red `http:

22 Dec 18:24:29 - [red] Uncaught Exception:
22 Dec 18:24:29 - Error: getaddrinfo ENOTFOUND
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:60:26)

Am sorry to hear that, but why were you using the node-red IP? It takes a web page URL and parses the content.

why were you using the node-red IP?

Because I wanted to see what it would do.

It also has a problem with http://google.com gets this

12/22/2019, 1:27:43 PMnode: db0f5c61.e30f88
msg.payload : Object
title: "Error 404 (Not Found)!!1"
softTitle: "Error 404 (Not Found)!!1"
date: null
author: array[0]
publisher: null
copyright: null
lang: "en"
tags: array[0]
image: null
videos: array[0]
links: array[1]
text: "The requested URL /null was not found on this server. That’s all we know."

yes, that's the expected behaviour. If you supply a URL of an article like http://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u, it will correctly parse the details. I guess I need to add some error correction. I appreciate the feedback.

What is the difference compared to the html node ?

Crashes using http://wordpress.org and after a very long time, it crashed NR using http://w3schools.com with

22 Dec 18:40:37 - Error: connect ETIMEDOUT
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1128:14)

I think I'll wait for the next update... :stuck_out_tongue_winking_eye:

outputs a JSON object with the following fields:

title - The document's title (from the <title> tag)
softTitle - A version of title with less truncation
date - The document's publication date
copyright - The document's copyright line, if present
author - The document's author
publisher - The document's publisher (website name)
text - The main text of the document with all the junk thrown away
image - The main image for the document (what's used by facebook, etc.)
videos - An array of videos that were embedded in the article. Each video has src, width and height.
tags- Any tags or keywords that could be found by checking <rel> tags or by looking at href urls.
canonicalLink - The canonical url of the document, if given.
lang - The language of the document, either detected or supplied by you.
description - The description of the document, from <meta> tags
favicon - The url of the document's favicon.
links - An array of links embedded within the article text. (text and href for each)

Maybe the functioning of the node is not clear. It will fail if the URL is not an article. I use it with the node-red-contrib-get-reddit node to get the top 5 news headlines and then parse the content URL of each post for the full story and pipe to Alexa.

No matter what, it shouldn't cause NR to crash.

I agree. Will look into it.

I'm quickly reading through the code and I'm noticing a number of things, in no particular order:

  1. Abstract comparison (==) instead of the usual strict comparison (===) that's preferred most of the time when working with javascript
  2. You wrote your own URL parser rather than using the builtin URL parser Node has: https://nodejs.org/docs/latest-v8.x/api/url.html
  3. You check for the protocol in your own URL parser, but then use Node's HTTPS client for all requests: https://nodejs.org/api/https.html#https_https_request_options_callback , regardless of whether the webpage is on a secure HTTP connection or not.
  4. Your URL parser has no support for ports set in the address.

The GetAddrInfoReqWrap.onlookup from the second message is an error caused by a DNS lookup that failed, where the web address it tried looking up was actually an IP address, hence no DNS query needed. The Google lookup query (text: "The requested URL /null was not found on this server. That’s all we know.") failed because the parser would set the part after the backslash in the domain/tld as path, but since there was no backslash and no checks were done, the value is set to null. So rather than parsing http://google.com it would parse http://google.com/null.

This is caused by the following, when parsed_url.path is set to null

let options = {
    host: parsed_url.domain,
    path: '/' + parsed_url.path,
    method: 'GET'
 parsed_url = {}

    if ( url == null || url.length == 0 )
        return parsed_url;

    protocol_i = url.indexOf('://');
    parsed_url.protocol = url.substr(0,protocol_i);

    remaining_url = url.substr(protocol_i + 3, url.length);
    domain_i = remaining_url.indexOf('/');
    domain_i = domain_i == -1 ? remaining_url.length - 1 : domain_i;
    parsed_url.domain = remaining_url.substr(0, domain_i);
    parsed_url.path = domain_i == -1 || domain_i + 1 == remaining_url.length ? null : remaining_url.substr(domain_i + 1, remaining_url.length);

I think the reason it failed the DNS lookup on IP addresses is because of the use of the HTTPS client, but I do not know enough about the Node internals to explain exactly why.

Beyond that, your URL parser will also (likely) fail on addresses like sub.sub.domain.tld, because nested subdomains are just as valid as sub.domain.gov.uk type tlds.

1 Like

thank you so much for a detailed feedback. I'll see how I can error correct it.

Hi @balsimpson, i was wondering if you could integrate with mozilla's readability library which seems to be regularly updated compared to unfluff library.

Have been trying to do that for a few days.It's a little more complicated than unfluff. Are you having no success with unfluff? Am also curious as to how are you using the node?