Web scraping / crawling & HTML content analysis based on keywords - Images & Text

Tags: crawling, HTML, puppeteer, cheerio, search by keyword

Hey guys, I'm introducing:

I've recently been in a situation where we had to check a lot of the websites for certain keywords (something to do with compliance). It was done manually, by hand and it took a lot of time to do so.

We came up with the idea to create a script that would do so for us. The task wasn't that hard, all we had to do was check for images on affiliate's website, see the content and evaluate it.

After going through a few thought cycles, an idea came to not create this specific script but rather go a step back, divide the process into symbolic steps such as the image analysis and scrape / crawl itself. We wanted to have an open end, a lot of the data we were dealing with was flying around in all kinds of formats, so being able to extend upon the processes was also quite important (transform the data downstream etc.).

Solution

I present my humble contribution (one of many) to the nodeRED community. The web content analysis package, featuring:

  • page-finder - a node that crawls through the parent url and finds all the pages in child-parent relationship relative to the given url
  • image-analyzer - a node that analyzes the images living on the given page url
  • text-analyzer - a node that analyzes the text living on the given page url

How it works

Crawling through the website

The job wasn't that hard, the main idea behind it is to use the page-finder to crawl through the url. The node will then spit out the found links, one by one, downstream. I wanted to have this process as extendable and open ended as possible. This node might find its use cases elsewhere also. The crawler will stay inbound the parent link, and the logic behind it is a bit dumb and inefficient.

Content Fetching & Analyzing

The other nodes, text and image analyzers fetch the HTML content of a provided url and analyze the HTML tags that hold images and text. You simply pass the keywords to the nodes for them to check for.

In here, we rely on the assumption that a person that made the webpage knows his stuff when it comes to SEO, so if we're looking for new mentions of the Bitcoin cryptocurrency, image on the website should hold something like "bitcoin" in it's <alt> tag.
(note - in our case the assumption is true 99.9% of the time)

My 2 cents

Let me know if you find it useful, the provided link to the flows section of the NodeRED website has a more detailed explanation.
I found these nodes useful in:

  • compliance business - checking for content
  • creating alerts when a new keyword is mentioned on the website (although there might be a more efficient way of doing so).

Happy to add more features in the future,
Cheers

3 Likes

Nice! :clap:
I'll give it a try

Sounds like some useful additions to the Node-RED universe. Don't have an immediate need for this but going to keep it in mind.

One - somewhat random - thought is that I quite often find myself looking to find out when a page was last updated since a lot of the pages that I'm interested in have a relatively limited lifespan. Having a way to identify that - perhaps from server metadata(?) - would be very nice.

Related to that would be a way to identify NEW pages since a previous run. That would be massive. Change pages would also be nice but new ones would be the most critical.

I'm a guy that "develops nodes" so this is my point of view.

Considering the target audience of the NodeRED (cus i primarily made this with "my user" in mind, only later on would i publish it), I would have the less-technical and more logical approach in solving these two cases you stated (it would make it way less efficient though)

  • Identifying new pages - you can run the crawler every now and then and store the results in some sort of temporary storage. You can compare the results new <---> old. The output of the crawler is a string, you can aggregate them and store them in a json object or Array<string>. It would boil down into list comparison in that case
  • Page Updates on the other hand are a bit different of a topic. Considering that this package loads the HTML from a given page URL, this is true if and only if the page is publicly available and not gated. It could be done easily but yet again, I want to create building blocks and let people assemble the logic so yet again it would be recursive checks and content comparison, HTML could be casted into one big string (hacky way to do so though).

It is a really nice and valuable idea proposition and thank you for that, I will add it to the list.

Cheers,
Luka

I'm all for you creating building blocks and really, that is good for Node-RED overall.

Re new pages, I was thinking that you don't need the page content as such, only the reachable URL's - but maybe you'd need to do what you suggest anyway so perhaps nothing to do there.

Though that actually leads on to another common requirement - finding broken URL's in a site, that is a very useful function as well. So a link-analyzer?

Yeah, that would make sense, id would be as simple as "if you cant load the HTML, let us know, its pro'ly broken"

1 Like