Tags: crawling, HTML, puppeteer, cheerio, search by keyword
Hey guys, I'm introducing:
I've recently been in a situation where we had to check a lot of the websites for certain keywords (something to do with compliance). It was done manually, by hand and it took a lot of time to do so.
We came up with the idea to create a script that would do so for us. The task wasn't that hard, all we had to do was check for images on affiliate's website, see the content and evaluate it.
After going through a few thought cycles, an idea came to not create this specific script but rather go a step back, divide the process into symbolic steps such as the image analysis
and scrape
/ crawl
itself. We wanted to have an open end, a lot of the data we were dealing with was flying around in all kinds of formats, so being able to extend upon the processes was also quite important (transform the data downstream etc.).
Solution
I present my humble contribution (one of many) to the nodeRED community. The web content analysis package, featuring:
- page-finder - a node that crawls through the parent url and finds all the pages in child-parent relationship relative to the given url
- image-analyzer - a node that analyzes the images living on the given page url
- text-analyzer - a node that analyzes the text living on the given page url
How it works
Crawling through the website
The job wasn't that hard, the main idea behind it is to use the page-finder
to crawl through the url. The node will then spit out the found links, one by one, downstream. I wanted to have this process as extendable and open ended as possible. This node might find its use cases elsewhere also. The crawler will stay inbound the parent link, and the logic behind it is a bit dumb and inefficient.
Content Fetching & Analyzing
The other nodes, text
and image
analyzers fetch the HTML content of a provided url and analyze the HTML tags that hold images and text. You simply pass the keywords
to the nodes for them to check for.
In here, we rely on the assumption that a person that made the webpage knows his stuff when it comes to SEO, so if we're looking for new mentions of the Bitcoin cryptocurrency, image on the website should hold something like "bitcoin" in it's <alt>
tag.
(note - in our case the assumption is true 99.9% of the time)
My 2 cents
Let me know if you find it useful, the provided link to the flows section of the NodeRED website has a more detailed explanation.
I found these nodes useful in:
- compliance business - checking for content
- creating alerts when a new keyword is mentioned on the website (although there might be a more efficient way of doing so).
Happy to add more features in the future,
Cheers