TLS Fingerprint Detection - Scraping Supermarket Special Offers

The URL half-price works in my browser, assuming the supernarket currently has any half price offers.

When I try and access the page with an http-request node it gives error 403 Access Denied.
I do have a User-Agent header set up.
Presumably the site is using TLS fingerprint Detection.

image

[{"id":"f09e83a54193a718","type":"inject","z":"c82a2bca89a97192","name":"","props":[{"p":"payload"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"go","payloadType":"str","x":150,"y":60,"wires":[["1d4688a192d47706"]]},{"id":"1d4688a192d47706","type":"http request","z":"c82a2bca89a97192","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"https://sainsburys.co.uk/gol-ui/offers/half-price","tls":"","persist":false,"proxy":"","insecureHTTPParser":false,"authType":"","senderr":false,"headers":[{"keyType":"other","keyValue":"User-Agent","valueType":"other","valueValue":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:132.0) Gecko/20100101 Firefox/139.0"}],"x":310,"y":60,"wires":[["3ff445451311b9b4"]]},{"id":"3ff445451311b9b4","type":"debug","z":"c82a2bca89a97192","name":"result","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"true","targetType":"full","statusVal":"","statusType":"auto","x":470,"y":60,"wires":[]}]

Is there a way to evade this with the http request node?
An alternative approach?

Since there is a lot of AI scraping ongoing, websites protect themselves in many ways, browser/system fingerprinting is what makes it harder.

In this case, it is not too bad, with puppeteer it seems to work ok with a rotating browser agent.
You will need to install chromium and assumes you are using linux (note the executablePath)

In function node, setup tab add puppeteer and user-agents:

On Message tab:

const browser = await puppeteer.launch({headless: true,  executablePath: '/usr/bin/chromium' });
const UserAgent = userAgents
const browserAgent = new UserAgent().toString();

const page = await browser.newPage();
await page.setUserAgent(browserAgent);

await page.goto('https://sainsburys.co.uk/gol-ui/offers', {
    waitUntil: 'networkidle2',networkIdleTimeout: 15000
})

const html = await page.$eval('body', el => el.outerHTML);

msg.payload =  html
node.log(html) // capture full output in node-red-log

await browser.close();
return msg;

I pass it through a HTML node, selector: article

That looks promising, thanks @bakman2.

I am having problems regarding puppeteer (?) though:

This is on a Raspberry Pi running the RPiOS Lite.
I installed chromium with sudo apt install chromium (164 packages!)

At first all of the inject node buttons were greyed out and running node-red-start appeared to hang loading the latest version of puppeteer.
Running npm install --omit=dev --engine-strict puppeteer at the command line hangs.

Now the inject buttons are back but it throws "TimeoutError: Timed out after 30000 ms while waiting for the WS endpoint URL to appear in stdout!"

I'm going to start afresh with a new sd card.
Do you have any suggestions about the best order of doing things - install Node-red, apt install chromium, npm install puppeteer?

Update:
Burn RPiOS Lite 64 bit
sudo apt update && sudo apt -y full-upgrade
sudo apt install chromium
Run NR install script
Create function with puppeteer, deploy. Node-red hangs.
Reboot, cd .node-red
npm install --omit=dev --engine-strict --verbose puppeteer. Hangs at step npm info run puppeteer@24.10.2 postinstall node_modules/puppeteer node install.mjs

I conclude that I can't use chromium & puppeteer on RPiOS Lite.
Next try: desktop OS.

Did you try npm install puppeteer@latest ? is the --engine-strict required to force it ?

I swapped the SD card into a Pi 3b for twice the memory. I did manage to install it there, but it was timing out requesting the web page.
I saw 13 Chromium processes running, clearly not ideal for a Pi Zero 2 with 4 CPU cores and 512MB memory.

So abandoning the lesser Pies for this project, I installed chromium on a Pi 4b 2GB with Ethernet and Bookworm Lite, created the flow and ran it, no separate npm install puppeteer.
It installed OK and the flow runs, though the first deploy was pretty slow while puppeteer was installed.
The flow from inject to the function completing takes > 20 sec (default timeout is 30s so a bit close for comfort)

Sometimes chromium does not close properly i noticed, i have an exec node with a pkill chromium, this could free up some memory over time.

You could also try waitUntil: 'domcontentloaded' - this doesnt wait until all images are loaded, there are more options available and different approaches to getting the content and interacting with the page.

1 Like