Is there any way to automatically authenticate myself in a site with a node?

Doomorelha · 28 June 2019 19:08

Hi guys! So, i was doing a robot on nodered who request an http site, then kinda 'look' at the page and retreive some information about the html and send it to my email whenever the html changes (like an update in a market shop). I've done this so far.
My problem is: This site needs an authentication, and i don't know how to do it. I already tried that "use authentication" on [http request] node, but nothing changes.

Is there anything that i can do? Please i've burned my mind over it
the site is www.autoavaliar.com.br

That's my flow, and that is the message that appears:

(sorry for the bad english)

TotallyInformation · 28 June 2019 22:10

Well the first thing to do is identify what type of authentication is needed. There are many ways to do authentication so we need some more clues first.

afelix · 29 June 2019 13:42

I took a quick look at that site, and it looks like a html form in the page, where info is entered and then POSTed back to login, thus setting a (session) cookie where then the next values can be received. So basically it's a scraping job with just a regular login system. I've done a lot of scraping jobs for all kinds of projects so far, several even work related. Is there a particular reason why you want the "get the output" part to be done in Node-RED elements rather than utilising another tool for that?

If I had to do this task in a work related context I would see a couple options, with a first step shared by all:
0. Reverse engineer the website and figure out how exactly the login system works, if things are stored as cookies and if those cookies are easily transferred or if you would have to simulate the login process on each attempt.
Options:

Write a custom script in a language of choice that handles the logging in process and gathering the value, then make it so that this script gets called and returns the value to node-red
Figure out if you can send a POST request through the HTTP request node where cookies are shared to other nodes in the flow. I haven't played with that yet, so I can't give the answer to it (mostly because this wouldn't be my own preferred option). If yes, send a POST request to the where the login ends up, then share that session with the next node which will be a GET to that b2b page, then follows the rest of your current flow.
Write the entire calling setup in a function node, where callbacks are used to forward each step of the http processing until you get the data you need, then forward it to that switch node (though take a look at the RBE node as well because I think it does exactly what you want to do with that switch)
Handle this entire flow outside of node red (or perhaps partially in node red with just the timestamp node and executing your script)

As for the "use authentication" on the http request node, as far as I know that one is to pass a username/password over HTTP Basic Auth to a web page, which is not the case for the website you mentioned here. HTTP Basic Auth is (simplified) when the web browser gives you a popup that says "you need to login to open this page" and you have to enter a username and password in that popup before you can get control of your web browser back.

(Note: I'm a Q/A engineer by trade, and automating websites/write and run automated tests on websites was part of my job description Any more questions, just ask.)

Doomorelha · 30 June 2019 16:42

I will try this first option, but i barely know how to code. So i downloaded a bot in python who can login by setting the url and some html tags.
Unfortunately, it didn't work yet.
First; i don't know how to link the code, in python, to a node-red flow;
Second; i don't know if the code is right.

https://github.com/Doomorelha/ola i stored the code here, so u guys can check it out

afelix · 30 June 2019 16:51

I stopped reading when I noticed username and password hardcoded in your code. Consider that account hijacked/compromised from now on, change the password ASAP (it’s a weak one too), and never use this same email/password combination further on.

I’ll read the rest of your code in a moment. I’ve hundreds of hours of Selenium experience, consider yourself lucky

afelix · 30 June 2019 18:52

Okay, I got my laptop now, and looking at the code I've mixed feelings about it. If you were to run it, it would indeed fill in the username and password, but it would crash upon clicking the button. I looked at the HTML code on the site, and it's programmed in a way to make automation a bit harder, but not impossible. Seeing that while the form looked quite simple, it's already a non-standard example I'm going to walk you through reverse engineering a login system in order to automate it. Selenium is a decent choice, however it requires you to have a browser present on the system you would deploy this tool at, and run it from a browser. From a Python perspective, another option would be the requests library as-is.

As for running the code from a node-red flow, you need a way to communicate with your NR setup. My preferred option there is to use MQTT as I use that already in other projects and thus hook everything together. You could for example start it through an exec node, which just executes the script, then at the end of the script publish the results to an MQTT topic, which is then used as entry on the NR side, from where the flow continues.

I guess this post is going to show my how many images I can put in a single post on here... The below part is a step by step on reverse engineering the login system of this site. I will be using your (kindly supplied) username/password combination, but black out the password and most of the email from any included screenshots. I'm using Firefox for this, but you could just as well use Chrome. It's what you prefer to use.

Step 1: Open the website you want to automate, and do all actions by hand while observing what happens. Here that is opening https://www.autoavaliar.com.br/, entering your credentials in the username/password fields, then pressing the "ENTRAR" button. It will sign in, and redirect you to a page on the b2b.autoavaliar.com.br subdomain. From there on, you can change your web address to the page you need. One of the things I observe here, is that on the b2b part a lot of the page is blurry and only loaded later on. Thus suggesting that the page is loaded asynchronously, and it might actually pull in the information from an (external) API. Something to come back to later.

Step 2. Log out, go back to the first page, and open the developer console of the browser. You're interested in 3 parts of the window: the inspector, the javascript console, and the network tab. These 3 together will give you most of the secrets this website has to offer. This was always the part of my job that I loved most (I got tasked to automate the test suites for web applications for our clients that were deemed impossible to automate. Coworkers had been trying for over a year to get it working, and would fail, so the task was put for the long term. Got them all automated, talk about an ego-boost).
Right click the username field of the login form on the page, tap "inspect element", and watch the inspector focus on the following:

This describes the login form. The HTML makes me cry, but that's beyond the scope of this post. <input ... name="email" ...> is the email address input, and <input ... name="password" ...> the password field. The button for logging in is the <button> below. I'm coming back to this at later stage, for now move on to the next part. If I were to do this in Selenium, which is perfect for automated testing but not the quickest/most efficient solution, I would enter the username in the field that has name="email", the password in name="password", and then press the button that's inside the form with id="form_home_login". I'll add some sample python code for that at the end.

Once you enter your credentials and press ENTRAR it will redirect to the b2b domain, as seen in the first step, but what exactly happens to get there? Time to do that and look at the network tab while doing so. I make sure to always have the "keep registrations" checkbox on so I can keep watching the requests when I'm redirected.

Several interesting requests are being done. Starting from the top, first a POST request is done to /auth/login on the regular www subdomain. This one has the username and password as keys in the body of the request, with the values from the form as values in the request body. The return type is a json file which simply looks like { "success": true } if you supplied the correct information. It doesn't appear to set useful cookies, so it will be ignored for now. The next one is a POST to the b2b subdomain. The status code is 302, which means that it will immediately be redirected elsewhere. The headers show it is redirected to the repasse-cotacao page at the bottom of this screenshot. The request body of this one is interesting, and might give us instructions on how to automate this form without having to use Selenium. An anonymised screenshot is visible below:

There's a bunch of seemingly random (hexadecimal if you look closely) characters that connect to the username, and another set like that with the password. Next there's a value rc, which in my trained eye I guess will be a CSRF token, an empty value ru, and a login which is probably the value of the login button. All in all, this appears to be another form that is being send from that first page. The biggest clue I have here however is the switch to the b2b subdomain. Seeing how the previous address in the list didn't redirect, the browser has to know how to go there, and a reference to it has to exist in the javascript or HTML of the previous page. Since the values here do not match up to that form we saw before, it's time to take a closer look.

Step 3. Time to log out, and go back to the first page. Go back to the inspector, and search for b2b and try to find a form that has that dashboard page as action attribute. And bingo, the 10th (of 10) results has the form.

As can be seen by comparing both of the images, these keys appear to be static, but who knows, they might be generated from my useragent, ip address, browser info, and so on. Better safe than sorry, if you want to use these, make sure to scrape that information from the page initially and use them like that

As expected, the rc value is empty, but it's needed to complete the process. Suggests it's indeed a CSRF token, that has to be requested first. Another thing to note is the name2 attribute (I'm crying internally, that's not valid HTML and should have been a data-name2 or something instead but I'm not going there), that has the name of the parameter in the original form that it corresponds to. There's probably javascript in place in between that moves these values to the right place. It might also have the rc value, so that's the next place to look.

Step 4. Switch to the Debugger tab of the developer console, and take a look at what sources are included. For each file you see, search for form_b2b_login, the name of the secondary form. There's a folder assets/js with inside a javascript file, minified. But, as seen below, it has the code we're looking for. Un-minify the code (https://beautifier.io/ is your friend) and continue below.

Seeing as there is a reference to the previously seen /auth/login too, it seems like we're in the right place.

This is where the magic truly happens. I did a quick search through this file for references to that rc value, but I can't find it yet after 3 scans through the code. What happens here that if the login form on the page is filled in and the enter/return key is pressed in the password field, the form will submit. When that form submits, the email address and password are first retrieved from that form, then the function submitForm is called which will POST to the defined action, which is "/auth/login" the username and password. And when that happens successfully, the form form_b2b_login is then called, and the name2 attributes are accessed to find the relevant keys, where then email and password are added to. That form is then submitted for real, resulting in the browser to post that form we could see in the screenshot in step 2.

Step 5. At this point there's 2 options. One is to fall back to Selenium, and use the simple way as receiving the value of rc might be hard to reverse engineer. The other one is to keep searching, for the value of rc, figure out how that one is added to the form, then switch to a library like requests and automate this without a browser.

That's it for the reverse engineering part so far. Once you can successfully replay the logging in, you can keep the session and go the page you need and get that information out.

As promised, the quick-and-dirty working selenium code for just the initial form, which according to the javascript seen in step 4 will work.

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.autoavaliar.com.br/')

username = browser.find_element_by_name(element_for_username)
username.send_keys(username)		
password = browser.find_element_by_name(element_for_password)
password.send_keys(password)
browser.find_element_by_xpath('//form[@id="form_home_login"]/descendant::button[contains(@class, "btn-login")]').click()

That should be enough to log in, but the code isn't too clean either. On how to call it carefully is a different story. I usually have a folder with Chrome Drivers for all platforms in a folder called drivers adjacent to the code I'm running in selenium, and use os.path.join links to the absolute path of the directory the code is running in to make sure the correct platform is executed. Here's a snippet I add to all my current generation selenium code to do so:

import sys
import os

if sys.platform.startswith('linux'):
    PLATFORM = 'linux'
    DRIVER_PATH = os.path.join('drivers', 'chromedriver_linux64')
elif sys.platform.startswith('win32'):
    PLATFORM = 'win32'
    DRIVER_PATH = os.path.join('drivers', 'chromedriver_win32.exe')
elif sys.platform.startswith('darwin'):
    PLATFORM = 'mac'
    DRIVER_PATH = os.path.join('drivers', 'chromedriver_mac64')
else:
    raise RuntimeError(f'Platform not supported: {sys.platform}')

I'm going to keep it at that for tonight. Good luck figuring out your next steps.

Topic		Replies	Views
Help to login this page... I want to do web scraping General	29	2139	20 December 2022
How can I log in on a website that requires authentication with node-red General	5	1203	17 October 2022
How can I login to a website with username & pass from node RED General	3	1575	19 March 2021
How to fill in and submit the login web form? General	7	891	30 March 2023
Using Node-RED to grab data from Webpage (advice required) General	12	574	15 September 2023

Is there any way to automatically authenticate myself in a site with a node?

Related topics