Okay, I got my laptop now, and looking at the code I've mixed feelings about it. If you were to run it, it would indeed fill in the username and password, but it would crash upon clicking the button. I looked at the HTML code on the site, and it's programmed in a way to make automation a bit harder, but not impossible. Seeing that while the form looked quite simple, it's already a non-standard example I'm going to walk you through reverse engineering a login system in order to automate it. Selenium is a decent choice, however it requires you to have a browser present on the system you would deploy this tool at, and run it from a browser. From a Python perspective, another option would be the requests
library as-is.
As for running the code from a node-red flow, you need a way to communicate with your NR setup. My preferred option there is to use MQTT as I use that already in other projects and thus hook everything together. You could for example start it through an exec
node, which just executes the script, then at the end of the script publish the results to an MQTT topic, which is then used as entry on the NR side, from where the flow continues.
I guess this post is going to show my how many images I can put in a single post on here... The below part is a step by step on reverse engineering the login system of this site. I will be using your (kindly supplied) username/password combination, but black out the password and most of the email from any included screenshots. I'm using Firefox for this, but you could just as well use Chrome. It's what you prefer to use.
Step 1: Open the website you want to automate, and do all actions by hand while observing what happens. Here that is opening https://www.autoavaliar.com.br/, entering your credentials in the username/password fields, then pressing the "ENTRAR" button. It will sign in, and redirect you to a page on the b2b.autoavaliar.com.br subdomain. From there on, you can change your web address to the page you need. One of the things I observe here, is that on the b2b part a lot of the page is blurry and only loaded later on. Thus suggesting that the page is loaded asynchronously, and it might actually pull in the information from an (external) API. Something to come back to later.
Step 2. Log out, go back to the first page, and open the developer console of the browser. You're interested in 3 parts of the window: the inspector, the javascript console, and the network tab. These 3 together will give you most of the secrets this website has to offer. This was always the part of my job that I loved most (I got tasked to automate the test suites for web applications for our clients that were deemed impossible to automate. Coworkers had been trying for over a year to get it working, and would fail, so the task was put for the long term. Got them all automated, talk about an ego-boost).
Right click the username field of the login form on the page, tap "inspect element", and watch the inspector focus on the following:
This describes the login form. The HTML makes me cry, but that's beyond the scope of this post.
<input ... name="email" ...>
is the email address input, and
<input ... name="password" ...>
the password field. The button for logging in is the
<button>
below. I'm coming back to this at later stage, for now move on to the next part. If I were to do this in Selenium, which is perfect for automated testing but not the quickest/most efficient solution, I would enter the username in the field that has
name="email"
, the password in
name="password"
, and then press the button that's inside the form with
id="form_home_login"
. I'll add some sample python code for that at the end.
Once you enter your credentials and press ENTRAR it will redirect to the b2b domain, as seen in the first step, but what exactly happens to get there? Time to do that and look at the network tab while doing so. I make sure to always have the "keep registrations" checkbox on so I can keep watching the requests when I'm redirected.
Several interesting requests are being done. Starting from the top, first a POST request is done to
/auth/login
on the regular
www
subdomain. This one has the username and password as keys in the body of the request, with the values from the form as values in the request body. The return type is a
json
file which simply looks like
{ "success": true }
if you supplied the correct information. It doesn't appear to set useful cookies, so it will be ignored for now. The next one is a POST to the b2b subdomain. The status code is 302, which means that it will immediately be redirected elsewhere. The headers show it is redirected to the repasse-cotacao page at the bottom of this screenshot. The request body of this one is interesting, and might give us instructions on how to automate this form without having to use Selenium. An anonymised screenshot is visible below:
There's a bunch of seemingly random (hexadecimal if you look closely) characters that connect to the username, and another set like that with the password. Next there's a value
rc
, which in my trained eye I guess will be a CSRF token, an empty value
ru
, and a
login
which is probably the value of the
login
button. All in all, this appears to be another form that is being send from that first page. The biggest clue I have here however is the switch to the
b2b
subdomain. Seeing how the previous address in the list didn't redirect, the browser has to know how to go there, and a reference to it has to exist in the javascript or HTML of the previous page. Since the values here do not match up to that form we saw before, it's time to take a closer look.
Step 3. Time to log out, and go back to the first page. Go back to the inspector, and search for b2b
and try to find a form that has that dashboard page as action
attribute. And bingo, the 10th (of 10) results has the form.
As can be seen by comparing both of the images, these keys appear to be static, but who knows, they might be generated from my useragent, ip address, browser info, and so on. Better safe than sorry, if you want to use these, make sure to scrape that information from the page initially and use them like that
As expected, the
rc
value is empty, but it's needed to complete the process. Suggests it's indeed a CSRF token, that has to be requested first. Another thing to note is the
name2
attribute (I'm crying internally, that's not valid HTML and should have been a
data-name2
or something instead but I'm not going there), that has the name of the parameter in the original form that it corresponds to. There's probably javascript in place in between that moves these values to the right place. It might also have the
rc
value, so that's the next place to look.
Step 4. Switch to the Debugger tab of the developer console, and take a look at what sources are included. For each file you see, search for form_b2b_login
, the name of the secondary form. There's a folder assets/js
with inside a javascript file, minified. But, as seen below, it has the code we're looking for. Un-minify the code (https://beautifier.io/ is your friend) and continue below.
Seeing as there is a reference to the previously seen
/auth/login
too, it seems like we're in the right place.
This is where the magic truly happens. I did a quick search through this file for references to that
rc
value, but I can't find it yet after 3 scans through the code. What happens here that if the login form on the page is filled in and the enter/return key is pressed in the password field, the form will submit. When that form submits, the email address and password are first retrieved from that form, then the function
submitForm
is called which will POST to the defined
action
, which is "/auth/login" the username and password. And when that happens successfully, the form
form_b2b_login
is then called, and the name2 attributes are accessed to find the relevant keys, where then email and password are added to. That form is then submitted for real, resulting in the browser to post that form we could see in the screenshot in step 2.
Step 5. At this point there's 2 options. One is to fall back to Selenium, and use the simple way as receiving the value of rc
might be hard to reverse engineer. The other one is to keep searching, for the value of rc
, figure out how that one is added to the form, then switch to a library like requests
and automate this without a browser.
That's it for the reverse engineering part so far. Once you can successfully replay the logging in, you can keep the session and go the page you need and get that information out.
As promised, the quick-and-dirty working selenium code for just the initial form, which according to the javascript seen in step 4 will work.
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://www.autoavaliar.com.br/')
username = browser.find_element_by_name(element_for_username)
username.send_keys(username)
password = browser.find_element_by_name(element_for_password)
password.send_keys(password)
browser.find_element_by_xpath('//form[@id="form_home_login"]/descendant::button[contains(@class, "btn-login")]').click()
That should be enough to log in, but the code isn't too clean either. On how to call it carefully is a different story. I usually have a folder with Chrome Drivers for all platforms in a folder called drivers
adjacent to the code I'm running in selenium, and use os.path.join
links to the absolute path of the directory the code is running in to make sure the correct platform is executed. Here's a snippet I add to all my current generation selenium code to do so:
import sys
import os
if sys.platform.startswith('linux'):
PLATFORM = 'linux'
DRIVER_PATH = os.path.join('drivers', 'chromedriver_linux64')
elif sys.platform.startswith('win32'):
PLATFORM = 'win32'
DRIVER_PATH = os.path.join('drivers', 'chromedriver_win32.exe')
elif sys.platform.startswith('darwin'):
PLATFORM = 'mac'
DRIVER_PATH = os.path.join('drivers', 'chromedriver_mac64')
else:
raise RuntimeError(f'Platform not supported: {sys.platform}')
I'm going to keep it at that for tonight. Good luck figuring out your next steps.