Using Node-RED to grab data from Webpage (advice required)

I am looking to use Node-RED in conjunction with MQTT Explorer in order to parse through a webpage, grab the relevant information, and publish it to MQTT Explorer accordingly.

Being relatively new to Node-RED and an amateur with regards to HTML, I am hoping to get any advice in how to achieve this. Currently, my flow consists of; Inject --> Http request --> Write file (for viewing) --> Debug.

I am using the GET method and I have also enabled 'Enable secure (SSL/TLS) connection' and checked off 'Use key and certificates from local files' although I do not understand the purpose of the Enabled SSL or how to program its settings.

The URL I am trying to access is along the lines of: http://10.35.138.72/home.htm.

Any advice would be greatly appreciated. If you feel there are any details missing, I can provide further information.

To expand, there is Table from the webpage which I am attempting to access however when making the GET request, the details of the table are not within the msg.payload output.

What does the table look like in the source html?

Hello Colin, I appreciate the reply.

To answer your request; now that I have looked deeper I do not believe the table is presented in the source HTML (attached below). From what I understand the Graphs.js updates the table continually which is referenced in the source HTML below. So, I do not believe I can simply get away with using the request node to access this information.

<!DOCTYPE html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="author" content="Nick Pod" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" /> <!-- force IE into its most recent standards mode -->
<META HTTP-EQUIV="CACHE-CONTROL" CONTENT="NO-CACHE">
<title>Ingersoll Rand | Industrial Technologies Sector in the Americas</title>
<link rel="stylesheet" href="style.css" type="text/css" media="screen" />
<link rel="stylesheet" href="print.css" type="text/css" media="print" />

<!--[if (gte IE 8)&(lte IE 9)]>
	<link rel="stylesheet" href="ie8.css" type="text/css" media="screen" />
<![endif]-->
<!--[if IE 9]>
	<link rel="stylesheet" href="ie9.css" type="text/css" media="screen" />
<![endif]-->

<script type="text/javascript" src="s1.js">
<!-- basic encryption against man-in-the-middle //-->
</script>
<script type="text/javascript" src="common.js">
<!-- common scriptcode //-->
</script>
<script type="text/javascript" src="inspectionlog.js">
<!-- common scriptcode //-->
</script>
<script type="text/javascript" id="script_lang" src="lang/eng.js">
<!-- The source of this script gets modified in case of a language change -->
</script>
<script type="text/javascript" src="events.js">
<!-- handle the event log generation //-->
</script>
<script type="text/javascript" src="graphs.js">
<!-- handle the graphing generation //-->
</script>
<script type="text/javascript">
<!--
var prevOnLoad = window.onload;
window.onload = function() {
	if (prevOnLoad != null) {
		prevOnLoad();
	}
	var cookie_state = get_cookie('login_state');
	if(cookie_state == "1"){
		document.getElementById('homeID').style.display = "inline";
		getPage('Home');
	}else{
		location.href = "index.htm";
	}
	
};
//-->
</script>
</head>

<body onkeyup="javascript:KeyHandler(event.keyCode);">
<div id="homeID" style="display: none;">
<div id="edit">
	<div id="edit_pop">
		<form id="editForm" action="">
			<div id="edit_var"></div> <br />
			<br />
			<div id="min_max"> Min: <span id="min"></span> | Max: <span id="max"></span></div><br />
			<input type="hidden" id="edit_var_id" />
			<button type="sumbit" id="confirmVarEdit"><span id="txt_button_confirm">Confirm</span></button>
			<button type="button" id="cancelVarEdit" onclick="cancelEdit();"><span id="txt_button_cancel">Cancel</span></button>
		</form>
	</div>
</div>
<div id="folderBar">
	<img id="logo" src="img/spritesheet.png" alt="IR logo" />
	<ul id="tabbar">
		<li><a href="javascript:nav('Home');" id="tabHome" class="current"><img src="img/spritesheet.png" /></a></li>
		<li><a href="javascript:nav('Event');" id="tabEvent" class=""><img src="img/spritesheet.png" /></a></li>
		<li><a href="javascript:nav('PerformanceLog');" id="tabPerformanceLog" class=""><img src="img/spritesheet.png" /></a></li>
		<li><a href="javascript:nav('Graphing');" id="tabGraphing" class=""><img src="img/spritesheet.png" /></a></li>
		<li><a href="javascript:nav('Maintenance');" id="tabMaintenance" class=""><img src="img/spritesheet.png" /></a></li>
		<li><a href="javascript:nav('InspectionLog');" id="tabInspectionLog" class=""><img src="img/spritesheet.png" /></a></li>
		<li><a href="javascript:nav('CompInfo');" id="tabCompInfo" class=""><img src="img/spritesheet.png" /></a></li>
		<li><a href="javascript:nav('Account');" id="tabAccount" class=""><img src="img/spritesheet.png" /></a></li>
	</ul>
</div>

<div id="ctrlAllowed"></div>

<ul id="controlPanel">
	<li><a href="javascript:StartCmd();" id="cmdStart"><img id="startIcon" src="img/spritesheet.png" alt="" /><span id="txt_statusstart">Start</span></a></li>
	<li><a href="javascript:StopCmd();" id="cmdStop"><img id="stopIcon" src="img/spritesheet.png" alt="" /><span id="txt_statusstop">Stop</span></a></li>
	<li><a href="javascript:ResetCmd();" id="cmdReset"><img id="resetIcon" src="img/spritesheet.png" alt="" /><span id="txt_statusresetalarm">Reset Alarm</span></a></li>
	<!--<li><a href="javascript:LoadCmd();" id="cmdLoad"><img id="loadIcon" src="img/spritesheet.png" alt="" /><span id="txt_statusload">Load</span></a></li>//-->
	<!--<li><a href="javascript:UnloadCmd();" id="cmdUnload"><img id="unloadIcon" src="img/spritesheet.png" alt="" /><span id="txt_statusunload">Unload</span></a></li>//-->
</ul>

<div id="content">
	<div id="titleBarContainer">
		<div id="titleBar">
			<span id="title">Home</span>
			<div id="printAndCredentials">
            <img id="printIcon" src="img/spritesheet.png" style="cursor:pointer;" alt="Print" onclick="window.print();" />
				<div id="credentials">
					<span id="txt_username">Username:</span> <span id="un">...</span><span class="verticalSeperator">|</span>
					<button type="button" id="logoutButton" onclick="javascript:logout();"><span id="txt_button_logout">Logout</span></button>
					<br />
					<span id="txt_compressor">Compressor</span>: <span id="compressorname">...</span>
				</div>
			</div>
		</div>
	</div>
	<div id="page"> </div>
</div>

<div id="dashboard">
	<span id="status">STATUS</span>
	<div id="statusIcons">
		<img id="statuswarn" src="img/spritesheet.png" alt="warn" class="" />
		<img id="statusremote" src="img/spritesheet.png" alt="remote" class="" />
		<img id="statusservice" src="img/spritesheet.png" alt="service" class="" />
		<img id="statusload" src="img/spritesheet.png" alt="load" class="" />
	</div>
	<div id="statusVars">
		<span id="ID_machine_state_number" class="rf1"></span>
		<span id="ID_comm_control" class="rf1"></span>
		<span id="ID_remote_start_stop_enabled" class="rf1"></span>
		<span id="ID_status_flags" class="rf1"></span>
		<span id="ID_service_time_period" class="rf1"></span>
		<span id="ID_service_hours" class="rf1"></span>  
		<span id="ID_remote_web_enabled" class="rf1"></span>  	
	</div>
</div>
</div>
</body>
</html>

The quote below from the following thread seems to meet my case:
Thread: Newbie and scrapping info from webpages - General - Node-RED Forum (nodered.org)

Quote from Knolleary:
The one caveat to all of this is that some web pages use javascript to dynamically generate their content after the browser has done the initial load. They use further HTTP requests under the covers to get more information and insert it into the page. That can make it trickier to grab the page and extract that content; the HTTP Request node would get the initial page - any javascript embedded on the page doesn't get run. In those cases, it is sometimes possible to dig in and identify what additional calls the page is making and to update the HTTP Request node to use those calls instead.

Sorry for the lengthy response. I am quite lost in the details currently.

Open the page in a new window/tab, right click > web inspector, go to the network tab and see if you see anything happening there when the page updates.

I did do this and confirmed that a getVar.cgi request is constantly being updated. I am hoping to locate the dynamic variables in order to pull the data into Node-RED. Any advice there?

getVar.cgi request is constantly being updated.

Right click on this getVar.cgi, copy as cURL and paste the output here

Hello again,

Here is the cURL:

curl "http://10.35.133.71/getVar.cgi" ^
  -H "Accept: */*" ^
  -H "Accept-Language: en-US,en;q=0.9" ^
  -H "Connection: keep-alive" ^
  -H "Content-type: application/x-www-form-urlencoded" ^
  -H "Cookie: lang=eng; unit=eng; login_state=1" ^
  -H "Origin: http://10.35.133.71" ^
  -H "Referer: http://10.35.133.71/home.htm" ^
  -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.79" ^
  --data-raw "target_pressure&vsd_display_pct_capacity&immediate_stop_pressure&automatic_stop_pressure&package_discharge_pressure&sump_pressure&airend_discharge_temperature&injected_coolant_temperature&aftercooler_discharge_pressure&separator_pressure_drop&coolant_filter_pressure_drop&inlet_vacuum_pressure&remote_pressure&aftercooler_discharge_temperature&interstage_pressure&oil_cooler_out_temp&package_inlet_temp&power_on_hours&running_hours&umv_3301_kilowatts&umv_3301_motor_speed&avg_package_kW_hrs&avg_percent_capacity&energy_cost&energy_savings&lifetime_energy_savings&comm_control&remote_start_stop_enabled&PORO_enabled&PORO_on_time&low_ambient_temperature&scheduled_start_day&scheduled_start_hours&scheduled_start_minutes&scheduled_stop_day&scheduled_stop_hours&scheduled_stop_minutes&remote_pressure_enabled&ISC_Seq1&ISC_Seq2&ISC_Seq3&ISC_Seq4&machine_state_number&comm_control&remote_start_stop_enabled&status_flags&service_time_period&service_hours&remote_web_enabled&1689332930099" ^
  --compressed ^
  --insecure

Admin edit: wrapped code in triple backticks. Please use the </> code button before pasting

I see that there is a cookie set for a login - do you need to login on this page ? if yes, this example flow probably does not work.

[{"id":"027b9f40798de2ca","type":"inject","z":"e636d2d51696eb79","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"date","x":300,"y":360,"wires":[["084f05dc358e2cb3"]]},{"id":"de4239bf40d4fbd5","type":"http request","z":"e636d2d51696eb79","name":"","method":"use","ret":"txt","paytoqs":"ignore","url":"","tls":"","persist":false,"proxy":"","insecureHTTPParser":false,"authType":"","senderr":false,"headers":[],"x":650,"y":360,"wires":[["3ff4a5f5eb98b50e"]]},{"id":"084f05dc358e2cb3","type":"function","z":"e636d2d51696eb79","name":"function 167","func":"\nconst time = Date.now()\nconst headers = []\n\nheaders['Content-type'] = \"application/x-www-form-urlencoded\"\nheaders['Cookie'] = \"lang=eng; unit=eng; login_state=1\"\nheaders['Referer'] = \"http://10.35.133.71/home.htm\"\n\nmsg.method = \"GET\"\nmsg.payload = `target_pressure&vsd_display_pct_capacity&immediate_stop_pressure&automatic_stop_pressure&package_discharge_pressure&sump_pressure&airend_discharge_temperature&injected_coolant_temperature&aftercooler_discharge_pressure&separator_pressure_drop&coolant_filter_pressure_drop&inlet_vacuum_pressure&remote_pressure&aftercooler_discharge_temperature&interstage_pressure&oil_cooler_out_temp&package_inlet_temp&power_on_hours&running_hours&umv_3301_kilowatts&umv_3301_motor_speed&avg_package_kW_hrs&avg_percent_capacity&energy_cost&energy_savings&lifetime_energy_savings&comm_control&remote_start_stop_enabled&PORO_enabled&PORO_on_time&low_ambient_temperature&scheduled_start_day&scheduled_start_hours&scheduled_start_minutes&scheduled_stop_day&scheduled_stop_hours&scheduled_stop_minutes&remote_pressure_enabled&ISC_Seq1&ISC_Seq2&ISC_Seq3&ISC_Seq4&machine_state_number&comm_control&remote_start_stop_enabled&status_flags&service_time_period&service_hours&remote_web_enabled&${time}`\nmsg.url = \"http://10.35.133.71/getVar.cgi\"\n\nreturn msg","outputs":1,"noerr":0,"initialize":"","finalize":"","libs":[],"x":470,"y":360,"wires":[["de4239bf40d4fbd5"]]},{"id":"3ff4a5f5eb98b50e","type":"debug","z":"e636d2d51696eb79","name":"debug 354","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","x":830,"y":360,"wires":[]}]
1 Like

Hello @bakman2,

Appreciate the help. Unfortunately yes there is a log in. I was solely looking at extracting the information first & then I was going to attack the next hurdle of passing the log in screen.

Are you familiar with that by any chance?

Thanks a lot.

So for the login you have to follow the same procedure - logout, and then login, capture the curl, also check which cookies exist and translate the headers to function/change node.

Thanks a lot, I really appreciate the help.

When I log into the page, using the sample flow, and executing the sample flow I receive an empty string as the result.

I see in the network tab that the 'Payload' aligns with the 'data-raw' portion of the cURL which is the 'msg.payload' in the sample flows Function Node. Is the expected outcome of the sample flow to return what would be found in the network tab underneath 'Response' tab? This is the data I am looking to manipulate.

*To add, when I click on the getVar.cgi event in the network tab. These are the General Headers:

Request URL:
http://10.35.133.71/getVar.cgi
Request Method:
POST
Status Code:
200 OK
Remote Address:
10.35.133.71:80
Referrer Policy:
strict-origin-when-cross-origin*

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.