HTML is NOT parsing DIV tags consistently

Hello,

I have a flow that scrapes the U.S. Securities website page: "sec.report"

This web page contains a Company List table that contains the following columns:

  1. Company Name. ( Como Health Holdings, LLC )
  2. State Abbreviation Code. ( MO = Montana )
  3. State of Incorporation Abbreviation Code. ( MO = Montana )
  4. CIK - Central Index Key ( CIK = 0001816732 )
  5. Updated. ( Updated = 2020-07-02 )

NODE-RED FLOW:

  1. HTTP Request Node
  1. HTML Node with the following selector:
    Selector = div.table>div>div>div

  2. When Node-Red processed the web page, it did NOT generate a carriage
    return for the two state columns 2 and 3.

    The abbreviations have been appended to the end of the company name.

I have attached two screen prints:

  1. SEC Report - Node Red.png
  2. SEC Report - web page and inspect window.png

Does anyone know why the carriage returns are not being generated by the div tags and how to fix the string elements in the string array ???

Ideally, I would like to convert the string elements into separate key=value pairs and create an object to replace the string element in the array.

Thanks
Vic

There is no need for the json node.

Try this flow (bit rudimentary way of doing it, but it works):

[{"id":"6f07b3fa.94448c","type":"inject","z":"5de8b0ff.e7f7a","name":"","topic":"","payload":"","payloadType":"date","repeat":"","crontab":"","once":false,"onceDelay":0.1,"x":180,"y":384,"wires":[["94bfc134.ad2ce"]]},{"id":"94bfc134.ad2ce","type":"http request","z":"5de8b0ff.e7f7a","name":"","method":"GET","ret":"txt","paytoqs":false,"url":"https://sec.report","tls":"","persist":false,"proxy":"","authType":"","x":334,"y":384,"wires":[["5f1c3b20.c1cf2c"]]},{"id":"5f1c3b20.c1cf2c","type":"html","z":"5de8b0ff.e7f7a","name":"","property":"payload","outproperty":"payload","tag":"div.table .row .cell","ret":"text","as":"single","x":522,"y":384,"wires":[["f15144b9.0fd3a"]]},{"id":"87b90c36.4b8a78","type":"debug","z":"5de8b0ff.e7f7a","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","x":862,"y":384,"wires":[]},{"id":"f15144b9.0fd3a","type":"function","z":"5de8b0ff.e7f7a","name":"","func":"m = msg.payload\no = []\nfor(x=0;x<m.length;x+=5){\n  \n  \n  company = m[x].trim()\n  state =m[x+1].trim()\n  inc = m[x+2].trim()\n  cik = m[x+3].trim()\n  updated = m[x+4].trim()\n \n  o.push({'company':company,'state':state,'inc':inc,'cik':cik,'updated':updated})\n  \n    \n}\no.shift()\nreturn {payload:o}","outputs":1,"noerr":0,"x":698,"y":384,"wires":[["87b90c36.4b8a78"]]}]

output

[
  {
    "company": "Culinary Craft Workshop, LLC",
    "state": "MD",
    "inc": "MD",
    "cik": "0001816758",
    "updated": "2020-07-02"
  },
  {
    "company": "Thomas Robert A",
    "state": "",
    "inc": "",
    "cik": "0001816757",
    "updated": "2020-07-02"
  },
  {
    "company": "Green Jeffrey D",
    "state": "",
    "inc": "",
    "cik": "0001816753",
    "updated": "2020-07-02"
  },
...

@bakman2

:sunglasses: :partying_face:

This is Fantastic !

I am new to Node-Red and I have been trying to understand how to interact with the node flow and this is the best explanation that I have seen.

Just Bypass the node processing with a function node.

I guess there are no For Loop Processing Nodes where you can specify the criteria and the for loop body statements ???

Thanks
nedstrader

@bakman2 would you say that there is something wrong with the HTML node where it is skipping those two div fields ?

Should we create a ticket ?

Should we create a ticket ?

No.

empty divs returned linebreak.

I guess there are no For Loop Processing Nodes where you can specify the criteria and the for loop body statements ???

Could have used a split node as well, but because the data returns line-breaks this is not nice to deal with the split node.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.