Read File node returns one line too many?

jbudd · 10 October 2022 13:39

I have a file (on Linux) which contains 2 lines of text, 4 words, 20 characters

$ wc testdata
 2  4 20 testdata
$ cat testdata
A line
Another line

Each line is terminated by \n, there are no blank lines or spaces at the end:

$ od -c testdata
0000000   A       l   i   n   e  \n   A   n   o   t   h   e   r       l
0000020   i   n   e  \n
0000024

The Read FIle node, set to send a message per line, sends 3 messages:

A line

Another line

(empty string)

Is this spurious blank line the expected behaviour?

knolleary · 10 October 2022 13:51

Yes - because you have a newline at the end of the file, so your file ends with a blank line.

jbudd · 10 October 2022 13:55

No it doesn't. See the character count and the octal dump. There are two new line characters. There are two lines. The node outputs 3 messages.

Presumably the node identifies the end of line1 by seeing the 1st new line character.
It identifies the end of line 2 by seeing the 2nd new line character.
Then it uses the 2nd new line character again to spit out a blank line.

Steve-Mcl · 10 October 2022 14:02

I see a first line: A l i n e + plus + 2 \n newline characters === 3 lines.

jbudd · 10 October 2022 14:08

Sorry, I disagree with that interpretation.
So does the Linux wc -l command which says the file has two lines.

Certainly it is possible, but not generally true that the last line in a file has no \n.

I suggest that the node should identify a line by:
From the first character in the file OR The first character after \n or \n\c
To the next \n or \n\c OR EOF

Edit - If you feed the Read File output to a Write File node, with "add new line to each payload", the output file is not identical to the input. It has \n\n at the end.

Steve-Mcl · 10 October 2022 14:14

This is a very common thing...

jbudd · 10 October 2022 14:27

And a selection of languages and command lines which don't count a final \n as another line (from a stackexchange answer a few years back)

$ cat -n testdata | tail -n 1 | cut -f 1
     2
$ awk 'END {print NR}' testdata
2
$ sed 's/.*//' testdata | uniq -c
      2
$ LINECT=0; while read -r LINE; do (( LINECT++ )); done < testdata ; echo $LINECT
2
$ perl -lne '}{ print $.' testdata
2
$ wc -l testdata
2 testdata

Steve-Mcl · 10 October 2022 14:50

I dont disagree but thats not how this is. Changing this behaviour could feasibly break existing flows.

Something I have always done throughout my years is to add a guard that checks if the line is empty (and dispose of it) so it doesnt really catch me out.

Lets put it this way (devils advocate)

If you read the text file line by line into an array then use array.join('\n') it will be faithfully recreated (assuming linux or files systems / OSes that use \n). How else would you know there is a newline at the end? That new line might be important to some sub systems or applications.

jbudd · 10 October 2022 15:04

I dont disagree but thats not how this is.

That's fair enough.

Node-red has nodes to read a file line by line and write a file line by line.
Surely it should be possible to duplicate a file by reading it with the file read and writing it with the file write, without the need to consider it as an array or a string.

I don't think your example of splitting a string at new line characters is a valid way to count lines in a file.

For me it's a bug, and you can't really fail to fix a bug because people may be relying on it.

Oh, this was once recommended here as a way to obtain the last line of a file. Maybe it used to work differently?

I'll hush up now

dceejay · 10 October 2022 15:13

errr - indeed that works ...

conway : /tmp $ wc a.txt
       2       4      20 a.txt
conway : /tmp $ wc a.new
       2       4      20 a.new

dceejay · 10 October 2022 15:16

Indeed - if I have a file A line with no \n - then wc -l says it has 0 lines... which is perverse...

Steve-Mcl · 10 October 2022 15:19

The way I have always look at it is "how many positions can the cursor be at in the file"

In the case of your data - there are 3 positions the cursor can be
WindowsTerminal_Nvs7f9rCmk

I dont consider it a bug but an implementation detail.

If others feel this is something that should be changed then it would (probably) not be until V4 since it is a breaking change.

dceejay · 10 October 2022 15:51

I agree - my way of thinking is as a byte stream.... but yes ends up same as where the cursor is. You read the \n char that moves the curser to line 3 - then you close the file... - the fact that there is nothing there doesn't mean you are not on line3.

BUT - I can now - if I finally register these words read a file line by line and write a file line by line - see that in that case we do then add an extra \n when it's not necessary at the end of the file. I then think it is something that is part of the file out node rather than the in node as (as per my streams thinking - at the point of the final \n in per line mode we don't know that the next character doesn't exist.

So yes I think it could be accepted as an issue.

jbudd · 10 October 2022 16:41

stops hushing, momentarily
I am glad you agree there is an issue, I think you are wrong to ascribe it to the Write File node.

Another programmatic way to read a file a line at a time and count them: C getline()

$ cat try.c
#include <stdio.h>
#include <stdlib.h>

int main(void)
{
    FILE * fp;
    char * line = NULL;
    size_t len = 0;
    ssize_t read;
    int linesread = 0;

    fp = fopen("/home/pi/testdata", "r");
    if (fp == NULL)
        exit(EXIT_FAILURE);

    while ((read = getline(&line, &len, fp)) != -1) {
        printf("Retrieved line of length %zu:\n", read);
        printf("%s", line);
        linesread++;
    }
    printf("Read %i lines in total\n", linesread);

    fclose(fp);
    if (line)
        free(line);
    exit(EXIT_SUCCESS);
}
$ cc -o try try.c
$ ./try
Retrieved line of length 7:
A line
Retrieved line of length 13:
Another line
Read 2 lines in total

And you can remove the final new line from testdata, C still reads 2 lines

$ printf "%s\n%s" "A line with nl" "Another line no nl" > testdata
$ ./try
Retrieved line of length 15:
A line with nl
Retrieved line of length 18:
Another line no nlRead 2 lines in total

This is all Linux based, I don't know what the node[s] do on WIndows, don't have NR installed.

I will try and raise an issue on github

dceejay · 10 October 2022 17:17

Sorry - have to disagree on it being the in node... If you are setting it to return a line at a time - ie split on \n then we do read it a char at a time and send the string as soon as we see the \n - so we then when we get to the next (non)character we need some way to say "oh actually that was the end of the file" - so need to send null. - It needs to be different to a file with a final line that doesn't end \n.

When it gets to the out node we need to reconstruct it - and if the add newline box is checked need to do that for every valid line - so all lines except the last null one, whereas currently we just add a newline regardless.

system · 9 December 2022 17:17

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
File-In node oddities General	21	886	14 December 2019
Reading File - Odd character return output? General	4	347	3 May 2023
Reading and Writing into a text file from the function node General	34	24893	23 January 2019
Read a txt file without write protection General	15	451	21 December 2020
How to get the line number of a CSV file in function node General	8	1977	21 November 2020

Read File node returns one line too many?

Related topics