Read File node returns one line too many?

I have a file (on Linux) which contains 2 lines of text, 4 words, 20 characters

$ wc testdata
 2  4 20 testdata
$ cat testdata
A line
Another line

Each line is terminated by \n, there are no blank lines or spaces at the end:

$ od -c testdata
0000000   A       l   i   n   e  \n   A   n   o   t   h   e   r       l
0000020   i   n   e  \n
0000024

The Read FIle node, set to send a message per line, sends 3 messages:

A line

Another line

(empty string)

Is this spurious blank line the expected behaviour?

Yes - because you have a newline at the end of the file, so your file ends with a blank line.

No it doesn't. See the character count and the octal dump. There are two new line characters. There are two lines. The node outputs 3 messages.

Presumably the node identifies the end of line1 by seeing the 1st new line character.
It identifies the end of line 2 by seeing the 2nd new line character.
Then it uses the 2nd new line character again to spit out a blank line.

I see a first line: A l i n e + plus + 2 \n newline characters === 3 lines.

Sorry, I disagree with that interpretation.
So does the Linux wc -l command which says the file has two lines.

Certainly it is possible, but not generally true that the last line in a file has no \n.

I suggest that the node should identify a line by:
From the first character in the file OR The first character after \n or \n\c
To the next \n or \n\c OR EOF

Edit - If you feed the Read File output to a Write File node, with "add new line to each payload", the output file is not identical to the input. It has \n\n at the end.

This is a very common thing...

And a selection of languages and command lines which don't count a final \n as another line (from a stackexchange answer a few years back)

$ cat -n testdata | tail -n 1 | cut -f 1
     2
$ awk 'END {print NR}' testdata
2
$ sed 's/.*//' testdata | uniq -c
      2
$ LINECT=0; while read -r LINE; do (( LINECT++ )); done < testdata ; echo $LINECT
2
$ perl -lne '}{ print $.' testdata
2
$ wc -l testdata
2 testdata

I dont disagree but thats not how this is. Changing this behaviour could feasibly break existing flows.

Something I have always done throughout my years is to add a guard that checks if the line is empty (and dispose of it) so it doesnt really catch me out.

Lets put it this way (devils advocate)

If you read the text file line by line into an array then use array.join('\n') it will be faithfully recreated (assuming linux or files systems / OSes that use \n). How else would you know there is a newline at the end? That new line might be important to some sub systems or applications.

I dont disagree but thats not how this is.

That's fair enough.

Node-red has nodes to read a file line by line and write a file line by line.
Surely it should be possible to duplicate a file by reading it with the file read and writing it with the file write, without the need to consider it as an array or a string.

I don't think your example of splitting a string at new line characters is a valid way to count lines in a file.

For me it's a bug, and you can't really fail to fix a bug because people may be relying on it.

Oh, this was once recommended here as a way to obtain the last line of a file. Maybe it used to work differently?

I'll hush up now :face_with_open_eyes_and_hand_over_mouth:

errr - indeed that works ...

conway : /tmp $ wc a.txt
       2       4      20 a.txt
conway : /tmp $ wc a.new
       2       4      20 a.new

Indeed - if I have a file A line with no \n - then wc -l says it has 0 lines... which is perverse...

1 Like

The way I have always look at it is "how many positions can the cursor be at in the file"

In the case of your data - there are 3 positions the cursor can be
WindowsTerminal_Nvs7f9rCmk

I dont consider it a bug but an implementation detail.

If others feel this is something that should be changed then it would (probably) not be until V4 since it is a breaking change.

1 Like

I agree - my way of thinking is as a byte stream.... but yes ends up same as where the cursor is. You read the \n char that moves the curser to line 3 - then you close the file... - the fact that there is nothing there doesn't mean you are not on line3.

BUT - I can now - if I finally register these words read a file line by line and write a file line by line - see that in that case we do then add an extra \n when it's not necessary at the end of the file. I then think it is something that is part of the file out node rather than the in node as (as per my streams thinking - at the point of the final \n in per line mode we don't know that the next character doesn't exist.

So yes I think it could be accepted as an issue.

1 Like

stops hushing, momentarily
I am glad you agree there is an issue, I think you are wrong to ascribe it to the Write File node.

Another programmatic way to read a file a line at a time and count them: C getline()

$ cat try.c
#include <stdio.h>
#include <stdlib.h>

int main(void)
{
    FILE * fp;
    char * line = NULL;
    size_t len = 0;
    ssize_t read;
    int linesread = 0;

    fp = fopen("/home/pi/testdata", "r");
    if (fp == NULL)
        exit(EXIT_FAILURE);

    while ((read = getline(&line, &len, fp)) != -1) {
        printf("Retrieved line of length %zu:\n", read);
        printf("%s", line);
        linesread++;
    }
    printf("Read %i lines in total\n", linesread);

    fclose(fp);
    if (line)
        free(line);
    exit(EXIT_SUCCESS);
}
$ cc -o try try.c
$ ./try
Retrieved line of length 7:
A line
Retrieved line of length 13:
Another line
Read 2 lines in total

And you can remove the final new line from testdata, C still reads 2 lines

$ printf "%s\n%s" "A line with nl" "Another line no nl" > testdata
$ ./try
Retrieved line of length 15:
A line with nl
Retrieved line of length 18:
Another line no nlRead 2 lines in total

This is all Linux based, I don't know what the node[s] do on WIndows, don't have NR installed.

I will try and raise an issue on github

Sorry - have to disagree on it being the in node... If you are setting it to return a line at a time - ie split on \n then we do read it a char at a time and send the string as soon as we see the \n - so we then when we get to the next (non)character we need some way to say "oh actually that was the end of the file" - so need to send null. - It needs to be different to a file with a final line that doesn't end \n.

When it gets to the out node we need to reconstruct it - and if the add newline box is checked need to do that for every valid line - so all lines except the last null one, whereas currently we just add a newline regardless.