Windows/Linux file reading with regexp inconsistency?

In my program, I am reading an input text file using the following code:
file = fileread(GAFname);
fileText = regexp(file, '\r\n|\r|\n', 'split')';
The text file can be anything, in this case let's use:
A
B
C
D
E
The text file is created in windows using notepad++. When executing this code on windows, it splits up file into 5x1 cells perfectly. But when the same files are transferred to a Linux workstation, it suddenly makes a 6x1 cell array with the last one being empty.
Does anyone know if this is related to regexp, creation of the text file (i.e. windows vs linux in terms of \r and \n), or is this caused by something else? Could it be because the version of MATLAB on the Linux workstation is 2018b?
I tried the following: I copied all but the last line into a Linux text editor and then manually entered the last line in the Linux editor then saved. It does not fix the problem so maybe it is not due to windows vs Linux text editors.
Thanks!
Jesse

 Accepted Answer

An empty one at the end is expected if the file ends with newline. It is the old question of whether newline is the line terminator or the line separator. The unix specs are clear that it is a separator. The Windows specs for text mode with CR+LF were traditionally that it was a terminator.

4 Comments

There is a newline character even when there isn't another line after the last one?
Much of the time programs put a newline (and possibly carriage return as well) at the end of the last line before the end of file. Some programs would just end the line without a newline, with there just suddenly being the end-of-file after text.
Both MS Windows and Unix say that it is acceptable for files to just end without a newline or carriage return, but it is more common for there to be a newline just before the end of file.
Unix says that newline is a separator, so as far as Unix is concerned, text followed by newline followed by end of file is a case of the newline "separating" the text from an empty line after it. It is correct for regexp('split') to leave an empty character vector at the end.
Because the empty vector might or might not be there, after I use regexp 'split' for this purpose, I use
if isempty(fileText{end}); fileText(end) = []; end
By the way, you can use:
fileText = regexp(file, '\r?\n', 'split')';
instead of your
fileText = regexp(file, '\r\n|\r|\n', 'split')';
unless you need to handle old Mac OS files from before OS-X . In Windows and Unix, the carriage return will not always be there but the newline will always be there and the carriage return will be before the newline (though if you need to process old WordStar or WordPerfect files, I would not promise that the two will always be in order.)
Thank you very much for the detailed response!
I now understand this system much better but probably should just stick with fopen and other file IO options in general, unless they also have similar problems?
Just assume that the file might or might not end in linefeed no matter how you read it, so do the test for isempty that I show.

Sign in to comment.

More Answers (0)

Categories

Products

Release

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!