read file with ascii characters and binary data on one single line

Hi,
I need to read data from a file which consists of approximately 1000 lines of ascii header and then a line consisting of the delimiter "#!" and the data in the file (written in binary data (Int32)). I can read the header using fgetl(fid) and also identify the delimiter, but when trying to continue reading from there using fread(fid,3,'int32') the values are non-sense. Do I somehow loose the reading position?
If I cut the delimiter and the header of the file using a text editor, I can read the purely binary file with fread(fid,3,'int32'). But I would like to speed it up without manually manipulating the file. Any suggestions? Thanks a lot!
So far I'm trying this:
fn='test.nid';
fid = fopen(fn, 'r', 'ieee-be');
stillheader=1;
while stillheader
% store the current position in file
headerendpos_ascii=ftell(fid);
oneline=fgetl(fid);
k=k+1;
if isempty(strfind(oneline,'#!')) % it's not yet the delimiter
% read relevant data from header ...
else
stillheader=0;
end
end
% now read the delimiter and the binary data
fseek(fid,headerendpos_ascii,'bof'); % move to beginning of line with delimiter
test=textscan(fid,'%c',2) % read the delimiter, now we are behind the delimeter #!
binarydata=fread(fid,200,'int32'); % read the first 200 valkues of the binary data.
fclose(fid)

8 Comments

Are you accounting for the newline in the positioning? Looks to me like you moved to after the delimiter characters, but before the \n. What encoding does the file use--one or two?
I assume there is no newline after the delimiter, if I open it with a text editor the binary data starts on the same line.
The encoding is 'Windows-1252', did I understand your question correctly?
Usually, if a file contains binary, it needs to be all read as binary (even the text part). There are different ways to mark the end of text in a binary file (you can prepend the length of the text, you can mark the end with a \0, it can be fixed width, etc.). If you read it as text without knowing the way it's embedded you run the risk of missing or skipping essential binary data. Particulalry, if end of text is marked by a \0, you will read that as an invisible character as text.
Do you have the actual specification of the file format? Knowing how it is actually encoded we can tell you how to read reliably. Without that information, any solution you're given may work on your test file but break on some other file.
Can you attach a sample file?
Dear Guillaume,
Thanks for bringing up the point, I did not consider that there might be an end-of-the-line marker which I did not see. I'll have to check that.
I attached a .zip with an example file.
Best,
Beat
There is no special character (such as \0) at the end of the textual part, no anything before. I also doubt it's fixed size, so it does look like #! is indeed the marker for end of text. That's a bit dangerous (what if it appeared in the text?).
Have you got a specification for the format? Really, following the specs is the safest way to decode the file.
What should the first few values be (as int32) after the text in your example file.
Note that the number of bytes from the end of the #! to the end of file is not a multiple of 4, so we need to understand the format.
In the header I found:
"SaveMode=Binary
SaveBits=32
SaveSign=Signed"
and when reading the binary part (header and delimiter removed with notepad++) with fread(fid,3,'int32'), the output is correct. therefore I'm quite sure that 'int32' is the correct format.
The beginning of the data part should be the three numbers 48053388, 49088328, 50666668
How did you find out about the number of bytes between #! and the end?
The beginning of the data part should be the three numbers 48053388, 49088328, 50666668
None of these numbers can be found anywhere in your attached file, encoded as int32 or uint32, in little-endian or big-endian. I don't see how you could ever get these numbers from the file.
Note that if you were editing the file with a text editor to crop the text, then your editor may well have changed the actual binary values. It is not safe to edit a binary file in a text editor.
I'll repeat my request for actual documentation of the format. It'd be a lot easier to understand how data is encoded.
How did you find out about the number of bytes between #! and the end?
fid = fopen('20190322_KBr_tipC5_07356.nid'); %open file in binary mode
bytes = fread(fid, [1 Inf], '*uint8'); %read the whole lot as bytes
fclose(fid);
binstart = strfind(bytes, '#!') + 2; %find location of #!. skip these two characters
numel(bytes) - binstart %number of bytes after the #!
ans =
327679
Not an even number.
By the way:
>> strfind(bytes, typecast(int32(48053388), 'uint8')) %search 48053388 as int32, little endian
ans =
[]
>> strfind(bytes, typecast(uint32(48053388), 'uint8')) %search 48053388 as uint32, little endian
ans =
[]
>> strfind(bytes, fliplr(typecast(int32(48053388), 'uint8'))) %search 48053388 as int32, big endian
ans =
[]
>> strfind(bytes, fliplr(typecast(uint32(48053388), 'uint8'))) %search 48053388 as uint32, big endian
ans =
[]
Unfortunately, I cannot provide any documentation about the data file, I don't have any.
I agree, that working on such a file with the text editor is questionable. But on the other hand, like this I could import the data, plot it, and check that it is correct. The 3 numbers I provided above are part of the data set, which I checked like this.
Thanks a lot for you efforts and the idea, that also the operation with the text editor could have caused an error. I will use the bit counting operations you provided above to investigate, how the file changes it's lenght during the whole operation.
I don't have time for this today, but I'll do it in the next days. Thanks a lot for your effort, I'll post it once I know more!

Sign in to comment.

 Accepted Answer

As noted in the comment, it's bizzaro way to have done, but the following seems to work...
>> l=fgetl(fid);while ~contains(l,'#!'), l=fgetl(fid);end % get to the delimeter line...
>> fseek(fid,-numel(l)+1,'cof'); % backup to just past the #!
>> fread(fid,3,'*int32')
ans =
3×1 int32 column vector
47465896
49587076
45505136
>>
These aren't quite the same values as OP says, probably he's looking at a different file than the one he posted.
NB: The position after the while loop will be dependent on the data content -- fgetl won't terminate until if finds a two-byte sequence that qualifies as line terminator and it'll be dependent upon the actual data values where that is. So, back up the number of bytes in the last read and the offset of the two terminator bytes that offset the two indicator characters and add one. This is dependent upon the Windows convention which it appears the file follows.
It would be more robust to do the read on a character-by-character basis or, (as I think G?) suggested suck it all up as char() and do a text search for the magic character string and then decode the rest from that point.

1 Comment

Great, this works! thanks a lot!
You're right, apparently I sent the three numbers of the scan file previous to the one, which I attached. Sorry for that!

Sign in to comment.

More Answers (2)

Right, now that we've resolved that the published int32 were incorrect. Here how I would parse the file.
fid = fopen('20190322_KBr_tipC5_07356.nid'); %open file in binary mode
bytes = fread(fid, [1 Inf], '*uint8'); %read the whole lot as bytes
fclose(fid);
binstart = strfind(bytes, '#!') + 2; %find location of #!. skip these two characters
text = char(bytes(1:binstart-2));
bindata = typecast(bytes(binstart:numel(bytes)), 'int32');
Note that the above should be a lot faster than reading the file line by line.
Things to take into account:
  • It is assumed the text encoding is the same as the one used by matlab. Strange characters will appear if this is not the case. The file format specification would tell you what encoding is used. Modern software would most likely use unicode.
  • It is assumed that numbers are encoded as 32-bit signed integer in little endian. Since the text header seems to specify the encoding, I assume it can varies. My method (and any of the other solutions) would completely break down the day you come across a file with different encoding. You could at least check that with:
encodings = regexp(text, 'SaveMode=(\w+)\s+SaveBits=(\d+)\s+SaveSign=(\w+)\s+SaveOrder=(\w+)', 'tokens');
encodings = vertcat(encodings{:});
assert(all(strcmp(encodings(:, 1), 'Binary')), 'At least one of the channels is not encoded as binary');
assert(all(str2double(encodings(:, 2)) == 32), 'At least one of the channels is not encoded on 32 bits');
assert(all(strcmp(encodings(:, 3), 'Signed')), 'At keast one of the channels is not signed');
assert(all(strcmp(encodings(:, 4), 'Intel')), 'At least one of the channels is not little endian');
  • To properly parse the file, you really need to parse the [Dataset] portions of the code and decode the binary according to the encodings above.
edit: actually here is a better parser, based on the notes above, but still assuming fixed encoding
It is assumed that the order of the fields in the [Dataset] portion of text is fixed (otherwise a more complex parsing of the text is requiried.
fid = fopen('20190322_KBr_tipC5_07356.nid'); %open file in binary mode
bytes = fread(fid, [1 Inf], '*uint8'); %read the whole lot as bytes
fclose(fid);
binstart = strfind(bytes, '#!') + 2; %find location of #!. skip these two characters
text = char(bytes(1:binstart-2));
encodings = regexp(text, 'Points=(\d+)\s+Lines=(\d+).+?SaveMode=(\w+)\s+SaveBits=(\d+)\s+SaveSign=(\w+)\s+SaveOrder=(\w+)', 'tokens');
encodings = vertcat(encodings{:});
assert(all(strcmp(encodings(:, 3), 'Binary')), 'At least one of the channels is not encoded as binary');
assert(all(str2double(encodings(:, 4)) == 32), 'At least one of the channels is not encoded on 32 bits');
assert(all(strcmp(encodings(:, 5), 'Signed')), 'At keast one of the channels is not signed');
assert(all(strcmp(encodings(:, 6), 'Intel')), 'At least one of the channels is not little endian');
datasetsizes = str2double(encodings(:, [1 2]));
assert(sum(prod(datasetsizes, 2)) == (numel(bytes) - binstart + 1)/4, 'File does not have the right length');
bindata = typecast(bytes(binstart:numel(bytes)), 'int32');
datasets = mat2cell(bindata, 1, prod(datasetsizes, 2))';
datasets = cellfun(@(v, s) reshape(v, s), datasets, num2cell(datasetsizes, 2), 'UniformOutput', false);

2 Comments

  • To properly parse the file, you really need to parse the [Dataset] portions of the code and decode the binary according to the encodings above.
+1
I tested the solution presented above on different files of the same type and it works great except a minor detail: sometimes, the binary can coincidentially contain the same content as '#!', then binstart contains more than one element. Adding an extra line to discard all elements except the firs one solves the issue.
fid = fopen('20190322_KBr_tipC5_07356.nid'); %open file in binary mode
bytes = fread(fid, [1 Inf], '*uint8'); %read the whole lot as bytes
fclose(fid);
binstart = strfind(bytes, '#!') + 2; %find location of #!. skip these two characters
binstart=binstart(1); % discard further elements in binstart in case #! also appears in the binary data
text = char(bytes(1:binstart-2));
encodings = regexp(text, 'Points=(\d+)\s+Lines=(\d+).+?SaveMode=(\w+)\s+SaveBits=(\d+)\s+SaveSign=(\w+)\s+SaveOrder=(\w+)', 'tokens');
encodings = vertcat(encodings{:});
assert(all(strcmp(encodings(:, 3), 'Binary')), 'At least one of the channels is not encoded as binary');
assert(all(str2double(encodings(:, 4)) == 32), 'At least one of the channels is not encoded on 32 bits');
assert(all(strcmp(encodings(:, 5), 'Signed')), 'At keast one of the channels is not signed');
assert(all(strcmp(encodings(:, 6), 'Intel')), 'At least one of the channels is not little endian');
datasetsizes = str2double(encodings(:, [1 2]));
assert(sum(prod(datasetsizes, 2)) == (numel(bytes) - binstart + 1)/4, 'File does not have the right length');
bindata = typecast(bytes(binstart:numel(bytes)), 'int32');
datasets = mat2cell(bindata, 1, prod(datasetsizes, 2))';
datasets = cellfun(@(v, s) reshape(v, s), datasets, num2cell(datasetsizes, 2), 'UniformOutput', false);
Thanks a lot to all who contributed to the solution!

Sign in to comment.

If you are on Windows and opening the file with notepad a line feed (ASCII 10) is not enough to trigger a new line, as you will also need a carriage return (ASCII 13). Most other viewers (like e.g. notepad++) do not require CRLF but will also interpret LF as a newline.
As for the rest of your question, it might not be easier to read everything as uint8 and then convert everything up to the first occurrence of [35 33] to text (a simple call to char should work with CP1252, for UTF8 it is less trivial). Then you need to reinterpret the rest from uint8 to int32 with typecast.

9 Comments

I assume there is neither a line feed nor a carriage return, because I don't see a difference whether I use notepad++.
Thanks a lot for the suggestion to read it as a uint8, I'll try this! I'll let you know when I have news!
Best,
Beat
That's weird!!! The text portion of the file is conventional with two-character \n (0D 0A) and I was presuming that the delimiter line, being text, would also have the \n associated with.
But, G, you're correct, it doesn't follow the pattern...
Maybe it's too nitpicky to comment on, but here it is nonetheless: \n is only x{0A}, the sequence you mention is the full CRLF, which in Matlab can be explicitly called with \r\n. If you use fopen with a t flag on a Windows machine Matlab will automatically convert \n to \r\n.
Yeah, it was in the ML convention to have \n "be whatever" to match the OS that I was implying in the context...but, strictly speaking, yes, you're correct and that was a needed inference by the reader, not expressed...
Mea Culpa! :)
No, certainly don't open the file with a 't' flag! Matlab may remove \r (byte value 13) from the binary stream as well. If it was a pure text file, yes, it would be better to use 't' mode.
I agree. In this case you absolutely should not use the text mode (I even suggest to read the text as binary in my answer). It is always good to remove these points of potential for misunderstanding. Personally I wouldn't use the t flag in any situation, since I prefer using \n on Windows as well, and I like my code to return a bit-perfect copy regardless of the OS (or even Matlab/Octave runtime).
So now it is my turn to say mea culpa ;)
Since Guillaume implemented my suggested solution, I think it is better to delete this answer (the balance tipping from 'preservation of discussion' to 'avoiding clutter'). If you disagree, please to post a comment saying so.
I'm of the opinion that unless the answer is completely off base, it should be preserved. There's still some useful discussion here (about 't' mode).
Totally off topic here, but 't' mode is sometimes useful if you're going to use regexp on the text. For regexp a newline (for the dotexceptnewline option and the $ match) is always just \n, so removing the \r makes the search easier.

Sign in to comment.

Products

Release

R2013b

Asked:

on 25 Mar 2019

Commented:

on 26 Mar 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!