Why do I get "Out of memory." when reading only 16 chars?
Show older comments
Dear Matlab community,
I want to read specific parts from a large (> 20 GB) binary file. However, the command
tmpString=fread(fid,[1,16],'char=>char');
fails with "Out of memory." The command is applied very near to the beginning of the file (offset is 20 bytes).
Why do I get this error and how can I successfully read in my file?
Thank you for your suggestions,
Ed
17 Comments
Walter Roberson
on 25 Mar 2020
That is odd.
I wonder if it is trying to buffer the entire file?
Ohhh... Are you sure you want 20 characters and not 20 bytes? From R2020a if you fopen with specifying an encoding, and fread without specifying an encoding, then char for fread means utf8 decoded characters, which could be up to 4 bytes per character.
Ed Frank
on 25 Mar 2020
Walter Roberson
on 25 Mar 2020
Again, are you sure you want 20 utf8 characters and not 20 bytes such as uint8>=char?
Ed Frank
on 25 Mar 2020
Guillaume
on 25 Mar 2020
Which OS is this on? and which version of matlab?
"How did this work before R2020a?"
If not specified, matlab used your native encoding as set by your OS. Now it uses utf8 by default, which is much better (IMO).
"But does this affect my issue?"
Unlikely. But then as Walter said, what you're seeing is unexpected so who knows.
Ed Frank
on 25 Mar 2020
Walter Roberson
on 25 Mar 2020
I would experiment with fread uint8 for 80 bytes (the maximum valid to encode 20 utf8 characters) and use native2unicode and take the first 20 characters of the result.
It is really uncommon to specify utf8 for something identified as a binary file because binary files need to be particular about exact field widths. The exception would be for binary files in which fields are preceded by byte counts, in which case you would read the count and read that many uint8 and native2unicode that.
Showing us the code before that line would help (from the point the file is open).
Walter may be onto something. Maybe the out of memory error comes from the unicode library parsing if it's trying to decode text that is not utf8 as utf8. Do you know what the character encoding of your binary file?
Do you get an out of memory error if you read 16 bytes, 32 bytes or 80 bytes as suggested by Walter, i.e. if your replace the above line by:
curloc = ftell(fid);
fread(fid, 16, '*uint8');
disp('16 bytes read succesfully');
fseek(fid, curloc, 'bof');
fread(fid, 32, '*uint8');
disp('32 bytes read succesfully');
seek(fid, curloc, 'bof');
fread(fid, 80, '*uint8');
disp('80 bytes read succesfully');
keyboard;
Ed Frank
on 25 Mar 2020
Guillaume
on 25 Mar 2020
"Your test code really works"
Assuming that there's nothing confidential in there, can you post the full output of the 80 bytes read? If there's a bug with unicode decoding, I'd like to let mathworks know:
So, replace the offending line by
bytes = fread(fid, 80, '*uint8');
fprintf('%d ', bytes);
fprintf('\n');
and post/attach the content of byte.
Walter Roberson
on 25 Mar 2020
I just tested with that sequence of bytes on R2020a on MacOS High Sierra and encountered no difficulty. Do you encounter the same problem if you write just those bytes to a file?
Guillaume
on 26 Mar 2020
No issue for me either on Win10 with just that sequence. Can the whole file be made available somewhere?
Whatever the encoding, the fact that there are 0s intermixed with non-zeros in the first 16 bytes would indicate that it's not just text stored at that offset.
Ed Frank
on 26 Mar 2020
Guillaume
on 26 Mar 2020
"However what I could think of is that Matlab tries to guess the encoding"
No, as walter mentioned, as of R2020a, if you don't specify the encoding, it's UTF8. Prior to UTF8 it was whatever native encoding your system use. Matlab never tried to guess the encoding.
As walter said, it's very unlikely that your file uses UTF8 encoding for text (unless the text is prefixed by length information). It possibly doesn't use your native encoding either. When reading text from a binary file, it's always safer to read it as bytes (uint8) and then convert to the correct encoding with native2unicode.
Walter Roberson
on 22 Apr 2020
It turns out that R2020a, fopen now tries to do encoding detection; https://www.mathworks.com/help/matlab/ref/fopen.html#btrnibn-1-encodingIn
Historically, encoding detection for text being read by readtable() used to examine the first 10 kilobytes of the file; matters might be different for fopen()
Ed Frank: if you are still interested, could you try starting by reading (say) one character, and timing the fopen() and the first short fread(), and then fread() of the next 79, to see whether the long time is at the fopen() or at the first fread() of character data, or if the position somehow triggers the delay ?
Guillaume
on 29 Apr 2020
Sorry, I've been a bit too busy to follow answers recently but indeed I had conversations with Mathworks recently on text file parsing and indeed 2020a does automatic character set detection which is most likely the issue here. I'll post the details I've got from mathworks support in an answer.
Accepted Answer
More Answers (0)
Categories
Find more on Data Type Conversion in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!