Why is extractFileText Much Slower than fileread for Text Files?

2 views (last 30 days)
I'm fine with using fileread, but am curious why extractFileText is so much slower:
>> tic,for i=1:500,s1=extractFileText('sonnets.txt');end,toc
Elapsed time is 13.239366 seconds.
>> tic,for i=1:500,s2=string(fileread('sonnets.txt'));end,toc
Elapsed time is 0.401361 seconds.
>> isequal(s1,s2)
ans =
logical
1

Accepted Answer

Rik
Rik on 18 Dec 2020
fileread is very basic. It will often not work for files that are encoded with UTF-8 and can only handle plain text files. That is why my readfile function isn't just function str=readfile(fn),str=cellstr(fileread(fn));end, although for simple files it will probably be mostly equivalent.
extractFileText on the other hand has much more options. This makes the function very versatile (as you can use it on pdf, doc, etc), but that means it has to check everything as well. It will also probably handle different encodings better. Everything that makes a function more versatile will require more time.
  2 Comments
Paul
Paul on 18 Dec 2020
That's about what I figured. I guess I was just surprised that it coudn't more quickly determine that the input is a plain text file.
On a side note, you mention that extractFileText can read many different file types. However, I can't find a list of supported file types on the its doc page. All it shows is examples wih .pdf and .txt files. I'm surprised that the doc page doesn't have the list of supported file types front and center.
Rik
Rik on 18 Dec 2020
I agree that would make more sense. The fact that it works with .doc files as well is from the comment that support for that will be removed in the future.
Just a note: I have not dug into what exactly is happening in that function (especially as I don't have the required toolbox), so my answer is mostly an educated guess.

Sign in to comment.

More Answers (0)

Products


Release

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!