Improving speed of readtable

Question

Daniel van Huyssteen on 14 Apr 2021

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/801591-improving-speed-of-readtable

Commented: Daniel van Huyssteen on 14 Apr 2021

Accepted Answer: Walter Roberson

Example.zip

Open in MATLAB Online

I have a large array stored in a .dat file (see Example.dat attached) and I need to import the array into MATLAB.

At the moment I am using the following approach to load the table and convert it to an array.

Example_Table = readtable("Example.dat");
Example_Array = table2array(Example_Table);

This process is, however taking much longer than I would expect since I have a reasonably powerful PC.

I suspect that the issue is related to the array having a large number of zero entries.

The results of Run & Time are shown below

It is clear that pretty much all of the time is involved in reading the table and not in converting it to an array.

The timing profile of table.readTextFile>textscanReadData is shown below

Where all of the time is spent on the TreatAsEmpty command (because of having many zero entries?).

Below is a snapshot of the CPU and RAM usage during the reading of table.

Here it is clear that there is a lot of computational power not being used so this process should be able to be sped up some way or another.

How can I make this process run faster?

I have to read in lots of data like this and it is a very frustrating process.

Thanks in advance!

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Walter Roberson on 14 Apr 2021

0
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/801591-improving-speed-of-readtable#answer_674726

Open in MATLAB Online

Where all of the time is spent on the TreatAsEmpty command (because of having many zero entries?)

No, that is not what is happening.

What is happening is that Mathworks coded the internal call to textscan() split across two lines

data = textscan(fid, format, a bunch of stuff here, ...
                'TreatAsEmpty', treatasempty, a bunch more stuff here)

The time for the overall call is accounted to the last line of the call, as the textscan() call itself cannot start until all of the parameters have been executed.

For this purpose, "executed" for something like

treatasempty = [];
textscan('TreatAsEmpty', treatasempty)

would consist of parsing the character vector 'TreatAsEmpty' and created a temporary (unnamed) expression block for it and pushing that into the parameters; and then parsing the variable named treatasempty and locating the variable in scope and pushing its (named) expression block into the parameters. Those operations might not take long but they take some time, and that time is time spent preparing to call textscan() but not yet having called textscan(). The time to parse the parameters and get ready for the call is being shown in the 554 line, the data = textscan( part.

'TreatAsEmpty' is not a command in this context: it is just a literal constant to be prepared and passed in to the function.

The timing you are seeing for line 555 is the time spent executing the textscan()

7 Comments
Show 5 older commentsHide 5 older comments

Walter Roberson on 14 Apr 2021

Open in MATLAB Online

If you have a matrix of data, D, then

tic
[r, c, s] = find(D);
as_rows = [r, c, s].';
dlmwrite('Example_sparse.csv', as_rows, 'precision', 16);
toc

About 1.3 seconds to write

Restore with

tic
S = fileread('Example_sparse.csv'); rcs = str2num(S); Drestored = sparse(rcs(1,:), rcs(2,:), rcs(3,:));
time_via_sparse = toc

about 1.5 seconds to read.

Daniel van Huyssteen on 14 Apr 2021

This works great! Thanks! :D

Sign in to comment.

Answer 2

Bjorn Gustavsson on 14 Apr 2021

0
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/801591-improving-speed-of-readtable#answer_674716

It is a rather large data-file to read. You might reduce the read-time if you use load instead of readtable - that should reduce all sorts of overhead associated with the capacity to handle all sorts of data-formats of readtable.

If you have the capacity to modify the data-format of your files that might be a far more successful way forward if you have very sparse data - then you might be better off saving the non-zero components together with their row and column indices and handle that when reading data instead of saving large number of zeros. But maybe you're given the data and have to shovel zeros and zeros around...

HTH

2 Comments
Show NoneHide None

Daniel van Huyssteen on 14 Apr 2021

Thanks for the answer.

Unfortunately I'm not able to manipulate the writing/storage of the original data.

Curiously, using load instead of readtable takes even longer!

Bjorn Gustavsson on 14 Apr 2021

That's a double bummer. I'm really surprised that load takes longer time, I would've bet good money that the more general capacity of readtable would cost time. Then perhaps you can save overall processing-time by following Walter's suggestion of converting the data-files to a sparse format. You might be able to bulk-process all data-files over-night when it doesn't test your patience...

Sign in to comment.

Improving speed of readtable

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

7 Comments
Show 5 older commentsHide 5 older comments

More Answers (1)

2 Comments
Show NoneHide None

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Improving speed of readtable

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

7 Comments Show 5 older commentsHide 5 older comments

More Answers (1)

2 Comments Show NoneHide None

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

7 Comments
Show 5 older commentsHide 5 older comments

2 Comments
Show NoneHide None