Program architecture for handling large .csv files
26 views (last 30 days)
So i am a bit concerned. I just spent a week writing a matlab program that takes in a large amount of data from a .csv file
When i started the project i opted for a table variable type because of its ease of use with functions like readtable()
However i am not stuck with an unusable program. It takes hours for matlab to process my .csv files
I have considered the following options
it appears matlab has not yet developed any performance for the table datatype with no hope in sight.
1: re-write the entire program using arrays
Problem, the data in my file is all hex values. I will need to convert them to decimal numbers using the hex2dec function which only works with a char data type. So using an array of doubles is out of the question. Not sure where to go here
2: try to re-write the program using paraelle toolbox
Peter Perkins on 28 Jul 2021
Edited: Peter Perkins on 28 Jul 2021
With no code at all to go on, it's pretty hard to give specific advice.
Hex issues aside, the first advice I would give would be to write vectorized code. The fact that you have 50k calls to subscripting suggests that you are doing scalar operations in a tight loop. That's not the best way to write code in MATLAB. Again, not much information to go on, so hard to say. So many calls to brace subscripting suggests that you are doing assignments to one variable using braces, which, as Walter points out, is slower than using dot (because it has to do a lot more in general). That's an unfortunate difference that is not highlighted in the doc, but perhaps should be. In any case, no code to go on, so ...
Walter is correct that since 18b, performance, especially for assignments into large tables, has improved quite a bit.
If you can't vectorize, and you can't upgrade to a newer version, it's probably not necessary to "re-write the entire program using arrays". It's usually possible to focus only on the tight loop and "hoist" some of the variables out of the table for that part of the code, then put them back in. Use tables for the organization and convenience they provide, use raw numeric in small doses for performance. Tables have a lot going on, and will never be as fast as double arrays. That doesn't mean you should avoid them.
It really helps to provide concrete code examples.
Walter Roberson on 27 Jul 2021
In some cases, depending on the format of the file, you have some options of how to proceed:
- readtable() and readmatrix() and readcell() all permit a 'Format' option, using the same format specifications as are are used by textscan() -- including the potential to use the %x format (possibly with a length specification, if your fields are fixed width.)
- you could use textscan() directly, since you are working with text files
- you could use lower-level I/O commands, including fscanf() or fgetl() with sscanf(), depending how complicated your files are. If your format is complicated enough that you effectively need to read one line at a time, then this might be lower performance
- When your file is not super-complicated but does have different sections, then there are surprisingly high performance gains to be had by reading the entire file in as text and using regexp() to break it up into subsections and then textscan() or sscanf() the subsections. Performance gains relative to looping testing each line, that is.
Jeremy Hughes on 28 Jul 2021
Edited: Jeremy Hughes on 29 Jul 2021
First, it would help to see an example file, and some sample code that demonstrates the problem.
Here's the best I can say with what I see.
If the entire variable is in hex format, you can use import options to do that conversion quickly on import.
detectImportOptions will see 0x1a as a hex value, but not without the prefix, but it can be read as hex if you ask for it.
opts = detectImportOptions(filename,"Delimiter",",")
% varNamesOrNumbers = [1 3 5]
% or varNamesOrNumbers = ["Var1","Var3"]
opts = setvaropts(opts, varNamesOrNumbers, "NumberSystem", "hex", "type", "auto");
% You can also improve reading performance by selecting only the columns
% you would like (this is optional)
opts.SelectedVariableNames = opts.VariableNames([1 3 5 7 9]);
T = readtable(filename,opts)