speed up renamecats/categorical multiple columns
    5 views (last 30 days)
  
       Show older comments
    
I have a huge csv file of about 16GB which over 9k columns. Each column is initially filled with some codes (either integer or string), and I have a code book with code and meaning for each column. What I'm trying to do is to translate the table and finally have a table that has readable texts instead of codes.
I can use either categorical or renamecats to "translate" them, but the issue is that it takes substentially long time to loop through these columns. I'm thinking if there is a way to speed this up.
See below an example
tbl = table(["a1", "b2", "c3", "d4", "e5"]', ...
    ["123", "234", "345", "456", "567"]', ...
    'VariableNames', {'A', 'B'});
dictionary.A = table(["a1", "b2", "c3", "d4", "e5"]', ...
    ["apple", "banana", "cat", "dog", "elephont"]', ...
    'VariableNames', {'Code', 'Meaning'});
dictionary.B = table(["123", "234", "345", "456", "567"]', ...
    ["East", "West", "North", "South", "Middle"]', ...
    'VariableNames', {'Code', 'Meaning'});
Vars   = tbl.Properties.VariableNames;
for iC = 1:width(tbl)
    tbl.(iC) = categorical(tbl.(iC), dictionary.(Vars{iC}).Code, ...
        dictionary.(Vars{iC}).Meaning);
end
Is that possible to avoid this loop, or any suggestions to speed this up (considering that I have over 500k rows and 9k columns).
Thank you!
0 Comments
Answers (1)
  Campion Loong
    
 on 9 Oct 2020
        Hi Peng,
It seems you have the Dictionary code book to boot, and you already know which sets of code go wtih which field/name in the Dictionary (i.e. you can designate "VariableNames" in the first table(...) call).
In this case, why not create the table with categorical to begin with:
tbl = table(categorical(["a1"; "b2"; "c3"; "d4"; "e5"],      dictionary.A.Code, dictionary.A.Meaning),...
            categorical(["123"; "234"; "345"; "456"; "567"], dictionary.B.Code, dictionary.B.Meaning),...
            'VariableNames', {'A', 'B'});
There is no loop, faster and much more readable.
3 Comments
  Campion Loong
    
 on 9 Oct 2020
				If you have thousands of columns, are you actually reading it from a file or a source somewhere? I struggle to imagine that could be manageable if you're making the first table call manually on thousands of columns.
If you are reading or importing, check out ImportOptions -- it gives you much more flexibility before actually reading the data in:
See Also
Categories
				Find more on Large Files and Big Data in Help Center and File Exchange
			
	Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!
