Mass replace text, then split the replaced text into multiple columns - from multiple csv files
1 view (last 30 days)
Show older comments
I have multiple csv files (Azure.csv, Blue.csv,...until 50th file) which has roughly this kind of data inside it:
ID, Duration, Team A, TeamAComposition, Team B, TeamBComposition
120404 3442 Azure Ae Bee Cee Dee Ee Blue Ef Gee Heij Ai Jei
302033 233 Azure Ae Bee Cee Dee Zet Cyan El Em En Ou Pee
...and so on
I would like to replace the team composition data into integers, so for example "Ae" will be replaced into "1", "Bee" will be "2", "Cee" into "3" and so on, and if possible convert those into 5 different new columns. So the new table is roughly like this
ID, Duration, Team A, AComp1, AComp2, AComp3, AComp4, AComp5, Team B, BComp1, BComp2, BComp3, BComp4, BComp5,
120404 3442 Azure 1 2 3 4 5 Blue 6 7 8 9 10
How can I do this? Thanks in advance.
6 Comments
Guillaume
on 2 Mar 2018
A comma or any other separator not present in the names would remove any ambiguity indeed. It's possible to do the splitting without that separator but if there's a chance of the problem above, then an explicit separator is safer.
Accepted Answer
Guillaume
on 5 Mar 2018
Solution for when heroes are just separated by a space:
In its own m file:
function varargout = split_heros(heroes, varargin)
%split multiple scalar strings (varargin) into heroes id
varargout = cell(1, numel(varargin)*5);
for col = 1:numel(varargin)
ishero = ~cellfun(@isempty, regexp(varargin{col}, heroes.regex));
varargout(5*col-4 : 5*col) = num2cell(heroes.hero_id(ishero));
end
end
To create the table:
heronames = readtable('hero_names.csv');
heronames.regex = strcat(heronames.localized_name, '( |$)'); %regular expression to help splitting list of names
%to wrap in a loop for each of the 50 files:
intable = readtable('DC vs EG.csv');
outnames = [sprintfc('AComp%d', 1:5), sprintfc('BComp%d', 1:5)]; %using undocumented sprintfc. in R2016b or later replace with compose
outtable = [intable(:, [1:6, 9, 11]), ...
rowfun(@(varargin) split_heros(heronames, varargin{:}), intable, 'InputVariables', {'TeamAHeroes', 'TeamBHeroes'}, 'NumOutputs', 10, 'OutputVariableNames', outnames)]
Solution for when heroes are separated by a comma:
In its own m file:
function varargout = split_heros(heroes, varargin)
[~, row] = ismember(strsplit(strjoin([varargin{:}], ','), ','), heroes.localized_name);
varargout = num2cell(heroes.hero_id(row));
end
To create the table:
heronames = readtable('hero_names.csv');
%to wrap in a loop for each of the 50 files, same code as before
intable = readtable('DC vs EG.csv');
outnames = [sprintfc('AComp%d', 1:5), sprintfc('BComp%d', 1:5)]; %using undocumented sprintfc. in R2016b or later replace with compose
outtable = [intable(:, [1:6, 9, 11]), ...
rowfun(@(varargin) split_heros(heronames, varargin{:}), intable, 'InputVariables', {'TeamAHeroes', 'TeamBHeroes'}, 'NumOutputs', 10, 'OutputVariableNames', outnames)]
6 Comments
Guillaume
on 7 Mar 2018
If you're getting an "In an assignment A(:) = B, ..." error most likely it's because for some reason the code didn't identify exactly 5 heros in a team. It either found more or less. As said, I didn't put any error checking. You could add the line
assert(sum(ishero) == 5, '%d heros found in team %d', sum(ishero), col);
after the ishero = ... line in split_heros.m to make the error message clearer.
When I test your 'CDEC vs VG.csv' file, I get a problem on the first row of the table because Barathrum is not in your 'hero_name.csv'. I suspect a similar problem happens for row 38 (where you're getting the error).
Note that the check for hero names that I've implemented for the 1st case is fairly crude. It also assumes that the hero names do not contain any of the following characters: ()[]{}|^$.+*\?, that is any of the characters that have special meaning in regular expressions.
If you want to check which heros are identified for a particular team of a particular row, e.g. team B of row 38:
heroes = intable{38, 'TeamBHeroes'}
ishero = ~cellfun(@isempty, regexp(heroes, heronames.regex));
foundheros = heronames(ishero, :)
which in my case shows that Monkey King is not found (since not in heronames)
More Answers (0)
See Also
Categories
Find more on Call Java from MATLAB in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!