How to delete similar data from big data file?
2 views (last 30 days)
Show older comments
I have a big data file consists of 100000 rows and 5 columns ..and within this file there rows that are similar to each other,so I need to delete the similar ones and let one of them and delete the other similar ones ( fir example, if I have 5 rows have the same data ..I need a line code to choose the maximum value of the five rows and delete the other 4 rows and if all similar ...delete 4 and makes one only present.
6 Comments
per isakson
on 13 Jan 2020
You received two good answers. However, one question: does the order of the rows in the result matter?
Your sample data is, in more readable form
%%
data = [
1970 30 30 4
1971 30.5 30.5 4.2
1970 31 32 4.2
1972 32 33 5
1972 32 33 5
1973 32.1 32.2 5
1974 34 35 4.5
];
Sindar
on 13 Jan 2020
Ah, yes, I forgot to mention that both my answers will sort by the first three columns (year, then x, then y).
Answers (2)
Sindar
on 12 Jan 2020
Alright, here's my solution. Its speed depends on how often you expect duplicates (controlled in test by test_vals). I ran a test with very few duplicates (test_vals = 200, only 600/1e5 duplicates) and it took 8s. More duplicates runs faster. There are probably ways to speed up the idx search, but it seems fast enough.
% setting up some test data, skip for your real data
test_rows = 1e5;
test_vals = 5;
% make a test matrix with random integers 1-test_vals in first three columns, random number in last
data = [randi(test_vals,test_rows,3) rand(test_rows,1)];
% find duplicate data in the first three columns
% ia lists the row # in data of the first row in each duplicate
% ic lists the duplicate-group of each row in data
[~,ia,ic] = unique(data(:,1:3),'rows');
% loop over the duplicate-groups
for ind=1:length(ia)
% skip ahead if no duplicates (comment out if you expect few unique rows)
if nnz(ic==ind)==1
idx(ind)=ia(ind);
else
% data(:,4).*(ic==ind) takes the z-column, and zeros it except for the current duplicate-group
% then, stores the index of the max z row in idx
[~,idx(ind)] = max(data(:,4).*(ic==ind));
end
end
% select only the max-z rows for each duplicate-group
clean_data = data(idx,:);
3 Comments
Sindar
on 13 Jan 2020
actually, it can't. We need the index from the whole array, not the subset that logical indexing will give you. There may be a better way than mine, but this isn't it.
per isakson
on 13 Jan 2020
Edited: per isakson
on 13 Jan 2020
Yes indeed, that's wrong. Isn't it even syntactically incorrect and ought to have thrown an error.
I should have written
data( ic==ind, 4 )
Sindar
on 12 Jan 2020
Edited: per isakson
on 13 Jan 2020
Looking into it a bit (I have a similar problem), it seems like there is a way using the "Split-Apply-Combine Workflow". It is faster for many duplicate groups, but slower for few (for my in-memory tests), but I think it should scale much better to distributed memory and/or parallel computation.
% lists the duplicate-group of each row in data, equal to ic, above
G = findgroups(data(:,1),data(:,2),data(:,3));
% for each group, apply the maxrow function
% this takes the 4th column of data (your z's), finds the index of the max, then
% returns that full row repeated for each duplicate-group
splitapply(@(x1) maxrow(x1(:,4),x1),data,G)
% return the row of full_data corresponding to the index of the maximum of sort_vec
function x = maxrow(sort_vec,full_data)
[~,idx] = max(A);
x = B(idx,:);
end
0 Comments
See Also
Categories
Find more on Logical in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!