FILTER BIG DATA SET

4 views (last 30 days)
SAPTORSHEE KANTO
SAPTORSHEE KANTO on 30 Aug 2024
Edited: Subhajyoti on 30 Aug 2024
Dear All,
I have Survey data for six years each year containing 26 variables and more than 2 million rows [2533835x28 table].
I would like to filter the entire dataset by using some entries from the variable (c33) for the values
[1111 1112 1113 1114 1115 1116 1117 1118 1119 1121 1122 1123 1124 1131 1132 1133 1134 1135 1136 1137 1139 1140 1150 1161 1162 1169 1191 1193 1199 1210 1221 1222 1223 1224 1225 1229 1231 1232 1233 1239 1241 1242 1243 1249 1251 1252 1259 1261 1262 1269 1271 1272 1273 1279 1281 1282 1283 1284 1285 1286 1287 1291 1292 1293 1299 1301 1302 1309 1411 1412 1413 1420 1430 1441 1442 1450 1461 1462 1463 1491 1492 1493 1499 1500 1611 1612 1619 1620 1631 1632 1633 1639 1640 1700 2101 2102 2109 2201 2202 2203 2209 2301 2302 2303 2309 2401 2402 3111 3112 3113 3121 3122 3211 3212 3214 3215 3219 3221 3222 3223 3229 ].
Can anyone guide me how to filter the data.
  2 Comments
KSSV
KSSV on 30 Aug 2024
What do you mean by fitler the data? You may use logical indexing like ==, >, < etc.
SAPTORSHEE KANTO
SAPTORSHEE KANTO on 30 Aug 2024
I don't know how to do that. I know it in excel but since this does not fit in excel. I am unable to do it.

Sign in to comment.

Answers (2)

Star Strider
Star Strider on 30 Aug 2024
Your qquestion is a bit ambiguous.
If you want to match thee elements of the data you posted to elements of your matrix, one option is to use the ismember function (since they all appear to be integers, ii they are actually floating-point numbers instead, use ismembertol wiith a simiiilar calling syntax).
Try something like this —
V = [1111 1112 1113 1114 1115 1116 1117 1118 1119 1121 1122 1123 1124 1131 1132 1133 1134 1135 1136 1137 1139 1140 1150 1161 1162 1169 1191 1193 1199 1210 1221 1222 1223 1224 1225 1229 1231 1232 1233 1239 1241 1242 1243 1249 1251 1252 1259 1261 1262 1269 1271 1272 1273 1279 1281 1282 1283 1284 1285 1286 1287 1291 1292 1293 1299 1301 1302 1309 1411 1412 1413 1420 1430 1441 1442 1450 1461 1462 1463 1491 1492 1493 1499 1500 1611 1612 1619 1620 1631 1632 1633 1639 1640 1700 2101 2102 2109 2201 2202 2203 2209 2301 2302 2303 2309 2401 2402 3111 3112 3113 3121 3122 3211 3212 3214 3215 3219 3221 3222 3223 3229 ];
size(V)
ans = 1x2
1 121
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
A = array2table(randi([1000 3300], 10, 12)) % Create Data (Matrix Of Random Integers)
A = 10x12 table
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11 Var12 ____ ____ ____ ____ ____ ____ ____ ____ ____ _____ _____ _____ 2110 2585 1198 2064 3155 2709 1449 2709 2238 1018 2600 2290 2201 1032 1176 2623 1915 2429 1906 1978 1789 2992 1423 3089 1608 2024 2759 1469 2524 2230 2074 1745 1802 1093 1149 1781 2954 2179 1207 2837 1068 2004 3191 1506 1410 1159 2053 1094 1922 2337 1159 1111 1555 2747 2276 1487 2496 2767 2842 2015 2290 2044 2748 2057 1115 2614 1242 2159 2649 1123 3133 1659 2065 1184 3041 1659 3145 2259 1259 1947 2734 1185 1779 1964 1690 1685 2973 1169 3085 2625 2779 1734 3182 2406 2794 1391 2462 1847 2635 1632 3193 1399 2061 1030 2919 2101 2077 2694 3283 2817 3089 3193 1296 2292 1883 1880 1996 3289 3125 3191
Aa = table2array(A);
Lm = ismember(Aa, V) % Logical MAtrix Of Locations
Lm = 10x12 logical array
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[r,c] = find(Lm); % Return Numeric Indices
rc = [r c] % Row & Column Indices Of Matching Values
rc = 9x2
2 1 5 4 8 4 9 4 6 5 6 7 7 7 6 10 9 10
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
.

Subhajyoti
Subhajyoti on 30 Aug 2024
Edited: Subhajyoti on 30 Aug 2024
To filter your dataset in MATLAB based on specific entries in a particular variable, you can use logical indexing.
Here, in the following code, I have generated a dummy data-table, and performed filtering operations on numerical and string data types.
% Create a data table of size 10x4
% Columns x1, x2, x3, x4 of data type string, double, double, boolean
num_rows = 10000000;
t = table;
% random data
x1 = string(randi([1, 10], num_rows, 1));
x2 = randi([1, 10], num_rows, 1);
x3 = randi([1, 10], num_rows, 1);
x4 = randi([0, 1], num_rows, 1);
% assign data to table
t.x1 = x1;
t.x2 = x2;
t.x3 = x3;
t.x4 = logical(x4);
  • Use logical indexing to filter data where 'x3' is less than 5:
tic
%---------------------------------------%
filtered_data1 = t(t.x3 < 5, :);
%---------------------------------------%
toc
Elapsed time is 0.322336 seconds.
disp("Time taken to filter data using logical indexing: " + toc + " seconds")
Time taken to filter data using logical indexing: 0.3353 seconds
  • Use 'ismember' to filter data in 'x1' which are member of given array
tic
%---------------------------------------%
filterValues = ["1", "2", "3", "4", "5"];
filtered_data2 = t(ismember(t.x1, filterValues), :);
%---------------------------------------%
toc
Elapsed time is 4.224324 seconds.
disp("Time taken to filter data using ismember: " + toc + " seconds")
Time taken to filter data using ismember: 4.227 seconds
  • For complex numeric conditional operations, converting it to array using 'table2array' function can sometime speed-up operations.
c33Array = table2array(t(:,'x3'));
% Check if data-squared is less than 27
filter = c33Array.^2 < 27;
filtered_data5 = t(filter, :);
You may go through the following MathWorks documentation links to learn more about ‘table’ in MATLAB:
I hope this helps.

Tags

Products


Release

R2024a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!