Removing Error Data from Table

10 views (last 30 days)
Hi,
I am working on a project that involves plotting Temperature vs. Sensor Data on a scatter plot. My goal is to calculate the coefficients of a trend line for future predictions. However, there are instances where my sensor outputs false results. Instead of manually removing this data, I am in search of an outlier function that can automatically eliminate these anomalies and provide coefficients. These coefficients would be useful for defining the boundaries of valid data in future datasets.
I am aware of the standard outlier removal functions. However, the scatter plot I've attached here illustrates my challenge. You can observe the general pattern of valid data, interspersed with some errors. Although there's no trend line in the plot, it's possible to visualize how it would look and identify the inaccurate data points.
My question is: Is there a way to configure an outlier function to take into account both temperature and sensor output in identifying false results? Additionally, can this function reference previous data to aid in this evaluation? I have posted the plots below and also attached data samples of some sensor output for your reference.

Accepted Answer

Star Strider
Star Strider on 5 Dec 2023
Edited: Star Strider on 5 Dec 2023
This is definitely an interesting problem, and one I have thought about at other times.
I am not certain that I have completely solved it, since it still may require a bit of interactivity, depending on the data. For these two data sets, it seems to work reasonably well with these parameters. The scatterhist calls are just there as a way to illustrate the data, and are not absolutely necessary.
Try this —
LD = load('scatterData.mat')
LD = struct with fields:
dataTable4: [8321×2 table] dataTable5: [8498×2 table]
% dataTable4 ————————————————————————————————————————————————————————————————————————————————
T4 = LD.dataTable4;
T4 = rmmissing(T4)
T4 = 8135×2 table
Temperature NO2_WE_Voltage ___________ ______________ 10.769 249.58 10.669 249.71 10.991 248.95 11.243 248.65 11.301 248.27 11.414 248.4 11.414 248.22 11.283 247.82 11.258 248.17 11.27 248.1 11.215 248.12 11.228 248.07 11.414 248 11.706 247.52 12.028 246.84 12.801 245.61
VN4 = T4.Properties.VariableNames;
figure
hsh4 = scatterhist(T4{:,1}, T4{:,2}, 'NBins',50, 'Marker','.')
hsh4 =
1×3 Axes array: Axes Axes Axes
idx4 = clusterdata(table2array(T4), 'MaxClust',15, 'Criterion','distance');
[Uidx,~,uix] = unique(idx4); % Unique Cluster Numbers & Indices
ClustCounts = accumarray(idx4, 1); % Count Cluster Members
T4_ClusterCounts = table(Uidx, ClustCounts) % Results Summary Table
T4_ClusterCounts = 15×2 table
Uidx ClustCounts ____ ___________ 1 3 2 136 3 1 4 1 5 1 6 1 7 3526 8 1 9 4 10 2173 11 2273 12 1 13 11 14 2 15 1
[MaxCounts,MaxCidx] = max((T4_ClusterCounts{:,2})) % Largest Cluster & Index
MaxCounts = 3526
MaxCidx = 7
cm = colormap(turbo(numel(unique(idx4))));
hold on
Ax = hsh4(1);
% get(Ax)
Ax.UserData
ans = 1×4 cell array
{'gscatter'} {8135×1 double} {8135×1 double} {0×0 double}
scatter(Ax.UserData{2}, Ax.UserData{3}, 10, idx4)
colormap(turbo(numel(unique(idx4))))
hold off
colorbar
Lv4 = idx4 == MaxCidx;
figure
scatter(Ax.UserData{2}(Lv4), Ax.UserData{3}(Lv4), 10, '.')
grid
% dataTable5 ————————————————————————————————————————————————————————————————————————————————
T5 = LD.dataTable5;
T5 = rmmissing(T5)
T5 = 7487×2 table
Temperature NO2_WE_Voltage ___________ ______________ 12.411 252.73 12.255 252.96 12.255 253.31 12.27 253.18 12.368 252.81 12.356 252.68 12.338 253.21 12.368 252.3 12.338 252.78 12.27 252.33 12.169 252.96 12.283 252.43 12.451 252.63 12.69 251.52 13.224 251.09 13.831 249.76
VN5 = T5.Properties.VariableNames;
figure
hsh5 = scatterhist(T5{:,1}, T5{:,2}, 'NBins',50, 'Marker','.')
hsh5 =
1×3 Axes array: Axes Axes Axes
idx5 = clusterdata(table2array(T5), 'MaxClust',10, 'Criterion','distance');
[Uidx,~,uix] = unique(idx5);
ClustCounts = accumarray(idx5, 1);
T5_ClusterCounts = table(Uidx, ClustCounts)
T5_ClusterCounts = 10×2 table
Uidx ClustCounts ____ ___________ 1 2 2 7411 3 1 4 64 5 1 6 1 7 1 8 3 9 2 10 1
[MaxCounts,MaxCidx] = max((T5_ClusterCounts{:,2}))
MaxCounts = 7411
MaxCidx = 2
cm = colormap(turbo(numel(unique(idx5))));
hold on
Ax = hsh5(1);
% get(Ax)
Ax.UserData
ans = 1×4 cell array
{'gscatter'} {7487×1 double} {7487×1 double} {0×0 double}
scatter(Ax.UserData{2}, Ax.UserData{3}, 10, idx5)
colormap(turbo(numel(unique(idx5))))
hold off
colorbar
Lv5 = idx5 == MaxCidx;
figure
scatter(Ax.UserData{2}(Lv5), Ax.UserData{3}(Lv5), 10, '.')
grid
The clusterdata function works here with 'MaxClust' (the maximum number of clusters) set at 10, however that may not generalise to all data sets, and more clusters may be necessary in those instances. That is where the ‘interactivity’ requirement may arise.
It would likely be possible to create a function from this code (with or without the plots, however I believe the plots are necessary to illustrate the data and how the code works).
I have been working on this for a few hours and will ponder that in the morning.
EDIT — (5 Dec 2023 at 04:35)
Streamlined code, added more explanation text.
.
  11 Comments

Sign in to comment.

More Answers (0)

Tags

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!