How do I identify close matched regions from this vector?

Question

Faraz on 20 Mar 2014

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/122469-how-do-i-identify-close-matched-regions-from-this-vector

Commented: Faraz on 28 Mar 2014

Hello,

I have this 1D vector: posted in Pastebin .

Just by looking at it I can tell that I have 4 different regions, 0 to 122 then 375 to 563, 1145 to 1292 and 1697 to 2242. This is based on how much one region "jumps" to another.

Is there any way in Matlab in which I can identify these regions from this vector?

Thank you

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Star Strider on 20 Mar 2014

1
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/122469-how-do-i-identify-close-matched-regions-from-this-vector#answer_129463

Edited: Star Strider on 21 Mar 2014

Open in MATLAB Online

There may be more efficient ways to do what you want, but this at least seems to work:

v = [ 9 18 21 58 59 60 63 66 69 70 72 74 ...
dv = diff([0 v]);                                       % Create difference vector
dvs = [mean(dv)  std(dv)];                              % Determine mean & std
dvd = [0 find(dv > 1.96*dvs(2)+dvs(1)) length(v)];      % Use ‘dvs’ to detect discontinuities & create index reference vector
for k1 = 1:length(dvd)-1
    vs{k1} = v(dvd(k1)+1:dvd(k1+1)-1);                  % Create cell array of regions-of-interest
    vi{k1} = [dvd(k1)+1 dvd(k1+1)-1];                   % Create reference array of start-end indices for ‘vs’ regions
end

The v vector are your data, the vi cell array contains the beginning and end indices of your segments-of-interest, and the vs cell array contains the data in your segments-of-interest. A cell array is necessary for vs because the vectors are of different lengths. (See the documentation for cell2mat to convert cell arrays back into doubles.)

9 Comments
Show 7 older commentsHide 7 older comments

Star Strider on 21 Mar 2014

Edited: Star Strider on 21 Mar 2014

I'm pleased that it does what you want!

The diff function takes a vector of length N and produces a vector of length N-1. Adding a leading zero makes the result equal in length to the argument, preserves the first entry in the argument vector, and also makes the indices of the result, dv here, match the indices of argument vector v. That makes the programming easier.

The dvd vector stores the starting and ending indices of your segments-of-interest, offset by -1. This is necessary because of the way I defined the discontinuities in your vector, so that vs and vi would contain only the information between the discontinuities without including them.

The search criterion, (dv > 1.96*dvs(2)+dvs(1)) uses the 95% confidence intervals to detect and include the discontinuities. [I have a background in biostatistics, and using a statistical criterion (1) makes sense mathematically, and (2) makes it applicable to different data vectors without needing to modify the criterion.]

I reposted the code in my answer with comments. (Nothing else changed.)

Star Strider on 22 Mar 2014

Edited: Star Strider on 22 Mar 2014

My pleasure.

All estimated parameters (such as the mean) derived from real-world measurements contain uncertainties. The confidence interval is the probability that the true mean is within those limits. So assuming the data are normally distributed (are described by the normal distribution), a 95% confidence interval means that there is a 95% probability that the true mean will be within ±1.96 standard deviations of the mean. The Wikipedia article on Confidence Interval explains it more extensively.

Considering a normally-distributed data set, the 95% confidence intervals are within ±1.96 standard deviations of the mean. So to get the upper 95% confidence limit (all the discontinuity differences are above that), I multiplied the standard deviation by 1.96 and added it to the mean.

The 1.96 number comes from the properties of the normal distribution. (See norminv for more information.) In my code, I chose the 95% confidence interval to be sure I only detected the extreme differences that indicated the discontinuities. The 95% confidence interval is the standard. A 99% confidence interval would likely not make any difference in this particular application. It might actually be less accurate, because it is more likely that the larger 99% confidence limits would not catch some discontinuities.

Faraz on 28 Mar 2014

@StarStrider

I happy that the confusion here has been resolved.

Statistics is not my strong point, so I have been reading in detail anything I can find related to your provided solution and the explanation you provided above.

I want to fully understand the working of the solution you provided. Based on what I have read and understood this is my understanding of your explanation (Please note I have no base in stats at all so I may be going very basic here).

So to begin with, a 95% confidence interval is basically saying that I am 95% sure that given mean of a probability is within these limits [a, b].

And this can be calculated with the formula:

x +- 1.96 (std/sqrt(n))

where x is the mean, std is teh standard deviation and n is the sample size.

And this confidence interval applies to a normal distribution only.

So now that I have my basics done, let me try to explain your solution.

You saw my vector and the samples and found them to be normally distributed samples (between the jumps). Knowing that these are normal distributions simply finding their confidence intervals will give you the start and end point of a normally distributed sample.

And that is what you did in this line:

dvd = [0 find(dv > 1.96*dvs(2)+dvs(1)) length(v)];

where find, finds the indexes of the points where the confidence interval equation holds true, but why the differece vector, dv? and why use it in the condition statement?

This is based on what I read today so I may be way off. Sorry for such a long read but am I on the right path? Is my understanding correct.

Statistics is really fascinating me as to how intelligently one can make assumptions based on a sample and find them to be true.

Thank you P.S. did you get me email? is the email system even working?

Star Strider on 28 Mar 2014

Edited: Star Strider on 28 Mar 2014

Thank you, Image Analyst.

Faraz, I don’t respond to MATLAB Answers e-mail for the reasons Image Analyst listed (and others). I believe everything should be kept posted here for the sake of continuity. I check my profile page to see if there has been any activity in anything I’ve answered.

The difference vector does two things: it (1) detects the approximate slope of the data between the discontinuities thereby removing the offset, and (2) creates ‘spikes’ at the discontinuities, making the discontinuities much easier to detect. The 95% confidence limits are simply an adaptive way of making the code work for a large number of different data sets. In the dvd statement, my code looks for the beginning of each segment and the end as defined by the discontinuities. The lower limits are offset by 1, so I started with 0, so the code picks data from index 1 to the first discontinuity, continues to the last discontinuity to the end of the vector.

This works for your data because they are not noisy, and is not a general solution. Noisy data or data with significant variations between the step discontinuities would pose different problems.

Probability and statistics are fascinating areas, and I wish I knew more about them than I do. I certainly suggest you take courses in them.

Faraz on 28 Mar 2014

@StarStrider and @ImageAnalyst

Thanks a lot, you guys were very helpful in your explanations. I believe I have a firm grip of the solution now.

Duly noted and agreed with the email thing, will post here from now on.

Sign in to comment.

Answer 2

Image Analyst on 20 Mar 2014

1
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/122469-how-do-i-identify-close-matched-regions-from-this-vector#answer_129465

What do you mean by identify? It seems like you just did when you described the range of values that each takes. Do you want the indexes of each class? Do you want those intensity ranges specified automatically depending on each vector, like as if you use kmeans() or something? You (or someone) tagged it with image Processing so do you want to do connected components analysis (useful if some regions in an intensity range are not touching each other)?

14 Comments
Show 12 older commentsHide 12 older comments

Faraz on 25 Mar 2014

Open in MATLAB Online

I don't know why but I believe this question of mine along with the previous one has generated some confusion. I'll try to explain it as best I can here.

Ill start from my previous question: http://www.mathworks.co.uk/matlabcentral/answers/121944-why-is-the-entire-image-copied-instead-of-the-axes-only

What I have is a plot of areas

     stats = regionprops(label, 'Area');
     area = [stats.Area];
     figure(1), plot(area(:),'.');

The above code generates this plot

In the plot, my region of interest are the blue clusters. I want to extract the y-axis start and end point of each cluster automatically or via command line.

For example in the image I showed there are 5 regions I am interested in, their y- axis start and end values being (roughly); 0 - 150, 300 - 600, 900 - 1400, 1600 - 2500 and 2800 - 4000 respectively.

This was my final goal and was what I really wanted to do. And this is where I am getting these numbers from.

In the previous question taht i linked above it was discussed that there was no practical way to extract these limits from the plot figure automatically, so after plotting them I added these lines to the code:

     for k = 1: length(area)
         test(area(k)) = 255;
     end

What this did was create a "test" image and mark white my regions of interest. I now had my clusters in image format or better yet in vector format.

I clearly had a lot of "0" regions between my clusters. I removed those by using the "find" command in Matlab.

And the "find" command is what produced the vector that I presented here.

From the vector, once the zeros are removed one can see that the numbers at some point abruptly jump a large value (distance). This represents the 0's that were removed by find(). And thus by identifying this jump the cluster end and beginning can be found.

I asked for an autmatic way of finding these clusters, which Star Strider provided. It works fairly well but in some cases the start point of a region is higher than the end point. Which does not make sense and breaks the program execution, but oh well.

So I hope now I managed to explain the challenge I faced properly. please do comment if this explanation did lack in some way.

Although ‘That was perfect and exactly what I was looking for.’ it remains unaccepted. Oh, well...

oops sorry, fixed that now :)

Image Analyst on 26 Mar 2014

Open in MATLAB Online

Star I think you see now that he's just plotting blob areas:

     stats = regionprops(label, 'Area');
     area = [stats.Area];
     figure(1), plot(area(:),'.');

So it's basically a plot of blob area vs. the blob label (ID) number. To look at that plot/scatterplot and see clusters is not meaningful. The clusters are totally artificial because the "x" axis is the label number. Labels are assigned by going down your image. first start at column 1 and see if you hit a blob. Then do a region growing to classify that whole blob as blob #1. Then continue on down the row, moving over column by column finding blobs and assigning new label numbers as you encounter them. So I hope you see that the label ID number of a blob is rather arbitrary and not meaningful as far as clusters go. For example, let's take your image and turn it 90 degrees, then label and find areas and plot them. You're going to have different label numbers assigned to each blob, and the area vs. label number plot will be COMPLETELY DIFFERENT , even though the image has the very same blobs in it.

As an analogy let's say you were measuring the heights of two classes of people: young children (who will be short) and adults (tall). So we have two clusters of heights, short heights and tall heights. Now let's stand them all in a line with all the children on the left and all the adults on the right. Let's give them an ID number (label) that corresponds to their position in the line. Now plot the heights in a scatterplot. You'll see a cluster of points in the lower left of the plot (this represents the children) and a cluster of points in the upper right representing adults. You get two clusters. Interleave them (kid, adult, kid, adult, etc.) and you again get two clusters. But the clusters are along the height axis, any clusters appearing along the label axis are artificial. Now rearrange them again to have half the kids, half the adult, half the kids, and half the adults. Now the plot seems to show 4 clusters even though the set of heights didn't change! But you still have 2 clusters along heights and any clusters you see along the label axis are not real. Here's some code to illustrate that:

clc;    % Clear the command window.
close all;  % Close all figures (except those of imtool.)
clear;  % Erase all existing variables. Or clearvars if you want.
workspace;  % Make sure the workspace panel is showing.
format long g;
format compact;
fontSize = 20;
c = [3,4,3,3.2,3.5,3.1] % child heights
a = [5.9, 6, 5.8, 6.1, 6.05, 5.95];
% Stand children on left, adulat on right.
allPersons = [c,a];
IDNumbers = 1 : length(allPersons);
subplot(3,1,1);
plot(IDNumbers, allPersons, 'b*', 'LineWidth', 4);
grid on;
% Enlarge figure to full screen.
set(gcf, 'Units', 'Normalized', 'OuterPosition', [0 0 1 1]);
title('Now you see 2 clusters along height and label.', 'fontSize', fontSize);
% Now interleave children and adults.
for k = 1 : 6
  allPersons(2*k-1) = c(k);
  allPersons(2*k) = a(k);
end
% Now it's kid, adult, kid, adult, kid, etc.
% Now plot
subplot(3,1,2);
plot(IDNumbers, allPersons, 'b*', 'LineWidth', 4);
grid on;
title('Now you see 2 clusters along height ONLY, not along label.', 'fontSize', fontSize);
% Now have 4 groups, 3 chilred, 3 adults,  3 children, 3 adults.
allPersons = [c(1:3), a(1:3), c(4:end), a(4:end)];
% Now plot
subplot(3,1,3);
plot(IDNumbers, allPersons, 'b*', 'LineWidth', 4);
grid on;
% Now you see 4 clusters along height and along label.
title('Now you see 4 clusters along height and along label.', 'fontSize', fontSize);

I hope this illustrates why finding clusters on an area vs label number plot is meaningless. You'd be better off just doing unsupervised clustering on the areas. Or pick a certain number of clusters and use kmeans.

Faraz on 26 Mar 2014

Open in MATLAB Online

@ImageAnalyst. yes I completely agree with what you are saying and that is why I never gave any importance to the x-axis. As I knew they are just ID numbers of the areas. My interest was always in the y-axis and evenin your example above, if I only take the y-axis, ignoring the x-axis I get the same result in all cases.

For example if I edit your code by just changing the plot command to this:

clc;    % Clear the command window.
close all;  % Close all figures (except those of imtool.)
clear;  % Erase all existing variables. Or clearvars if you want.
workspace;  % Make sure the workspace panel is showing.
format long g;
format compact;
fontSize = 20;
c = [3,4,3,3.2,3.5,3.1] % child heights
a = [5.9, 6, 5.8, 6.1, 6.05, 5.95];
% Stand children on left, adulat on right.
allPersons = [c,a];
IDNumbers = 1 : length(allPersons);
subplot(3,1,1);
plot(IDNumbers, allPersons, 'b*', 'LineWidth', 4);
grid on;
% Enlarge figure to full screen.
set(gcf, 'Units', 'Normalized', 'OuterPosition', [0 0 1 1]);
title('Now you see 2 clusters along height and label.', 'fontSize', fontSize);
% Now interleave children and adults.
for k = 1 : 6
  allPersons(2*k-1) = c(k);
  allPersons(2*k) = a(k);
end
% Now it's kid, adult, kid, adult, kid, etc.
% Now plot
subplot(3,1,2);
plot(IDNumbers, allPersons, 'b*', 'LineWidth', 4);
grid on;
title('Now you see 2 clusters along height ONLY, not along label.', 'fontSize', fontSize);
% Now have 4 groups, 3 chilred, 3 adults,  3 children, 3 adults.
allPersons = [c(1:3), a(1:3), c(4:end), a(4:end)];
% Now plot
subplot(3,1,3);
plot(IDNumbers, allPersons, 'b*', 'LineWidth', 4);
grid on;
% Now you see 4 clusters along height and along label.
title('Now you see 4 clusters along height and along label.', 'fontSize', fontSize);

I get this result: (I am sorry, I did not edit the titles)

I get 4 blobs/clusters/regions of interest in all cases.

Even in my previous question I made it clear that I only wanted the y-axis and was not interested in the x-value. Quoting from there:

This is the plot I have, the y-axis represents the object areas. This is the important axis that I want to transfer to the new image.

So I apologize again for not being clear and more descriptive from the get go and allowing for this confusion to settle in. I just wanted to implement a very simple method I thought about, that is all.

Star Strider on 26 Mar 2014

I had to go read about regionprops and related functions, since I don’t do much image processing. (I intend to explore the File Exchange for demos and tutorials, but not just now.)

I feel as though I managed to jump into the middle of something here without knowing the wider context, and I’m still not certain I do. I always assume that Faraz and others who post here have designed their studies carefully, knowing how they intend to acquire and analyse their data, and post here for help in dealing with unanticipated problems. (Too often that is not the situation, and people decide to design their studies after they have gathered their data, but I did not get that impression here.)

I’m glad I could help, and I apologise for contributing to any confusion.

Image Analyst on 26 Mar 2014

You've been very helpful. It can get confusing when people don't give the whole context so we don't know what the big picture is. Even more frustrating is when you know the big picture but people are dead set on going down a path that you know is a dead end , and when you suggest a workable approach they continue to try their dead end approach.

Sign in to comment.

How do I identify close matched regions from this vector?

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

9 Comments
Show 7 older commentsHide 7 older comments

More Answers (1)

14 Comments
Show 12 older commentsHide 12 older comments

See Also

Categories

Tags

Community Treasure Hunt

How do I identify close matched regions from this vector?

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

9 Comments Show 7 older commentsHide 7 older comments

More Answers (1)

14 Comments Show 12 older commentsHide 12 older comments

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

9 Comments
Show 7 older commentsHide 7 older comments

14 Comments
Show 12 older commentsHide 12 older comments