**You are now following this question**

- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.

# How do I identify close matched regions from this vector?

4 views (last 30 days)

Show older comments

Hello,

Just by looking at it I can tell that I have 4 different regions, 0 to 122 then 375 to 563, 1145 to 1292 and 1697 to 2242. This is based on how much one region "jumps" to another.

Is there any way in Matlab in which I can identify these regions from this vector?

Thank you

##### 0 Comments

### Accepted Answer

Star Strider
on 20 Mar 2014

Edited: Star Strider
on 21 Mar 2014

There may be more efficient ways to do what you want, but this at least seems to work:

v = [ 9 18 21 58 59 60 63 66 69 70 72 74 ...

dv = diff([0 v]); % Create difference vector

dvs = [mean(dv) std(dv)]; % Determine mean & std

dvd = [0 find(dv > 1.96*dvs(2)+dvs(1)) length(v)]; % Use ‘dvs’ to detect discontinuities & create index reference vector

for k1 = 1:length(dvd)-1

vs{k1} = v(dvd(k1)+1:dvd(k1+1)-1); % Create cell array of regions-of-interest

vi{k1} = [dvd(k1)+1 dvd(k1+1)-1]; % Create reference array of start-end indices for ‘vs’ regions

end

The v vector are your data, the vi cell array contains the beginning and end indices of your segments-of-interest, and the vs cell array contains the data in your segments-of-interest. A cell array is necessary for vs because the vectors are of different lengths. (See the documentation for cell2mat to convert cell arrays back into doubles.)

##### 9 Comments

Faraz
on 21 Mar 2014

That was perfect and exactly what I was looking for. However I do not know what is going on, why did you diff([0 v]) rather than just diff(v) also whats going on in dvd?

Can I ask you to please comment your code so that I know whats going on. Thank you

Star Strider
on 21 Mar 2014

Edited: Star Strider
on 21 Mar 2014

I'm pleased that it does what you want!

The diff function takes a vector of length N and produces a vector of length N-1. Adding a leading zero makes the result equal in length to the argument, preserves the first entry in the argument vector, and also makes the indices of the result, dv here, match the indices of argument vector v. That makes the programming easier.

The dvd vector stores the starting and ending indices of your segments-of-interest, offset by -1. This is necessary because of the way I defined the discontinuities in your vector, so that vs and vi would contain only the information between the discontinuities without including them.

The search criterion, (dv > 1.96*dvs(2)+dvs(1)) uses the 95% confidence intervals to detect and include the discontinuities. [I have a background in biostatistics, and using a statistical criterion (1) makes sense mathematically, and (2) makes it applicable to different data vectors without needing to modify the criterion.]

I reposted the code in my answer with comments. (Nothing else changed.)

Faraz
on 22 Mar 2014

Thanks a lot Star Rider for that explanation. And I'm sorry if I'm asking too much but can you please explain the search criterion, you mentioned it uses the confidence interval (95%). I tried to learn about it by reading this: http://www.stats.gla.ac.uk/steps/glossary/confidence_intervals.html. But nothing that I read there helped me understand. In the link they mentioned that we can use a 95 % or a 99% confidence level? and this might be a stupid question but would 99% produce better results?

Also why use this 1.96 multiplier and whats the role of standard deviation and mean with confidence level?

Thank you again and sorry for asking so much :)

Star Strider
on 22 Mar 2014

Edited: Star Strider
on 22 Mar 2014

My pleasure.

All estimated parameters (such as the mean) derived from real-world measurements contain uncertainties. The confidence interval is the probability that the true mean is within those limits. So assuming the data are normally distributed (are described by the normal distribution), a 95% confidence interval means that there is a 95% probability that the true mean will be within ±1.96 standard deviations of the mean. The Wikipedia article on Confidence Interval explains it more extensively.

Considering a normally-distributed data set, the 95% confidence intervals are within ±1.96 standard deviations of the mean. So to get the upper 95% confidence limit (all the discontinuity differences are above that), I multiplied the standard deviation by 1.96 and added it to the mean.

The 1.96 number comes from the properties of the normal distribution. (See norminv for more information.) In my code, I chose the 95% confidence interval to be sure I only detected the extreme differences that indicated the discontinuities. The 95% confidence interval is the standard. A 99% confidence interval would likely not make any difference in this particular application. It might actually be less accurate, because it is more likely that the larger 99% confidence limits would not catch some discontinuities.

Image Analyst
on 22 Mar 2014

Faraz
on 28 Mar 2014

@StarStrider

I happy that the confusion here has been resolved.

Statistics is not my strong point, so I have been reading in detail anything I can find related to your provided solution and the explanation you provided above.

I want to fully understand the working of the solution you provided. Based on what I have read and understood this is my understanding of your explanation (Please note I have no base in stats at all so I may be going very basic here).

So to begin with, a 95% confidence interval is basically saying that I am 95% sure that given mean of a probability is within these limits [a, b].

And this can be calculated with the formula:

x +- 1.96 (std/sqrt(n))

where x is the mean, std is teh standard deviation and n is the sample size.

And this confidence interval applies to a normal distribution only.

So now that I have my basics done, let me try to explain your solution.

You saw my vector and the samples and found them to be normally distributed samples (between the jumps). Knowing that these are normal distributions simply finding their confidence intervals will give you the start and end point of a normally distributed sample.

And that is what you did in this line:

dvd = [0 find(dv > 1.96*dvs(2)+dvs(1)) length(v)];

where find, finds the indexes of the points where the confidence interval equation holds true, but why the differece vector, dv? and why use it in the condition statement?

This is based on what I read today so I may be way off. Sorry for such a long read but am I on the right path? Is my understanding correct.

Statistics is really fascinating me as to how intelligently one can make assumptions based on a sample and find them to be true.

Thank you P.S. did you get me email? is the email system even working?

Image Analyst
on 28 Mar 2014

Star Strider
on 28 Mar 2014

Edited: Star Strider
on 28 Mar 2014

Thank you, Image Analyst.

Faraz, I don’t respond to MATLAB Answers e-mail for the reasons Image Analyst listed (and others). I believe everything should be kept posted here for the sake of continuity. I check my profile page to see if there has been any activity in anything I’ve answered.

The difference vector does two things: it (1) detects the approximate slope of the data between the discontinuities thereby removing the offset, and (2) creates ‘spikes’ at the discontinuities, making the discontinuities much easier to detect. The 95% confidence limits are simply an adaptive way of making the code work for a large number of different data sets. In the dvd statement, my code looks for the beginning of each segment and the end as defined by the discontinuities. The lower limits are offset by 1, so I started with 0, so the code picks data from index 1 to the first discontinuity, continues to the last discontinuity to the end of the vector.

This works for your data because they are not noisy, and is not a general solution. Noisy data or data with significant variations between the step discontinuities would pose different problems.

Probability and statistics are fascinating areas, and I wish I knew more about them than I do. I certainly suggest you take courses in them.

Faraz
on 28 Mar 2014

@StarStrider and @ImageAnalyst

Thanks a lot, you guys were very helpful in your explanations. I believe I have a firm grip of the solution now.

Duly noted and agreed with the email thing, will post here from now on.

### More Answers (1)

Image Analyst
on 20 Mar 2014

##### 14 Comments

Image Analyst
on 21 Mar 2014

Edited: Image Analyst
on 21 Mar 2014

Why not just do

v1 = v(v<122);

inRange = v >= 375 & v <= 563;

v2 = v(inRange);

inRange = v >= 1145 & v <= 1292 ;

v3 = v(inRange);

inRange = v >= 1697 & v <= 2242;

v4 = v(inRange);

You're just extracting values in each range into their own new vector.

Is that what you want?

If you also want the locations of those in terms of what indexes they reside at you can get them from regionprops():

measurements = regionprops(logical(inRange), 'PixelIdxList');

indexes = [measurements.PixelIdxList];

Faraz
on 24 Mar 2014

Sorry about my late reply. I work on weekends.

Thank you for your answer, these regions are actually produced by the process we discussed in an earlier question. Where you suggested that rather than going via plot, I should work at pixel level directly. http://www.mathworks.co.uk/matlabcentral/answers/121944-why-is-the-entire-image-copied-instead-of-the-axes-only

This vector I posted in pastebin is a result of that. Would smoothing improve my answer?

Image Analyst
on 24 Mar 2014

Star Strider
on 24 Mar 2014

I got the impression that Faraz had extracted the data of interest (I didn’t know it was from an image), it gave the vector he posted, and Faraz wants a programmatic way of extracting the data between the discontinuities. That’s what I provided.

I assume the data came from something like:

h1l = findobj('type','line');

x1d = get(h1l,'XData');

y1d = get(h1l,'YData');

but I actually have no idea.

I’m now totally lost.

Image Analyst
on 24 Mar 2014

Star Strider
on 24 Mar 2014

Edited: Star Strider
on 24 Mar 2014

I certainly agree with your approach. Start with the image and get the necessary data. I had no idea the vector here represented image data, since the context wasn’t familiar to me. (I read many of your answers out of interest, but thought this was removed from image processing.) I get the impression no suggestion would have gone over well.

At least I have a ready-to-post code snippet if a real data discontinuity problem presents itself!

Although ‘That was perfect and exactly what I was looking for.’ it remains unaccepted. Oh, well...

Faraz
on 25 Mar 2014

I don't know why but I believe this question of mine along with the previous one has generated some confusion. I'll try to explain it as best I can here.

Ill start from my previous question: http://www.mathworks.co.uk/matlabcentral/answers/121944-why-is-the-entire-image-copied-instead-of-the-axes-only

What I have is a plot of areas

stats = regionprops(label, 'Area');

area = [stats.Area];

figure(1), plot(area(:),'.');

The above code generates this plot

In the plot, my region of interest are the blue clusters. I want to extract the y-axis start and end point of each cluster automatically or via command line.

For example in the image I showed there are 5 regions I am interested in, their y- axis start and end values being (roughly); 0 - 150, 300 - 600, 900 - 1400, 1600 - 2500 and 2800 - 4000 respectively.

This was my final goal and was what I really wanted to do. And this is where I am getting these numbers from.

In the previous question taht i linked above it was discussed that there was no practical way to extract these limits from the plot figure automatically, so after plotting them I added these lines to the code:

for k = 1: length(area)

test(area(k)) = 255;

end

What this did was create a "test" image and mark white my regions of interest. I now had my clusters in image format or better yet in vector format.

I clearly had a lot of "0" regions between my clusters. I removed those by using the "find" command in Matlab.

And the "find" command is what produced the vector that I presented here.

From the vector, once the zeros are removed one can see that the numbers at some point abruptly jump a large value (distance). This represents the 0's that were removed by find(). And thus by identifying this jump the cluster end and beginning can be found.

I asked for an autmatic way of finding these clusters, which Star Strider provided. It works fairly well but in some cases the start point of a region is higher than the end point. Which does not make sense and breaks the program execution, but oh well.

So I hope now I managed to explain the challenge I faced properly. please do comment if this explanation did lack in some way.

Although ‘That was perfect and exactly what I was looking for.’ it remains unaccepted. Oh, well...

oops sorry, fixed that now :)

Star Strider
on 25 Mar 2014

Edited: Star Strider
on 25 Mar 2014

I’m still not quite sure I understand, but now I assume you are starting from a plot someone else provided (and not generated by MATLAB, in which situation you could get the information in it from a ‘.fig’ file), and you want to get the data from it.

I assumed your data always moved in one direction, because that’s what I observed. If you are having problems because the start is higher than the end in some segments, experiment with changing the find criteria in the dvd line to something like:

find( (dv > dvs(1)+1.96*dvs(2)) | (dv < dvs(1)-1.96*dvs(2)) )

That enlarges the criteria to the ±95% confidence limits instead of simply the upper limit. I can’t guarantee that will work since I don’t have access to your other data, but it’s a start. (Note that I changed the code slightly.)

And Thanks!

Faraz
on 26 Mar 2014

Image Analyst
on 26 Mar 2014

Star I think you see now that he's just plotting blob areas:

stats = regionprops(label, 'Area');

area = [stats.Area];

figure(1), plot(area(:),'.');

So it's basically a plot of blob area vs. the blob label (ID) number. To look at that plot/scatterplot and see clusters is not meaningful. The clusters are totally artificial because the "x" axis is the label number. Labels are assigned by going down your image. first start at column 1 and see if you hit a blob. Then do a region growing to classify that whole blob as blob #1. Then continue on down the row, moving over column by column finding blobs and assigning new label numbers as you encounter them. So I hope you see that the label ID number of a blob is rather arbitrary and not meaningful as far as clusters go. For example, let's take your image and turn it 90 degrees, then label and find areas and plot them. You're going to have different label numbers assigned to each blob, and the area vs. label number plot will be COMPLETELY DIFFERENT , even though the image has the very same blobs in it.

As an analogy let's say you were measuring the heights of two classes of people: young children (who will be short) and adults (tall). So we have two clusters of heights, short heights and tall heights. Now let's stand them all in a line with all the children on the left and all the adults on the right. Let's give them an ID number (label) that corresponds to their position in the line. Now plot the heights in a scatterplot. You'll see a cluster of points in the lower left of the plot (this represents the children) and a cluster of points in the upper right representing adults. You get two clusters. Interleave them (kid, adult, kid, adult, etc.) and you again get two clusters. But the clusters are along the height axis, any clusters appearing along the label axis are artificial. Now rearrange them again to have half the kids, half the adult, half the kids, and half the adults. Now the plot seems to show 4 clusters even though the set of heights didn't change! But you still have 2 clusters along heights and any clusters you see along the label axis are not real. Here's some code to illustrate that:

clc; % Clear the command window.

close all; % Close all figures (except those of imtool.)

clear; % Erase all existing variables. Or clearvars if you want.

workspace; % Make sure the workspace panel is showing.

format long g;

format compact;

fontSize = 20;

c = [3,4,3,3.2,3.5,3.1] % child heights

a = [5.9, 6, 5.8, 6.1, 6.05, 5.95];

% Stand children on left, adulat on right.

allPersons = [c,a];

IDNumbers = 1 : length(allPersons);

subplot(3,1,1);

plot(IDNumbers, allPersons, 'b*', 'LineWidth', 4);

grid on;

% Enlarge figure to full screen.

set(gcf, 'Units', 'Normalized', 'OuterPosition', [0 0 1 1]);

title('Now you see 2 clusters along height and label.', 'fontSize', fontSize);

% Now interleave children and adults.

for k = 1 : 6

allPersons(2*k-1) = c(k);

allPersons(2*k) = a(k);

end

% Now it's kid, adult, kid, adult, kid, etc.

% Now plot

subplot(3,1,2);

plot(IDNumbers, allPersons, 'b*', 'LineWidth', 4);

grid on;

title('Now you see 2 clusters along height ONLY, not along label.', 'fontSize', fontSize);

% Now have 4 groups, 3 chilred, 3 adults, 3 children, 3 adults.

allPersons = [c(1:3), a(1:3), c(4:end), a(4:end)];

% Now plot

subplot(3,1,3);

plot(IDNumbers, allPersons, 'b*', 'LineWidth', 4);

grid on;

% Now you see 4 clusters along height and along label.

title('Now you see 4 clusters along height and along label.', 'fontSize', fontSize);

I hope this illustrates why finding clusters on an area vs label number plot is meaningless. You'd be better off just doing unsupervised clustering on the areas. Or pick a certain number of clusters and use kmeans.

Faraz
on 26 Mar 2014

@ImageAnalyst. yes I completely agree with what you are saying and that is why I never gave any importance to the x-axis. As I knew they are just ID numbers of the areas. My interest was always in the y-axis and evenin your example above, if I only take the y-axis, ignoring the x-axis I get the same result in all cases.

For example if I edit your code by just changing the plot command to this:

clc; % Clear the command window.

close all; % Close all figures (except those of imtool.)

clear; % Erase all existing variables. Or clearvars if you want.

workspace; % Make sure the workspace panel is showing.

format long g;

format compact;

fontSize = 20;

c = [3,4,3,3.2,3.5,3.1] % child heights

a = [5.9, 6, 5.8, 6.1, 6.05, 5.95];

% Stand children on left, adulat on right.

allPersons = [c,a];

IDNumbers = 1 : length(allPersons);

subplot(3,1,1);

plot(IDNumbers, allPersons, 'b*', 'LineWidth', 4);

grid on;

% Enlarge figure to full screen.

set(gcf, 'Units', 'Normalized', 'OuterPosition', [0 0 1 1]);

title('Now you see 2 clusters along height and label.', 'fontSize', fontSize);

% Now interleave children and adults.

for k = 1 : 6

allPersons(2*k-1) = c(k);

allPersons(2*k) = a(k);

end

% Now it's kid, adult, kid, adult, kid, etc.

% Now plot

subplot(3,1,2);

plot(IDNumbers, allPersons, 'b*', 'LineWidth', 4);

grid on;

title('Now you see 2 clusters along height ONLY, not along label.', 'fontSize', fontSize);

% Now have 4 groups, 3 chilred, 3 adults, 3 children, 3 adults.

allPersons = [c(1:3), a(1:3), c(4:end), a(4:end)];

% Now plot

subplot(3,1,3);

plot(IDNumbers, allPersons, 'b*', 'LineWidth', 4);

grid on;

% Now you see 4 clusters along height and along label.

title('Now you see 4 clusters along height and along label.', 'fontSize', fontSize);

I get this result: (I am sorry, I did not edit the titles)

I get 4 blobs/clusters/regions of interest in all cases.

Even in my previous question I made it clear that I only wanted the y-axis and was not interested in the x-value. Quoting from there:

This is the plot I have, the y-axis represents the object areas. This is the important axis that I want to transfer to the new image.

So I apologize again for not being clear and more descriptive from the get go and allowing for this confusion to settle in. I just wanted to implement a very simple method I thought about, that is all.

Image Analyst
on 26 Mar 2014

Star Strider
on 26 Mar 2014

I had to go read about regionprops and related functions, since I don’t do much image processing. (I intend to explore the File Exchange for demos and tutorials, but not just now.)

I feel as though I managed to jump into the middle of something here without knowing the wider context, and I’m still not certain I do. I always assume that Faraz and others who post here have designed their studies carefully, knowing how they intend to acquire and analyse their data, and post here for help in dealing with unanticipated problems. (Too often that is not the situation, and people decide to design their studies after they have gathered their data, but I did not get that impression here.)

I’m glad I could help, and I apologise for contributing to any confusion.

Image Analyst
on 26 Mar 2014

### See Also

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!**An Error Occurred**

Unable to complete the action because of changes made to the page. Reload the page to see its updated state.

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list:

## How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

### Americas

- América Latina (Español)
- Canada (English)
- United States (English)

### Europe

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)

### Asia Pacific

- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)