18 views (last 30 days)

I need to fit a line to some data, plotted as weighted scatter diagram below. The larger point has bigger weight. I used 'fit' function and the results is shown by blue line in the figure. However, I want to find the Green line, where I beleve is the expected behaviour.

The code and data is attached. I appreciate any idea that help me to reach to fit the green line.

Image Analyst
on 9 Jul 2020

I'd first of all filter out known bad data, data that you know should not be included in the fit. I'd make a tentative fit for the green line, like going between the points (20,3000) and (60,11000). Then for any data where the y value is more than 4000 above the line or 4000 below the line, it's throw those out. Then I'd fit a line through what remains. Something like this untested code:

coefficients = polyfit([20, 60], [3000, 11000], 1);

keepers = true(1, length(x));

for k = 1 : length(x)

% Get the point on the green line for this particular x value.

yFit = polyval(coefficients, x(k));

% If it's farther away than 4000, mark it for deletion.

if abs(yFit - y(k)) > 4000

keepers(k) = false;

end

end

% Extract only the good data.

goodx = x(keepers);

goody = y(keepers)

% Now find a fit for only the good data.

goodCoefficients = polyfit(goodx, goody, 1);

yFit = polyval(goodCoefficients, x);

% Plot it.

plot(x, yFit, 'g-', 'LineWidth', 3);

It would be even better to use RANSAC. If you have the Computer Vision Toolbox, you can use fitPolynomialRANSAC(). This type of fit will ignore the other major clump of data between x=40 to 60 and y = 5000 to 6000 and give you a fit for just the larger, longer elliptical cluster that you want.

John D'Errico
on 9 Jul 2020

Edited: John D'Errico
on 9 Jul 2020

You "want" the green line. I want the Buffalo Bills to win the Super bowl. Probably not gonna happen in either case. ;-)

You want to solve a weighted linear least squares problem, but you don't really understand linear least squares. And that is the fundamental problem.

plot(WX,WY,'.')

opts = fitoptions( 'Method', 'LinearLeastSquares' );

opts.Weights = W;

f = fit(WX,WY,'poly1',opts);

hold on

plot(f)

Linear least squares looks ONLY at errors in the y variable. Large positve or negative residuls, thus points that fall far above or below the line will have more importance in the fit. They will drag the curve around. In this case, it is a straight line.

Now, consider the green line that you "want" to see produce. Do you see a problem in this context? Down near x==0, ALL of the errors will be above the line, at lest if the green curve were the one we expected. Near x == 100, ALL of the errors will be BELOW the line.

In both cases, the line will be drawn into a position so the slope is reduced. While you see data that makes you WANT something, this is something the data tells me cannot happen. What you WANT is not relevant, because the tool you are using looks only at errors in the y variable. Sadly, I am pretty sure the tool cannot read your mind, nor does it really care about what you want to see happen. Computers do what they are programmed to do.

Worse, we have another problem that is just as serious. your two variables have wildly different variation.

stdx = std(WX)

stdx =

15.2824457612656

stdy = std(WY)

stdy =

1649.76961766046

We can also see that in the axis scaling.

We can see the vast difference by forcing the two axes to have the same scaling.

axis equal

Can the problem be solved to do what you want to see? For that, you are probably thinking in terms of what is called the total least squares problem. And of course, you have weights, so that will force me to actually write code. And since the two variables have hugely different variations, we need to deal with that too. Bah, humbug. I hate writing code. It makes me think.

The trick for total least squares is to use the SVD to do the work, sometimes called orthogonal regression. Other people call this principal component regression. As you can see here I computed weighted means. Subtract them off the data variables, then I weighted that matrix using W, and rescaled the variables to have unit variances. SVD does the hard work though.

mux = sum(WX.*W)./sum(W);

muy = sum(WY.*W)./sum(W);

A = [(WX - mux).*W/stdx,(WY - muy).*W/stdy];

[U,S,V] = svd(A,0);

S

S =

240.408960729392 0

0 167.877291251367

>> V

V =

0.246669782511878 -0.969099591577431

0.969099591577431 0.246669782511878

We choose the singular vector with the SMALLER singular value. That the two singular values are so close in magnitude tells me this regression is poorly posed, in the sense that the data ended up as a vaguely circular point cloud. Seriously, any line is arguably almost as good as any other in this case.

The weighted orthogonal regression line becomes...

V(1,2)*(x - mux)/stdx + V(2,2)*(y - muy)/stdy = 0

I'll use MATLAB to display the equation in a standard form, mainly becaue I am feeling too lazy to do basic algebra. Just too hot out today.

syms x y

y = vpa(solve(V(1,2)*(x - mux)/stdx + V(2,2)*(y - muy)/stdy,y),5)

y =

424.11*x - 11482.0

Now replot things, and see what happens.

plot(f)

hold on

plot(WX,WY,'b.')

H = fplot(matlabFunction(y),[35,55]);

H.Color = 'g';

H.LineWidth = 3;

So the total least squares regression, using weights and a rescaling of the variables to have unit variances works. Again, it was very close. I could almost have chosen the regression line orthogonal to the one we got. Your data is NOT well posed for regression, as an almost circular point cloud.

The Rolling Stones said it for me. "You can't always get what you want." Of course, you can freely decide the answer is exactly what you want to see. It is a scheme that works well for many politicians these days. :( But today, you got lucky. Que sera, sera... (I know somebody said that, but who? One "day" I'll remember.)

Image Analyst
on 10 Jul 2020

John D'Errico
on 11 Jul 2020

Opportunities for recent engineering grads.

Apply Today
## 1 Comment

## Direct link to this comment

https://au.mathworks.com/matlabcentral/answers/562220-fit-a-curve-on-scatter-data-main-behaviuor#comment_931796

⋮## Direct link to this comment

https://au.mathworks.com/matlabcentral/answers/562220-fit-a-curve-on-scatter-data-main-behaviuor#comment_931796

Sign in to comment.