Mean and 3-sgima for Lognormal distributions

12 views (last 30 days)
Hello,
I am trying to plot the lognormal distribution over 10 iterations and would like to see the mean and 3 sigma outliers. Somehow, doing this for lognormal plots does not look easy. Also, is it possible that I can skip the histogram in my plots and only look at the lognormal curves? This is the code snippet I am using. Thanks!
figure(113);
for i = 1:10
norm=histfit(Current_c(i,:),10,'lognormal'); %%Current_c is a 10*100 matrix %%
[muHat, sigmaHat] = lognfit(Current_c(i,:));
% Plot bounds at +- 3 * sigma.
lowBound = muHat - 3 * sigmaHat;
highBound = muHat + 3 * sigmaHat;
yl = ylim;
%line([lowBound, lowBound], yl, 'Color', [0, .6, 0], 'LineWidth', 3);
%line([highBound, highBound], yl, 'Color', [0, .6, 0], 'LineWidth', 3);
line([muHat, muHat], yl, 'Color', [0, .6, 0], 'LineWidth', 3);
grid on;
set(gcf, 'Toolbar', 'none', 'Menu', 'none');
% Give a name to the title bar.
set(gcf, 'Name', 'Line segmentation', 'NumberTitle', 'Off')
hold on;
end

Accepted Answer

John BG
John BG on 1 Jun 2016
Edited: John BG on 1 Jun 2016
ok no worries
to visualize the mean, sigma and 3 sigma on the curve of the probability distribution that approximates your data you have to consider that the bins of the histogram and too wide to directly get accurate marks on the curve, let me explain:
If you try
find(y==my)
=
Empty matrix: 1-by-0
no match
But you can find the histogram bin where the mean is contained in:
find(y>my)
=
Columns 1 through 14
3 4 5 6 7 8 9 10 11 12 13 14 15 16
Columns 15 through 28
17 18 19 20 21 22 23 24 25 26 27 28 29 30
Columns 29 through 31
31 32 33
find(y<=my)
=
Columns 1 through 14
1 2 34 35 36 37 38 39 40 41 42 43 44 45
Columns 15 through 28
46 47 48 49 50 51 52 53 54 55 56 57 58 59
Columns 29 through 42
60 61 62 63 64 65 66 67 68 69 70 71 72 73
Columns 43 through 56
74 75 76 77 78 79 80 81 82 83 84 85 86 87
Columns 57 through 69
88 89 90 91 92 93 94 95 96 97 98 99 100
Let be ny the reference vector of y:
ny=[1:1:length(y)]
then the mean value is somewhere within the interval
[33 34]
So, to accurately plot my and 3*sy you first have to decide how accurate you need be.
While the y step is, and it varies along the curve
abs(y(33)-y(34))
ans =
0.258998840336178
the x step is constant
abs(x(33)-x(34))
=
0.011562155305573
mean(diff(x))
= 0.011562155305573
max(diff(x))
= 0.011562155305573
min(diff(x))
= 0.011562155305573
So if you decide that 2 decimals precision (on y) is enough, we have to refine x small enough so that at least one point of y interpolated has the value my truncated after 2nd decimal:
3.02xxxx
for 0.2589 go down to 0.0001 the y step has to be fractioned at least
0.2589/dy=0.0001 hence dy=258.9 , let's take dy=259
the angle
alpha=atand(abs(y(33)-y(34))/abs(x(33)-x(34)))
=
87.443914593184061
since
dx=.0001/atand(alpha)
=
1.119259323130124e-06
x_step=abs(x(33)-x(34))/dx
=
1.033018449490165e+04
make it
x_step=10331
then
x2=linspace(x(1),x(end),x_step);
y2=interp1(x,y,x2);
overlap, check both y and y2 are the same:
figure(1);plot(x,y,x2,y2);grid on;grid minor
where in y2 is my located?
find(y2>my)
=
..
..
3396 3397 3398 3399 3400 3401 3402
Columns 3228 through 3234
3403 3404 3405 3406 3407 3408 3409
Columns 3235 through 3238
3410 3411 3412 3413
x_mean=max(find(y2>my))
the actual mean value is going to be approximated with y2(3413) = 3.021925056586602
and this error is acceptable
abs(my-y2(3413))
=
1.358549030059386e-04
put what looks like the mean on the curve:
hold on;plot(x2(x_mean),y2(x_mean),'bo')
. . but when checking
sum(y)
=
3.021789201683596e+02
length(y)
=
100
mean(y)
=
3.021789201683596
and already found mean located between [33 34]
the upper tail cannot even accommodate half single sigma (34.1%)
sum(y([33:end]))/sum(y)*100
=
13.483260209259832
Let's find mu summing samples:
pc_target=50
n=2
pc=sum(y([1:n]))/sum(y)*100
while pc<pc_target
n=n+1
pc=sum(y([1:n]))/sum(y)*100
end
..
..
n = 14
pc = 42.855348381237192
n = 15
pc = 46.872419796584815
n = 16
pc = 50.701800361715335
So it turns out mu is within [15 16]
the x boundaries for y~mu
x([15 16])
=
0.201607447590281 0.213169602895854
Using y2 for more detail
pc_target=50
n=2
pc=sum(y2([1:n]))/sum(y2)*100
while pc<pc_target
n=n+1
pc=sum(y2([1:n]))/sum(y2)*100
end
..
..
n = 1600
pc = 49.982072607170089
n = 1601
pc = 50.018166984408886
for y2, mu is within [1600 1601]
n_mu=1600
and x2 boundaries
x2([1600 1601])
ans =
0.216920307874458 0.217031116526467
n_mu=1600
the upper y2 boundary for 1 sigma (+34.1%)
pc_target=34.1
n=1
pc=sum(y2([1600:1600+n]))/sum(y2)*100
while pc<pc_target
n=n+1
pc=sum(y2([1600:1600+n]))/sum(y2)*100
end
..
..
n =
1475
pc =
34.089262560230956
n =
1476
pc =
34.101853576106748
the numeral distance for +1sigma is
n_up_1sigma=1475;
And the location on x2 of +1sigma is:
x2_up_1sigma=x2(1600+n_up_1sigma)
x2_up_1sigma =
0.380363069587555
the lower y2 boundary for 1 sigma (-34.1%)
pc_target=34.1
n=1
pc=sum(y2([1600-n:1600]))/sum(y2)*100
while pc<pc_target
n=n+1
pc=sum(y2([1600-n:1600]))/sum(y2)*100
end
..
..
n=
842
pc =
34.078560325462846
n =
843
pc =
34.117107246027864
the numeral distance for -1sigma is
n_down_1sigma=842;
And the location on x2 of -1sigma is:
x2_down_1sigma=x2(1600-n_down_1sigma)
=
0.123619422882982
Repeating for 3 sigma, understanding that all +- sigma interval cover 89%, then up it is:
pc_target=44.5
n=1
pc=sum(y2([1600:1600+n]))/sum(y2)*100
while pc<pc_target
n=n+1
pc=sum(y2([1600:1600+n]))/sum(y2)*100
end
..
..
n =
2836
pc =
44.496938162892050
n =
2837
pc =
44.501155111466957
n_up_3sigma=2836;
x2(n_up_3sigma)
=
0.353879801757433
and down we go:
pc_target=44.5
n=1
pc=sum(y2([1600-n:1600]))/sum(y2)*100
while pc<pc_target
n=n+1
pc=sum(y2([1600-n:1600]))/sum(y2)*100
end
..
..
n =
1161
pc =
44.484883787669325
n =
1162
pc =
44.509871734900422
n_down_3sigma=1161;
x2(n_down_3sigma)
ans =
0.168275309642560
and the plot you asked for:
plot(x2,y2,x2(1600),y2(1600),'ro',...
x2(n_mu+n_up_1sigma),y2(n_mu+n_up_1sigma),'rd',x2(n_mu-n_down_1sigma),y2(n_mu-n_down_1sigma),'rd',...
x2(n_mu+n_up_3sigma),y2(n_mu+n_up_3sigma),'gd',x2(n_mu-n_down_3sigma),y2(n_mu-n_down_3sigma),'gd')
grid on;grid minor;
additional comments:
1. do not mix lognormal mu sigma and normal mu sigma parameters: mind the notation gap. Your x is a reference vector and your y is the actual distribution.
However, in
Y=lognpdf(X,mu,sigma)
MATLAB function lognpdf calculates the lognormal distribution Y out of the normal distribution X, where X has mean mu and standard variance sigma. In lognpdf help page, both Y and X are random variables, data, each with their respective reference vectors.
As explained in lognpdf help page, the 'mu' and 'sigma' of Y are related to those belonging to X by:
m=exp(mu+sigma^2/2)
v=exp(2*mu+sigma^2)*(exp(sigma^2)-1)
and
mu=log(m^2/(v+m^2)^.5)
sigma=(log(v/m^2)+1)^.5
2.- while this one got it right http://keisan.casio.com/exec/system/1180573214
this one may be having same problem of displaced mean for assymetric distribution https://667fdf68dbe4f0e67f750fedc96d9bd17eba3000.googledrive.com/host/0B2GQktu-wcTiVFpoRThHN3Z3cUE/
bankers seem to have a lot of interest on lognormal distributions
would you mind to both vote on the thumbs-up and
mark my answer as accepted?
thanks in advance
John
  1 Comment
Jan
Jan on 24 Feb 2017
Edited: Jan on 25 Feb 2017
John BG has accepted this answer by his own. According to his comment it is not clear, if it was useful.

Sign in to comment.

More Answers (2)

John BG
John BG on 23 May 2016
Edited: John BG on 1 Jun 2016
histfit with option dist does not really mean that the data has a lognormal distribution.
You expect it to be lognormal, so you tell histfit to fit a lognormal curve as reasonably tight as possible, over the bins of the histogram of the data you feed histfit with.
So, you feed your data in histfit, and with
rng default;
b = betarnd(3,10,100,1);
h1=histfit(b,20,'lognormal')
b_histogram=h1(1)
b_pdf_lognormal_fit=h1(2)
you catch the curve of the probability density function that histfit has applied
b_histogram contains the histogram, but you want the pdf:
b_pdf_lognormal_fit =
Line with properties:
Color: [1.00 0 0]
LineStyle: '-'
LineWidth: 2.00
Marker: 'none'
MarkerSize: 6.00
MarkerFaceColor: 'none'
XData: [1x100 double]
YData: [1x100 double]
ZData: [1x0 double]
Show all properties
AlignVertexCenters: 'off'
Annotation: [1x1 matlab.graphics.eventdata.Annotation]
BeingDeleted: 'off'
BusyAction: 'queue'
ButtonDownFcn: ''
Children: [0x0 GraphicsPlaceholder]
Clipping: 'on'
Color: [1.00 0 0]
CreateFcn: ''
DeleteFcn: ''
DisplayName: ''
HandleVisibility: 'on'
HitTest: 'on'
Interruptible: 'on'
LineJoin: 'round'
LineStyle: '-'
LineWidth: 2.00
Marker: 'none'
MarkerEdgeColor: 'auto'
MarkerFaceColor: 'none'
MarkerSize: 6.00
Parent: [1x1 Axes]
PickableParts: 'visible'
Selected: 'off'
SelectionHighlight: 'on'
Tag: ''
Type: 'line'
UIContextMenu: [0x0 GraphicsPlaceholder]
UserData: []
Visible: 'on'
XData: [1x100 double]
XDataMode: 'manual'
XDataSource: ''
YData: [1x100 double]
YDataSource: ''
ZData: [1x0 double]
ZDataSource: ''
let's focus on the probability distribution:
x=b_pdf_lognormal_fit.XData;
y=b_pdf_lognormal_fit.YData;
figure(2);
plot(x,y);
grid on;
the mean value is located somewhere between the 33rd and 34th elements of y
mean(y)
=
3.021789201683596
find(y>mean(y))
ans =
Columns 1 through 13
3 4 5 6 7 8 9 10 11 12 13 14 15
Columns 14 through 26
16 17 18 19 20 21 22 23 24 25 26 27 28
Columns 27 through 31
29 30 31 32 33
There are standard expressions for mean and variance of basic probability distributions. For lognormal I found, X be the data
  • mean: E(X)=expected(mu+sigma^2/2)
  • variance: V(X)=expected(2*mu+sigma^2/2)*(expected(sigma^2-1))
but I am not sure about this expression of variance, and since we have already obtained the pdf curve, we don't need to resort to cooking recipes, just apply
mean (1st statistical moment )
my=mean(y)
=
3.021789201683596
variance (2nd statistical moment)
vy=var(y)
=
17.421143231291438
standard deviation, the sigma you are after, from
is:
sy=(mean((y-my).^2)).^.5
=
4.152942547035599
standard deviation (your sigma): .. the standard deviation represents the root mean square value of the random variable around its mean ..
you may want to find the corresponding x coordinates, I can assist on this too.
If you find this answer of any help solving your question,
please click on the thumbs-up vote link,
thanks in advance
John
  2 Comments
Kash022
Kash022 on 1 Jun 2016
@John: thanks a lot for the great response! Yes, the curves now seem to come out much better...however, a couple of things, 1. Yes, I would like to know the corresponding x values as they will tell me the actual mean of my variable and the 3-sgima distributions.. 2. It would be great if we could see these (mean and 3sigma) on the histogram plot! Thanks!
Kash022
Kash022 on 1 Jun 2016
Edited: Kash022 on 1 Jun 2016
@John..please have a look at the attached image showing the lognormal plot for my distribution..for this I need the mean and 3 sigma values and if possible, it could be shown on the graph. Thanks! Please note...x axis has value in the xE-10 range

Sign in to comment.


Kash022
Kash022 on 2 Jun 2016
Sorry @John..your answer is way too big and confusing..honestly now I have no idea what to plot..would have been great if your answer would have been concise and to the point rather than so many lines where I seem to be lost..thanks for the effort anyways!
  4 Comments
John BG
John BG on 15 Jun 2016

I keep a copy of script lines only for some answers, for this one I didn't.

The key lines, for any probability distribution y over x, are the following ones:

1. find mu, where 50% probability is located:
      pc_target=50
      n=2
      pc=sum(y([1:n]))/sum(y)*100
      while pc<pc_target
      	n=n+1
      	pc=sum(y([1:n]))/sum(y)*100
      end

mind the gap: n is the vector reference for both y and x. You want x be the reference of y, so be it, but to integrate it is easier to use n= 1 2 3 .. do you understand this?

the loop stops on n=1601, mu is located on y(1600) that happens for x(1600)

2. for instance to find -sigma, -34% below mu:
      pc_target=34.1
      n=1
      pc=sum(y2([1600-n:1600]))/sum(y2)*100
      while pc<pc_target
      	n=n+1
      	pc=sum(y2([1600-n:1600]))/sum(y2)*100
      end

this loop stops when the integration value is 34% below the previously found n=1600 the 50%

I named the variable after 'percentage target'

Regards

John BG

jgb2012@sky.com

John BG
John BG on 12 Mar 2017
Hi Kash022
I have abridged the above explanation in a function that may be of use, attached to this comment: lognormal_pdf_123sigma_locations.m
call example, replace b with data:
b = betarnd(3,10,100,1); % simulated data, replace betarand() with real data
[mu0,sm1,sp1,sm2,sp2,sm3,sp3,y2,ny2]=lognormal_pdf_123sigma_locations(b)
this produces
1.
mu0: mean
sm1: sigma1 minus
sp1: sigma1 plus
sm2: sigma2 minus
sp2: sigma2 plus
sm3: sigma3 minus
sp3: sigma3 plus
the lognormal probability distribution density approximation to the data, the 10^-10 can be scaled, do not pass the Armstrong order of magnitude to this function, scale it accordingly so that the x, the reference vector has a similar range to that of a betarnd(3,10,100,1).
2.
couple figures showing data, pdf approximation, and the location of the requested points:
.
.
3.
Copy of the function
function [mu0,sm1,sp1,sm2,sp2,sm3,sp3,y2,ny2]=lognormal_pdf_123sigma_locations(b)
% mu0: mean
% sm1: sigma 1 minus
% sp1: sigma 1 plus
% sm2: sigma 2 minus
% sp2: sigma2 plus
% sm3: sigma 3 minus
% sp3: sigma 3 plus
% author: John Bofarull Guix, jgb2012@sky.com
% b = betarnd(3,10,100,1); % simulated data, replace betarand() with real data
figure(1);
% h1=histfit(b,20,'lognormal') % test
h1=histfit(b,20,'lognormal')
b_histogram=h1(1);grid on;
b_pdf_lognormal_fit=h1(2)
y=b_pdf_lognormal_fit.YData;
sumy=sum(y);y=y/sumy;
ny=b_pdf_lognormal_fit.XData;
N=100;
y2=interp(y,N);y2=y2/sum(y2);
ny2=[1:1:numel(y)]*N;
pc_target=50; % mu: distribution mean value
n=2
pc=sum(y2([1:n])) *100
while pc<pc_target
n=n+1;pc=sum(y2([1:n])) *100;
end
mu=n;
figure(2);plot(y2);hold all
figure(2);plot(mu,y2(n),'bo');grid on
pc_target=34; [s1_min,s1_plus]=go_get_it(pc_target,mu,y2); % +- sigma
figure(2);plot(s1_min,y2(s1_min),'ro');
figure(2);plot(s1_plus,y2(s1_plus),'ro');
pc_target=47.5; [s2_min,s2_plus]=go_get_it(pc_target,mu,y2); % +-2*sigma
figure(2);plot(s2_min,y2(s2_min),'ro')
figure(2);plot(s2_plus,y2(s2_plus),'ro');
pc_target=49.7;[s3_min,s3_plus]=go_get_it(pc_target,mu,y2); % +-3*sigma
figure(2);plot(s3_min,y2(s3_min),'ro');
figure(2);plot(s3_plus,y2(s3_plus),'ro');
mu0=mu;
sm1=s1_min;
sp1=s1_plus;
sm2=s2_min;
sp2=s2_plus;
sm3=s3_min;
sp3=s3_plus;
function [sdown,sup]=go_get_it(pc_target,mu,y2)
% support function finds t1 t2 such sum(y([t1:mu]))=sum(y([mu:t2]))
n=1;pc=sum(y2([mu-n:mu])) *100; % search - s
while pc<pc_target
n=n+1;pc=sum(y2([mu-n:mu])) *100;
end
sdown=mu-n;
% figure(h);
% plot(h,sdown,y2(sdown),'ro');hold all;
n=1;pc=sum(y2([mu:mu+n])) *100; % search + s
while pc<pc_target
n=n+1;pc=sum(y2([mu:mu+n])) *100 ;
end
sup=n+mu;
% figure(2);
% plot(h,sup,y2(sup),'ro');
end
end
4.
Explanation:
in last year conversation the generation of the pdf curve was limited to the same x range of the generated data, that introduced error caused by step not small enough.
Reproducing the same code used earlier on
pc_target=50; % mu: distribution mean value
n=2
pc=sum(y([1:n])) *100
while pc<pc_target
n=n+1;pc=sum(y([1:n])) *100;
end
mu=n;
figure(2);plot(y);hold all
figure(2);plot(mu,y(n),'bo');grid on
pc_target=34; % -sigma
n=1;
pc=sum(y([mu-n:mu])) *100;
while pc<pc_target
n=n+1;pc=sum(y([mu-n:mu])) *100;
end
sigma1_min=mu-n;
figure(2);plot(sigma1_min,y(sigma1_min),'ro');
pc_target=34 ; % +sigma
n=1;
pc=sum(y([mu:mu+n])) *100
while pc<pc_target
n=n+1;pc=sum(y([mu:mu+n])) *100 ;
end
sigma1_plus=n+mu;
figure(2);plot(sigma1_plus,y(sigma1_plus),'ro');
pc_target=47.5; % -2*sigma
n=1;
pc=sum(y([mu-n:mu])) *100;
while pc<pc_target
n=n+1;pc=sum(y([mu-n:mu])) *100;
end
sigma2_min=mu-n;
figure(2);plot(sigma2_min,y(sigma2_min),'go');
pc_target=47.5 ; % +2*sigma
n=1;
pc=sum(y([mu:mu+n])) *100
while pc<pc_target
n=n+1;pc=sum(y([mu:mu+n])) *100 ;
end
sigma2_plus=n+mu;
figure(2);plot(sigma2_plus,y(sigma2_plus),'go');
pc_target=49.7; % -3*sigma
n=1;
pc=sum(y([mu-n:mu])) *100;
while pc<pc_target
n=n+1;pc=sum(y([mu-n:mu])) *100;
end
sigma3_min=mu-n;
figure(2);plot(sigma3_min,y(sigma3_min),'ko');
pc_target=49.7 ; % +3*sigma
n=1;
pc=sum(y([mu:mu+n])) *100
while pc<pc_target
n=n+1;pc=sum(y([mu:mu+n])) *100 ;
end
sigma3_plus=n+mu;
figure(2);plot(sigma3_plus,y(sigma3_plus),'ko');
.
the points are not wrongly located from the data point of view, the data is not accurate enough, and the rectangles approximating the integration are too wide.
.
.
checks
sum(y([mu:sigma1_plus]))
sum(y([sigma1_min:mu]))
sum(y([mu:sigma2_plus]))
sum(y([sigma2_min:mu]))
sum(y([mu:sigma3_plus]))
sum(y([sigma3_min:mu]))
sum(y([sigma3_plus:end]))
sum(y([1:sigma3_min]))
ans =
0.343910033120536
ans =
0.376779316893812
ans =
0.475739109483067
ans =
0.486722066501548
ans =
0.499152952111189
ans =
0.498547671119994
ans =
0.034710670665522
ans =
0.020295937115606
the first sum seems ok, 34% but the second should be 34% too, the 4th should be 47% and the last 2 tails should be 2.1% according to
.
.
hope it helps, and having a look to the other question you mentioned
John BG

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!