# How to use chi2gof within CUPID

4 views (last 30 days)
Sim on 22 Jun 2023
Commented: Sim on 26 Jun 2023
[The same question on the CUPID GitHub]
Two examples of usage of the Matlab's "Chi-square goodness-of-fit test" (chi2gof) function are the following:
First (comparing two frequency distributions):
Population = [996, 749, 370, 53, 9, 3, 1, 0];
Sample = [647, 486, 100, 22, 0, 0, 0, 0];
Population2 = [996, 749, 370, sum(Population(4:8))];
Sample2 = [647, 486, 100, sum(Sample(4:8))];
x = [];
for i = 1:length(Sample2)
x = [x,i*ones(1,Sample2(i))];
end
edges = .5+(0:length(Sample2));
[h,p,k] = chi2gof(x,'Expected',Population2,'Edges',edges)
Second (fit a distribution to data):
bins = 0:5;
obsCounts = [6 16 10 12 4 2];
n = sum(obsCounts);
pd = fitdist(bins','Poisson','Frequency',obsCounts');
expCounts = n * pdf(pd,bins);
[h,p,st] = chi2gof(bins,'Ctrs',bins,...
'Frequency',obsCounts, ...
'Expected',expCounts,...
'NParams',1)
But, how can I use the chi2gof function within CUPID?
Here below an example where I would like to use the Matlab's chi2gof function :
% (1) create a "truncated dataset"
pd = makedist('Weibull','a',3,'b',5);
t = truncate(pd,3,inf);
data_trunc = random(t,10000,1);
% (2) fit a distribution (in this case the "Weibull2") to the "truncated test"
fittedDist = TruncatedXlow(Weibull2(2,2),3);
% (3) estimate the Weibull parameters by maximum likelihood, allowing for the truncation.
fittedDist.EstML(data_trunc);
% (4) plot both the "truncated test" (through the histogram) and the "fitting distribution"
% (in this case the "Weibull2" with Weibull's parameters estimated by maximum likelihood)
figure
xgrid = linspace(0,100,1000)';
histogram(data_trunc,100,'Normalization','pdf','facecolor','blue')
line(xgrid,fittedDist.PDF(xgrid),'Linewidth',2,'color','red')
xlim([2.5 6]) Jeff Miller on 23 Jun 2023
Yes, that is correct. The successive bin probabilities are the differences of the successive CDF values, and the expected number is the total N times the bin probability--just as you have computed it.
Sim on 23 Jun 2023
Thanks a lot @Jeff Miller, very kind!! :-)
Sim on 26 Jun 2023
I accepted the @Jeff Miller's answer
"Yes, that is correct. The successive bin probabilities are the differences of the successive CDF values, and the expected number is the total N times the bin probability--just as you have computed it."
since it confirms what I showed in my Answer (please see my two examples called "Test 1" and "Test 2"):
"I might have found a solution that makes sense to me and gives me what I would expect, even though I am not 100% sure it is correct... maybe, experts of CUPID and chi2gof might tell me if this is correct.... Test 1.... Test 2....."

Sim on 22 Jun 2023
Edited: Sim on 22 Jun 2023
I might have found a solution that makes sense to me and gives me what I would expect, even though I am not 100% sure it is correct... maybe, experts of CUPID and chi2gof might tell me if this is correct:
Test 1: I produce an artifical set of data following a distribution (A) and I fit those data with the same distribution (A)
% (1) create a "truncated dataset"
pd = makedist('Exponential','mu',1); % <-- dataset following a distribution (A)
whereToTruncate = 2;
t = truncate(pd,whereToTruncate,inf);
data_trunc = random(t,10000,1);
% (2) fit a distribution to the "truncated test"
fittedDist = TruncatedXlow(Exponential(1),whereToTruncate); % <-- fitting distribution (A)
% (3) estimate the distribution parameters by maximum likelihood, allowing for the truncation.
fittedDist.EstML(data_trunc);
% (4) plot both the "truncated test" (through the histogram) and the "fitting distribution"
figure
xgrid = linspace(0,10,1000)';
num_bins = 50;
hold on
histogram(data_trunc,num_bins,'Normalization','pdf','facecolor','blue')
line(xgrid,fittedDist.PDF(xgrid),'Linewidth',2,'color','red')
hold off
xlim([0 7])
% (5) calculate the Chi-square goodness-of-fit test (chi2gof)
bin_edges = linspace(min(data_trunc), max(data_trunc), num_bins+1);
expected_values = numel(data_trunc) * diff(fittedDist.CDF(bin_edges));
[h,p,st] = chi2gof(data_trunc, 'Expected', expected_values) % Output Test 1
h =
0
p =
0.55248
st =
struct with fields:
chi2stat: 21.469
df: 23
edges: [2.0001 2.2661 2.5321 2.7982 3.0642 3.3302 3.5963 3.8623 4.1283 4.3944 4.6604 4.9264 5.1925 5.4585 5.7245 5.9906 ]
O: [2368 1798 1344 1107 810 594 442 333 294 212 165 116 113 68 53 37 33 28 15 15 18 11 5 21]
E: [2348.7 1797.1 1375 1052 804.95 615.89 471.24 360.56 275.87 211.08 161.5 123.57 94.548 72.341 55.351 42.35 32.404 ]
Test 2: I produce an artifical set of data following a distribution (A) and I fit those data with a different distribution (B)
% (1) create a "truncated dataset"
pd = makedist('Exponential','mu',1); % <-- dataset following a distribution (A)
whereToTruncate = 2;
t = truncate(pd,whereToTruncate,inf);
data_trunc = random(t,10000,1);
% (2) fit a distribution to the "truncated test"
fittedDist = TruncatedXlow(Normal(0,1),whereToTruncate); % <-- fitting distribution (B)
% (3) estimate the distribution parameters by maximum likelihood, allowing for the truncation.
fittedDist.EstML(data_trunc);
% (4) plot both the "truncated test" (through the histogram) and the "fitting distribution"
figure
xgrid = linspace(0,10,1000)';
num_bins = 50;
hold on
histogram(data_trunc,num_bins,'Normalization','pdf','facecolor','blue')
line(xgrid,fittedDist.PDF(xgrid),'Linewidth',2,'color','red')
hold off
xlim([0 7])
% (5) calculate the Chi-square goodness-of-fit test (chi2gof)
bin_edges = linspace(min(data_trunc), max(data_trunc), num_bins+1);
expected_values = numel(data_trunc) * diff(fittedDist.CDF(bin_edges));
[h,p,st] = chi2gof(data_trunc, 'Expected', expected_values) % Output Test 2
h =
1
p =
6.4417e-116
st =
struct with fields:
chi2stat: 628.59
df: 26
edges: [2.0001 2.1895 2.3789 2.5682 2.7576 2.947 3.1364 3.3258 3.5152 3.7046 3.8939 4.0833 4.2727 4.4621 4.6515 4.8409 ]
O: [1742 1409 1198 959 798 699 561 463 391 295 266 205 162 135 114 102 86 73 56 51 39 30 22 18 16 20 90]
E: [1386.2 1248.4 1114.2 985.49 863.77 750.27 645.8 550.88 465.67 390.1 323.84 266.42 217.2 175.48 140.5 111.47 87.65 ]