Transforming a right skewed data set to normal

58 views (last 30 days)
I am attempting to fit an ARIMA model to a set of data. The issue is I cannot get a good fit due to the data set following a weibel distribution, and when attempting to transform the data so it follows a normal distribution, a second peak emerges. So far I have tried using a square root, cube root, natural log, log10, log2, and log(x/1-x). Figure 1 is the raw data before any transform.

Answers (2)

Adam Danz
Adam Danz on 19 Mar 2019
Edited: Adam Danz on 19 Mar 2019
Have you tried fitting the data to a Weibull distribution? Matlab's mblfit() reutrns the maximum likelihood estimates of the parameters that best fit the underlying Weibull distribution of your data.
You could then use wblpdf() to plot the results and compare them to your data's distribution.
[Updated] Here's a demo
%create data
data = wblrnd(8,2,1000,1);
% do fiting
[parmhat, parmci] = wblfit(data);
% Plot fitting
figure
h = histogram(data);
hold on
% Calculate pdf and scale it to your data
Y = wblpdf(sort(data),parmhat(1), parmhat(2));
yScaled = Y * (1/max(Y)) * max(h.Values);
% Plot scaled pdf (the pdf should overlap with the hist)
plot(sort(data), yScaled, 'r-', 'LineWidth', 3)
legend('Your data', 'scaled pdf')
190319 104753-Figure 1.jpg
  19 Comments
Adam Danz
Adam Danz on 19 Mar 2019
Edited: Adam Danz on 19 Mar 2019
The updated distribution doesn't look as much like a Weibull distribution as the mistaken one did. If your data should come from a Weibull distribution because of the principals behind your data collection, then you can use these methods to do the fitting. But the updated plot doesn't look like a Weibull distribution. It doesn't look normal either, due to the rightward tail. The skew doesn't look strong enough to be fixed by a log transform either but you could at least try it (with low expectations).
Your original question asks how to make a bimodal distribution more like a normal distribution and that question made sense with the original example distribution which appeared to be Weibullian with 2 peaks. But the updated example isn't Weibullian nor does it have two peaks. So I've lost track of the goal. If you want your data to be more normal and less skewed, I'm sure there's a complicated transformation that could be created but what's the goal? Any distribution can be transformed into another but that usually results in uninterpretable data.
Michael Mueller
Michael Mueller on 19 Mar 2019
The goal is to take the current data set and make it normal. Apllying any form of a transform (log, sqrt, cube root, etc) has created a bimodal distribution with different degrees of skewness. The issue is whatever I do to the data to make it normal, I need to be able to undo on predicted values produced with an ARIMA model.

Sign in to comment.


Jeff Miller
Jeff Miller on 20 Mar 2019
One very general two-step approach is to
  1. convert the original scores to percentiles within the original distribution
  2. replace each original score with the standard normal (z) score having the same percentile.
The arima model will then predict z scores, and you can convert back to the original scores by reversing the steps (i.e., find the percentile of the predicted z score and then find the original score at that percentile).
  2 Comments
Michael Mueller
Michael Mueller on 20 Mar 2019
Edited: Michael Mueller on 20 Mar 2019
I attempted this this morning. I obtained the percentile values, as well as the z-scores, however when I go to create my arima model I get a warning message:
Warning: Error in calculation of parameter covariance matrix. Matrix of NaN's returned.
> In arima/estimate (line 1137)
The current code used to generate the transform is as follows:
Test = percentile(Final_test,Final_test);
z = @(Test) -sqrt(2) * erfcinv(Test*2);
Zs = z(Test);
Where percentile is a user-defined function:
function x = percentile(datas,value)
perc = prctile(datas,1:100);
x = zeros(length(value),1);
for ii = 1:length(value)
[c index] = min(abs(perc'-value(ii)));
x(ii) = (index+1)./100;
end
end
Where Final_test is a 52561 x 1 double containing 52477 real values and 84 NaNs. Zs is also a 52561 x 1 with 52477 real values and 84 NaNs. Of the 52477 real values 26281 are negative and 632 are Inf. Of the 632 Inf values, the corresponding values in Final _ test vary without repeating, all being above a certain value, I'll call said min "X", specific to the data set.
The values of inf in Zs correspond to a value in Test of 1.
Using:
check = Final_test(Test == 1);
verify = find(Final_test >= min(check));
Where check is the values of Final_test corresponding to Inf in Zs, which also equals the values of Final_test corresponding to values of 1 in Test.
The output of Verify is a 715x1 double, which makes me think there are values greater than X that result in values less than 1 in Test, which would result in a z-score not equal to Inf in Zs.
I am using a temporary work around by replacing all values of 1 in test with 0.9999, but is there a better, more accurate work around?
Also for the inverse, I am using normcdf to ge teh percentile, and multipling by 100. My thought to finish the transform was to use prctile(X,p) where P is the Percentile, and X is the data set. Should I be using the original data set for X, or should I be using a different function all together?
Jeff Miller
Jeff Miller on 20 Mar 2019
I don't really understand this very well, but some more comments that might help.
First, you have to get rid of the nans before you even start, don't you? I'm not too familiar with arima models, but I wouldn't think they would allow nans. And even if they did, I'm not sure how they would fit into a normal distribution.
Second, once you get Final_testWithoutNans, I think you should get the percentile scores for each data value more precisely. Use this technique at stackoverflow to rank the values in Final_testWithoutNans (use the method that allows for ties if you have them). Then divide the ranks by numel(Final_testWithoutNans)+1 to get the percentile values of each point. The +1 avoids values of 1, which gives you those pesky infinite z values, and it's the right thing to do anyway.
After you've got the percentile values this way you can convert those back to z scores with norminv or erfinv. At this point you might take a look at the histogram of those z scores and make sure it looks normal. It must if you have no ties, but if there are lots of ties it might not. Anyway, if this plot of z's doesn't look normal (e.g., you might have a whole bunch of scores tied at the maximum value, which would never happen in a normal distribution), then you can be sure that you will never find any other transformation of your original data that does look normal.

Sign in to comment.

Products


Release

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!