Transforming a right skewed data set to normal
58 views (last 30 days)
Show older comments
I am attempting to fit an ARIMA model to a set of data. The issue is I cannot get a good fit due to the data set following a weibel distribution, and when attempting to transform the data so it follows a normal distribution, a second peak emerges. So far I have tried using a square root, cube root, natural log, log10, log2, and log(x/1-x). Figure 1 is the raw data before any transform.
0 Comments
Answers (2)
Adam Danz
on 19 Mar 2019
Edited: Adam Danz
on 19 Mar 2019
Have you tried fitting the data to a Weibull distribution? Matlab's mblfit() reutrns the maximum likelihood estimates of the parameters that best fit the underlying Weibull distribution of your data.
[Updated] Here's a demo
%create data
data = wblrnd(8,2,1000,1);
% do fiting
[parmhat, parmci] = wblfit(data);
% Plot fitting
figure
h = histogram(data);
hold on
% Calculate pdf and scale it to your data
Y = wblpdf(sort(data),parmhat(1), parmhat(2));
yScaled = Y * (1/max(Y)) * max(h.Values);
% Plot scaled pdf (the pdf should overlap with the hist)
plot(sort(data), yScaled, 'r-', 'LineWidth', 3)
legend('Your data', 'scaled pdf')
19 Comments
Adam Danz
on 19 Mar 2019
Edited: Adam Danz
on 19 Mar 2019
The updated distribution doesn't look as much like a Weibull distribution as the mistaken one did. If your data should come from a Weibull distribution because of the principals behind your data collection, then you can use these methods to do the fitting. But the updated plot doesn't look like a Weibull distribution. It doesn't look normal either, due to the rightward tail. The skew doesn't look strong enough to be fixed by a log transform either but you could at least try it (with low expectations).
Your original question asks how to make a bimodal distribution more like a normal distribution and that question made sense with the original example distribution which appeared to be Weibullian with 2 peaks. But the updated example isn't Weibullian nor does it have two peaks. So I've lost track of the goal. If you want your data to be more normal and less skewed, I'm sure there's a complicated transformation that could be created but what's the goal? Any distribution can be transformed into another but that usually results in uninterpretable data.
Jeff Miller
on 20 Mar 2019
One very general two-step approach is to
- convert the original scores to percentiles within the original distribution
- replace each original score with the standard normal (z) score having the same percentile.
The arima model will then predict z scores, and you can convert back to the original scores by reversing the steps (i.e., find the percentile of the predicted z score and then find the original score at that percentile).
2 Comments
Jeff Miller
on 20 Mar 2019
I don't really understand this very well, but some more comments that might help.
First, you have to get rid of the nans before you even start, don't you? I'm not too familiar with arima models, but I wouldn't think they would allow nans. And even if they did, I'm not sure how they would fit into a normal distribution.
Second, once you get Final_testWithoutNans, I think you should get the percentile scores for each data value more precisely. Use this technique at stackoverflow to rank the values in Final_testWithoutNans (use the method that allows for ties if you have them). Then divide the ranks by numel(Final_testWithoutNans)+1 to get the percentile values of each point. The +1 avoids values of 1, which gives you those pesky infinite z values, and it's the right thing to do anyway.
After you've got the percentile values this way you can convert those back to z scores with norminv or erfinv. At this point you might take a look at the histogram of those z scores and make sure it looks normal. It must if you have no ties, but if there are lots of ties it might not. Anyway, if this plot of z's doesn't look normal (e.g., you might have a whole bunch of scores tied at the maximum value, which would never happen in a normal distribution), then you can be sure that you will never find any other transformation of your original data that does look normal.
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!