# Evaluation Criteria for Missing Data Imputation Techniques

10 views (last 30 days)
Tiago Dias on 28 Jun 2018
Answered: Tiago Dias on 5 Jul 2018
Hello,
I have 5 methods for missing data imputation, since my original data set, has missing values due to the fact that is industrial data. And to perform a PCA analysis, and in order to have eigenvalues positives, I need a covariance to be determine positive.
I use the 5 methods to impute missing data, so now i got 5 new matrices of X_imputed.
Question: How can measure the performance of each one? what criteria should I use?
I read about calculation RMSE, but when I see the formula they use SQRT of Xi obs - Xi imputed, and they do the calculation because their initial X is complete, and they introduce a % of MD, but the problem for me is that i already start with Missing Data.

Jeff Miller on 4 Jul 2018
You can't evaluate the performance of the different imputaton methods with respect to your actual data set, for exactly the reason you mention. You can only compare their performance across simulations where you know the values of each of the missing points (i.e., your simulation pretends that some simulated points are missing). Such a simulation would require very detailed assumptions about the multivariate situation that your data came from, including the reasons why some points are missing.
It might be better to perform the PCA without imputing any missing data (check the pca documentation). Did you try
coeff = pca(X,'Rows','pairwise');
This essentially computes each entry in the covariance matrix using whichever of your original data rows/cases have values for both relevant variables.
##### 2 CommentsShowHide 1 older comment
Jeff Miller on 5 Jul 2018
Sorry, I do not know whether your suggestion is reasonable or not.
If the data do not even allow the covariances to be estimated, then you probably don't have enough data to decide which is the best imputation method or to do PCA afterwards.
Can you select out a subset of the variables for which you can get a complete set of covariances? You might just do PCA on this subset.

Tiago Dias on 5 Jul 2018
I can't really make a subset, because all variables have missing data. But I found an article when they do the residues from X(with MD) - Ximputed, just for the i,j that are values in X, so I go that way.