A Study on Missing Data Management
Keywords:
UCI database, Missing At Random (MAR), Missing Completely At Random (MCAR), Missing Not At Random (MNAR), Multiple Imputation, Expectation Maximization with Bootstrap approach (EMB), Root Mean Square Error (RMSE)Abstract
Missing data, a persistent problem in most scientific research, should be handled very carefully, as role of data are vital in every analysis. Mishandling missing values may cause distorted analysis or may generate biased results. Valid and reliable models require good data preparation. Dozens of techniques have been proposed by methodologists to address the problem. Appropriate method should be taken into consideration for a particular study in order to achieve efficient and valid analysis. In this study we discuss different methods to handle missing data and compare three imputation methods: Arithmetic Mean Imputation, Regression Imputation and Multiple Imputation using EMB algorithm, performed on three data sets from UCI repository under the assumption of MAR based on Root Mean Square Error (RMSE) as an evaluation criteria.
References
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, “Missing value estimation methods for dna microarrays”, Bioinformatics Vol.17, pp.520-525, 2001.
Lewis HD, “Missing data in clinical trials”, New England Journal of Medicine, Vol. 367, pp. 2557-2558, 2012.
Rubin DB, “Inference and missing data”, Biometrica Vol. 63, pp. 581-592, 1976.
Little RJA, Rubin DB, Statistical Analysis with Missing Data (2nd edn.), Wiley-Interscience, 2002.
N.Durga, D.Ragupathi and V. Raj Kumar, "Uses of HDFS in Metadata Management System", International Journal of Computer Sciences and Engineering, Vol.2(9), pp.145-150, Sep 2014
Schafer. J. L. & Graham, J.N., “Missing Data: Our view of the state of the art”, Psychological Methods, Vol. 7, pp. 147-177, 2002.
Bhambri V., "Data Mining as a Solution for Data Management in Banking Sector", International Journal of Computer Sciences and Engineering, Vol.1(1), pp.20-25, Sep -2013.
King G, Tomaz M, Wittenberg J, “Making the Most of Statistical Analyses: Improving and Presentation”, American Journal of Political Science, Vol. 44(2), pp. 341-355, 2000.
Dempster A. P., Laird N. M., Rubin D. B., "Maximum Likelihood from Incomplete Data via the EM Algorithm", Journal of the Royal Statistical Society, Vol. 39(1) , pp. 1–38, 1977.
Honaker J., King G., “What to do About Missing Values in Time Series Cross-Section Data”, American J. of Political Science, Vol. 54(2), pp.561-581, 2010.
Horton NJ, Kleinman KP, “Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models”, The American Statistician Vol.61, pp. 79-90, 2007.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
