A Study on Missing Data Management

Authors

  • Mitra M Department of Computer Science and Application, University of North Bengal, Raja Rammuhunpur, India
  • RK Samanta Department of Computer Science and Application, University of North Bengal, Raja Rammuhunpur, India

Keywords:

UCI database, Missing At Random (MAR), Missing Completely At Random (MCAR), Missing Not At Random (MNAR), Multiple Imputation, Expectation Maximization with Bootstrap approach (EMB), Root Mean Square Error (RMSE)

Abstract

Missing data, a persistent problem in most scientific research, should be handled very carefully, as role of data are vital in every analysis. Mishandling missing values may cause distorted analysis or may generate biased results. Valid and reliable models require good data preparation. Dozens of techniques have been proposed by methodologists to address the problem. Appropriate method should be taken into consideration for a particular study in order to achieve efficient and valid analysis. In this study we discuss different methods to handle missing data and compare three imputation methods: Arithmetic Mean Imputation, Regression Imputation and Multiple Imputation using EMB algorithm, performed on three data sets from UCI repository under the assumption of MAR based on Root Mean Square Error (RMSE) as an evaluation criteria.

References

Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, “Missing value estimation methods for dna microarrays”, Bioinformatics Vol.17, pp.520-525, 2001.

Lewis HD, “Missing data in clinical trials”, New England Journal of Medicine, Vol. 367, pp. 2557-2558, 2012.

Rubin DB, “Inference and missing data”, Biometrica Vol. 63, pp. 581-592, 1976.

Little RJA, Rubin DB, Statistical Analysis with Missing Data (2nd edn.), Wiley-Interscience, 2002.

N.Durga, D.Ragupathi and V. Raj Kumar, "Uses of HDFS in Metadata Management System", International Journal of Computer Sciences and Engineering, Vol.2(9), pp.145-150, Sep 2014

Schafer. J. L. & Graham, J.N., “Missing Data: Our view of the state of the art”, Psychological Methods, Vol. 7, pp. 147-177, 2002.

Bhambri V., "Data Mining as a Solution for Data Management in Banking Sector", International Journal of Computer Sciences and Engineering, Vol.1(1), pp.20-25, Sep -2013.

King G, Tomaz M, Wittenberg J, “Making the Most of Statistical Analyses: Improving and Presentation”, American Journal of Political Science, Vol. 44(2), pp. 341-355, 2000.

Dempster A. P., Laird N. M., Rubin D. B., "Maximum Likelihood from Incomplete Data via the EM Algorithm", Journal of the Royal Statistical Society, Vol. 39(1) , pp. 1–38, 1977.

Honaker J., King G., “What to do About Missing Values in Time Series Cross-Section Data”, American J. of Political Science, Vol. 54(2), pp.561-581, 2010.

Horton NJ, Kleinman KP, “Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models”, The American Statistician Vol.61, pp. 79-90, 2007.

Downloads

Published

2025-11-11

How to Cite

[1]
M. Mitra and R. Samanta, “A Study on Missing Data Management”, Int. J. Comp. Sci. Eng., vol. 5, no. 2, pp. 30–341, Nov. 2025.