A Study on Missing Data Management

Authors

Mitra M Department of Computer Science and Application, University of North Bengal, Raja Rammuhunpur, India
RK Samanta Department of Computer Science and Application, University of North Bengal, Raja Rammuhunpur, India

Keywords:

UCI database, Missing At Random (MAR), Missing Completely At Random (MCAR), Missing Not At Random (MNAR), Multiple Imputation, Expectation Maximization with Bootstrap approach (EMB), Root Mean Square Error (RMSE)

Abstract

Missing data, a persistent problem in most scientific research, should be handled very carefully, as role of data are vital in every analysis. Mishandling missing values may cause distorted analysis or may generate biased results. Valid and reliable models require good data preparation. Dozens of techniques have been proposed by methodologists to address the problem. Appropriate method should be taken into consideration for a particular study in order to achieve efficient and valid analysis. In this study we discuss different methods to handle missing data and compare three imputation methods: Arithmetic Mean Imputation, Regression Imputation and Multiple Imputation using EMB algorithm, performed on three data sets from UCI repository under the assumption of MAR based on Root Mean Square Error (RMSE) as an evaluation criteria.

References

Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, “Missing value estimation methods for dna microarrays”, Bioinformatics Vol.17, pp.520-525, 2001.

Lewis HD, “Missing data in clinical trials”, New England Journal of Medicine, Vol. 367, pp. 2557-2558, 2012.

Rubin DB, “Inference and missing data”, Biometrica Vol. 63, pp. 581-592, 1976.

Little RJA, Rubin DB, Statistical Analysis with Missing Data (2nd edn.), Wiley-Interscience, 2002.

N.Durga, D.Ragupathi and V. Raj Kumar, "Uses of HDFS in Metadata Management System", International Journal of Computer Sciences and Engineering, Vol.2(9), pp.145-150, Sep 2014

Schafer. J. L. & Graham, J.N., “Missing Data: Our view of the state of the art”, Psychological Methods, Vol. 7, pp. 147-177, 2002.

Bhambri V., "Data Mining as a Solution for Data Management in Banking Sector", International Journal of Computer Sciences and Engineering, Vol.1(1), pp.20-25, Sep -2013.

King G, Tomaz M, Wittenberg J, “Making the Most of Statistical Analyses: Improving and Presentation”, American Journal of Political Science, Vol. 44(2), pp. 341-355, 2000.

Dempster A. P., Laird N. M., Rubin D. B., "Maximum Likelihood from Incomplete Data via the EM Algorithm", Journal of the Royal Statistical Society, Vol. 39(1) , pp. 1–38, 1977.

Honaker J., King G., “What to do About Missing Values in Time Series Cross-Section Data”, American J. of Political Science, Vol. 54(2), pp.561-581, 2010.

Horton NJ, Kleinman KP, “Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models”, The American Statistician Vol.61, pp. 79-90, 2007.

Downloads

PDF ⁰

Published

2025-11-11

How to Cite

[1]

M. Mitra and R. Samanta, “A Study on Missing Data Management”, Int. J. Comp. Sci. Eng., vol. 5, no. 2, pp. 30–341, Nov. 2025.

Download Citation

Issue

Vol. 5 No. 2 (2017): IJCSE February Edition

Section

Survey Article

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.

A Study on Missing Data Management

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Journal Information

UGC Gazette Regulation

Join Editorial Board

Information

Current Issue

Keywords