Tackling Imbalance Datasets: Methods, Techniques & Comparisons

Authors

DOI:

https://doi.org/10.26438/ijcse/v11i5.612

Keywords:

Multiclass, Classification, Imbalance, Prediction, Majority, Minority, Synthetic Minority Over-sampling Technique(smote), Simplified Swarm Optimization(SSO), Particle Swarm Optimization (PSO)

Abstract

Over the past many years of continuous research and learning from data, i.e.duplication and Extraction continues to be a spotlight of enormous research. A classification data set with skewed class proportions is referred to as imbalanced. This term originated as a debate over the skewed distributions of binary tasks. Imbalanced data are those datasets that have an uneven distribution of observations across the target class, i.e First class category will have a very higher number of observations while the other class will have less number of observations. The emergence of the massive data era, along with the growth of machine learning and data mining (Data Science), as going deeper into the field of learning with imbalanced datasets, alongside the challenges which are emerging. Data-level methods and algorithm-level methods are repeatedly used and getting improved and popularity of hybrid approaches increased due to the extraction of earlier approaches (data level and algo level) and reduced weaknesses with powerful points. In order to advance the field of addressing imbalanced datasets and compare existing approaches and methodologies, this paper attempts to discuss the open questions and challenges that need to be resolved. This essay discusses each of them and offers ideas for potential directions for further investigation. The main issue with an unbalanced class distribution is when bad training habits cause bias in favour of the majority class. Deep learning algorithms and machine learning algorithms perform training on datasets which are underrepresented in some categories. Conventional methods advise to perform undersampling on majority class category and oversampling minority class category before the learning stage.By including learning modules with clever representations of samples from majority and minority samples, this research investigates various traditional and contemporary strategies to address this issue. The works of several researchers are compiled in a very logical approach and numerical opportunities and also future difficulties for the field`s future research are discussed.

References

[1] Krawczyk, B. Learning from imbalanced data: open challenges and future directions.Prog Artif Intell 5, pp.221–232, 2016. https://doi.org/10.1007/s13748-016-0094-0

[2] S. Sridhar and A. Kalaivani, "A Two Tier Iterative Ensemble Method To Tackle Imbalance In Multiclass Classification," 2020 International Conference on Decision Aid Sciences and Application (DASA), pp.1248-1254, 2020. doi: 10.1109/DASA51403.2020.9317019.

[3] Yang P, Yoo P D, Fernando J, et al. Sample subset optimization tech- niques for imbalanced and ensemble learning problems in bioinformatics applications, IEEE Transactions on Cybernetics, Vol.44, no.3, pp.445- 455, 2014.

[4] Wang K J , Makond B , Chen K H , et al. A hybrid classifier combining SMOTE with PSO to estimate 5-year survivability of breast cancer patients, Applied Soft Computing, 2014, Vol.20, pp.15-24, 2014.

[5] Susan, S., & Kumar, A. (2021). The balancing trick: Optimized sampling of imbalanced datasets—A brief survey of the recent State of the Art. Engineering Reports, 3(4), e12298, 2021.

[6] Y. Fathy, M. Jaber and A. Brintrup, "Learning With Imbalanced Data in Smart Manufacturing: A Comparative Analysis," in IEEE Access, Vol.9, pp.2734-2757, 2021. doi: 10.1109/ACCESS.2020.3047838.

[7] Neshat, M., Sepidnam, G. & Sargolzaei, M. Swallow swarm optimization algorithm: a new method to optimization. Neural Comput & Applic 23, pp.429–454, 2013. https://doi.org/10.1007/s00521-012-0939-9

[8] Kaur, Harsurinder & Pannu, Husanbir & Malhi, Avleen. (2019). A Systematic Review on Imbalanced Data Challenges in Machine Learning: Applications and Solutions. ACM Computing Surveys. 52. pp.1-36, 2019. 10.1145/3343440.

[9] Krawczyk, B. Learning from imbalanced data: open challenges and future directions.Prog Artif Intell 5, pp.221–232, 2016. https://doi.org/10.1007/s13748-016-0094-0

[10] W. Obaid and A. B. Nassif, "The Effects of Resampling on Classifying Imbalanced Datasets," 2022 Advances in Science and Engineering Technology International Conferences (ASET), pp.1-6, 2022. doi:10.1109/ASET53988.2022.9735021.

[11] Fadi Thabtah, Suhel Hammoud, Firuz Kamalov, Amanda Gonsalves, Data imbalance in classification: Experimental evaluation, Information Sciences, Vol.513, pp.429-441, 2020. ISSN 0020-0255,https://doi.org/10.1016/j.ins.2019.11.004.

[12] Kaur, H., Pannu, H. S., & Malhi, A. K. (2019). A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM Computing Surveys (CSUR), 52(4), pp.1-36, 2019.

[13] Goyal, A., Rathore, L., & Kumar, S. (2021). A Survey on Solution of Imbalanced Data Classification Problem Using SMOTE and Extreme Learning Machine. In Communication and Intelligent Systems, pp.31-44, 2021. Springer, Singapore

[14] Sowah, R. A., Kuditchar, B., Mills, G. A., Acakpovi, A., Twum, R. A., Buah, G., & Agboyi, R. (2021). HCBST: An Efficient Hybrid Sampling Technique for Class Imbalance Problems. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(3), pp.1-37, 2021.

[15] Liu, Y., Loh, H. T., & Sun, A. (2009). Imbalanced text classification: A term weighting approach. Expert systems with Applications, 36(1), pp.690-701, 2009.

[16] Goyal, S. (2022). Handling class-imbalance with KNN (neighborhood) under-sampling for software defect prediction. Artificial Intelligence Review, 55(3), pp.2023-2064, 2022.

[17] Tsai, C. F., Lin, W. C., Hu, Y. H., & Yao, G. T. (2019). Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Information Sciences, 477, pp.47-54, 2019.

[18] Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1), pp.20-29, 2004.

Downloads

Published

2023-05-31
CITATION
DOI: 10.26438/ijcse/v11i5.612
Published: 2023-05-31

How to Cite

[1]
S. Kumar, D. Ahuja, and S. Kumar, “Tackling Imbalance Datasets: Methods, Techniques & Comparisons”, Int. J. Comp. Sci. Eng., vol. 11, no. 5, pp. 6–12, May 2023.

Issue

Section

Research Article