Comparative Study of Machine Learning Algorithms for Document Classification

Authors

Jain R School of Computer Science and IT, Devi Ahilya University, Indore, India
Thakur A School of Computer Science and IT, Devi Ahilya University, Indore, India

DOI:

https://doi.org/10.26438/ijcse/v7i6.11891191

Keywords:

Text Classification, Naïve Bayes, Random Forest, Machine Learning

Abstract

Text classification is a task of distribution of collection of predefined classes to free-text. Text classifiers are not able to organize, structure, and reason just about something. In this work we have used random forest and naïve Bayes algorithms to perform document classification task. We have trained the machine learning models to inference the respective class of the documents. By working on very big data sets of movie reviews the chosen machine learning models predict whether the reviews are positive or negative and then we analyse and compare the results of each model’s individual confusion matrix like precision, recall, f1-score & support. An important observation is that for the same input data random forest provides more relevant results as compared to naïve bayes algorithm. But as the training data grows naïve bayes also performs equally good as random forest.

References

[1] Agarwal, B. Xie, I. Vovsha, O. Rambow, and R.Passonneau, “Sentiment Analysis of Twitter Data,” Annual International Conference New York: Columbia University, 2012.

[2] M.Rambocas, and J. Gama, “Marketing Research: The Role of Sentiment Analysis”. The 5th SNA-KDD Workshop’11. University of Porto, 2013.

[3] Andrew Mc Callumzy, and Kamal Nigamy. “A Comparison of Event Models for Naive Bayes Text Classification”. Learning for Text Categorization: Papers from the 1998 AAAI Workshop, pp. 41-48.

[4] Zu G., Ohyama W., Wakabayashi T., Kimura F., "Accuracy improvement of automatic text classification based on feature transformation": Proc: the 2003 ACM Symposium on Document Engineering, November 20-22, 2003, pp.118-120

[5] Chaudhary, A., Kolhe, S., Kamal, R., 2016. A hybrid ensemble for classification in multiclass datasets: An application to oilseed disease dataset. Computers and Electronics in Agriculture 124, pp.65–72.

[6] Chaudhary, A., Kolhe, S., Kamal, R., 2016. An improved random forest classifier for multi-class classification. Information Processing in Agriculture 3, pp. 215-222.

[7] Chaudhary, A., Kolhe, S., Kamal, R., 2012. Machine learning techniques for mobile intelligent systems: A study. In IEEE Ninth International Conference on Wireless and Optical Communications Networks (WOCN), pp. 1-55.

[8] Chaudhary, A., Kolhe, S., Kamal, R., 2013. Machine Learning Classification Techniques: A Comparative Study. International Journal on Advanced Computer Theory and Engineering 2(4), pp. 21-25.

[9] Chaudhary, A., Kolhe, S., Kamal, R., 2013. Machine Learning Techniques for Mobile Devices: A Review. International Journal of Engineering Research and Applications 3(6), pp. 913-917.

[10] Chaudhary, A., Kolhe, S., Kamal, R., 2013. Performance Examination of Feature Selection methods with Machine learning classifiers on mobile devices. International Journal of Engineering Research and Applications 3(6), pp.587-594.

[11] Thakur, A., Thakur, R., 2018. Machine Learning Algorithms for Intelligent Mobile Systems. International Journal of Computer Sciences and Engineering 6(6), pp. 1257-1261.

[12] http://www.cs.cornell.edu/people/pabo/movie-review-data/poldata.README.2.0.txt

[13] https://www.anaconda.com/distribution/#download-section

[14] https://stackabuse.com/using-regex-for-text-manipulation-in-python/

[15] A. Pak, and P. Paroubek, “Twitter as a Corpus for Sentiment Analysis and Opinion Mining,” Special Issue of International Journal of Computer Application, France: Universitede Paris-Sud, 2010.

[16] Forman, G., 2003. “An Experimental Study of Feature Selection Metrics for Text Categorization”. Journal of Machine Learning Research, 3 2003, pp. 1289-1305

[17] https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a

[18] Y.H.LI and A.K Jain “Classification of text document”, the computer Journal, vol.41, pp. 8,1998

[19] https://monkeylearn.com/text-classification/

[20] https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/

Downloads

PDF ⁰

Published

2019-06-30

CITATION

DOI: 10.26438/ijcse/v7i6.11891191

Published: 2019-06-30

How to Cite

[1]

R. Jain and A. Thakur, “Comparative Study of Machine Learning Algorithms for Document Classification”, Int. J. Comp. Sci. Eng., vol. 7, no. 6, pp. 1189–1191, Jun. 2019.

Download Citation

Issue

Vol. 7 No. 6 (2019): IJCSE June Edition

Section

Research Article

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.

Comparative Study of Machine Learning Algorithms for Document Classification

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Journal Information

UGC Gazette Regulation

Join Editorial Board

Information

Current Issue

Keywords