Comparative Study of Machine Learning Algorithms for Document Classification

Authors

  • Jain R School of Computer Science and IT, Devi Ahilya University, Indore, India
  • Thakur A School of Computer Science and IT, Devi Ahilya University, Indore, India

DOI:

https://doi.org/10.26438/ijcse/v7i6.11891191

Keywords:

Text Classification, Naïve Bayes, Random Forest, Machine Learning

Abstract

Text classification is a task of distribution of collection of predefined classes to free-text. Text classifiers are not able to organize, structure, and reason just about something. In this work we have used random forest and naïve Bayes algorithms to perform document classification task. We have trained the machine learning models to inference the respective class of the documents. By working on very big data sets of movie reviews the chosen machine learning models predict whether the reviews are positive or negative and then we analyse and compare the results of each model’s individual confusion matrix like precision, recall, f1-score & support. An important observation is that for the same input data random forest provides more relevant results as compared to naïve bayes algorithm. But as the training data grows naïve bayes also performs equally good as random forest.

References

[1] Agarwal, B. Xie, I. Vovsha, O. Rambow, and R.Passonneau, “Sentiment Analysis of Twitter Data,” Annual International Conference New York: Columbia University, 2012.

[2] M.Rambocas, and J. Gama, “Marketing Research: The Role of Sentiment Analysis”. The 5th SNA-KDD Workshop’11. University of Porto, 2013.

[3] Andrew Mc Callumzy, and Kamal Nigamy. “A Comparison of Event Models for Naive Bayes Text Classification”. Learning for Text Categorization: Papers from the 1998 AAAI Workshop, pp. 41-48.

[4] Zu G., Ohyama W., Wakabayashi T., Kimura F., "Accuracy improvement of automatic text classification based on feature transformation": Proc: the 2003 ACM Symposium on Document Engineering, November 20-22, 2003, pp.118-120

[5] Chaudhary, A., Kolhe, S., Kamal, R., 2016. A hybrid ensemble for classification in multiclass datasets: An application to oilseed disease dataset. Computers and Electronics in Agriculture 124, pp.65–72.

[6] Chaudhary, A., Kolhe, S., Kamal, R., 2016. An improved random forest classifier for multi-class classification. Information Processing in Agriculture 3, pp. 215-222.

[7] Chaudhary, A., Kolhe, S., Kamal, R., 2012. Machine learning techniques for mobile intelligent systems: A study. In IEEE Ninth International Conference on Wireless and Optical Communications Networks (WOCN), pp. 1-55.

[8] Chaudhary, A., Kolhe, S., Kamal, R., 2013. Machine Learning Classification Techniques: A Comparative Study. International Journal on Advanced Computer Theory and Engineering 2(4), pp. 21-25.

[9] Chaudhary, A., Kolhe, S., Kamal, R., 2013. Machine Learning Techniques for Mobile Devices: A Review. International Journal of Engineering Research and Applications 3(6), pp. 913-917.

[10] Chaudhary, A., Kolhe, S., Kamal, R., 2013. Performance Examination of Feature Selection methods with Machine learning classifiers on mobile devices. International Journal of Engineering Research and Applications 3(6), pp.587-594.

[11] Thakur, A., Thakur, R., 2018. Machine Learning Algorithms for Intelligent Mobile Systems. International Journal of Computer Sciences and Engineering 6(6), pp. 1257-1261.

[12] http://www.cs.cornell.edu/people/pabo/movie-review-data/poldata.README.2.0.txt

[13] https://www.anaconda.com/distribution/#download-section

[14] https://stackabuse.com/using-regex-for-text-manipulation-in-python/

[15] A. Pak, and P. Paroubek, “Twitter as a Corpus for Sentiment Analysis and Opinion Mining,” Special Issue of International Journal of Computer Application, France: Universitede Paris-Sud, 2010.

[16] Forman, G., 2003. “An Experimental Study of Feature Selection Metrics for Text Categorization”. Journal of Machine Learning Research, 3 2003, pp. 1289-1305

[17] https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a

[18] Y.H.LI and A.K Jain “Classification of text document”, the computer Journal, vol.41, pp. 8,1998

[19] https://monkeylearn.com/text-classification/

[20] https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/

Downloads

Published

2019-06-30
CITATION
DOI: 10.26438/ijcse/v7i6.11891191
Published: 2019-06-30

How to Cite

[1]
R. Jain and A. Thakur, “Comparative Study of Machine Learning Algorithms for Document Classification”, Int. J. Comp. Sci. Eng., vol. 7, no. 6, pp. 1189–1191, Jun. 2019.

Issue

Section

Research Article