Degraded Bangla Character Recognition by k- NN Classifier
Keywords:
Degraded document recognition, Bangla document analysis, K-Means, k-nearest neighbourAbstract
Digitization of Bangla degraded document by Optical Character Recognition is a research activities now a days. Some historical documents particularly of 60s and 70s are degrading day by day due to lack of preservation. Those need to be retrieved. In this article, we present our recent study on recognition of degraded printed document images of Bangla, the 7th most popular language in the world. In the proposed approach the input will be low quality degraded images and the output is the recognized characters. In the first step some preprocessing are done on the document image to improve the quality of the scanned image. The proposed approach is an analytic approach. The segmentation is carried out line by line, word by word and finally character by character. The database used is the ISIDDI database. The total number of historical pages in TIF and JPG formats are 535, containing different fonts, sizes, formats and most importantly different levels of degradations. After segmentation we have manually identified 320 classes of such segmented symbols and divided the whole character dataset into test set (30%) and training set (70%). From the training set of 320 classes we have computed the Histogram of gradient feature or HOG feature on the samples. By applying the K-means clustering algorithm clusters for 320 classes has been generated and labeled according to the classes. For a character of test set again the HOG is computed and by applying k-nearest neighbour algorithm with the 320 classes the character is assigned to a character class with the minimum distance. The classification accuracy obtained on the test set is encouraging. We have achieved 82. 80% character or symbol level accuracy on 320 classes from the confusion matrix.
References
[1] BB Chaudhuri, U Pal and Mandar Mitra, “Automatic recognition of printed Oriya script”, Sadhana, Vol. 27, Pp. 23–34, 2002.
[2] R Seethalakshmi, TR Sreeranjani, T Balachandar, Abnikant Singh, Markandey Singh, Ritwaj Ratan and Sarvesh Kumar, “Optical character recognition for printed Tamil text using Unicode”, Journal of Zhejiang University-SCIENCE A, Vol.6,Pp. 1297–1305, 2005.
[3] BB Chaudhuri and U Pal, “A complete printed Bangla OCR system”, Pattern Recognition, Vol. 31, Pp. 531–549, 1998.
[4] Ujjwal Bhattacharya, Malayappan Shridhar and Swapan K Parui, “On recognition of handwritten Bangla characters”, Computer Vision, Graphics and Image Processing, Springer publisher, Pp. 817–828, 2006.
[5] Apurva A Desai, “Gujarati handwritten numeral optical character reorganization through neural network”, Pattern Recognition, Vol. 43 Pp. 2582–2589, 2010.
[6] Binu P Chacko, VR Vimal Krishnan, G Raju and P Babu Anto, “Handwritten character recognition using wavelet energy and extreme learning machine”, International Journal of Machine Learning and Cybernetics, Vol. 3,Pp. 149–161, 2012.
[7] C Vasantha Lakshmi and C Patvardhan, “An optical character recognition system for printed Telugu text”, Pattern analysis and applications, Vol. 7, Pp. 190–204, 2014.
[8] Kapil Dev Dhingra, Sudip Sanyal, and Pramod Kumar Sharma, “A robust ocr for degraded documents”, In Advances in Communication Systems and Electrical Engineering, Springer publisher, Pp. 497–509 , 2008.
[9] Laurence Likforman-Sulem, Abderrazak Zahour, and Bruno Taconet. “Text line segmentation of historical documents: a survey”, International journal on document analysis and recognition,Vol. 9(2), Pp. 123–138, 2007.
[10] Tapan Kumar Bhowmik, Swapan Kumar Parui, Utpal Roy, and Lambert Schomaker, “Bangla handwritten character segmentation using structural features: A supervised and bootstrapping approach”, ACM Transactions on Asian and Low-Resource Language Information Processing, Vol. 15(4), Pages. 29, 2016.
[11] Chandan Biswas, Partha Sarathi Mukherjee, Koyel Ghosh, Ujjwal Bhattacharya, and Swapan K. Parui, “A hybrid deep architecture for robust recognition of text lines of degraded printed documents”, In 24th International Conference on Pattern Recognition, IEEE, 2018.
[12] Jaakko Sauvola and Matti Pietikäinen, “Adaptive document image binarization”, Pattern Pecognition, Vol. 33(2), Pp. 225–236, 2000.
[13] Chandan Singh, Nitin Bhatia, and Amandeep Kaur, “Hough transform based fast skew detection and accurate skew correction methods”, Pattern Recognition, Vol. 41(12), Pp. 3528– 3546, 2008.
[14] Ying Jie Liu and Fu Cheng You, “Application of mathematical morphology on touching or broken characters processing”, In Advanced Materials Research, Vol. 171, Pp. 73–77, 2011.
[15] BB Chaudhuri and U Pal, “A complete printed bangla ocr system”, Pattern Recognition, Vol 31(5), Pp. 531–549, 1998.
[16] Mohamed Becha Kaaniche, Francois Bremond, “Tracking HoG Descriptors for Gesture Recognition”, Advanced Video and Signal Based Surveillance, 2009 AVSS`09, Sixth IEEE International Conference on, Pp. 140–145, 2009, IEEE.
[17] John A Hartigan and Manchek A Wong, “Algorithm as 136: A k-means clustering algorithm”, Journal of the Royal Statistical Society. Series C (Applied Statistics),Vol. 28(1), Pp. 100–108, 1979.
[18] Keinosuke Fukunaga and Patrenahalli M. Narendra, “A branch and bound algorithm for computing k-nearest neighbors”. IEEE transactions on computers, Vol. 100(7), Pp. 750–753, 1975.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
