Developing & Deploying Algorithms for Information Extraction using Classification Measures for Named Entity Recognition
DOI:
https://doi.org/10.26438/ijcse/v6i10.235248Keywords:
Information extraction, Natural language processing, Named entity recognition, Conditional random fieldsAbstract
The web is full of the content which is either in complete or semi unstructured form and retrieving the essential data out of this unstructured form is very difficult so the concept of the information extraction (IE) keeping in view the necessary parameters becomes highly essential. This paper presents a comparative study for how the problem of information extraction can be handled for a dataset by taking the first step towards IE of named entity recognition (NER) into consideration. Various classifiers/techniques and impact of pipeline on some of them is discussed in this paper for NER and based on the results with keeping the due response time into consideration the classifier/technique of conditional random fields for NER serves out to be the best with an average recall and precision of 0.97 each helping in predicting efficiently of whether a given word is a part of the named entity or not. The automation in the field of medical science for search of the patient for clinical trials from the clinical databases serves to be the most important area of concern at the present time & this paper provides an approach for choosing the technique according to parameters, also discussing the results of the novel algorithmic approach
References
[1] E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. “Improving Word Representations via Global Context and Multiple Word Prototypes.” In ACL, 2012.
[2] Charles Sutton and Andrew McCallum, “An Introduction to Conditional Random Fields”, Foundations and Trends in Machine Learning, Vol. 4, No.4, 267-373, 2012.
[3] Paul Anderson, Aspen Olmsted, Gayathri Parthasarathy, “NLP Pipeline for Temporal Information Extraction & Classification from free text Eligibility Criteria”, International Conference on Information Society, IEEE, 2016.
[4] Dekai Wu, Grace Ngai, Marine Carpuat, Jeppe Larsen, and Yongsheng Yang. “Boosting for named entity recognition.” In Dan Roth and Antal van den Bosch, editors, Proc. 6th Conf. on Computational Natural Language Learning (CoNLL), 2002.
[5] Yefeng Wang and Jon Patrick. “Cascading classifiers for named entity recognition in clinical notes.”In Proc. Workshop on Biomedical Information Extraction (WBIE), pages 42-49, 2009.
[6] Rehan Khan and A.J. Singh, "NLP: A Comparative Study with Algorithmic Approach for Information Extraction", International Journal of Emerging Technologies and Innovative Research, Vol.5, Issue 9, page no.970-979, September-2018.
[7] D. Koller and N. Friedman, “Probabilistic Graphical Models: Principles and Techniques”, MIT Press, 2009.
[8] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. “Natural Language Processing (Almost) from Scratch.” Journal of Machine Learning Research, 12:2493,2537, 2011.
[9] C. Sutton and A. McCallum, “Piecewise training for structured prediction,” Machine Learning, vol. 77, no. 2–3, no. 2–3, pp. 165–194, 2009.
[10] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost: Joint appearance, shape and context modeling for mulit-class object recognition and segmentation,” in European Conference on Computer Vision (ECCV), 2006.
[11] C. Sutton, K. Rohanimanesh, and A. McCallum, “Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data,” in International Conference on Machine Learning (ICML), 2004.
[12] Yoav Freund and Robert E. Schapire. “Large margin classication using the perceptron algorithm.”Machine Learning, 37(3):277 to 296, 1999.
[13] F. C. Peng and A. McCallum, “Accurate information extraction from research papers using conditional random fields,” in HLT-NAACL 2004: Main Proceedings, pp. 329–336, Association for Computational Linguistics, Boston, Mass, USA, 2004.
[14] A. Torralba, K. P. Murphy, and W. T. Freeman, “Contextual models for object detection using boosted random fields,” in Advances in Neural Information Processing Systems, vol. 17, pp. 1401–1408, 2005.
[15] Crammer K., Dekel O., Keshet J., Shalev-Shwartz S., Singer Y., “Online Passive-Aggressive Algorithms”, Journal of Machine Learning Research 7 (2006) 551–585.
[16] V. Vineet, J. Warrell, P. Sturgess, and P. H. S. Torr, “Improved initialisation and Gaussian mixture pairwise terms for dense random fields with mean-field inference,” in Proceedings of the 23rd British Machine Vision Conference (BMVC ’12), Surrey, UK, September 2012.
[17] S. Kumar and M. Hebert, “Discriminative random fields,” International Journal of Computer Vision, vol. 68, no. 2, pp. 179–201, 2006.
[18] N. Piatkowski and K. Morik, “Parallel loopy belief propagation in conditional random fields,” in Proceedings of the KDML Workshop of the LWA, Magdeburg, Germany, 2011.
[19] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a large annotated corpus of English: The Penn Treebank,” Computational Linguistics, vol. 19, no. 2, no. 2, pp. 313–330, 1993.
[20] GMB (Groningen Meaning Bank) corpus, http://gmb.let.rug.nl/.
[21] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1222–1239, 2001.
[22] Confusion Matrix, http:// scikit-learn.org/ stable/ modules/ generated/sklearn.metrics.confusion_matrix.html
[23] Scikit-learn, http://scikit-learn.org/ stable/ auto_examples/ model_selection/ plot_precision_recall.html # sphx-glr-auto-examples-model-selection-plot-precision-recall-py
[24] I. Sutskever and T. Tieleman, “On the convergence properties of contrastive divergence,” in Conference on Artificial Intelligence and Statistics (AISTATS), 2010.
[25] Erik F. Tjong Kim Sang. “Introduction to the CoNLL-2002 shared task:Language-independent named entity recognition.”In Dan Roth and Antal van den Bosch, editors, Proc. 6th Conf. on Computational Natural Language Learning (CoNLL), pages 155 to 158, 2002.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
