Extracting Tasks of Text Files using Dictionary Based Approach for Classification and Indexing

Authors

  • Rayate P PG Student, Computer Engineering Department, Bharati Vidyapeeth Deemed University College of Engineering Pune, India
  • Thakore DS H.O.D., Computer Engineering Department, Bharati Vidyapeeth Deemed University College of Engineering Pune, India

Keywords:

Natural language processing, text mining, part-of-speech tagging, text files, machine learning techniques, WordNet library etc

Abstract

In software documentation, product knowledge and software requirement are very important to improve product quality. Reading of whole documentation of large corpus cannot be possible by developers in maintenance stage. They need to receive software documentation entities i.e. (development, designing and testing etc.) in a short period of time. In software documentation an important documents are able to record. There exists a space between information which developer wants and software documentation. This difference can be experimental whenever developers effort to discover the accurate information in the correct form at the exact time. To solve this problem, an approach for extracting relevant task of the documentation under four phases of software entities (i.e. documentation, development, testing and other etc.) is described. The main idea is task extracted from the software documentation, freeing the developer easily get the required data from software documentation with customize portal using Natural Language Processing (NLP) and then the category of task can be generated easily from existing applications. The machine learning approach that is based on supervised learning technique for training dataset in the form of text files based on text mining. Our approach use WordNet library to identify relevant tasks for calculating frequency of each word which allows developers in a piece of software to discover the word usage and also assigning Part-of Speech (POS) to each word. The result shows that task is extracted by calculating how many sentences, tokens and tasks appearing in a document and also shows task is relevant or not. It also reduced a live space between information which developers want and software documentation. This is used to improve the performance of system by taking feedback of developers. The result is identified through customize portal which helps to developers easily get information in a short period of time. The system is 80% precise to extract task by taking feedback of developers in the form of comment.

References

Christoph Treude, Martin P. Robillard, and Barth_el_emy Dagenais ,”Extracting Development Task To Navigate Software Documentation” in Proc, IEEE Soft,Vol.41 No.6,2015,pp,565-581, June 2015.

S. Gupta, S. Malik, L. Pollock, and K. Vijay-Shanker, “Part-of speech tagging of program identifiers for improved text-based software engineering tools,” in Proc. 21st IEEE Int. Conf. Program Comprehension, pp. 3–12,2013 .

M. Barouni-Ebrahimi and A. A. Ghorbani, “On query completion in web search engines based on query stream mining,” in Proc. IEEE/WIC/ACM Int. Conf. Web Intell., pp. 317–320,2007.

P. Mika, E. Meij, and H. Zaragoza, ”Investigating the semantic gap through query log analysis,” in Proc. 8th Int. Semantic Web Conf., pp. 441–455,2009.

S.L.Abebe and P.Tonella,“Natural language parsing of program element names for concept extraction,” in Proc. 18 th IEEE Int. Conf. Program Comprehension, pp. 156–159,2010.

C. Treude and M.-A. Storey, “Effective communication of software development knowledge through community portals,” in Proc. 8th Joint Meet. Eur. Soft. Eng. Conf. ACM SIGSOFT Symp. Found. Soft. Eng., pp. 91–101,2011.

T. C. Lethbridge, J. Singer, and A. Forward, “How software engineers use documentation: The state of the practice,” IEEE Soft., vol. 20, no. 6, pp. 35–39, Nov./Dec. 2003.

C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, “The Stanford Core NLP natural language processing toolkit,” in Proc. 52 nd Annu. Meet. Assoc. Computat. Linguistics: Syst. Demonstrations, pp. 55–60,2014.

G. Sridhara, E. Hill, L. Pollock, and K. Vijay-Shanker, “Identifying word relations in software: A comparative study of semantic similarity tools,” in Proc. 16th IEEE Int. Conf. Program Comprehension, pp. 123–132, 2008.

H. Zhong, L. Zhang, T. Xie, and H. Mei, “Inferring resource specifications From natural language API documentation,” in Proc. 24th IEEE/ACM Int. Conf. Automated Soft. Eng., pp. 307–318,2011.

S. Haiduc, G. Bavota, A. Marcus, R. Oliveto, A. De Lucia, and T. Menzies, “Automatic query reformulations for text retrieval in software engineering,” in Proc. 35th Int. Conf. Soft. Eng., pp. 842–851,2013.

J. Yang and L. Tan, “Inferring semantically related words from software context,” in Proc. 9th Working Conf. Min. Softw. Repositories, pp. 161–170,2012.

E. Hill, L. Pollock, and K. Vijay-Shanker, “Automatically capturing source code context of NL-queries for software maintenance and reuse,” in Proc. 31st Int. Conf. Soft. Eng., pp. 232–242,2009.

M. J. Howard, S. Gupta, L. Pollock, and K. Vijay-Shanker, “Automatically mining software-based, semantically-similar words from comment-code mappings,” in Proc. 10th Working Conf. Min. Softw. Repositories, pp. 377–386, 2013.

James H. Martin, “Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition”, by Prentice Hall ,January 2000.

Downloads

Published

2025-11-11

How to Cite

[1]
P. Rayate and D. S. Thakore, “Extracting Tasks of Text Files using Dictionary Based Approach for Classification and Indexing”, Int. J. Comp. Sci. Eng., vol. 4, no. 7, pp. 44–50, Nov. 2025.

Issue

Section

Research Article