Significance of learning methods for mining of real time data streams
DOI:
https://doi.org/10.26438/ijcse/v6i3.188209Keywords:
Data MiningAbstract
Stream Data is now more than ever highly distributed, loosely structured, increasingly large in volume and changing over time. Broadly speaking, firstly the volume of data increasing exponentially each year and secondly the speed at which the new data is being generated of distinct concept and changes over time. Stream Data is generated by number of sources. Data streaming applications are typically dealing with large amounts of data over an extended period of time. However, in most cases the user is only interested in recent data instead of the whole data set. Furthermore, stream data tends to express features of a concept drift, i.e. the data is evolving over time. This would cause algorithms which consider the whole data set with the same importance to produce distorted results. In such cases the majority of processed data would not be valid anymore. Sometimes the nature of a data stream itself requires giving up a certain amount of precision because its high volume couldn’t be processed otherwise and one would end up with no information at all. If the data distribution is stable, mining a data stream is largely the same as mining a large data set, since statistically it is easily to mine a sufficient sample. The expectations of mining data streams are finding and understanding changes, maintaining an updated model. For evolving data, two classes of problems are of particular interest: model maintenance and change detection. The goal of model maintenance is to maintain a data mining model under inserts and deletes of blocks of data. In this model, older data is available if necessary. Change detection is related to quantify the difference between two sets of data and determine when the change has statistical significance. Data streams can be seen as stochastic processes in which events occur continuously and independently from each another [1]. Querying data streams is quite different from querying in the conventional relational model. A key idea is that operating on the data stream model does not preclude the use of data in conventional stored relation, data might be transient.
References
G. Hulten, L. Spencer, and P. Domingos , Mining Time-Changing Data Streams,‖ Proc. Seventh ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ’01), pp. 97-106, 2001.
D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz. UCI repository of machine learning databases, 1998.
C.X. Ling and V.S. Sheng. Cost-sensitive learning and the class imbalance problem. Encyclopedia of Machine Learning, 2008.
Pedro Domingos, Geoff Hulten, “ Mining High Speed Data Streams”,KDD-00 in proceeding of sixth ACM SIGKDD international conference on knowledge discovery and data mining, USA, 2000, pp 71-80.[5]Leo Breiman (2001). Random orests.Machine Learning. 45(1):5-32.
J.C.Schimmer and R.H.Ganger Beyond incremental processing :Tracking Concept Drift .In proceedings of the fifth National conference on Artificial Intelligence .pages 502-507 AAAI press ,Menlo park ,CA,1986.
N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent data analysis,6(5):429{449, 2002.
W. Nick Street and Yong Seog Kim. A Streaming Ensemble Algorithm (SEA) for Large- Scale Classification. KDD – 01. San Francisco, CA.
D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz. UCI repository of machine learning databases, 1998.
. Dougherty, J., R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous features. In Proceedings of International Conference on Machine Learning (ICML-1995), 1995.
Pedro Domingos, Geoff Hulten, “ Mining High Speed Data Streams”,KDD-00 in proceeding of sixth ACM SIGKDD international conference on knowledge discovery and data mining, USA, 2000, pp 71-80.
A. Tsymbal. “The problem of concept drift: definitions and related work”, Technical Report TCD-CS-2004-15, Computer Science Department, Trinity College Dublin, Ireland. 2004.
W. Nick Street and Yong Seog Kim. A Streaming Ensemble Algorithm (SEA) for Large- Scale Classification. KDD – 01. San Francisco, CA.
E Padmalatha, C R K Reddy and Padmaja B Rani. Article: Ensemble Classification for Drifting Concept. International Journal of Computer Applications 80(11):33-36, October 2013.
E.Padmalatha,C.R.K.Reddy, B.Padmaja Rani ”Classification of Concept Drift Data Streams”In the proceedings of the Fifth International Conference on Information Science and Applications .ICISA 2014.IEEE PP291-295, 2014.
Periasamy Vivekanandan and Raju Nedunchezhian, “Mining data streams with concept drifts using genetic algorithm”, Artificial Intelligence Review, Vol. 36, Issue 3, pp 163-178, Springer, October 2011.
Basheer M. Al-Maqaleh and Hamid Shahbazkia, “A Genetic Algorithm for Discovering Classification Rules in Data Mining”, International Journal of Computer Applications (0975-8887), Vol. 41-No. 18, March 2012.
Syed Shaheena and Shaik Habeeb, “Classification Rule Discovery Using Genetic Algorithm-Based Approach”, NIMRA Institute, Department of CSE, IJCTT Journal, Vol. 4, Issue 8, pp 2710-2715, August 2013.
E Padmalatha, C R K Reddy and Padmaja B Rani. Article: Classification of Concept-Drifting Data Streams using Optimized Genetic Algorithm. International Journal of Computer Applications 125(15):1-6, September 2015.
Wei Liu, Sanjay Chawla, David A. Cieslak, Nitesh V. Chawla, ― A Robust Decision Tree Algorithm for Imbalanced Data Sets‖, 2010.
Xu-Ying Liu, Jianxin Wu, Zhi-Hua Zhou” Exploratory Undersampling for Class-Imbalance Learning”, IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 39, NO. 2, APRIL 2009, pp.no:539 – 550.
X.Y. Liu, J. Wu, and Z.H. Zhou. Exploratory undersampling for class-imbalance learning. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 39(2):539{550, 2009. [12]12Data Mining: Concepts and Techniques. J. Han and M. Kamber. Morgan Kaufmann, 2000.
Junfeng Pan and Qiang Yang, Yiming Yang and Lei Li, Frances Tianyi Li and George Wenmin Li “Cost-SensitiveData Preprocessing for Mining Customer Relationship Management Databases”, JANUARY/FEBRUARY 2007, A Technical Report.
J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classifier chains for multi-label classification. In Proceedings of ECML PKDD ’09, pages 254–269, 2009.
]Macskassy, S.A. and Provost, F.J., “Confidence Bands for ROC Curves,” CeDER Working Paper 02-04, Stern School of Business, New York University, NY, NY 10012. Jan 2004.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
