Query Execution Performance Analysis of Big Data Using Hive and Pig of Hadoop
Keywords:
Big Data, Hive, Pig, Performance Analysis, Data Processing, Query Execution TimeAbstract
The cloud platform requires an efficient computational infrastructure. On this platform a huge amount of data gets generated in a fraction of a second, therefore, traditional computing techniques are not enough. The Big Data provides an answer for such huge computing and also provides support to scale the storage according to the application’s need. Big Data is a new generation storage infrastructure (hardware and software). In this paper the Big Data environment is investigated and the comparative study is performed among most frequently used data retrieval techniques. In order to perform the comparative study, Pig and Hive of Hadoop technology are selected. These techniques provide efficient data processing ability. In order to perform comparative study Hadoop storage is prepared first and then with the help of MapReduce framework the Pig and Hive are configured. Additionally, for evaluating the efficiency of query execution in terms of processing time, a list of similar queries is prepared and for each query the experiment was performed. The result evaluation is done for both the techniques. It is observed that query processing time of the Hive is less as compared to the Pig for the selected new_songs dataset, but both the data models are working to achieve the different goals thus both the technologies are adaptable for different kinds of computer configuration.
References
Bharath Vissapragada, “Optimizing SQL Query Execution over Map-Reduce,” M.S. thesis, Dept Comp. Sc., Center for Data Engineering International Institute of Information Technology, Hyderabad, India, September 2014.
Ammar Fuad, Alva Erwin, and Heru PurnomoIpung, “Processing Performance on Apache Pig, Apache Hive and MySQL Cluster,” International Conference on Information, Communication Technology and System, IEEE, 2014.
F. Provost, T. Fawcett, “Data Science and its relationship to Big Data and data-driven decision making,” University of Massachusetts Amherst, DOI: 10.1089/big.2013.1508, March 2013.
Changqing Ji, Yu Li, Wenming Qiu, Uchechukwu Awada, and Keqiu Li, “Big Data Processing in Cloud Computing Environments,” International Symposium on Pervasive Systems, Algorithms and Networks, IEEE, Dalian, China, 2012.
Apache Hadoop, Available: http://wiki.apache.org/hadoop.
Munesh Kataria, Ms.Pooja Mittal, “Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql,” IJCSMC, Vol. 3, July 2014, pp. 759 – 765.
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu and Raghotham Murthy, “Hive – A Petabyte Scale Data Warehouse Using Hadoop,” ICDE Conference, IEEE, 2010.
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy, “Hive – A Warehousing Solution Over a Map-Reduce Framework,” VLDB, ACM, Lyon, France, August 2009, pp. 24-28.
Anja Gruenheid, Edward Omiecinski, and Leo Mark, “Query Optimization Using Column Statistics in Hive,” IDEAS, ACM, Lisbon, Portugal, September 2011, pp. 21-23.
Meng-Ju Hsieh, Chao-Rui Chang, Li-Yung Ho, Jan-Jan Wu, and Pangfeng Liu, “SQLMR: A Scalable Database Management System for Cloud Computing,” DBLP, January 2011.
Avrilia Floratou, Umar Farooq Minhas, and Fatma Ozcan, “SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures,” Proceedings of the VLDB Endowment, Vol. 7, No. 12, 2014.
Rakesh Kumar, Neha Gupta, Shilpi Charu, Somya Bansal, and Kusum Yadav, “Comparison of SQL with HiveQL,” International Journal for Research in Technological Studies, Vol. 1, Issue 9, August 2014.
Sai Prasad Potharaju, Shanmuk Srinivas, Ravi Kumar Tirandasu, “Case Study of Hive Using Hadoop,” DBLP, Volume-1, Issue-3, 2014.
Madhuri Srinivas Palle, Konisa Jyothsna and B. Anusha, “Analyzing Failures of a Semi-Structured Supercomputer Log File Efficiently by Using Pig on Hadoop,” International Journal of Computer Science and Engineering, Volume-2, Issue-1, 2014.
Tak Lon Wu, Abhilash Koppula, and Judy Qiu, “Integrating Pig with Harp to Support Iterative Applications with Fast Cache and Customized Communication”, ACM, 2014.
Gang Zhao, “A Query Processing Framework based on Hadoop,” International Journal of Database Theory and Application, Vol.7, No.4, 2014, pp. 261-272.
James M. Harris, and Dr. Cynthia, and Z.F. Clark, “Strengthening Methodological Architecture with Multiple Frames and Data Sources,” Proceedings 59th ISI World Statistics Congress, Hong Kong, August 2013.
J. Christy Jackson, V. Vijaya kumar, Md. Abdul Quadir, and C. Bharathi, “Survey on Programming Models and Environments for Cluster, Cloud, and Grid Computing that defends Big Data,” 2nd International Symposium on Big Data and Cloud Computing (ISBCC’15), ELSEVIER, 2015.
Dataset that is used in this project, Available: https://github.com/jasondbaker/seis734.
Radhiya A. Arsekar, Ankita V. Chikhale, Vaibhav T. Kamble and Vinayak N. Malavade, “Comparative Study of MapReduce and Pig in Big Data”, International Journal of Current Engineering and Technology, Vol.5, No.2, April 2015.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
