Performance Comparison of Map Reduce and Apache Spark on Hadoop for Big Data Analysis
Keywords:
Big Data, Hadoop, HDFS, Map Reduce, Apache SparkAbstract
With the unremitting advancement of internet and IT, tremendous growth of data has been observed. Data creation occurring at very fast pace, referred as big data, is a trending term these days. Big Data has been the topic of fascination for Computer Science fanatic around the world, and has gained even more prominence in the last few years. This paper scrutinizes the comparison of Hadoop Map Reduce and the newly introduced Apache Spark – both of which are framework for analyzing big data. Although both of these resources are based on the idea of Big Data, their performance varies significantly based on the application under consideration. In this paper two frameworks are being compared along with providing the performance comparison using word count algorithm. In this paper, various datasets has been analyzed over Hadoop Map Reduce and Apache Spark environment for word count algorithm. The system that comes out to be better is further used to analyze the research dataset of a university.
References
Jacob,J.P., Basu A,“ Performance analysis of hadoop mapreduce on eucalyptus private cloud” , International Journal of Computer Applications , Vol.17, 2013.
Guanghui, X., Feng, X., Hongxu, M. ,.“ Deploying and Researching Hadoop in Virtual Machines”, Proceeding of the IEEE,International Conference on Automation and Logistics,Zhengzhou, China, 2012.
Ezhilvathani, A., Raja, K.,“Implementation of Parallel Apriori Algorithm on Hadoop Cluster”, IJCSMC, Vol. 2, 2013 pp.513 – 516.
Zaharia,M., Chowdhury, M., Franklin J, Shenker, S., Stoica, I., " Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing". Technical Report UCB/EECS-2011-82, EECS Department, UC Berkeley, 2011.
Peng, W.,, Yan, Q., Hua, Y. “Analysis and Study on the Performance of Query based on NoSQL Database”, Computer modelling & new technologies , 2014, pp.153-159 .
Wang, L., Tao, J., Ranjan, R., Marten, H., Streit, A., Chen, J., Chen, D.,. “G-Hadoop: MapReduce across distributed data centers for data-intensive computing” , Parallel and Distributed Processing Symposium Workshops and Phd Forum ,IEEE 26th International , 2012, pp.2004-2011.
Rao,B.T., Sridevi N.V.,Reddy V.K., Reddy L.S.S.“Performance Issues of Heterogeneous Hadoop Clusters in Cloud Computing”, Global Journal of Computer Science and Technology ,2011,Vol.11, Issue 8.
Pradeepa, A., Thanamani, A.S. “ Hadoop file system and fundamental concept of mapreduce interior and closure rough set approximations”, International Journal of Advanced Research in Computer and Communication Engineering ,Vol. 2, Issue 10, 2013.
Lee, C., Hseieha, K., Hsieha, S., Hsia, H.“ A Dynamic Data Placement Strategy for Hadoop in Heterogeneous Environments," Big Data Research, Vol. 1, 2014, pp.14–22.
“Hadoop in Action” by Chuck Lam.
White, Tom, 2011.“Hadoop the definitive guide” O’ Reilly media, Inc., CA.
SBPU University Research Dataset: http://www.unipune.ac.in/dept/mental_moral_and_social_science/politics_and_public_administration/ppa_webfiles/pdf/new11/Link_Archives_PhDThesisList2011.pdf
Apache Spark, http://spark.apache.org/
Amp Lab web page : https:// amplab.cs.berkeley.edu/projects/spark- lightning-fast-cluster-computing
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
