Straggler Problem –Tail Latancy in Distributed network

Authors

Rahman MN Department of ICT, Bangladesh University of Professionals (BUP), Bangladesh
Siddika A Department of CSE, World University of Bangladesh (WUB), Bangladesh
Islam MS Department of ICT, Bangladesh University of Professionals (BUP), Bangladesh
Md Shahajada Senior Database Engineer, eGeneration Limited, Bangladesh

DOI:

https://doi.org/10.26438/ijcse/v7i8.168178

Keywords:

Distributed network, latency, straggler detection, data clusters, slowest performing straggler

Abstract

Distributed processing frameworks split a data intensive computation job into multiple smaller tasks, which are then executed in parallel on commodity clusters to achieve faster job completion. A natural consequence of such a parallel execution model is that slow running tasks, commonly called stragglers potentially delay overall job completion. Stragglers in general take more time to complete tasks than their peers. This could happen due to many reasons such as load imbalance, I/O blocks, garbage collections, hardware configuration etc. Straggler tasks continue to be a major hurdle in achieving faster completion of data intensive applications running on modern data-processing frameworks. The trouble with stragglers is that when parallel computations are followed by synchronizations such as reductions, this would cause all the parallel tasks to wait for others meaning that the parallel runtime is dominated by the slowest performing straggler. In a large-scale distributed system comprising a group of worker nodes, the stragglers` delay performance bottleneck, is caused by the unpredictable latency in waiting for slowest nodes (or stragglers) to finish their tasks. Such stragglers increase the average job duration by 52% in data clusters of Facebook and Bing even after these companies using state of the art straggler mitigation techniques[1]. This is because current mitigation techniques all involve an element of waiting and speculation. Existing straggler mitigation techniques are inefficient due to their reactive and replicative nature – they rely on a wait speculate- execute mechanism, thus leading to delayed straggler detection and inefficient resource utilization. Hence, full cloning of small jobs, avoiding waiting and speculation altogether is proposed in a system called as Dolly. Dolly utilizes extra resources due to replication.

References

[1] S. Venkataraman, A. Panda, M. J. Franklin, and I. Stoica, “The Power of Choice in Data-Aware Cluster Scheduling This paper is included in the Proceedings of the Operating Systems Design and Implementation .,” 2014.

[2] D. Ford et al., “Availability in Globally Distributed Storage Systems,” 9th USENIX Symp. Oper. Syst. Des. Implement., pp. 61–74, 2010.

[3] X. Tian, R. Han, L. Wang, J. Zhan, and G. Lu, “Latency critical big data computing in finance,” J. Financ. Data Sci., vol. 1, no. 1, pp. 33–41, 2015.

[4] J. Dean and S. Ghemawat, “Summary of Installed Capacity , Dependable Capacity , Power Generation and Consumption (2003-2016),” pp. 137–149, 2016.

[5] J. Dean and L. A. Barroso, “The tail at scale,” Commun. ACM, vol. 56, no. 2, p. 74, 2013.

[6] W. D. Gray and D. A. Boehm-Davis, “Milliseconds matter: An introduction to microstrategies and to their use in describing and predicting interactive behavior,” J. Exp. Psychol. Appl., vol. 6, no. 4, pp. 322–335, 2000.

[7] M. Kambadur, T. Moseley, R. Hank, and M. A. Kim, “Measuring interference between live datacenter applications,” Int. Conf. High Perform. Comput. Networking, Storage Anal. SC, no. 3, 2012.

[8] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica, “Effective Straggler Mitigation: Attack of the Clones,” Nsdi, p. 185, 2013.

[9] G. Ananthanarayanan et al., “Reining in the Outliers in Map-Reduce Clusters using Mantri,” Time, pp. 265–278, 2010.

[10] E. Krevat, J. Tucek, and G. R. Ganger, “Disks are like snowflakes: no two are alike,” Proc. 13th USENIX Conf. Hot Top. Oper. Syst., p. 5, 2011.

[11] A. Tumanov, R. H. Katz, M. A. Kozuch, C. Reiss, and G. R. Ganger, “Heterogeneity and dynamicity of clouds at scale,” pp. 1–13, 2012.

[12] P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan, “The influence of operating systems on the performance of collective operations at extreme scale,” Proc. - IEEE Int. Conf. Clust. Comput. ICCC, 2006.

[13] F. Petrini, D. J. Kerbyson, and S. Pakin, “The Case of the Missing Supercomputer Performance,” vol. 836, p. 55, 2011.

[14] K. B. Ferreira, P. G. Bridges, R. Brightwell, and K. T. Pedretti, “The impact of system design parameters on application noise sensitivity,” Cluster Comput., vol. 16, no. 1, pp. 117–129, 2013.

[15] C. Curino, D. E. Difallah, C. Douglas, and S. Krishnan, “Socc14-Paper15.”

[16] A. D. Ferguson, P. Bodik, E. Boutin, and R. Fonseca, “Jockey : Guaranteed Job Latency in Data Parallel Clusters,” Proc. 8th ACM Eur. Conf. Comput. Syst. - EuroSys ’12, pp. 99–112, 2012.

[17] B. Hindman et al., “2011_Benjamin Hindman_Benjamin Hindman_Mesos A Platform for Fine-Grained Resource Sharing in the Data Center.”

[18] C. R. Lumb and R. Golding, “D-SPTF: Decentralized Request Distribution in Brick-based Storage Systems,” ACM SIGOPS Oper. Syst. Rev., vol. 38, p. 37, 2004.

[19] D. Shue, M. Freedman, and A. Shaikh, “Performance Isolation and Fairness for Multi-Tenant Cloud Storage Setting : Shared Storage in the Cloud.”

[20] T. Zhu, A. Tumanov, M. A. Kozuch, M. Harchol-Balter, and G. R. Ganger, “PriorityMeister: Tail Latency QoS for Shared Networked Storage,” Symp. Cloud Comput., pp. 1–14, 2014.

[21] M. Capitão, “Mediator Framework for Inserting Data into Hadoop Micael José Pedrosa Capitão Plataforma de Mediação para a Inserção de Dados em Hadoop Mediator Framework for Inserting Data into Hadoop,” no. January, 2015.

[22] “Apache Hadoop.” [Online]. Available: http://hadoop.apache.org/. [Accessed: 08-Mar-2019].

[23] “NameNode and DataNode – Hadoop In Real World.” [Online]. Available: http://www.hadoopinrealworld.com/namenode-and-datanode/. [Accessed: 08-Mar-2019].

[24] “What is Hadoop Distributed File System (HDFS)? - Definition from WhatIs.com.” [Online]. Available: https://searchdatamanagement.techtarget.com/definition/Hadoop-Distributed-File-System-HDFS. [Accessed: 10-Jan-2019].

[25] “20 Essential Hadoop Tools for Crunching Big Data – Data Science IO – Medium.” [Online]. Available: https://medium.com/data-science-io/20-essential-hadoop-tools-for-crunching-big-data-efbc8b5c77ce. [Accessed: 15-Jan-2019].

[26] A. Manzanares et al., “Improving MapReduce performance through data placement in heterogeneous Hadoop clusters,” Ned. Tijdschr. Psychol., vol. 4, no. 4, pp. 1–9, 2010.

[27] “Hadoop Soup: 01/10/14.” [Online]. Available: http://dailyhadoopsoup.blogspot.com/2014_01_10_archive.html. [Accessed: 10-Sep-2018].

[28] “20 essential Hadoop tools for crunching Big Data.” [Online]. Available: https://bigdata-madesimple.com/20-essential-hadoop-tools-for-crunching-big-data/. [Accessed: 08-Sep-2018].

[29] “Apache Spark Introduction.” [Online]. Available: https://www.tutorialspoint.com/apache_spark/apache_spark_introduction.htm. [Accessed: 08-Aug-2018].

[30] “Home - Apache Hive - Apache Software Foundation.” [Online]. Available: https://cwiki.apache.org/confluence/display/HIVE. [Accessed: 08-Aug-2018].

[31] “What is Hive? Architecture & Modes.” [Online]. Available: https://www.guru99.com/introduction-hive.html. [Accessed: 08-Jul-2018].

[32] “Impala Hadoop Tutorial.” [Online]. Available: https://www.dezyre.com/hadoop-tutorial/hadoop-impala-tutorial. [Accessed: 20-Nov-2018].

[33] “Cloudera Impala Overview | 5.3.x | Cloudera Documentation.” [Online]. Available: https://www.cloudera.com/documentation/enterprise/5-3-x/topics/impala_intro.html. [Accessed: 04-Oct-2018].

[34] “Big Data: How to manage Hadoop.” [Online]. Available: https://www.cleverism.com/how-to-manage-hadoop-big-data/. [Accessed: 20-Dec-2018].

[35] “Introduction to batch processing - MapReduce - Data, what now?” [Online]. Available: https://datawhatnow.com/batch-processing-mapreduce/. [Accessed: 05-Jan-2019].

[36] A. Gupta and G. N. Campus, “HIVE- Processing Structured Data in HADOOP,” no. August, 2018.

[37] “Why is Impala faster than Hive? - Quora.” [Online]. Available: https://www.quora.com/Why-is-Impala-faster-than-Hive. [Accessed: 30-Sep-2018].

[38] “What is the advantages of Hadoop and Big data? - Quora.” [Online]. Available: https://www.quora.com/What-is-the-advantages-of-Hadoop-and-Big-data. [Accessed: 12-Jan-2019].

[39] “Advantages of Hadoop MapReduce Programming.” [Online]. Available: https://www.tutorialspoint.com/articles/advantages-of-hadoop-mapreduce-programming. [Accessed: 10-Dec-2018].

Downloads

PDF ⁰

Published

2019-08-31

CITATION

DOI: 10.26438/ijcse/v7i8.168178

Published: 2019-08-31

How to Cite

[1]

M. N. Rahman, A. Siddika, M. S. Islam, and M. Shahajada, “Straggler Problem –Tail Latancy in Distributed network”, Int. J. Comp. Sci. Eng., vol. 7, no. 8, pp. 168–178, Aug. 2019.

Download Citation

Issue

Vol. 7 No. 8 (2019): IJCSE August Edition

Section

Research Article

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.

Straggler Problem –Tail Latancy in Distributed network

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Journal Information

UGC Gazette Regulation

Join Editorial Board

Information

Current Issue

Keywords