Understanding Spark Performance in Hybrid and Multi-Site Clouds

Abstract : Recently, hybrid multi-site big data analytics (that combines on-premise with off-premise resources) has gained increasing popularity as a tool to process large amounts of data on-demand, without additional capital investment to increase the size of a single datacenter. However, making the most out of hybrid setups for big data analytics is challenging because on-premise resources can communicate with off-premise resources at significantly lower throughput and higher latency. Understanding the impact of this aspect is not trivial, especially in the context of modern big data an-alytics frameworks that introduce complex communication patterns and are optimized to overlap communication with computation in order to hide data transfer latencies. This paper contributes with a work-in-progress study that aims to identify and explain this impact in relationship to the known behavior on a single cloud. To this end, it analyses a representative big data workload on a hybrid Spark setup. Unlike previous experience that emphasized low end-impact of network communications in Spark, we found significant overhead in the shuffle phase when the bandwidth between the on-premise and off-premise resources is sufficiently small.
Type de document :
Communication dans un congrès
BDAC-15 - 6th International Workshop on Big Data Analytics: Challenges and Opportunities (in conjunction with SC15) , Nov 2015, Austin, TX, United States
Liste complète des métadonnées

Littérature citée [20 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01239140
Contributeur : Alexandru Costan <>
Soumis le : lundi 14 décembre 2015 - 14:14:39
Dernière modification le : vendredi 1 décembre 2017 - 01:22:21
Document(s) archivé(s) le : samedi 29 avril 2017 - 10:15:35

Fichier

main (1).pdf
Fichiers produits par l'(les) auteur(s)

Licence


Domaine public

Identifiants

  • HAL Id : hal-01239140, version 1

Citation

Roxana-Ioana Roman, Bogdan Nicolae, Alexandru Costan, Gabriel Antoniu. Understanding Spark Performance in Hybrid and Multi-Site Clouds. BDAC-15 - 6th International Workshop on Big Data Analytics: Challenges and Opportunities (in conjunction with SC15) , Nov 2015, Austin, TX, United States. 〈hal-01239140〉

Partager

Métriques

Consultations de la notice

711

Téléchargements de fichiers

545