T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fernández-moctezuma et al., The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-scale, Unbounded, Out-of-order Data Processing, Proc. VLDB Endow, vol.8, pp.2150-8097, 2015.

A. Alexandrov, R. Bergmann, S. Ewen, J. Freytag, F. Hueske et al., The Stratosphere Platform for Big Data Analytics, The VLDB Journal, vol.23, issue.6, pp.939-964, 2014.

A. Alexandrov, S. Ewen, M. Heimel, F. Hueske, O. Kao et al., MapReduce and PACTComparing Data Parallel Programming Models, Proceedings of the 14th Conference on Database Systems for BTW, pp.978-981, 2011.

G. Ananthanarayanan, M. Chien-chun-hung, X. Ren, I. Stoica, A. Wierman et al., GRASS: Trimming Stragglers in Approximation Analytics, Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation. NSDI'14, pp.978-979, 2014.

A. Arasu, S. Babu, and J. Widom, The CQL Continuous Query Language: Semantic Foundations and Query Execution, The VLDB Journal, vol.15, issue.2, pp.1066-8888, 2006.
DOI : 10.1007/s00778-004-0147-z

A. Arasu and J. Widom, Resource Sharing in Continuous Sliding-window Aggregates, Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol.30, pp.336-347, 2004.
DOI : 10.1016/b978-012088469-8.50032-2
URL : http://www.vldb.org/conf/2004/RS9P2.PDF

M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu et al., Spark SQL: Relational Data Processing in Spark, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. SIGMOD '15, pp.978-979, 2015.

A. Arrow, , 2018.

A. Avro, , 2018.

, Big Data Digest: How many Hadoops do we really need?, 2018.

R. Bolze, F. Cappello, E. Caron, M. Dayde, F. Desprez et al., Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed, International Journal of High Performance Computing Applications, vol.20, issue.4, pp.481-494, 2006.
URL : https://hal.archives-ouvertes.fr/hal-00684943

I. Botan, G. Alonso, P. M. Fischer, D. Kossmann, and N. Tatbul, Flexible and Scalable Storage Management for Data-intensive Stream Processing, Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology. EDBT '09. Saint, pp.934-945, 2009.
DOI : 10.1145/1516360.1516467

P. Carbone, J. Traub, A. Katsifodimos, S. Haridi, and V. Markl, Cutty: Aggregate Sharing for User-Defined Windows, Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. CIKM '16, 2016.

J. Guoqiang, J. L. Chen, S. Wiener, A. Iyer, R. Jaiswal et al., Realtime Data Processing at Facebook, Proceedings of the 2016 International Conference on Management of Data. SIGMOD '16, pp.1087-1098, 2016.

J. Francisco, B. Clemente-castelló, K. Nicolae, M. Katrinis, R. Mustafa-rafique et al., Enabling Big Data Analytics in the Hybrid Cloud Using Iterative Mapreduce, Proceedings of the 8th International Conference on Utility and Cloud Computing. UCC '15, pp.290-299, 2015.

C. Cranor, T. Johnson, O. Spataschek, and V. Shkapenyuk, Gigascope: A Stream Database for Network Applications, Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. SIGMOD '03, pp.647-651, 2003.

, HPC-Big Data convergence at processing level by bridging in situ/in transit processing with Big Data analytics, 2018.

J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Commun. ACM, vol.51, issue.1, pp.107-113, 2008.

J. Peter, P. Desnoyers, and . Shenoy, Hyperion: High Volume Stream Archival for Retrospective Querying, 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference. ATC'07, vol.4, pp.1-4, 2007.

B. Dong, Q. Zheng, F. Tian, K. Chao, R. Ma et al., An Optimized Approach for Storing and Accessing Small Files on Cloud Storage, J. Netw. Comput. Appl, vol.35, pp.1847-1862, 2012.

M. Dorier, G. Antoniu, F. Cappello, M. Snir, R. Sisneros et al., Damaris: Addressing Performance Variability in Data Management for Post-Petascale Simulations, ACM Trans. Parallel Comput, vol.3, issue.3, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01353890

A. Druid, , 2018.

C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian et al., HYDRAstor: a Scalable Secondary Storage, FAST '09: Proccedings of the 7th conference on File and storage technologies, pp.197-210, 2009.

S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl, Spinning Fast Iterative Data Flows, Proc. VLDB Endow. 5.11, pp.2150-8097, 2012.
DOI : 10.14778/2350229.2350245

. Facebook, , 2018.

A. Flink, , 2018.

, Flink Large State Use Case, 2018.

. Flinkwindows, , 2018.

M. Graph, , 2018.

. Bu-?-gra-gedik, Partitioning Functions for Stateful Data Parallelism in Stream Processing, The VLDB Journal, vol.23, pp.517-539, 2014.

S. Ghemawat, H. Gobioff, and S. Leung, The Google File System, Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles. SOSP '03, pp.29-43, 2003.
DOI : 10.1145/945445.945450

J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin et al., GraphX: Graph Processing in a Distributed Dataflow Framework, Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation. OSDI'14, pp.978-979, 2014.

, Google Algorithms and Theory

. Grid5000, , 2018.

J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, Internet of Things (IoT): A Vision, Architectural Elements, and Future Directions, Future Gener. Comput. Syst, vol.29, pp.1645-1660, 2013.

A. Hadoop, , 2018.

, Hardware Trends in Keynote

A. Kudu, , 2018.

A. Heise, Meteor/Sopremo: An Extensible Query Language and Operator Model, Proceedings of the Int. Workshop on End-to-End Management of Big Data (BigData) in conjunction with VLDB, 2012.

H. Suite, , 2018.

P. Hunt, M. Konar, P. Flavio, B. Junqueira, and . Reed, ZooKeeper: Wait-free Coordination for Internet-scale Systems, Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference. USENIXATC'10, pp.11-11, 2010.

J. Hwang, M. Balazinska, A. Rasin, U. Cetintemel, M. Stonebraker et al., High-Availability Algorithms for Distributed Stream Processing, Proceedings of the 21st International Conference on Data Engineering. ICDE '05, pp.779-790, 2005.

. Introducing-spark-datasets,

N. Jain, S. Mishra, A. Srinivasan, J. Gehrke, J. Widom et al., Towards a Streaming SQL Standard, Proc. VLDB Endow, vol.1, 2008.

K. Jay, N. Neha, and R. Jun, Kafka: A distributed messaging system for log processing, Proceedings of 6th International Workshop on Networking Meets Databases. NetDB'11, 2011.

P. Flavio, I. Junqueira, B. Kelly, and . Reed, Durability with BookKeeper". In: SIGOPS Oper. Syst. Rev, vol.47, issue.1, pp.9-15, 2013.

A. Kafka, , 2018.

A. Kejriwal, A. Gopalan, A. Gupta, Z. Jia, S. Yang et al., SLIK: Scalable Low-latency Indexes for a Key-value Store, Proceedings of the 2016 USENIX Conference on Usenix Annual Technical Conference. USENIX ATC '16, 2016.

, Keys to Understanding Amazon's Algorithms

A. Kinesis, , 2018.

A. Klimovic, Y. Wang, P. Stuedi, A. Trivedi, J. Pfefferle et al., Pocket: Elastic Ephemeral Storage for Serverless Analytics, 13th USENIX Symposium on Operating Systems Design and Implementation, vol.18

C. A. Carlsbad, , pp.427-444, 2018.

S. Krishnamurthy, C. Wu, and M. Franklin, On-the-fly Sharing for Streamed Aggregation, Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. SIGMOD '06, pp.623-634, 2006.

. Kryo, , 2018.

A. Kudu, , 2018.

C. Kulkarni, A. Kesavan, R. Ricci, and R. Stutsman, Beyond Simple Request Processing with RAMCloud, IEEE Data Eng, 2017.

C. Kulkarni, S. Moore, M. Naqvi, T. Zhang, R. Ricci et al., Splinter: Bare-Metal Extensions for Multi-Tenant Low-Latency Storage, 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp.978-979, 2018.

S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg et al., Twitter Heron: Stream Processing at Scale, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. SIGMOD '15

, , pp.978-979, 2015.

Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, SkewTune: Mitigating Skew in Mapreduce Applications, Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. SIGMOD '12, pp.25-36, 2012.

A. Lakshman and P. Malik, Cassandra: A Decentralized Structured Storage System, SIGOPS Oper. Syst. Rev, vol.44, 2010.

C. Lee, J. Seo, A. Park, S. Kejriwal, J. Matsushita et al., Implementing Linearizability at Large Scale and Low Latency, 25th SOSP, pp.978-979, 2015.

. Large-hadron-holider, , 2018.

H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica, Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks, Proceedings of the ACM Symposium on Cloud Computing. SOCC, vol.6, pp.1-6, 2014.

J. Li, D. Maier, K. Tufte, V. Papadimos, and P. A. Tucker, No Pane, No Gain: Efficient Evaluation of Sliding-window Aggregates over Data Streams, SIGMOD Rec. 34.1 (Mar. 2005), pp.39-44

J. Li, D. Maier, K. Tufte, V. Papadimos, and P. A. Tucker, Semantics and Evaluation Techniques for Window Aggregates in Data Streams, Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. SIGMOD '05, pp.311-322, 2005.

B. Lohrmann, D. Warneke, and O. Kao, Nephele Streaming: Stream Processing Under QoS Constraints at Scale, Cluster Computing, vol.17, pp.1386-7857, 2014.

M. Danelutto, P. Kilpatrick, and G. Mencagli, State Access Patterns in Stream Parallel Computations, International Journal of High Performance Computing Applications (IJHPCA

D. Maier, J. Li, P. Tucker, K. Tufte, and V. Papadimos, Semantics of Data Streams and Operators, Proceedings of the 10th International Conference on Database Theory. ICDT'05, pp.37-52, 2005.

M. Streams, , 2018.

P. Matri, Týr: Storage-Based HPC and Big Data Convergence Using Transactional Blobs, 2018.

R. N. Marcelo, P. Mendes, P. Bizarro, and . Marques, Overcoming Memory Limitations in High-throughput Event-based Applications, Proceedings of the 4th

, ACM/SPEC International Conference on Performance Engineering. ICPE '13, 2013.

. Messaging and . Both, , 2018.

H. Miao, H. Park, M. Jeon, G. Pekhimenko, K. S. Mckinley et al., StreamBox: Modern Stream Processing on a Multicore Machine, USENIX ATC, pp.978-979, 2017.

C. Mitch, B. Hari, B. Magdalena, C. Donald, C. Ugur et al., Scalable Distributed Stream Processing, First Biennial Conference on Innovative Data Systems Research, 2003.

T. M. Mitchell, Machine Learning and Data Mining, Commun. ACM, vol.42, pp.30-36, 1999.

P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw et al., Ray: A Distributed Framework for Emerging AI Applications, 13th USENIX Symposium on Operating Systems Design and Implementation, vol.18

C. A. Carlsbad, , 2018.

, New directions for Apache Spark in 2015

S. Niazi, M. Ismail, S. Haridi, J. Dowling, S. Grohsschmiedt et al., HopsFS: Scaling Hierarchical File System Metadata Using newSQL Databases, Proceedings of the 15th Usenix Conference on File and Storage Technologies. FAST'17, pp.978-979, 2017.

B. Nicolae, BlobSeer: Towards efficient data storage management for large-scale, distributed systems, Theses. Université Rennes, vol.1, 2010.
URL : https://hal.archives-ouvertes.fr/tel-00552271

B. Nicolae, Leveraging naturally distributed data redundancy to reduce collective I/O replication overhead, IPDPS '15: 29th IEEE International Parallel and Distributed Processing Symposium, pp.1023-1032, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01115700

B. Nicolae, Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal, IPDPS '13: The 27th IEEE International Parallel and Distributed Processing Symposium, pp.19-28, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00781532

D. Ongaro, S. M. Rumble, R. Stutsman, J. Ousterhout, and M. Rosenblum, Fast Crash Recovery in RAMCloud, 23rd SOSP, 2011.

A. Orc, , 2018.

J. Ousterhout, Always Measure One Level Deeper, Commun. ACM, vol.61, pp.74-83, 2018.

J. Ousterhout, A. Gopalan, A. Gupta, A. Kejriwal, C. Lee et al., The RAMCloud Storage System". In: ACM Trans. Comput. Syst, vol.33, issue.3, 2015.

K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker, and B. Chun, Making Sense of Performance in Data Analytics Frameworks, Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation. NSDI'15, pp.978-979, 2015.

A. Parquet, , 2018.

. Pravega,

A. Pulsar, , 2018.

H. Qin, Q. Li, J. Speiser, P. Kraft, and J. Ousterhout, Arachne: Core-Aware Thread Management, 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018.

. Redis, , 2018.

J. Shi, Y. Qiu, U. Farooq-minhas, L. Jiao, C. Wang et al., Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics, Proc. VLDB Endow. 8.13 (Sept. 2015)

K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The Hadoop Distributed File System, Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). MSST '10, pp.1-10, 2010.
DOI : 10.1109/msst.2010.5496972

G. Sijie, D. Robin, and S. Leigh, DistributedLog: A High Performance Replicated Log Service, IEEE 33rd International Conference on Data Engineering. ICDE'17, 2017.

. Small-graph,

A. Spark, , 2018.

A. Impala, , 2018.

Y. Taleb, Optimizing Distributed In-memory Storage Systems: Fault-tolerance, Performance, Energy Efficiency, Theses. ENS Rennes, 2018.
URL : https://hal.archives-ouvertes.fr/tel-01891897

Y. Taleb, R. Stutsman, G. Antoniu, and T. Cortes, Tailwind: Fast and Atomic RDMA-based Replication, ATC '18-USENIX Annual Technical Conference, pp.1-13, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01676502

K. Tangwongsan, M. Hirzel, S. Schneider, and K. Wu, General Incremental Sliding-window Aggregation, Proc. VLDB Endow. 8.7 (Feb. 2015)
DOI : 10.14778/2752939.2752940

, Hadoop TeraGen for TeraSort, 2018.

T. Sort, , 2018.

, The world beyond batch: Streaming 101

, The world beyond batch: Streaming 102

Q. To, J. Soto, and V. Markl, A Survey of State Management in Big Data Processing Systems, 2017.

R. Tudoran, A. Costan, G. Antoniu, and H. Soncu, TomusBlobs: Towards Communication-Efficient Storage for MapReduce Applications in Azure, Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp.427-434, 2012.
DOI : 10.1109/ccgrid.2012.104
URL : https://hal.archives-ouvertes.fr/hal-00670725

. Twitter,

L. G. Valiant, A Bridging Model for Parallel Computation, Commun. ACM, vol.33, pp.103-111, 1990.
DOI : 10.1145/79173.79181

S. Venkataraman, A. Panda, K. Ousterhout, M. Armbrust, A. Ghodsi et al., Drizzle: Fast and Adaptable Stream Processing at Scale". In: 26th SOSP, pp.374-389, 2017.

P. Viotti and M. Vukoli´cvukoli´c, Consistency in Non-Transactional Distributed Storage Systems, Comput. Surv, vol.49, issue.1, 2016.

D. Warneke and O. Kao, Nephele: Efficient Parallel Data Processing in the Cloud, Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, 2009.

. Large-graph, , 2018.

M. Wiesmann, F. Pedone, A. Schiper, B. Kemme, and G. Alonso, Understanding Replication in Databases and Distributed Systems, Proceedings of the The 20th International Conference on Distributed Computing Systems ( ICDCS 2000). ICDCS '00

D. C. Washington and . Usa, , 2000.

F. Yang, E. Tschetter, X. Léauté, N. Ray, G. Merlino et al., Druid: A Real-time Analytical Data Store, Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. SIGMOD '14, pp.157-168, 2014.

L. Yang, J. Cao, Y. Yuan, T. Li, A. Han et al., A Framework for Partitioning and Execution of Data Stream Applications in Mobile Cloud Computing, SIGMETRICS Perform, vol.40, 2013.

E. Yildirim and T. Kosar, Network-aware end-to-end data throughput optimization, Proceedings of the first international workshop on Network-aware data management, pp.978-979, 2011.

M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker et al., Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling, Proceedings of the 5th European Conference on Computer Systems. EuroSys '10, pp.265-278, 2010.

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma et al., Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing, Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. NSDI'12, pp.2-2, 2012.

M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker et al., Discretized Streams: Fault-tolerant Streaming Computation at Scale, Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. SOSP '13. Farminton, pp.423-438, 2013.

B. Zhu, K. Li, and H. Patterson, Avoiding the disk bottleneck in the data domain deduplication file system, FAST'08: Proceedings of the 6th USENIX Conference on File and Storage Technologies, vol.18, p.14, 2008.

, même but, a proposé plus récemment des opérateurs d'itération en boucle fermée natifs [24] et un optimiseur automatique basé sur les coûts, capable de réorganiser les opérateurs et de mieux prendre en charge l'exécution en continu par la réduction de la latence

, Pour cela, nous proposons une méthodologie qui permet de corréler le réglage des paramètres et le plan d'exécution des opérateurs avec l'usage des ressources. Nous analysons les performances de Spark et Flink avec plusieurs charges de travail qui sont représentatives à la fois du traitement par lot (batch) itératif sur des plateformes jusqu'à 100 noeuds. La principale conclusion de cette analyse est qu'aucune des deux structures ne surpasse l'autre pour tous les types de données, les dimensions et les modèles d'emploi. Nous approfondissons la manière dont les résultats sont corrélés avec les opérateurs, Tirer le meilleur parti possible de ces structures constitue un défi considérable car l'efficacité des exécutions dépend fortement de l'ajustement des configurations complexes des paramètres par une compréhension fine des choix architecturaux sous-jacents

, La gestion de la mémoire joue un rôle crucial dans l'exécution d'une charge de travail, en particulier pour les ensembles de données plus volumineux que la mémoire disponible. Par exemple, le composant d'agrégation de Flink (combinateur basé sur le tri) semble plus efficace que celui de Spark car il s'appuie sur une gestion personnalisée de la mémoire et la sérialisation différenciée des données selon leur type (type oriented)

, Au cours de nos expériences, nous avons remarqué que, contrairement à Spark, Flink n'accumule pas beaucoup d'objets sur le tas, mais les stocke dans une région de la mémoire dédiée en dehors du tas pour éviter les problèmes de mémoire. Ceci conduit à une configuration de mémoire hybride, dans le tas et en dehors du tas, qui est difficile à régler. Le réglage des fraction de mémoire devrait (idéalement) être fait automatiquement par le système et modifié dynamiquement à l'exécution. Dans Flink, la plupart des opérateurs sont implémentés pour qu'ils puissent survivre avec très peu de mémoire en utilisant le disque si nécessaire. Nous avons également observé que, bien que Spark puisse sérialiser des données sur disque, il faut que des parties (significatives) de données soient placées dans le tas de la machine virtuelle Java pour plusieurs opérations, JVM) conduise à les stocker en tant qu'objets sur le tas. Cette approche présente des inconvénients notables, comme mentionné dans, vol.47

. Ovidiu-cristian, A. Marcu, G. Costan, M. Antoniu, B. Pérez et al., KerA: Scalable Data Ingestion for Stream Processing, IEEE International Conference on Distributed Computing Systems, 2018.

. Ovidiu-cristian, A. Marcu, G. Costan, M. Antoniu, and . Pérez, Spark versus Flink: Understanding Performance in Big Data Analytics Frameworks, IEEE International Conference on Cluster Computing, 2016.

, Publications dans des workshops internationaux

. Ovidiu-cristian, A. Marcu, G. Costan, M. Antoniu, R. Pérez et al., Towards a Unified Storage and Ingestion Architecture for Stream Processing, Second Workshop on Real-time & Stream Analytics in Big Data Colocated with the 2017 IEEE International Conference on Big Data, 2017.

. Ovidiu-cristian, R. Marcu, B. Tudoran, A. Nicolae, G. Costan et al., Exploring Shared State in Key-Value Store for WindowBased Multi-Pattern Streaming Analytics, Workshop on the Integration of Extreme Scale Computing and Big Data Management and Analytics in conjunction with IEEE/ACM CCGrid, 2017.