Swift: Reliable and Low-Latency Data Processing at Cloud Scale

Abstract

Nowadays, it is a rapidly rising demand yet challenging issue to run large-scale applications on shared infrastructures such as data centers and clouds with low execution latency and high resource utilization. This paper reports our experience with Swift, a system capable of efficiently running real-time and interactive data processing jobs at cloud scale. Taking directed acyclic graph (DAG) as the job model, Swift achieves the design goal by three new mechanisms: 1) fine-grained scheduling that can efficiently partition a job into graphlets (i.e., sub-graphs) based on new shuffle heuristics and that does scheduling in the unit of graphlet, thus avoiding resource fragmentation and waste, 2) adaptive memory-based in-network shuffling that reduces IO overhead and data transfer time by doing shuffle in memory and allowing jobs to select the most efficient way to fulfill shuffling, and 3) lightweight fault tolerance and recovery that only prolong the whole job execution time slightly with the help of timely failure detection and fine-grained failure recovery. Experimental results show that Swift can achieve an average speedup of 2.11× on TPC-H, and 14.18× on Terasort when compared with Spark. Swift has been deployed in production, supporting as many as 140,000 executors and processing millions of jobs per day. Experiments with production traces show that Swift outperforms JetScope and Bubble Execution by 2.44× and 1.23× respectively.

Publication
2021 IEEE 37th International Conference on Data Engineering