Measuring and Optimizing Distributed Array Programs

Abstract

Nowadays, there is a rising trend of building array-based distributed computing frameworks, which are suitable for implementing many machine learning and data mining algorithms. However, most of these frameworks only execute each primitive in an isolated manner and in the exact order defined by programmers, which implies a huge space for optimization. In this paper, we propose a novel array-based programming model, named Kasen, which distinguishes itself from models in the existing literature by defining a strict computation and communication model. This model makes it easy to analyze programs’ behavior and measure their performance, with which we design a corresponding optimizer that can automatically apply high-level optimizations to the original programs written by programmers. According to our evaluation, the optimizer of Kasen can achieve a significant reduction on memory read/write, buffer allocation and network traffic, which leads to a speedup up to 5.82x.

Publication
42nd International Conference on Very Large Data Bases