Abstract:
Spark is the leading platform for distributed large-scale data processing. It is designed with two main features: (1) an in-memory data engine that makes it uniquely faster than other systems (e.g., Hadoop MapReduce), and (2) a distributed programming model with an extensible, easy-to-use API supported by Scala, Java, R, and Python. Despite these features, writing efficient and complex Spark applications is still error-prone, time-consuming, and requires a clear and deep understanding of the inner-workings of Spark. For instance, (1) Spark does not support composition of distributively developed Spark applications; (2) it lacks automatic persisting-caching of distributed data sets for reuse across several operations; and (3) the same task can be implemented in several different ways, with significantly different execution times. The contribution of the thesis is twofold. First, we propose a component-based framework for composing independently developed Spark applications. The framework takes as input a set of sub-Spark applications embedded with input-output interfaces for exchanging datasets, and a configuration file defining the dependencies between these interfaces. Then, it automatically merges them into a single monolithic Spark application. We support our framework with several automatic persisting strategies to optimize the execution of the produced Spark application. Second, we present TaBOS, a transformation-based optimizer for Spark programs. TaBOS takes a Spark program and generates a state-space of semantically equivalent programs by applying a set of rewrite rules. A single rewrite rule replaces a fragment in the program with a new one aiming at performance optimization while preserving its semantics. From the generated state-space, TaBOS selects one optimal program based on a predefined strategy. We introduce several selection strategies (e.g., applying maximum number of transformations, a program with minimum number of heavy operations, prune-search techniques) for identifying an optimal program
Description:
Thesis. M.S. American University of Beirut. Department of Computer Science, 2018. T:6716$Advisor : Dr. Mohamad Jaber, Assistant Professor, Computer Science ; Committee members : Dr. Paul Attie, Professor, Computer Science ; Dr. Mohamed Nassar, Assistant Professor, Computer Science.
Includes bibliographical references (leaves 67-69)