dc.contributor.author |
Shmeiss, Zeinab Hasan, |
dc.date.accessioned |
2018-10-11T11:37:00Z |
dc.date.available |
2018-10-11T11:37:00Z |
dc.date.issued |
2018 |
dc.date.submitted |
2018 |
dc.identifier.other |
b21047510 |
dc.identifier.uri |
http://hdl.handle.net/10938/21387 |
dc.description |
Thesis. M.S. American University of Beirut. Department of Computer Science, 2018. T:6716$Advisor : Dr. Mohamad Jaber, Assistant Professor, Computer Science ; Committee members : Dr. Paul Attie, Professor, Computer Science ; Dr. Mohamed Nassar, Assistant Professor, Computer Science. |
dc.description |
Includes bibliographical references (leaves 67-69) |
dc.description.abstract |
Spark is the leading platform for distributed large-scale data processing. It is designed with two main features: (1) an in-memory data engine that makes it uniquely faster than other systems (e.g., Hadoop MapReduce), and (2) a distributed programming model with an extensible, easy-to-use API supported by Scala, Java, R, and Python. Despite these features, writing efficient and complex Spark applications is still error-prone, time-consuming, and requires a clear and deep understanding of the inner-workings of Spark. For instance, (1) Spark does not support composition of distributively developed Spark applications; (2) it lacks automatic persisting-caching of distributed data sets for reuse across several operations; and (3) the same task can be implemented in several different ways, with significantly different execution times. The contribution of the thesis is twofold. First, we propose a component-based framework for composing independently developed Spark applications. The framework takes as input a set of sub-Spark applications embedded with input-output interfaces for exchanging datasets, and a configuration file defining the dependencies between these interfaces. Then, it automatically merges them into a single monolithic Spark application. We support our framework with several automatic persisting strategies to optimize the execution of the produced Spark application. Second, we present TaBOS, a transformation-based optimizer for Spark programs. TaBOS takes a Spark program and generates a state-space of semantically equivalent programs by applying a set of rewrite rules. A single rewrite rule replaces a fragment in the program with a new one aiming at performance optimization while preserving its semantics. From the generated state-space, TaBOS selects one optimal program based on a predefined strategy. We introduce several selection strategies (e.g., applying maximum number of transformations, a program with minimum number of heavy operations, prune-search techniques) for identifying an optimal program |
dc.format.extent |
1 online resource (x, 69 leaves) : illustrations |
dc.language.iso |
eng |
dc.subject.classification |
T:006716 |
dc.subject.lcsh |
SPARK (Computer program language)$Big data.$Software engineering.$Electronic data processing -- Distributed processing. |
dc.title |
Component and transformation based frameworks for building and optimizing Spark programs - |
dc.type |
Thesis |
dc.contributor.department |
Faculty of Arts and Sciences.$Department of Computer Science, |
dc.contributor.institution |
American University of Beirut. |