AUB ScholarWorks

Component and transformation based frameworks for building and optimizing Spark programs -

Show simple item record

dc.contributor.author Shmeiss, Zeinab Hasan,
dc.date.accessioned 2018-10-11T11:37:00Z
dc.date.available 2018-10-11T11:37:00Z
dc.date.issued 2018
dc.date.submitted 2018
dc.identifier.other b21047510
dc.identifier.uri http://hdl.handle.net/10938/21387
dc.description Thesis. M.S. American University of Beirut. Department of Computer Science, 2018. T:6716$Advisor : Dr. Mohamad Jaber, Assistant Professor, Computer Science ; Committee members : Dr. Paul Attie, Professor, Computer Science ; Dr. Mohamed Nassar, Assistant Professor, Computer Science.
dc.description Includes bibliographical references (leaves 67-69)
dc.description.abstract Spark is the leading platform for distributed large-scale data processing. It is designed with two main features: (1) an in-memory data engine that makes it uniquely faster than other systems (e.g., Hadoop MapReduce), and (2) a distributed programming model with an extensible, easy-to-use API supported by Scala, Java, R, and Python. Despite these features, writing efficient and complex Spark applications is still error-prone, time-consuming, and requires a clear and deep understanding of the inner-workings of Spark. For instance, (1) Spark does not support composition of distributively developed Spark applications; (2) it lacks automatic persisting-caching of distributed data sets for reuse across several operations; and (3) the same task can be implemented in several different ways, with significantly different execution times. The contribution of the thesis is twofold. First, we propose a component-based framework for composing independently developed Spark applications. The framework takes as input a set of sub-Spark applications embedded with input-output interfaces for exchanging datasets, and a configuration file defining the dependencies between these interfaces. Then, it automatically merges them into a single monolithic Spark application. We support our framework with several automatic persisting strategies to optimize the execution of the produced Spark application. Second, we present TaBOS, a transformation-based optimizer for Spark programs. TaBOS takes a Spark program and generates a state-space of semantically equivalent programs by applying a set of rewrite rules. A single rewrite rule replaces a fragment in the program with a new one aiming at performance optimization while preserving its semantics. From the generated state-space, TaBOS selects one optimal program based on a predefined strategy. We introduce several selection strategies (e.g., applying maximum number of transformations, a program with minimum number of heavy operations, prune-search techniques) for identifying an optimal program
dc.format.extent 1 online resource (x, 69 leaves) : illustrations
dc.language.iso eng
dc.subject.classification T:006716
dc.subject.lcsh SPARK (Computer program language)$Big data.$Software engineering.$Electronic data processing -- Distributed processing.
dc.title Component and transformation based frameworks for building and optimizing Spark programs -
dc.type Thesis
dc.contributor.department Faculty of Arts and Sciences.$Department of Computer Science,
dc.contributor.institution American University of Beirut.


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search AUB ScholarWorks


Browse

My Account