Component and transformation based frameworks for building and optimizing Spark programs -

dc.contributor.authorShmeiss, Zeinab Hasan,
dc.contributor.departmentFaculty of Arts and Sciences.$Department of Computer Science,
dc.contributor.institutionAmerican University of Beirut.
dc.date2018
dc.date.accessioned2018-10-11T11:37:00Z
dc.date.available2018-10-11T11:37:00Z
dc.date.issued2018
dc.date.submitted2018
dc.descriptionThesis. M.S. American University of Beirut. Department of Computer Science, 2018. T:6716$Advisor : Dr. Mohamad Jaber, Assistant Professor, Computer Science ; Committee members : Dr. Paul Attie, Professor, Computer Science ; Dr. Mohamed Nassar, Assistant Professor, Computer Science.
dc.descriptionIncludes bibliographical references (leaves 67-69)
dc.description.abstractSpark is the leading platform for distributed large-scale data processing. It is designed with two main features: (1) an in-memory data engine that makes it uniquely faster than other systems (e.g., Hadoop MapReduce), and (2) a distributed programming model with an extensible, easy-to-use API supported by Scala, Java, R, and Python. Despite these features, writing efficient and complex Spark applications is still error-prone, time-consuming, and requires a clear and deep understanding of the inner-workings of Spark. For instance, (1) Spark does not support composition of distributively developed Spark applications; (2) it lacks automatic persisting-caching of distributed data sets for reuse across several operations; and (3) the same task can be implemented in several different ways, with significantly different execution times. The contribution of the thesis is twofold. First, we propose a component-based framework for composing independently developed Spark applications. The framework takes as input a set of sub-Spark applications embedded with input-output interfaces for exchanging datasets, and a configuration file defining the dependencies between these interfaces. Then, it automatically merges them into a single monolithic Spark application. We support our framework with several automatic persisting strategies to optimize the execution of the produced Spark application. Second, we present TaBOS, a transformation-based optimizer for Spark programs. TaBOS takes a Spark program and generates a state-space of semantically equivalent programs by applying a set of rewrite rules. A single rewrite rule replaces a fragment in the program with a new one aiming at performance optimization while preserving its semantics. From the generated state-space, TaBOS selects one optimal program based on a predefined strategy. We introduce several selection strategies (e.g., applying maximum number of transformations, a program with minimum number of heavy operations, prune-search techniques) for identifying an optimal program
dc.format.extent1 online resource (x, 69 leaves) : illustrations
dc.identifier.otherb21047510
dc.identifier.urihttp://hdl.handle.net/10938/21387
dc.language.isoen
dc.subject.classificationT:006716
dc.subject.lcshSPARK (Computer program language)$Big data.$Software engineering.$Electronic data processing -- Distributed processing.
dc.titleComponent and transformation based frameworks for building and optimizing Spark programs -
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
t-6716.pdf
Size:
1.79 MB
Format:
Adobe Portable Document Format