A rewrite-based optimizer for spark
Loading...
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Elsevier B.V.
Abstract
Spark is the leading platform for distributed large-scale data processing. Spark's Application Programming Interface (API) has a powerful easy-to-use distributed abstractions similarly related to functional programming (e.g., map, filter, reduce) in several different languages. However, writing an efficient Spark applications is still error-prone, time-consuming, and requires a clear and deep understanding of the inner-workings of Spark. For instance, the same task can be implemented in several different ways, yet the execution time can vary drastically between them. For this, we introduce TaBOS, a rewrite-based optimizer for Spark programs. TaBOS takes a Spark job and automatically generates a state-space of equivalent optimized jobs using a set of semantics-preserving rewrite rules. Then, from the generated state-space, it selects one optimal program based on a predefined strategy. We introduce several selection strategies (e.g., job with maximum number of applied rewrite rules, job with minimum number of heavy operations) for identifying an optimal job from the generated state-space. We evaluate the effectiveness, robustness and speedup gain of our solutions using several case studies. © 2019
Description
Keywords
Big data analytics, Optimization, Source-to-source, Spark, Application programming interfaces (api), Big data, Boron compounds, Data analytics, Data handling, Electric sparks, Functional programming, Semantics, Tantalum compounds, Case-studies, Error prones, Large-scale data processing, Optimizers, Rewrite rules, Sulfur compounds