dc.description.abstract |
Dynamic Parallelism on GPUs provides the means for the GPU to generate
work for itself instead of relying on the CPU where a thread running on the
GPU can also launch grids of threads that also run on the GPU. This mechanism is particularly useful with applications where the required parallelism is
dynamic and unknown on execution. However, multiple performance issues arise
when using dynamic parallelism. First, the massive number of small launches
incurs massive overhead. Second, the high number of launches is bottlenecked
by the limited numbers of simultaneously executable kernels. Third, the small
grids occupying the GPU causes the device to be underutilized. In this thesis, we
aim to propose a framework that optimizes dynamic parallelism performance by
applying three key compiler optimization techniques: threshold, coarsening, and
aggregation. Thresholding serializes the kernel work when the dynamic parallelism benefit is potentially cancelled by the launch overhead. Coarsening allows
a single child thread block to sequentially execute the work of multiple other
child thread blocks. Aggregation consolidates multiple child grids into a single
aggregated grid. We automate these optimizations as separate compiler passes
then analyze and evaluate the interactions between them. We also combine them
in a single compiler flow, our evaluation on data sets with high parallelism irregularity shows that when our compiler framework is applied on applications
with nested parallelism, on average, it achieves 43.0x speedup over applications
that uses dynamic parallelism, 8.7x speedup over applications that do not use
dynamic parallelism, and 3.6x speedup over applications that use dynamic parallelism with aggregation only. Our evaluation also shows that even with all
optimizations applied, on datasets that have low irregularity and low parallelism
requirements, dynamic parallelism still performs significantly worse. |