Lab Home | Phone | Search | ||||||||
|
||||||||
Recent trends in computing capabilities have resulted in accelerators (e.g., GPUs, Cells, etc.) having more computational power and memory bandwidth than CPUs. Using accelerators often results in reduced program runtime, but requires architecture-specific code. The OpenCL programming language and associated programming model solves this problem by enabling a single source code to run on both accelerators and CPUs. Computationally intense tasks are written as kernels and run on accelerators, while control logic is handled by a CPU. However, for small kernels the invocation cost is significant, and kernels that use the same data result require repeated data transfer operations. To reduce both invocation costs and the amount of data transferred, kernels can be fused. However, too much fusion can cause capacity misses in local stores and registers. Manually creating efficient fused kernels is time consuming: dependence analysis between kernels is tedious and error prone, and the optimal amount of fusion is machine dependent.
In this talk, we present a tool to automate the fusion of OpenCL kernels. We describe how we eliminate the manual analysis problem by automating the creation of fused OpenCL kernels. We explain how search can be added to our tool to find the amount of fusion that results in the smallest kernel runtimes. Throughout the talk, an elementary multi-physics simulation is used as a motivating example. |