December 16, 2020: pocl v1.6 released

Release Highlights

Support for Clang/LLVM 11

LLVM 10 remains to be officially supported, but versions down to 6.0 might still work.

Enhanced OpenCL debugging usage

See the documentation for instructions.

Improved CUDA performance and features

In PoCL v1.6, the CUDA backend gained several performance improvements. Benchmarks using SHOC benchmarks (now continually tested show that these optimizations resulted in much better performance, particularly for benchmarks involving local memory such as FFT and GEMM, when compared to a prior benchmark run. PoCL now often attains performance competitive with Nvidia's proprietary OpenCL driver. We welcome contributions to identifying and removing the root causes for any remaining problem areas. We also welcome contributions to improve the feature coverage for OpenCL 1.2/3.0 standards.

In particular, the following optimizations and improvements landed in the CUDA backend:

Use 32-bit pointer arithmetic for local memory #822
Use static CUDA memory blocks for OpenCL's constant __local blocks. Previous version of PoCL used one dynamic shared CUDA memory block for OpenCL's constant __local blocks and __local function arguments. This resulted in poor SASS code generation due to a pointer aliasing issue. #838, #846, #824
Use a higher unroll threshold in LLVM #826
Implement more special functions #836
Improve clEnqueueFillBufer #834

PowerPC support

PoCL v1.6 brings back support for PowerPC 8/9 with the internal test suite passing fully on the pthread device and the CUDA device test suite pass rate is the same as the pass rate for CUDA on an x86_64 machine. PoCL now fills the gap of running OpenCL codes on PowerPC machines as IBM's OpenCL CPU implementation is deprecated. This was tested on a PowerPC node with a Tesla V100 on Lawrence Livermore National Laboratory's Lassen supercomputer.

Improved packaging support

In previous PoCL releases, distributing a PoCL binary built with various devices support required that the build machine and the host machine have the same support for the devices. With PoCL v1.6 release, PoCL can be compiled with device drivers enabled at build time, and it will then check these devices for availability at run time. This has enabled the conda package manager to distribute PoCL binary packages with CUDA support to be distributed for Linux-x86_64 and Linux-ppc64le. Pre-built packages of PoCL are available via the conda-forge community package repository for Linux-x86_64, Linux-ppc64le, Linux-aarch64 and Darwin-x86_64 via the Conda user-level pacakge manager.

A more detailed changelog can be found here.

Acknowledgments

The CUDA improvements, PowerPC support and packaging support described in this post were made by Isuru Fernando and Matt Wala with assistance from Nick Christensen, and Andreas Klöckner, all part of the Department of Computer Science at the University of Illinois at Urbana-Champaign. The work was partially supported through awards OAC-1931577 and SHF-1911019 from the US National Science Foundation, as well as award DE-NA0003963 from the US Department of Energy.

Customized Parallel Computing (CPC) research group of Tampere University, Finland leads the development of PoCL on the side and for the needs of their research projects. This project has received funding from the ECSEL Joint Undertaking (JU) under grant agreement No 783162 (FitOptiVis). The JU receives support from the European Union’s Horizon 2020 research and innovation programme and Netherlands, Czech Republic, Finland, Spain, Italy. It was also supported by European Union's Horizon 2020 research and innovation programme under Grant Agreement No 871738 (CPSoSaware) and HSA Foundation.

Download.

Portable Computing Language | Portable Computing Language (pocl) v1.6 released