Release Notes for PoCL 7.1

This is mostly a bug-fix/maintenance release after the large 7.0 one.

The most notable improvements have been on solidifying and documenting the RISC-V port. For example, now 99% of the chipStar’s internal tests pass when ran through PoCL on a Milk-V Jupiter board with RVV vector emission confirmed.

Release highlights

  • Support for LLVM 21 for the CPU devices and LevelZero devices.

  • Support for cl_khr_icd v2.0.0, cl_khr_spirv_queries and SPV_KHR_expect_assume.

  • Various stability and ease-of-setup improvements on the Windows port, for example no longer requiring MS Visual Studio Build Tools for linking CPU device kernels.

Notable user facing changes

  • Improved overhead of clEnqueueNDRange() calls in cases where there are several hundreds of SVM/USM allocations. For example, on OpenVINO running resnet50 inference, the call time reduced to few microseconds from previous ~20us.

  • Improved error message when a recursive function is encountered: Print the infringing function in addition to the function where the recursion was encountered and demangle C++ function names.

  • Windows builds no longer require MS Visual Studio Build Tools for linking CPU device kernels. This only works with 1) static LLVM built with lld-link, 2) PoCL built with MSVC compiler for x86(-64) target. The only remaining runtime dependency is the MSVC runtime library.

Notable bugfixes

  • Multiple fixes on the fine-grain sub-buffer migration code.

Driver-specific features

  • Implemented version 1.0.0 of the cl_khr_spirv_queries extension for drivers that support SPIR-V.

CPU drivers

  • Report SPIR-V 1.3 and 1.4 support when using LLVM 18.

  • Support SPV_KHR_expect_assume.

Level Zero driver

  • Various bugfixes.

  • Enable SPV_INTEL_memory_access_aliasing.

CUDA driver

  • Reimplement support for global offsets and work dim.

  • Implement {read_|write_|}mem_fence() and get_{global|local}_linear_id().

  • note that CUDA driver does not support LLVM 21, due to a bug in upstream Clang code. Users must use LLVM 17 to 20 with CUDA.

Experimental and work-in-progress

  • Expanded existing defined built-in kernels and introduced a minimal set of new ones and implemented them on level0/npu for supporting llama.cpp single batch inferences on Intel NPU device. Tested with ~1B fp16 parameter variants of Gemma 3.1, Qwen 3 and Llama 3.2 models using experimental branch https://github.com/linehill/llama.cpp/tree/opencl-dbk.

Deprecation/feature removal notices

  • Support for LLVM versions older than 17.0 was removed.