Usage

The basic usage of pocl should be as easy as any other OpenCL implementation.

While it is possible to link against pocl directly, the recommended way is to use the ICD interface.

Linking your program with pocl through an icd loader

You can link your OpenCL program against an ICD loader. If your ICD loader is correctly configured to load pocl, your program will be able to use pocl. See the section below for more information about ICD and ICD loaders.

Example of compiling an OpenCL host program using the free ocl-icd loader:

gcc example1.c -o example `pkg-config --libs --cflags OpenCL`

Example of compiling an OpenCL host program using the AMD ICD loader (no pkg-config support):

gcc example1.c -o example -lOpenCL

Installable client driver (ICD)

pocl is built with the ICD extensions of OpenCL by default. This allows you to have several OpenCL implementations concurrently on your computer, and select the one to use at runtime by selecting the corresponding cl_platform. ICD support can be disabled by adding the flag:

-DENABLE_ICD=OFF

to the CMake invocation.

The ocl-icd ICD loader allows to use the OCL_ICD_VENDORS environment variable to specify a (non-standard) replacement for the /etc/OpenCL/vendors directory.

An ICD loader is an OpenCL library acting as a “proxy” to one of the various OpenCL implementations installed in the system. pocl does not provide an ICD loader itself, but NVidia, AMD, Intel, Khronos, and the free ocl-icd project each provides one.

Linking your program directly with pocl

Passing the appropriate linker flags is enough to use pocl in your program. However, please bear in mind that:

The pkg-config tool is used to locate the libraries and headers in the installation directory.

Example of compiling an OpenCL host program against pocl using the pkg-config:

gcc example1.c -o example `pkg-config --libs --cflags pocl`

In this link mode, your program will always require the pocl OpenCL library. It wont be able to run with another OpenCL implementation without recompilation.

Tuning pocl behavior with ENV variables

The behavior of pocl can be controlled with multiple environment variables listed below. The variables are helpful both when using and when developing pocl.

  • POCL_AFFINITY

    Linux-only, specific to ‘cpu’ driver. If set to 1, each thread of the driver sets its affinity to its index. This may be useful with very long running kernels, or when using subdevices. Defaults to 0 (most people don’t need this).

  • POCL_BINARY_SPECIALIZE_WG

    By default the PoCL program binaries store generic kernel binaries which can be executed across any grid dimensions. This configuration variable can be used to also include specialized work-group functions in the binaries, by defining a comma separated list of strings that describe the specialized versions. The strings adhere to the directory names in the PoCL cache from which the binaries are captured.

    Example:

    POCL_BINARY_SPECIALIZE_WG='2-1-1,0-0-0-goffs0,13-1-1-smallgrid,128-2-1-goffs0-smallgrid' poclcc [...]
    

    This makes poclcc generate a binary which contains the generic work-group function binary, a work-group function that is specialized for local size of 2x1x1, another with generic local size but specialized for the global offset at origo, one with local size of 13x1x1, but which is specialized for a “small grid” (size defined by the device driver), and finally one that is specialized for local size 128x2x1, an origo global offset and a small grid.

  • POCL_BITCODE_FINALIZER

    Defines a custom command that can manipulate the final kernel work-group function bitcode produced after all LLVM optimizations and before entering code generation. This can be useful, for example, to add instrumentation to the LLVM bitcode before proceeding to the backend.

    Example:

    POCL_BITCODE_FINALIZER='verificarlo %(bc) --emit-llvm -o %(bc)' examples/example1/example1
    

    This results in running the above command with ‘%(bc)’ strings replaced with the path of the final bitcode’s temporary file. Note that the modified bitcode should be written over the same file for it to get picked to the code generation.

    Please note that setting the env doesn’t force regeneration of the kernel binaries if they are found in the kernel compiler cache. You can either use POCL_KERNEL_CACHE=0 to disable the kernel cache, or wipe the kernel cache directory manually to force kernel binary rebuild.

  • POCL_BUILDING

If set, the pocl helper scripts, kernel library and headers are searched first from the pocl build directory. Only has effect if ENABLE_POCL_BUILDING was enabled at build (by default it is).

  • POCL_CACHE_DIR

If this is set to an existing directory, pocl uses it as the cache directory for all compilation results. This allows reusing compilation results between pocl invocations. If this env is not set, then the default cache directory will be used, which is $XDG_CACHE_HOME/pocl/kcache (if set) or $HOME/.cache/pocl/kcache/ on Unix-like systems.

  • POCL_CPU_LOCAL_MEM_SIZE

Set the local memory size of the CPU devices (cpu, cpu-minimal, cpu-tbb) to the given amount in bytes instead of the default one.

  • POCL_CPU_MAX_CU_COUNT

The maximum number of threads created for work group execution in the ‘cpu’ device driver. The default is to determine this from the number of hardware threads available in the CPU.

  • POCL_CPU_VENDOR_ID_OVERRIDE

Overrides the vendor id reported by PoCL for the CPU drivers. For example, setting the vendor id to be 32902 (0x8086) and setting the driver version using POCL_DRIVER_VERSION_OVERRIDE to “2023.16.7.0.21_160000” (or such) can be used to convince binary-distributed DPC++ compilers to compile and run SYCL programs on the PoCL-CPU driver.

  • POCL_DEBUG

Enables debug messages to stderr. This will be mostly messages from error condition checks in OpenCL API calls and Event/API timing information. Useful to e.g. distinguish between various reasons a call could return CL_INVALID_VALUE. If clock_gettime is available, messages will include a timestamp.

The old way (setting POCL_DEBUG to 1) has been updated to support categories. Using this limits the amount of debug messages produced. Current options are: ‘error’, ‘warning’, ‘general’, ‘memory’, ‘llvm’, ‘events’, ‘cache’, ‘locking’, ‘refcounts’, ‘timing’, ‘hsa’, ‘tce’, ‘cuda’, ‘vulkan’, ‘proxy’ and ‘all’. Note: setting POCL_DEBUG to 1 still works and equals error+warning+general.

  • POCL_DEBUG_LLVM_PASSES

When set to 1, enables debug output from LLVM passes during optimization.

  • POCL_DEVICES and POCL_x_PARAMETERS

POCL_DEVICES is a space separated list of the device instances to be enabled. This environment variable is used for the following devices:

  • cpu-minimal A minimalistic example device driver for executing

    kernels on the host CPU. No multithreading.

  • cpu Execution of OpenCL kernels on the host CPU using

    (by default) all available CPU threads via pthread library.

  • cpu-tbb Uses the Intel Threading Building Blocks (or oneTBB) library

    for task scheduling on the host CPU.

  • cuda An experimental driver that uses libcuda to execute on NVIDIA GPUs.

  • hsa Uses HSA Runtime API to control HSA-compliant

    kernel agents that support HSAIL finalization (deprecated).

  • vulkan An experimental driver that uses Vulkan and SPIR-V for executing on

    Vulkan supported devices.

  • ttasim Device that simulates a TTA device using the

    TCE’s ttasim library. Enabled only if TCE libraries installed.

  • level0 An experimental driver that uses libze to execute on Intel GPUs.

If POCL_DEVICES is not set, one cpu device will be used. To specify parameters for drivers, the POCL_<drivername><instance>_PARAMETERS environment variable can be specified (where drivername is in uppercase). Example:

export POCL_DEVICES="cpu ttasim ttasim"
export POCL_TTASIM0_PARAMETERS="/path/to/my/machine0.adf"
export POCL_TTASIM1_PARAMETERS="/path/to/my/machine1.adf"

Creates three devices, one ‘cpu’ device with multithreading and two TTA device simulated with the ttasim. The ttasim devices gets a path to the architecture description file of the tta to simulate as a parameter. POCL_TTASIM0_PARAMETERS will be passed to the first ttasim driver instantiated and POCL_TTASIM1_PARAMETERS to the second one.

  • POCL_DISCOVERY

Used to enable or disable device discovery. See Dynamic Device Management and Network Discovery for details on discovery.

  • POCL_DRIVER_VERSION_OVERRIDE

    Can be used to override the driver version reported by PoCL. See POCL_CPU_VENDOR_ID_OVERRIDE for an example use case.

  • POCL_EXTRA_BUILD_FLAGS

Adds the contents of the environment variable to all clBuildProgram() calls. E.g. POCL_EXTRA_BUILD_FLAGS="-g -cl-opt-disable" can be useful for force adding debug data all the built kernels to help debugging kernel issues with tools such as gdb or valgrind.

  • POCL_IGNORE_CL_STD

Ignores any --cl-std options passed to clBuildProgram(). This is useful to force-run programs that set the version to 2.x although they do not need all of its features which the targeted 3.x driver might not implement.

  • POCL_KERNEL_CACHE

If this is set to 0 at runtime, kernel compilation files will be deleted at clReleaseProgram(). Note that it’s currently not possible for pocl to avoid interacting with LLVM via on-disk files, so pocl requires some disk space at least temporarily (at runtime).

  • POCL_LEAVE_KERNEL_COMPILER_TEMP_FILES

If this is set to 1, the kernel compiler cache/temporary directory that contains all the intermediate compiler files are left as it is. This will be handy for debugging

  • POCL_LEVEL0_JIT

Sets up Just-In-Time compilation in the Level0 driver. (see Level Zero driver for details) Accepted values: {0,1,auto}

  • 0 = always disable JIT

  • 1 = always use JIT,

  • auto (default) = guess based on program’s kernel count & SPIR-V size.

  • POCL_LEVEL0_LINK_OPT

If non-empty string, runs llvm-opt with this option after the linking step, before converting to SPIRV and handing over to L0 driver. Default: empty.

  • POCL_LLVM_VERIFY

    if enabled, some drivers (CUDA, CPU, Level0) use an extra step of verification of LLVM modules at certain stages (program.bc always, kernel bitcode (parallel.bc) only with some drivers). Defaults to 0 if CMAKE_BUILD_TYPE=Debug and 1 otherwise.

  • POCL_MAX_WORK_GROUP_SIZE

Forces the maximum WG size returned by the device or kernel work group queries to be at most this number. For certain devices, this is can only be lower than their hardware limits.

  • POCL_MAX_COMPUTE_UNITS

Limits the maximum number of Compute Units for drivers which support limiting the CU count. The default is for each driver to determine the CU count based on hardware properties. If both this and driver specific env var are specified, the driver specific variable takes precedence.

  • POCL_MEMORY_LIMIT

Integer option, unit: gigabytes. Limits the total global memory size reported by pocl for the CPU devices (this will also affect local/constant/max-alloc-size numbers, since these are derived from global mem size).

  • POCL_OFFLINE_COMPILE

Bool. When enabled(==1), some drivers will create virtual devices which are only good for creating pocl binaries. Requires those drivers to be compiled with support for compilation for those devices.

  • POCL_PATH_XXX

String. These variables can be used to override the path to executables that pocl uses during compilation, linking, etc. By default, they are set to the paths configured during the build.

The following variables are available:

  • POCL_PATH_CLANG – Path to the clang executable.

  • POCL_PATH_LLVM_LINK – Path to the llvm-link executable.

  • POCL_PATH_LLVM_OPT – Path to the llvm-opt executable.

  • POCL_PATH_LLVM_LLC – Path to the llc executable.

  • POCL_PATH_LLVM_SPIRV – Path to the llvm-spirv executable.

  • POCL_PATH_SPIRV_LINK – Path to the spirv-link executable.

  • POCL_ARGS_XXX

String. These variables can be used to pass additional arguments to executables that pocl invokes during compilation, linking, etc. Multiple arguments can be passed by separating them with a semicolon.

The following variables are available:

  • POCL_ARGS_CLANG – Additional arguments to pass to clang.

  • POCL_PLATFORM_NAME_OVERRIDE

Overrides the platform name reported by PoCL. For example, setting the platform “PoCL (Intel OpenCL compat)” will allow running OneDNN applications, which will fail to create a device if ‘Intel’ and ‘OpenCL’ are not in the platform string.

  • POCL_PREGION_VALUE_REMAT

Controls the CPU kernel compiler’s value rematerialization, an optimization where the value is recompute in the using parallel region instead of storing it to the work-item context. Enabled by default.

  • POCL_REMOTE_XXX

These variables are used to configure different aspects of the remote driver and daemon. See remote_label for details.

  • POCL_REMOTE_SEARCH_DOMAINS – To specify DNS domains for unicast-DNS-SD

    based discovery queries.

  • POCL_REMOTE_DHT_PORT – To specify a port for the DHT node to operate.

  • POCL_REMOTE_DHT_BOOTSTRAP – To specify a bootstrap node to connect to

    an existing DHT network.

  • POCL_REMOTE_DHT_KEY – To specify the common key for server and client

    nodes to use when publishing or listening.

  • POCL_SIGUSR2_HANDLER

When set to 1 (default 0), pocl installs a SIGUSR2 handler that will print some debugging information. Currently it prints the count of live cl_* objects by type (buffers, events, etc).

  • POCL_STARTUP_DELAY

    Default 0. If set to an integer N > 0, libpocl will make a pause of N seconds once, when it’s loading. Useful e.g. to set up a LTTNG tracing session.

  • POCL_TBB_DEV_PER_NUMA_NODE can be set to either 0 or 1 (default). If set, PoCL TBB driver creates a separate OpenCL device per each NUMA node.

  • POCL_TBB_GRAIN_SIZE can be set specify a grain size for all dimensions. More information can be found in TBB documentation.

  • POCL_TBB_PARTITIONER can be set to one of affinity,``auto``, simple,``static`` to select a partitioner. If no partitioner is selected, the TBB library will select the auto partitioner by default. More information can be found in TBB documentation.

  • POCL_TRACING, POCL_TRACING_OPT and POCL_TRACING_FILTER

If POCL_TRACING is set to some tracer name, then all events will be traced automatically. Depending on the backend, traces may be output in different formats and collected in a different way. POCL_TRACING_FILTER is a comma separated list of string to indicate which event status should be filtered. For instance to trace complete and running events POCL_TRACING_FILTER should be set to “complete,running”. Default behavior is to trace all events.

  • cq – Dumps a simple per-kernel execution time statistics at the

    program exit time which is collected from command queue start and finish time stamps. Useful for quick and easy profiling purposes with accurate kernel execution time stamps produced in a per device way. Currently only tracks kernel timings, and POCL_TRACING_FILTER has no effect.

  • text – Basic text logger for each events state

    Use POCL_TRACING_OPT=<file> to set the output file. If not specified, it defaults to pocl_trace_event.log

  • lttng – LTTNG tracepoint support. Requires pocl to be built with -DENABLE_LTTNG=YES.

    When activated, a lttng session must be started. The following tracepoints are available:

    • pocl_trace:ndrange_kernel -> Kernel execution

    • pocl_trace:read_buffer -> Read buffer

    • pocl_trace:write_buffer -> Write buffer

    • pocl_trace:copy_buffer -> Copy buffer

    • pocl_trace:map -> Map image/buffer

    • pocl_trace:command -> other commands

    For more information, please see lttng documentation: http://lttng.org/docs/#doc-tracing-your-own-user-application

  • POCL_VECTORIZER_REMARKS

When set to 1, prints out remarks produced by the loop vectorizer of LLVM during kernel compilation.

  • POCL_VECTORIZER_FORCE_VECTOR_WIDTH

Forces the LLVM loop vectorizer to use the specified vector width (expressed as a number of loop iterations), overriding the default value determined by the vectorizer’s cost model. The same vector width will be used by all loops in all kernels. Setting the vector width to 1 disables vectorization. If the requested vector width is higher than the machine’s native vector width, the vectorizer will also unroll the loop.

  • POCL_VECTORIZER_PREFER_VECTOR_WIDTH

Override the preferred vector width (expressed as a number of bits) for x86 targets. When set, the LLVM loop vectorizer will generate code using vector instructions with the specified number of bits. When not set, the LLVM loop vectorizer may limit itself to using 256-bit vector instructions on some targets to avoid frequency penalties.

Note

POCL_VECTORIZER_FORCE_VECTOR_WIDTH and POCL_VECTORIZER_PREFER_VECTOR_WIDTH can be used together. For example, setting POCL_VECTORIZER_FORCE_VECTOR_WIDTH=16 POCL_VECTORIZER_PREFER_VECTOR_WIDTH=512 will force the LLVM loop vectorizer to use a vector width of 16 and generate 512-bit vector instructions.

  • POCL_VULKAN_VALIDATE

When set to 1, and the Vulkan implementation has the validation layers, enables the validation layers in the driver. You will also need POCL_DEBUG=vulkan or POCL_DEBUG=all to see the output printed.

  • POCL_WORK_GROUP_METHOD

The kernel compiler method to produce the work group functions from multiple work items. Legal values:

  • auto – Choose the best available method depending on the

    kernel and the work group size. Currently always defaults to loopvec.

  • cbs – Use continuation-based synchronization to execute work-items

    on non-SPMD devices. CBS is expected to work for kernels that ‘loops’ does not support. For most other kernels it is expected to perform slightly worse. Also enables the LLVM LoopVectorizer.

    An in-depth explanation of the implementation of CBS and how it compares to the other approaches can be found in [this thesis](https://joameyer.de/hipsycl/Thesis_JoachimMeyer.pdf).

  • loops – Create parallel for-loops that execute the work items.

    The loops will be unrolled a certain number of times of which maximum can be controlled with POCL_WILOOPS_MAX_UNROLL_COUNT=N environment variable (default is to not perform unrolling).

  • loopvec – Create parallel work-item for-loops (see ‘loops’) and execute

    the standard LLVM vectorizers. LLVM loop unrolling is disabled and the unrolling decisions are left to the generic loop vectorizer.

  • POCL_WORK_GROUP_SPECIALIZATION

    PoCL specializes work-groups at kernel command launch time by default to optimize the execution performance with the cost of cached variations of the kernels with the different specialization values.

    The kernel command parameters PoCL currently specializes with include the local size, global offset zero or non-zero and maximum grid size. The specialization can be disabled by setting this environment variable to 0.

Setting up and running PoCL on MacOS

Note about the kernel compiler

Clang/LLVM is included with Xcode, but at least the default installation lacks development headers/libraries and llvm-config. As a result, this version cannot be used as a kernel compiler for PoCL.

The simplest way to install llvm is through Homebrew:

brew install llvm
export PATH=/opt/homebrew/opt/llvm/bin:$PATH

Then, ensure that LLVM is correctly set up for PoCL:

which clang
/opt/homebrew/opt/llvm/bin/clang
llvm-config --version
19.1.7

Alternatively, you can compile LLVM from source (Example is for an ARM Mac):

git clone https://github.com/llvm/llvm-project
cd llvm-project
mkdir build && cd build
cmake -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_ENABLE_PROJECTS="clang;llvm"
-DCMAKE_INSTALL_PREFIX=<path-to-installation-directory> -DLLVM_TARGETS_TO_BUILD="AArch64" ../llvm

ninja install

Installing PoCL on MacOS using pre-built binaries

Homebrew

PoCL with the CPU driver supports Intel and Apple Silicon chips can be found on homebrew and can be installed with:

brew install pocl

Note that this installs an ICD loader from KhronoGroup and the built-in OpenCL implementation will be invisible when your application is linked to this loader.

Conda

PoCL with the CPU driver supports Intel and Apple Silicon chips can be found on conda-forge distribution and can be installed with:

curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh

To install the CPU driver:

mamba install pocl

Note that this installs an ICD loader from KhronosGroup and the builtin OpenCL implementation will be invisible when your application is linked to this loader. To make both pocl and the builtin OpenCL implementaiton visible, do:

mamba install pocl ocl_icd_wrapper_apple

Building PoCL from source on MacOS

Ensure that all required dependencies are installed. Clang/LLVM must be properly set up (see above).

Get the sources:

git clone git@github.com:pocl/pocl.git
cd pocl
mkdir build && cd build

For a standard build without the ICD loader

cmake .. -G Ninja -DENABLE_ICD=OFF -DCMAKE_INSTALL_PREFIX=<path-to-installation-directory>
ninja install

This will install libOpenCL.dylib to:

<path-to-installation-directory>/lib

Usage:

To override the MacOS OpenCL framework:

export LIBRARY_PATH=<path-to-installation-directory>/lib:$LIBRARY_PATH
export DYLD_LIBRARY_PATH=<path-to-installation-directory>/lib:$DYLD_LIBRARY_PATH
clang <program-source>.c -lOpenCL

// Use PoCL's debugging functionality to ensure it runs through PoCL.
POCL_DEBUG=all ./a.out

For a build with the ICD loader

// If not installed:
brew install ocl-icd
brew install opencl-headers

// These should enable PoCL to automatically detect the ICD loader.
export PKG_CONFIG_PATH="/opt/homebrew/opt/opencl-headers/share/pkgconfig":$PKG_CONFIG_PATH
export PKG_CONFIG_PATH="/opt/homebrew/opt/ocl-icd/lib/pkgconfig":$PKG_CONFIG_PATH
export CPATH=/opt/homebrew/opt/ocl-icd/include:$CPATH

cmake .. -G Ninja -DCMAKE_INSTALL_PREFIX=<path-to-installation-directory>
ninja install

This will install libpocl.dylib to:

<path-to-installation-directory>/lib

Make it visible to the ICD loader by setting:

export OCL_ICD_VENDORS=<path-to-installation-directory>/etc/OpenCL/vendors

Usage:

To override the MacOS OpenCL framework:

export LIBRARY_PATH=/opt/homebrew/opt/ocl-icd/lib:$LIBRARY_PATH
clang <program-source>.c -lOpenCL

// Use PoCL's debugging functionality to ensure it runs through PoCL.
POCL_DEBUG=all ./a.out

Using PoCL as the OpenCL backend for DPC++

SYCL is a programming model that enables single-source C++ development for heterogeneous computing. Compared to OpenCL, SYCL operates at a higher level of abstraction, and implementations can use varying backends for device offloading (e.g., OpenCL, Level Zero, and CUDA). It is worth noting that a SYCL implementation is not required to support OpenCL as a backend.

DPC++ is Intel’s implementation of SYCL that supports OpenCL. When the OpenCL backend is utilized, the DPC++ runtime translates SYCL API calls into corresponding OpenCL API calls and forwards them to the OpenCL runtime.

The toolchain flow, when PoCL is used as the OpenCL backend for DPC++, is as follows:

  • The DPC++ Clang++ frontend compiles the SYCL kernel into LLVM IR.

  • llvm-spirv is used to translate LLVM IR to SPIR-V.

  • SPIR-V is ingested by PoCL, where it is translated back into LLVM IR.

  • PoCL applies additional transformations to the LLVM IR.

  • If using a CPU driver, PoCL leverages llc (LLVM backend) to lower the kernel to machine code.

It should be pointed out that there are two versions of DPC++:

  • the Intel(R) oneAPI DPC++/C++ Compiler

  • the oneAPI DPC++/C++ Compiler.

The former is proprietary and thus distributed in binary form, whereas the latter is open-source.

This page covers the following steps:

  • How to obtain, install, and set up DPC++ (the proprietary or the open-source version)

  • How to build PoCL to support DPC++.

  • Verification with an example program.

Intel(R) oneAPI DPC++/C++ Compiler installation

DPC++ is available in various bundles. Installing the oneAPI Base Toolkit is the simplest way to install DPC++ and its dependencies.

Choose a suitable installer from:

https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html

Run the installer. The page above provides corresponding instructions for the selected installer. Pay attention to the default installation path and choose a suitable one if necessary.

The oneAPI Base Toolkit includes various components, some of which are not needed to run SYCL applications with PoCL.

For a minimal setup, pick:

  • Intel oneAPI DPC++ Library

  • Intel oneAPI DPC++/C++ compiler

  • Intel Distribution for GDB (Required by the compiler)

  • Intel oneAPI Threading Building Blocks (Required by the compiler)

  • Intel oneAPI Math Kernel Library (Useful, but not required here)

After installation, run the initialization script to set the environment variables:

source <path-to-oneapi-installation>/setvars.sh

Important

setvars.sh must be run in every new shell session unless added to .bashrc (or an equivalent).

Now, DPC++ should be set up. This can be verified by checking the available SYCL backends (In this example, Intel OpenCL was detected).:

sycl-ls
[opencl:cpu][opencl:0] Intel(R) OpenCL, AMD Ryzen Threadripper 2990WX 32-Core Processor OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [23.43.027642]

The initialization script also adds the compiler to the PATH:

icpx --version
Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205)

oneAPI DPC++/C++ Compiler installation

The sources for the open-source DPC++ compiler can be obtained from the DPC++ repository.

Official detailed instructions can be found here. The build process is managed using two Python scripts: configure.py and compile.py, which handle most of the heavy lifting. The configure.py is essentially a wrapper for CMake, so checking its contents can provide further details.

For a basic setup, run:

git clone git@github.com:intel/llvm
cd llvm
python3 ./buildbot/configure.py -o <path-to-dpcpp-installation>
python3 ./buildbot/compile.py -o <path-to-dpcpp-installation> -j <number-of-threads>

After building, export the compiler and SYCL runtime library paths:

export PATH=<path-to-dpcpp-installation>/bin:$PATH
export LD_LIBRARY_PATH=<path-to-dpcpp-installation>/lib:$LD_LIBRARY_PATH

Note: The open-source DPC++ compiler driver is clang++, not icpx.

which clang++
<path-to-dpcpp-installation>/bin/clang++

Building PoCL for DPC++

PoCL doesn’t normally require llvm-spirv, but in this case, it is a strict dependency because PoCL needs to convert the SPIR-V produced by DPC++ back to LLVM IR.

You must check out and build a version of llvm-spirv that corresponds to the LLVM version PoCL uses as its kernel compiler. For example, if the PoCL kernel compiler uses LLVM 18, then llvm-spirv should be checked out from the llvm_release_180 branch.

Note

DPC++ ships with its own llvm-spirv, which is typically based on the latest release. However, this version is intended for internal usage by DPC++ and cannot be used by PoCL.

Warning

Although the versions of llvm-spirv used by DPC++ and PoCL do not have to be an exact match, it is recommended to use versions that are reasonably close to each other.

Example PoCL build:

git clone git@github.com:pocl/pocl.git
cd pocl
mkdir build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=<path-to-installation-directory> -DLLVM_SPIRV=<path-to-llvm-spirv> -DWITH_LLVM_CONFIG=<path-to-llvm-config>
ninja install

To make PoCL visible to the ICD loader, either register the PoCL ICD (https://github.com/KhronosGroup/OpenCL-ICD-Loader#registering-icds) or set the OCL_ICD_FILENAMES or OCL_ICD_VENDORS environment variables. OCL_ICD_VENDORS only works on Linux/Android, whereas OCL_ICD_FILENAMES works on all platforms (see https://github.com/KhronosGroup/OpenCL-ICD-Loader#table-of-debug-environment-variables for more information).

On Linux:

export OCL_ICD_VENDORS=<path-to-pocl-installation>/etc/OpenCL/vendors

Compiling with DPC++ using PoCL as the backend

If using proprietary DPC++, there is one additional step. By default PoCL is blocked by the DPC++ runtime. To enable PoCL, we need to set the SYCL_DEVICE_ALLOWLIST environment variable. This variable is a comma-separated list of parameters that the DPC++ runtime uses to select allowed devices. It can be used quite flexibly. For example, to select only CPU devices:

export SYCL_DEVICE_ALLOWLIST="DeviceType:cpu"

To allow all available devices, use:

export SYCL_DEVICE_ALLOWLIST=""

To select only PoCL, you can use the PoCL vendor ID:

export SYCL_DEVICE_ALLOWLIST="DeviceVendorId:0x10006"

For more information about how to use the DPC++ environment variables, see:

https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md

Now that everything is set up, verify that PoCL is detected:

sycl-ls
[opencl:cpu][opencl:0] Portable Computing Language, cpu-znver1-AMD Ryzen Threadripper 2990WX 32-Core Processor OpenCL 3.0 PoCL HSTR: cpu-x86_64-pc-linux-gnu-znver1

Below is a simple SYCL program to test the setup. It selects the device automatically, so this will drop the possible GPUs out of the list:

export ONEAPI_DEVICE_SELECTOR=opencl:cpu
// hello_nd_range.cpp
#include <sycl/sycl.hpp>
#include <iostream>

#define SUB_GROUP_SIZE 2

using namespace sycl;

int main() {
    constexpr int global_size = 8;
    constexpr int local_size = 4;

    queue q;
    {
        q.submit([&](handler &h) {

            std::cout << "One dimensional nd_range with global_size: " << global_size << ", local_size: " << local_size << ", sg_size: " << SUB_GROUP_SIZE << "\n";

            range<1> global(global_size);
            range<1> local(local_size);
            nd_range<1> range(global, local);

            h.parallel_for(range, [=](nd_item<1> idx) [[sycl::reqd_sub_group_size(SUB_GROUP_SIZE)]] {

                int workgroup_id_x = idx.get_group(0);
                int global_id_x = idx.get_global_id(0);
                int local_id_x = idx.get_local_id(0);
                int sg_local_id = idx.get_sub_group().get_local_id();
                int sg_id = idx.get_sub_group().get_group_id();
                sycl::ext::oneapi::experimental::printf("hello from: (global_id %d) (local_id: %d) (wg_id: %d) (sg_id: %d) (sg_local id: %d)\n",global_id_x, local_id_x,workgroup_id_x, sg_id, sg_local_id);
            });
        }).wait();
    }
    return 0;
}

Compile and run (use icpx for proprietary version, and clang++ for open-source version):

clang++ hello_nd_range.cpp -fsycl -o hello
./hello

One dimensional nd_range with global_size: 8, local_size: 4, sg_size: 2
hello from: (global_id 0) (local_id: 0) (wg_id: 0) (sg_id: 0) (sg_local id: 0)
hello from: (global_id 1) (local_id: 1) (wg_id: 0) (sg_id: 0) (sg_local id: 1)
hello from: (global_id 2) (local_id: 2) (wg_id: 0) (sg_id: 1) (sg_local id: 0)
hello from: (global_id 3) (local_id: 3) (wg_id: 0) (sg_id: 1) (sg_local id: 1)
hello from: (global_id 4) (local_id: 0) (wg_id: 1) (sg_id: 0) (sg_local id: 0)
hello from: (global_id 5) (local_id: 1) (wg_id: 1) (sg_id: 0) (sg_local id: 1)
hello from: (global_id 6) (local_id: 2) (wg_id: 1) (sg_id: 1) (sg_local id: 0)
hello from: (global_id 7) (local_id: 3) (wg_id: 1) (sg_id: 1) (sg_local id: 1)