Fixed-Function Accelerators

The almaif driver can be used for easy integration of custom fixed-function accelerators through a standardized hardware interface and a standardized procedure for enqueuing commands. More information about the interface can be found from the publications at the end of this page.

Interface

The control register interface for the fixed-function accelerators is quite simple. The address space of the device is split into four regions. The sizes and starting addresses of the regions are advertised in the control region by the accelerator.

High bits

Address Space

00

Control registers

01

Instruction memory

10

Data memory

11

Parameter memory

The control registers are also used to control the execution of the accelerator:

Offset

Name

Description

0x000

STATUS

Status of the accelerator. Bit 0 is high when the execution is stalled due to any reason, bit 1 is high when the external stall signal is active, and bit 2 is high when the accelerator reset is active.

0x200

COMMAND

Command register to control execution. Writing 1 to this register resets the accelerator, writing 2 lifts reset and external stall, and writing 4 enables the external stall signal, pausing execution.

0x300

DEVICE_CLASS

Device class (vendor ID) of the accelerator. Currently unused by the driver.

0x304

DEVICE_ID

Device ID of the accelerator. Currently unused by the driver.

0x308

INTERFACE_TYPE

Version number of the interface. This describes interface version 3.

0x30C

CORE_COUNT

Core count of the accelerator. Multicore devices are currently not supported.

0x310

CTRL_SIZE

Size of control memory (this register space) in bytes. Must be at least 1024.

0x314

IMEM_SIZE

Size of the instruction memory in bytes

0x318

IMEM_STARTING_ADDRESS (64b)

Starting address of the instruction memory

0x320

CQMEM_SIZE (64b)

Size of the command queue memory in bytes. The cq region includes a ring buffer of 64B packets and the 64B queue header. Therefore the CQMEM_SIZE is (queue_length + 1) * 64.

0x328

CQMEM_STARTING_ADDRESS (64b)

Starting address of the command queue memory

0x330

BUFFERMEM_SIZE

Size of the data memory reserved for on-chip buffers.

0x338

BUFFERMEM_STARTING_ADDRESS

Starting address of the buffer memory.

0x340

FEATURE_FLAGS (64b)

Bitmap of various features.

Bit 0: HAS_MASTER_INTERFACE. If set to 1, the accelerator can access outside of its AlmaIF address space with a master interface. This also means that the above X_STARTING_ADDRESS values are absolute values. Also, any pointers to data buffers, completion signals etc. are given to the accelerator as absolute addresses and the accelerator needs to decode the address to determine whether the pointer points to its own buffermem region or external devices or memories. If set to 0, the data buffers, completion signals etc. are given as pointers relative to the beginning of the buffer mem region, and the device is assumed to not being able to access external devices or memories.

Bits: 1-63: Reserved, should be set to 0

The other three regions are used for the following: The instruction memory can be used to configure the accelerator. PoCL looks for <device_name>.img-binary file and if it exists, writes it to the region at the initialization time. In case of compiled kernels, PoCL will overwrite this region with the new program. The command queue memory is used to store an AQL Queue, as defined by the HSA Runtime Programmer’s Reference Manual, the write and read indexes of which are exposed in a 64B header at the beginnig of the region. The size of the queue is such that it uses all of the region remaining after the header. Finally, the buffer memory is used to store data and argument buffers as well as completion signals for the kernels.

As a practical example, enqueuing a kernel dispatch packet proceeds as follows:

  • The driver allocates and populates the OpenCL buffers and the argument buffer for the kernel, as well as space for a 32-bit completion signal.

  • The driver writes the kernel packet to the device. Its position depends on the value of the write index. The completion signal address as well as the argument buffer address and pointers to buffer arguments are given as physical addresses in the accelerator’s address space. The kernel object simply corresponds to the kernel IDs shown in the table below.

  • The driver sets the packet header and increments the queue write index.

  • The device executes the kernel and writes a 1 in case of a success or a 2 in case of a failure to the completion signal address, if it is not 0.

  • The driver sees the completion signal change, and can consider the command completed.

Usage

To enable this driver, simply add -DENABLE_ALMAIF_DEVICE=1 to the cmake arguments. On small FPGA SoCs and other relatively low performance hosts, you may wish to follow the instructions in LLVM-less build. (Recommended for Zynq-7020 SoC).

The fixed-function accelerators need to be told what kernel to execute. For this, the almaif driver has a list of builtin kernels that can be referred to in the clCreateProgramWithBuiltInKernels call:

Kernel name

Kernel ID

Function

pocl.copy.i8

0

Copies from argument 0 to argument 1 as many bytes as there are work items

pocl.add.i32

1

32-bit element-wise addition on arrays pointed to by arguments 0 and 1, stored in an array pointed to by argument 3

pocl.mul.i32

2

As pocl.add.i32, but with 32-bit multiplication

Online compiler available.

65535

Special flag to communicate that device supports compiled kernels.

The full list of currently supported built-in kernels is maintained in lib/CL/devices/builtin_kernels.{cc,hh}

To execute tests in examples/accel that generate both TTA and High-level synthesis (HLS) based accelerators for PYNQ-Z1 device you need to enable few variables in the CMAKE configuration. First, set CMAKE variable VIVADO_PATH to point to the directory with the ‘vivado’ executable. (E.g. at Xilinx/Vivado/2021.2/bin/)

1. If you have the OpenASIP/TCEMC toolset installed, you can set ENABLE_TCE to 1 to enable RTL and firmware generation of various OpenASIP TTA cores with different memory configurations. Then, you can simulate them with ttasim instruction set simulator by running ../tools/scripts/run_almaif_tests from the build directory.

2. If you have Vitis HLS installed, set VITIS_HLS_PATH to point to the directory with the vitis_hls executable. This enables the generation of fixed-function accelerator from C description.

The bitstreams themselves are not automatically built with PoCL build process, but rather with a separate ‘make bitstreams’ command. This generates the bitstreams to build/examples/accel/bitstreams and build/examples/accel/hls/bitstreams directories. Once bitstreams have been built, build PoCL on the PYNQ-Z1 device. (You don’t need to set ENABLE_TCE or VIVADO/VITIS_HLS_PATH). Copy the bistreams directories (and in case of TTA, also the firmware_imgs directory and example0_*.poclbins) to their correct PoCL build directories on PYNQ. Finally, run ../tools/scripts/run_almaif_tests to run the test program on the FPGA device.

Driver arguments are used to tell pocl where the accelerator is and what functions it supports. To run examples manually, after programming the FPGA, execute:

POCL_DEVICES=almaif POCL_ALMAIF0_PARAMETERS=0x40000000,<device_name>,1,2 ./accel_example

The environment variables define an accelerator with base physical address of 0x4000_0000 that can execute pocl.add.i32 and pocl.mul.i32. If the device requires firmware to be loaded in, pocl will attempt to load it from <device_name>.img. When running the example, verify that the address given in the parameter matches the base address of the accelerator.

Note that as the driver requires write access to /dev/mem for memory mapping, you may need to execute the application with elevated privileges. In this case, note that sudo by default overrides your environment variables. You can either assign them in the same command, or use sudo with the --preserve-env switch.

The driver supports instruction-set simulation for TTA devices. To enable it, set the base address to 0xB, and set the <device_name> to point to a TTA device’s .adf-file and compiled firmware binary (.tpef-file). PoCL will then start up the simulation with <device_name>.adf and, if it exists, <device_name>.tpef.

There’s an alternative way to emulate the accelerator in software by setting the base physical address to 0xE. This directs the driver to instead use a software emulating function from almaif/EmulationDevice.cc. No changes to the source OpenCL host program (e.g. accel_example.cpp) when switching between emulation, instruction-set simulation or FPGA execution.

Standalone mode

Almaif driver’s OpenASIP backend supports a standalone mode meant for executing kernels without the host runtime. The standalone mode generates a C program that contains the input data and pre-initialized command structures to run a single kernel with either ttasim or RTL simulation. The mode is enabled with an environment variable POCL_ALMAIF_STANDALONE=1. The mode generates helper scripts to the working directory, while outputting the standalone C program to the kernel CACHE directory.

Example usage of the mode can be found in examples/accel/CMakelists.txt, which generates standalone tests using both ttasim and RTL simulator (ghdl) to run the example0 kernel on various TTA configurations.

Using a bitstream database

You can use the AlmaIF-driver with the cross-vendor bitstream databases generated with the AFOCL-project. That project generates a directory-based database with a json-based metadata. The database contains the bitstreams and firmware-files necessary to implement the set of built-in kernels defined in the json-file.

The bistream database device will report all the built-in kernel implementations it can find from the database in clGetDeviceInfo’s CL_DEVICE_BUILT_IN_KERNELS-query. The bitstream database device (“0xF”) will automatically fetch bitstream from the database and reconfigure the FPGA when user enqueues a built-in kernel for execution. Therefore, the user does not need to handle the bitstream binaries themselves, since the OpenCL implementation reconfigures the FPGA behind-the-scenes.

To use AFOCL-databases in PoCL, it is enough to point the Almaif-driver to the database with the env variable:

POCL_DEVICES=almaif POCL_ALMAIF0_PARAMETERS=0xF,<path/to/afocl-db> ./accel_example

At the moment, the public AlmaIF-driver and AFOCL include support only for Xilinx Alveo U280 device, but adding support for other Alveo devices should be easy. In the AFOCL publication the methodology was also demonstrated with Intel Arria 10, but the code for that is not yet upstreamed. The driver is built to hide the vendor-specific details from the end user, with different AlmaIFDevice backends taking care of vendor-specific details. For more information about the bitstream database, see our AFOCL-publication (2023).

Wrapping a new hardware component

This section will walk through the addition of new implementation for an existing built-in kernel. The component can be any hardware component, as long as it supports the AlmaIF interface specification described above. The following section presents an example method of generating the accelerator with HLS. However, other methods of generating the accelerator exists, the only requirement is that it implements the AlmaIF specification as described above.

High-level synthesis

Template for HLS-accelerator is in examples/accel/hls/poclAccel.cpp-file. It can be generated with ‘make hls_vecadd_bs’, which generates the biststream file to examples/accel/hls/bitstreams/. To enable the target, you need to add VITIS_HLS_PATH and VIVADO_PATH as CMAKE variables that point to the directory containing the ‘vitis_hls’ and ‘vivado’ binaries.

The build process of HLS accelerator consists of two parts:

1. Generating accelerator RTL from C++ input (With Vitis HLS using script generate_hls_core.tcl)

2. Generating block design with the accelerator and block memory for AlmaIF regions (With Vivado using script generate_hls_project.tcl)

To run the vector addition on HLS generated core, the bitstream needs to be copied to PYNQ-Z1. The generate_hls_project.tcl file sets the base address of the accelerator to a physical address 0x40000000. This base address is given to PoCL through an environment variable:

export POCL_DEVICES=almaif
export POCL_ALMAIF0_PARAMETERS="0x40000000,dummy,1,2"

The bitstream can be loaded on the FPGA with various ways. PYNQ-Z1 image includes a python library to do it, which can be used with a following one-liner:

sudo -E python -c "from pynq import Overlay;Overlay('examples/accel/hls/bitstreams/vecadd_1.bit')"

After that, it’s possible to run the examples/accel/accel_example program.

Using this work

If you are utilizing, further developing or comparing to the AlmaIF driver of PoCL in your academic work, please cite the following publication:

@ARTICLE{leppanen2023,
  TITLE = {Efficient {OpenCL} system integration of non-blocking {FPGA} accelerators},
  JOURNAL = {Microprocessors and Microsystems},
  VOLUME = {97},
  PAGES = {104772},
  YEAR = {2023},
  ISSN = {0141-9331},
  DOI = {https://doi.org/10.1016/j.micpro.2023.104772},
  AUTHOR = {Topi Leppänen and Atro Lotvonen and Panagiotis Mousouliotis and Joonas Multanen and Georgios Keramidas and Pekka Jääskeläinen},
}

The other relevant publications:

@ARTICLE{afocl2023,
  AUTHOR={Leppänen, Topi and Multanen, Joonas and Leppänen, Leevi and Jääskeläinen, Pekka},
  TITLE={{AFOCL}: Portable {OpenCL} Programming of {FPGAs} via Automated Built-in Kernel Management},
  BOOKTITLE={2023 IEEE Nordic Circuits and Systems Conference ({NorCAS})},
  YEAR={2023},
  PAGES={1-7},
  DOI={10.1109/NorCAS58970.2023.10305457}
}

@ARTICLE{leppanen2022,
  AUTHOR={Leppänen, Topi and Lotvonen, Atro and Jääskeläinen, Pekka},
  TITLE={Cross-vendor programming abstraction for diverse heterogeneous platforms},
  JOURNAL={Frontiers in Computer Science},
  VOLUME={4},
  YEAR={2022},
  URL={https://www.frontiersin.org/articles/10.3389/fcomp.2022.945652},
  DOI={10.3389/fcomp.2022.945652},
  ISSN={2624-9898},
}

@INPROCEEDINGS{leppanen2021,
  AUTHOR={Leppänen, Topi and Mousouliotis, Panagiotis and Keramidas, Georgios and Multanen, Joonas and Jääskeläinen, Pekka},
  BOOKTITLE={2021 IEEE Nordic Circuits and Systems Conference (NorCAS)},
  TITLE={Unified OpenCL Integration Methodology for FPGA Designs},
  YEAR={2021},
  PAGES={1-7},
  DOI={10.1109/NorCAS53631.2021.9599861}
}