September 4th, 2023: No-MPI OpenCL-Only Distributed Computing With PoCL-Remote

PoCL now has a new backend that allows transparently offloading OpenCL tasks to other nodes on the network, thus enabling distributing compute without using MPI or similar APIs. Since the standard OpenCL API suffices, compute offloading can be performed identically whether using local or remote devices, which makes it useful for selective/adaptive edge offloading and other use cases.

The driver is now considered ready for out-of-lab testing and has been integrated to the main branch for the upcoming v5.0 release. The work was primarily carried out by Michal Babej, Jan Solanti and Pekka Jääskeläinen in the Customized Parallel Computing group at Tampere University within multiple European and national research projects.

PoCL-Remote follows a client-server architecture and is comprised of two parts: The remote driver in the PoCL library and pocld daemon.

The daemon can be run on any machine that is reachable via TCP/IP and has an OpenCL implementation available. Any OpenCL implementation/driver/device works on the server-side, not only PoCL-based drivers. The client library connects to the configured daemon(s) and lists the OpenCL devices available on them in clGetDeviceIDs just as if they were local to the client.

In contrast to existing networked offloading solutions for OpenCL, PoCL-Remote makes use of PoCL's memory management infrastructure to keep track of memory objects and only copy them around when actually necessary. When a memory object migration is needed, the most efficient route for the transfer is automatically chosen:

If both the source and destination device are part of the same native OpenCL context on the machine running the daemon, the migration is delegated to the underlying native driver.
If the devices are part of separate OpenCL platforms, the memory contents are manually copied through host memory within the daemon.
For copies between two daemons, direct peer-to-peer connections are utilised in order to minimize the amount of traffic to and from the client.
Should all else fail, the memory contents are downloaded into the client's memory and uploaded to the destination from there.

Similarly, synchronisation between commands is done with OpenCL events within the daemon and OpenCL user events that are signaled in peer-to-peer fashion for dependencies between daemons. In addition to automatically choosing the most efficient route for communication, PoCL-Remote is also able to use size buffers as specified in the cl_pocl_content_size extension to only transfer the meaningful portion of buffers with variable sized content such as compressed image or video data.

Early versions of PoCL-R or experiments using it have been presented at IWOCL '20, SAMOS 2021 and IWOCL '23. There is also a full length journal article under review which describes the published version (for example its RDMA support). A preprint of it is available in arXiv.

More information, instructions for building and using the PoCL-Remote backend can be found in the user manual.

Status and Maturity

The backend passes most of PoCL's basic test suite and has been successfully used to run complex applications such as FluidX3D and various computer vision and machine learning demos. The actual usable set of features is naturally also dependent on the native driver controlled by the daemon.

PoCL-R has been mainly tested within the research group but integrated to proper demonstrators, thus can be considered TRL4-TRL5 in the EU scale.

In terms of performance there is of course a major penalty when having to transfer buffers across the network. However this is mostly noticeable in buffer transfers (Write-/Copy-/ReadBuffer and migrating a buffer from one device to another). These can easily become a bottleneck for other drivers as well so designing applications with that in mind is advisable in general.

It is worth noting that PoCL-R will leave buffers resident on devices after use, so unchanged buffers do not need to be transferred again on next use. This means that static buffers such as neural network coefficients only need to be uploaded once during launch and afterwards inference can be performed repeatedly without this initial buffer transfer cost.

In multi-server setups the effects of server to server transfers can be mitigated somewhat by building PoCL with RDMA support enabled, if RDMA is supported by the networking hardware.

Known Limitations (as of 2023-10-03)

There is no traffic encryption or user authentication on the daemon side, making PoCL-R currently not suitable for use outside of closed private networks/clusters.
PoCL must be built with LOADABLE_DRIVERS=OFF, else initialisation of the remote backend fails.
While printf does somewhat work, it will likely behave differently from what applications expect.

Contributing

We welcome any contributions in the form of good quality bug reports and pull requests, but cannot commit to rapid support if the issue does not affect us due to a limited number of developers working on the project. If you're interested in improving PoCL-R, but aren't sure what to work on, please ask in the mailing list or the discussion forum for more information.

Portable Computing Language | No-MPI OpenCL-Only Distributed Computing With PoCL-Remote

September 4th, 2023: No-MPI OpenCL-Only Distributed Computing With PoCL-Remote

Status and Maturity

Known Limitations (as of 2023-10-03)

Contributing