SYCL is an open industry standard, not a programming language itself, developed by the Khronos Group to facilitate data parallelism in C++ for heterogeneous computing systems. This means you can write single-source C++ code that utilizes the power of various processors like CPUs, GPUs, and FPGAs in a single program. This improves code maintainability and portability across different hardware platforms.
Core characteristics of SYCL:
- Single-source programming: Write code once and target diverse hardware architectures without rewriting for each device.
- Data parallelism: Leverage parallel execution for data-intensive tasks, significantly boosting performance.
- Abstraction layer: SYCL hides the underlying hardware complexities, providing a unified programming model for different devices.
- Memory management: SYCL offers various memory models (e.g., unified shared memory) to efficiently manage data movement between host and devices.
Popular SYCL implementations
- DPC++ (Data Parallel C++): Part of Intel’s oneAPI suite and also available as an open source compiler, DPC++ leverages LLVM/Clang for broad CPU and GPU support across Intel, NVIDIA, and AMD. In this post we will use DPC++ from oneAPI.
- ComputeCPP: Open-source and multi-backend, ComputeCPP excels on Intel and AMD hardware with good CPU and GPU coverage. NVIDIA GPU support is under development.
- hipSYCL: Primarily focused on NVIDIA GPUs with AMD support, hipSYCL offers multiple backends and experimental Intel GPU compatibility.
Additional technical points to consider
- Performance portability: While SYCL promotes code portability, achieving optimal performance on each hardware platform might require specific tuning.
- Learning curve: Compared to lower-level hardware-specific languages, SYCL has a gentler learning curve with support to modern C++ syntax but still requires understanding parallel programming concepts.
- Ecosystem: The SYCL ecosystem is evolving, with growing adoption and toolchain improvements.
- SYCL and IntelGPU in HPC: Aurora, the upcoming exascale supercomputer, will harness the power of SYCL and oneAPI on Intel GPUs to unlock groundbreaking scientific discoveries.
SYCL 2020
SYCL 2020 marked a significant milestone in the evolution of data parallelism for heterogeneous systems. DPC++ embraces the cutting-edge features introduced in the SYCL 2020 specification, empowering developers with:
- Unified Shared Memory (USM): Streamlined memory management with seamless data access between host and devices.
- Sub-groups: Fine-grained control over parallel execution within work groups for enhanced performance and algorithm optimization.
- C++17 Syntax Simplifications (CTAD): Class template argument deduction for cleaner, more concise code.
- Anonymous Lambdas: Direct use of lambda expressions without explicit naming, promoting code readability and flexibility.
These advancements collectively elevate the developer experience and performance capabilities of SYCL 2020, solidifying its position as a powerful and user-friendly tool for heterogeneous programming.
Key attributes of SYCL (DPC++)
- Single source: Supports containing both host and device code in a single source file. We can also keep them in separate files, but it is not required.
- Host: CPU. Need to have C++17 support to support DPC++ and SYCL programs.
- Device: GPU/FPGA/DSP/ASIC/CPU. Used for acceleration offloading.
- Sharing devices: As GPU device can be used by other programs, the performance of DPC++ program may be impacted.
- Kernel code: Code that runs on accelerated devices.
- List of features that are supported:
- Lambdas
- Operator overloading
- Templates
- Classes
- Static polymorphism
- Read only values of non-global host variables via lambda captures
- The list of features not supported by kernel code are:
- Dynamic polymorphism
- Dynamic memory allocations (no new/delete operators)
- Static variables
- Function pointers
- Runtime type information
- Exception handling
- No virtual member functions / variadic function can be called from kernel code
- Recursion is not allowed within kernel code
Abstractions for Memory management in SYCL (DPC++)
- Unified Shared Memory (USM)
- Pointer based approach
- Easier integration with existing C++ code that works on pointers
- If we prefer to think about dependencies as performing one computation before another using an in-order queue, use USM
- Allocation types:
- device: Locates in device memory and cannot be accessed by host. Use
malloc_device
- host: Locates in host memory and accessible from both host and device. Use
malloc_host
- shared: Location shared and migrated automatically, accessible from both host and device. Use
malloc_shared
- Supports both explicit and implicit data movement based on allocation type
- Explicit: Occurs in
device
allocation using handler’smemcpy()
to transfer data from host to device and vice versa - Implicit: Occurs in
host
andshared
allocations. No need to usememcpy()
operation
- Buffers
- Represented by the
buffer
template class - Describe 1D/2D/3D arrays
- Buffers follow implicit data movement strategy
- If we prefer to think about data dependencies between kernels, choose buffers
- Buffer object can be a scalar data type(
int
,float
,double
),vector
+ data type or a user defined class or structure - Data structures in buffers must be C++ trivially copyable without the need of copy constructor
- Can be accessed from host and device
- Buffers represent data objects rather than specific memory addresses thus cannot be directly accessed like arrays
- Buffers are not directly accessible, but can be accessed by
accessor
objects - Buffer access mode:
read
: Read only accesswrite
: Write only access. Previous content not disgardedread_write
: Read and write access
- Represented by the
- Images
- Specific type of buffers for image processing
- Supports image formats, reading of images using sampler objects
C++ Lambda Functions in SYCL
- Lambda functions are introduced in
C++11
- Used to create anonymous function objects, can also be used with named variables
Lambda expression has the following syntax:
[ capture-list ] ( params ) -> ret { body }
capture-list
:- Lambda expression starts with a square bracket
- Denotes how to capture variables that are used within the lambda but not passed as parameters
- Comma separated list of captures
- Variables from the surrounding scope mentioned in this list are available in the lambda
body
- Controls the visibility and lifetime of variables captured by lambda
- Global variables are not captured in lambda expression
- Non-global static variables can be used in a kernel but only if they are
const
- Capture a variable by value by listing the variable name in the
capture-list
. Even the value is modified in thebody
, it does not affect the original value outside the lambda. - Capture a variable by reference by listing the variable name prefixing with ampersand (
&variable
). If this variable is modified in thebody
, the original variable is also modified. [=]
shorthand to capture all variables used in the body by value and current object by reference[&]
shorthand to capture all variables used in the body by reference and current object by reference[]
captures nothing- SYCL uses
[=]
as kernel does not support capturing variable by reference
params
:- List of function parameters similar to named function
- In SYCL, it can be unique 1D id, or 2D/3D id
ret
:- Defines the return type of the lambda expression
- If
-> ret
is not specified, it is inferred from the return statement in thebody
- Return with no value implies
void
return type - SYCL kernels must always have a
void
return type thus it is not specified in SYCL kernel
body
:- Contains function body
- SYCL kernel body does not have any
return
statement
Use DPC++ using Docker in interactive mode
Step 1: Pull Docker image
Pull Docker image for Intel oneAPI Basekit, which includes essential development tools. The process may take several minutes due to the image’s size (around a few gigabytes).:
docker pull intel/oneapi-basekit:devel-ubuntu22.04
Step 2: Prepare local environment
- Open a terminal in a local directory where you plan to execute a sample SYCL program (e.g.,
vector_add.cpp
). Place your SYCL program in this directory. A sample program is provided below for reference:
#include <sycl/sycl.hpp> using namespace sycl; void initialize_array(int n, double *a, double *b, double *c_result) { for (int i = 0; i < n; i++) { a[i] = 1.2; b[i] = 2.8; c_result[i] = a[i] + b[i]; } } void vector_add(queue &Q, int n, double *a, double *b, double *c) { buffer<double, 1> a_buffer(a, range<1>(n)); buffer<double, 1> b_buffer(b, range<1>(n)); buffer<double, 1> c_buffer(c, range<1>(n)); auto task_add = Q.submit([&](sycl::handler &cgh) { accessor a_accessor(a_buffer, cgh, read_only); accessor b_accessor(b_buffer, cgh, read_only); accessor c_accessor(c_buffer, cgh, write_only, no_init); cgh.parallel_for(range<1>(n), [=](id<1> idx) { c_accessor[idx] = a_accessor[idx] + b_accessor[idx]; }); }); task_add.wait(); } int main() { queue Q; std::cout << "Running on device: " << Q.get_device().get_info<info::device::name>() << std::endl; int n = 10000; double a[n]; double b[n]; double c[n]; double c_result[n]; initialize_array(n, a, b, c_result); vector_add(Q, n, a, b, c); int mismatch = 0; for (int i = 0; i < n; i++) { if (c_result[i] != c[i]) { mismatch++; } } if (mismatch == 0) { std::cout << "Verification passed" << std::endl; } else { std::cout << "Verification failed. Mismatch found: " << mismatch << std::endl; } return 0; }
Step 3: Mount the current directory to docker run
Mount the current directory into the Docker container to execute the SYCL program.
The following command runs a container named inteldpc
based on the intel/oneapi-basekit:devel-ubuntu22.04
image, mounting the current local directory to the /sycl_demo
directory inside the container in interactive mode.
docker run -v $(pwd):/sycl_demo -it --name=inteldpc intel/oneapi-basekit:devel-ubuntu22.04
Step 4: List of available devices and compiler version in Docker system
Within the Docker container, list the available devices in the system using
sycl-ls
:root@5d3cac1a5b20:/# sycl-ls [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2022.15.12.0.01_081451] [opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900H 3.0 [2022.15.12.0.01_081451]
Check the version of Intel LLVM-based compilers Intel® oneAPI DPC++/C++ Compiler:
root@5d3cac1a5b20:/# icpx --version Intel(R) oneAPI DPC++/C++ Compiler 2023.0.0 (2023.0.0.20221201) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /opt/intel/oneapi/compiler/2023.0.0/linux/bin-llvm Configuration file: /opt/intel/oneapi/compiler/2023.0.0/linux/bin-llvm/../bin/icpx.cfg
Step 5: Compile and execute SYCL program
Navigate to the mounted directory within the Docker container:
root@5d3cac1a5b20:/# cd sycl_demo
Use the Intel oneAPI DPC++ compiler (
icpx
) to compile the SYCL program and run the executable:root@5d3cac1a5b20:/sycl_demo# icpx -fsycl vector_add.cpp root@5d3cac1a5b20:/sycl_demo# ./a.out
Upon successful execution, the program will display information about the running device and verification status:
Running on device: 13th Gen Intel(R) Core(TM) i9-13900H Verification passed
If you exit the docker interactive mode and wants to turn on the interactive mode again you need to delete the container first:
docker rm inteldpc
Then follow the above steps again to compile and run SYCL program in docker interactive mode.
(Optional) Step 6: Compile and execute SYCL program using Makefile
It is easier to use the following
Makefile
to compile and run the program in the container interactive mode:CXX = icpx CXXFLAGS = -fsycl -std=c++17 -w SRCS := $(wildcard *.cpp) PRGS := $(patsubst %.cpp,%,$(SRCS)) RUN := $(addprefix run_, $(PRGS)) .PHONY: run_all run_all: $(sort $(RUN)) .PHONY: all all: $(PRGS) .PHONY: all %: %.cpp $(CXX) $(CXXFLAGS) -o $@ $^ run_%:% ./$^ .PHONY: clean clean: rm -f $(PRGS)
Compile and run the program
vector_add.cpp
usingmake
command:root@5d3cac1a5b20:/sycl_demo# make vector_add icpx -fsycl -std=c++17 -w -o vector_add vector_add.cpp root@5d3cac1a5b20:/sycl_demo# make run_vector_add ./vector_add Running on device: 13th Gen Intel(R) Core(TM) i9-13900H Verification passed
Some Useful Docker Commands
List and delete local Docker image:
➜ ~ docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE ghcr.io/intel/llvm/ubuntu2204_build 7ed894ab0acc8ff09262113fdb08940d22654a30 82f0019ffd14 17 hours ago 13.4GB # Delete a specific image using its IMAGE ID ➜ ~ sudo docker image remove 82f0019ffd14 -f Deleted: sha256:82f0019ffd14bdfc269d5f01fc28eda497fc339d367448086f4441eeae634e7a
List and delete Docker container:
# List all Docker containers, including stopped ones docker ps -a # Delete a specific container using its NAME or CONTAINER ID docker rm <NAME>
See disk usage in Docker:
➜ ~ docker system df TYPE TOTAL ACTIVE SIZE RECLAIMABLE Images 5 5 16.69GB 74.82MB (0%) Containers 7 0 20.95kB 20.95kB (100%) Local Volumes 2 2 50.18MB 0B (0%) Build Cache 0 0 0B 0B
Prune unused Docker data, including stopped containers, networks, and dangling images:
➜ ~ docker system prune WARNING! This will remove: - all stopped containers - all networks not used by at least one container - all dangling images - all dangling build cache Are you sure you want to continue? [y/N] y Deleted Containers: accd2f72561b96e4cebd297440083053fb7b420503a2acca1b363878c88cfbb6 76439ff0db662783f66edf94941a4afbfb9518cebbe4378df02c550a82705919 700aa68c01881bbdc1e7c382cd5d0de98ed95a61bb668a23a0c8bb14a556b4b0 d4857a1ac199c765539af1e1fb722267da66bc82ab7f2d5d8cd606e797645b76 b44844e8146badf9846d2d444b43f9db459428a826946bcbc3104f3bf2b7a6e4 4eff17fd7646f3b7f50e4e0b3bce0b22452e746662150cacc0fd0e5141a5eda7 83cf07d8e33fa4a585dd1bed239f80105b0c0e5850c6ba8d6075d7ce62fb59ab Deleted Networks: outline_default Total reclaimed space: 20.95kB
References
- Reinders, J., Ashbaugh, B., Brodman, J., Kinsner, M., Pennycook, J., & Tian, X. (2021). Data parallel C++: mastering DPC++ for programming of heterogeneous systems using C++ and SYCL (p. 548). Springer Nature.
- SYCL Overview by Thomas Applencourt
- SYCL 2020 API Reference Guide
- Docker Containers BKMs
- Docker, WSL, and oneAPI — A Quick How-To Guide
- OneAPI Samples
- Aurora Exascale Supercomputer
Advertisement