Introduction to SYCL and DPC++

SYCL is an open industry standard, not a programming language itself, developed by the Khronos Group to facilitate data parallelism in C++ for heterogeneous computing systems. This means you can write single-source C++ code that utilizes the power of various processors like CPUs, GPUs, and FPGAs in a single program. This improves code maintainability and portability across different hardware platforms.

Core characteristics of SYCL:

Single-source programming: Write code once and target diverse hardware architectures without rewriting for each device.
Data parallelism: Leverage parallel execution for data-intensive tasks, significantly boosting performance.
Abstraction layer: SYCL hides the underlying hardware complexities, providing a unified programming model for different devices.
Memory management: SYCL offers various memory models (e.g., unified shared memory) to efficiently manage data movement between host and devices.

Popular SYCL implementations

DPC++ (Data Parallel C++): Part of Intel’s oneAPI suite and also available as an open source compiler, DPC++ leverages LLVM/Clang for broad CPU and GPU support across Intel, NVIDIA, and AMD. In this post we will use DPC++ from oneAPI.
ComputeCPP: Open-source and multi-backend, ComputeCPP excels on Intel and AMD hardware with good CPU and GPU coverage. NVIDIA GPU support is under development.
hipSYCL: Primarily focused on NVIDIA GPUs with AMD support, hipSYCL offers multiple backends and experimental Intel GPU compatibility.

Additional technical points to consider

Performance portability: While SYCL promotes code portability, achieving optimal performance on each hardware platform might require specific tuning.
Learning curve: Compared to lower-level hardware-specific languages, SYCL has a gentler learning curve with support to modern C++ syntax but still requires understanding parallel programming concepts.
Ecosystem: The SYCL ecosystem is evolving, with growing adoption and toolchain improvements.
SYCL and IntelGPU in HPC: Aurora, the upcoming exascale supercomputer, will harness the power of SYCL and oneAPI on Intel GPUs to unlock groundbreaking scientific discoveries.

SYCL 2020

SYCL 2020 marked a significant milestone in the evolution of data parallelism for heterogeneous systems. DPC++ embraces the cutting-edge features introduced in the SYCL 2020 specification, empowering developers with:

Unified Shared Memory (USM): Streamlined memory management with seamless data access between host and devices.
Sub-groups: Fine-grained control over parallel execution within work groups for enhanced performance and algorithm optimization.
C++17 Syntax Simplifications (CTAD): Class template argument deduction for cleaner, more concise code.
Anonymous Lambdas: Direct use of lambda expressions without explicit naming, promoting code readability and flexibility.

These advancements collectively elevate the developer experience and performance capabilities of SYCL 2020, solidifying its position as a powerful and user-friendly tool for heterogeneous programming.

Key attributes of SYCL (DPC++)

Single source: Supports containing both host and device code in a single source file. We can also keep them in separate files, but it is not required.
Host: CPU. Need to have C++17 support to support DPC++ and SYCL programs.
Device: GPU/FPGA/DSP/ASIC/CPU. Used for acceleration offloading.
Sharing devices: As GPU device can be used by other programs, the performance of DPC++ program may be impacted.
Kernel code: Code that runs on accelerated devices.
List of features that are supported:
- Lambdas
- Operator overloading
- Templates
- Classes
- Static polymorphism
- Read only values of non-global host variables via lambda captures
The list of features not supported by kernel code are:
- Dynamic polymorphism
- Dynamic memory allocations (no new/delete operators)
- Static variables
- Function pointers
- Runtime type information
- Exception handling
- No virtual member functions / variadic function can be called from kernel code
- Recursion is not allowed within kernel code

Abstractions for Memory management in SYCL (DPC++)

Unified Shared Memory (USM)
- Pointer based approach
- Easier integration with existing C++ code that works on pointers
- If we prefer to think about dependencies as performing one computation before another using an in-order queue, use USM
- Allocation types:
- device: Locates in device memory and cannot be accessed by host. Use malloc_device
- host: Locates in host memory and accessible from both host and device. Use malloc_host
- shared: Location shared and migrated automatically, accessible from both host and device. Use malloc_shared
- Supports both explicit and implicit data movement based on allocation type
- Explicit: Occurs in device allocation using handler’s memcpy() to transfer data from host to device and vice versa
- Implicit: Occurs in host and shared allocations. No need to use memcpy() operation
Buffers
- Represented by the buffer template class
- Describe 1D/2D/3D arrays
- Buffers follow implicit data movement strategy
- If we prefer to think about data dependencies between kernels, choose buffers
- Buffer object can be a scalar data type(int, float, double), vector+ data type or a user defined class or structure
- Data structures in buffers must be C++ trivially copyable without the need of copy constructor
- Can be accessed from host and device
- Buffers represent data objects rather than specific memory addresses thus cannot be directly accessed like arrays
- Buffers are not directly accessible, but can be accessed by accessor objects
- Buffer access mode:
- read: Read only access
- write: Write only access. Previous content not disgarded
- read_write: Read and write access
Images
- Specific type of buffers for image processing
- Supports image formats, reading of images using sampler objects

C++ Lambda Functions in SYCL

Lambda functions are introduced in C++11
Used to create anonymous function objects, can also be used with named variables

Lambda expression has the following syntax:

[ capture-list ] ( params ) -> ret { body }

capture-list:
- Lambda expression starts with a square bracket
- Denotes how to capture variables that are used within the lambda but not passed as parameters
- Comma separated list of captures
- Variables from the surrounding scope mentioned in this list are available in the lambda body
- Controls the visibility and lifetime of variables captured by lambda
- Global variables are not captured in lambda expression
- Non-global static variables can be used in a kernel but only if they are const
- Capture a variable by value by listing the variable name in the capture-list. Even the value is modified in the body, it does not affect the original value outside the lambda.
- Capture a variable by reference by listing the variable name prefixing with ampersand (&variable). If this variable is modified in the body, the original variable is also modified.
- [=] shorthand to capture all variables used in the body by value and current object by reference
- [&] shorthand to capture all variables used in the body by reference and current object by reference
- [] captures nothing
- SYCL uses [=] as kernel does not support capturing variable by reference
params:
- List of function parameters similar to named function
- In SYCL, it can be unique 1D id, or 2D/3D id
ret:
- Defines the return type of the lambda expression
- If -> ret is not specified, it is inferred from the return statement in the body
- Return with no value implies void return type
- SYCL kernels must always have a void return type thus it is not specified in SYCL kernel
body:
- Contains function body
- SYCL kernel body does not have any return statement

Use DPC++ using Docker in interactive mode

alt Use DPC++ using Docker in Interactive mode

Step 1: Pull Docker image

Pull Docker image for Intel oneAPI Basekit, which includes essential development tools. The process may take several minutes due to the image’s size (around a few gigabytes).:

docker pull intel/oneapi-basekit:devel-ubuntu22.04

Step 2: Prepare local environment

Open a terminal in a local directory where you plan to execute a sample SYCL program (e.g., vector_add.cpp).

Place your SYCL program in this directory. A sample program is provided below for reference:

#include <sycl/sycl.hpp>

using namespace sycl;

void initialize_array(int n, double *a, double *b, double *c_result) {
for (int i = 0; i < n; i++) {
    a[i] = 1.2;
    b[i] = 2.8;
    c_result[i] = a[i] + b[i];
}
}

void vector_add(queue &Q, int n, double *a, double *b, double *c) {
buffer<double, 1> a_buffer(a, range<1>(n));
buffer<double, 1> b_buffer(b, range<1>(n));
buffer<double, 1> c_buffer(c, range<1>(n));
auto task_add = Q.submit([&](sycl::handler &cgh) {
    accessor a_accessor(a_buffer, cgh, read_only);
    accessor b_accessor(b_buffer, cgh, read_only);
    accessor c_accessor(c_buffer, cgh, write_only, no_init);
    cgh.parallel_for(range<1>(n), [=](id<1> idx) {
        c_accessor[idx] = a_accessor[idx] + b_accessor[idx];
    });
});
task_add.wait();
}

int main() {
queue Q;
std::cout << "Running on device: "
          << Q.get_device().get_info<info::device::name>()
          << std::endl;
int n = 10000;
double a[n];
double b[n];
double c[n];
double c_result[n];

initialize_array(n, a, b, c_result);
vector_add(Q, n, a, b, c);

int mismatch = 0;
for (int i = 0; i < n; i++) {
    if (c_result[i] != c[i]) {
        mismatch++;
    }
}
if (mismatch == 0) {
    std::cout << "Verification passed" << std::endl;
} else {
    std::cout << "Verification failed. Mismatch found: " << mismatch << std::endl;
}

return 0;
}

Step 3: Mount the current directory to docker run

Mount the current directory into the Docker container to execute the SYCL program. The following command runs a container named inteldpc based on the intel/oneapi-basekit:devel-ubuntu22.04 image, mounting the current local directory to the /sycl_demo directory inside the container in interactive mode.

docker run -v $(pwd):/sycl_demo -it --name=inteldpc intel/oneapi-basekit:devel-ubuntu22.04

Step 4: List of available devices and compiler version in Docker system

Within the Docker container, list the available devices in the system using sycl-ls:

root@5d3cac1a5b20:/# sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2022.15.12.0.01_081451]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900H 3.0 [2022.15.12.0.01_081451]

Check the version of Intel LLVM-based compilers Intel® oneAPI DPC++/C++ Compiler:

root@5d3cac1a5b20:/# icpx --version
Intel(R) oneAPI DPC++/C++ Compiler 2023.0.0 (2023.0.0.20221201)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2023.0.0/linux/bin-llvm
Configuration file: /opt/intel/oneapi/compiler/2023.0.0/linux/bin-llvm/../bin/icpx.cfg

Step 5: Compile and execute SYCL program

Navigate to the mounted directory within the Docker container:
```
root@5d3cac1a5b20:/# cd sycl_demo
```

Use the Intel oneAPI DPC++ compiler (icpx) to compile the SYCL program and run the executable:

root@5d3cac1a5b20:/sycl_demo# icpx -fsycl vector_add.cpp
root@5d3cac1a5b20:/sycl_demo# ./a.out

Upon successful execution, the program will display information about the running device and verification status:
```
Running on device: 13th Gen Intel(R) Core(TM) i9-13900H
Verification passed
```
If you exit the docker interactive mode and wants to turn on the interactive mode again you need to delete the container first:
```
docker rm inteldpc
```
Then follow the above steps again to compile and run SYCL program in docker interactive mode.

(Optional) Step 6: Compile and execute SYCL program using Makefile

It is easier to use the following Makefile to compile and run the program in the container interactive mode:

CXX = icpx
CXXFLAGS = -fsycl -std=c++17 -w
SRCS := $(wildcard *.cpp)
PRGS := $(patsubst %.cpp,%,$(SRCS))
RUN :=  $(addprefix run_, $(PRGS))

.PHONY: run_all
run_all: $(sort $(RUN))

.PHONY: all
all: $(PRGS)

.PHONY: all
%: %.cpp
	$(CXX) $(CXXFLAGS) -o $@ $^

run_%:%
	./$^

.PHONY: clean
clean:
	rm -f $(PRGS)

Compile and run the program vector_add.cpp using make command:

root@5d3cac1a5b20:/sycl_demo# make vector_add
icpx -fsycl -std=c++17 -w -o vector_add vector_add.cpp
root@5d3cac1a5b20:/sycl_demo# make run_vector_add
./vector_add
Running on device: 13th Gen Intel(R) Core(TM) i9-13900H
Verification passed

Some Useful Docker Commands

List and delete local Docker image:

➜  ~ docker image ls
REPOSITORY                            TAG                                        IMAGE ID       CREATED        SIZE
ghcr.io/intel/llvm/ubuntu2204_build   7ed894ab0acc8ff09262113fdb08940d22654a30   82f0019ffd14   17 hours ago   13.4GB
# Delete a specific image using its IMAGE ID
➜  ~ sudo docker image remove 82f0019ffd14 -f
Deleted: sha256:82f0019ffd14bdfc269d5f01fc28eda497fc339d367448086f4441eeae634e7a

List and delete Docker container:

# List all Docker containers, including stopped ones
docker ps -a

# Delete a specific container using its NAME or CONTAINER ID
docker rm <NAME>

See disk usage in Docker:

➜  ~ docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          5         5         16.69GB   74.82MB (0%)
Containers      7         0         20.95kB   20.95kB (100%)
Local Volumes   2         2         50.18MB   0B (0%)
Build Cache     0         0         0B        0B

Prune unused Docker data, including stopped containers, networks, and dangling images:

➜  ~ docker system prune
WARNING! This will remove:
- all stopped containers
- all networks not used by at least one container
- all dangling images
- all dangling build cache

Are you sure you want to continue? [y/N] y
Deleted Containers:
accd2f72561b96e4cebd297440083053fb7b420503a2acca1b363878c88cfbb6
76439ff0db662783f66edf94941a4afbfb9518cebbe4378df02c550a82705919
700aa68c01881bbdc1e7c382cd5d0de98ed95a61bb668a23a0c8bb14a556b4b0
d4857a1ac199c765539af1e1fb722267da66bc82ab7f2d5d8cd606e797645b76
b44844e8146badf9846d2d444b43f9db459428a826946bcbc3104f3bf2b7a6e4
4eff17fd7646f3b7f50e4e0b3bce0b22452e746662150cacc0fd0e5141a5eda7
83cf07d8e33fa4a585dd1bed239f80105b0c0e5850c6ba8d6075d7ce62fb59ab

Deleted Networks:
outline_default

Total reclaimed space: 20.95kB

By Ahmedur Rahman Shovon

Popular SYCL implementations

Additional technical points to consider

SYCL 2020

Key attributes of SYCL (DPC++)

Abstractions for Memory management in SYCL (DPC++)

C++ Lambda Functions in SYCL

Use DPC++ using Docker in interactive mode

Step 1: Pull Docker image

Step 2: Prepare local environment

Step 3: Mount the current directory to docker run

Step 4: List of available devices and compiler version in Docker system

Step 5: Compile and execute SYCL program

(Optional) Step 6: Compile and execute SYCL program using Makefile

Some Useful Docker Commands

References

Citation

Related contents in this website

cuDF vs Pandas dataframe performance comparison

How to install NVIDIA driver and CUDA on Ubuntu 23.10

Run a Python script inside a virtual environment in the background

Previous post

Next post

Ahmedur Rahman Shovon

Popular SYCL implementations

Additional technical points to consider

SYCL 2020

Key attributes of SYCL (DPC++)

Abstractions for Memory management in SYCL (DPC++)

C++ Lambda Functions in SYCL

Use DPC++ using Docker in interactive mode

Step 1: Pull Docker image

Step 2: Prepare local environment

Step 3: Mount the current directory to docker run

Step 4: List of available devices and compiler version in Docker system

Step 5: Compile and execute SYCL program

(Optional) Step 6: Compile and execute SYCL program using Makefile

Some Useful Docker Commands

References

Citation

APA Style

MLA Style

BibTeX entry

Related contents in this website

cuDF vs Pandas dataframe performance comparison

How to install NVIDIA driver and CUDA on Ubuntu 23.10

Run a Python script inside a virtual environment in the background

Previous post

Next post

Ahmedur Rahman Shovon