SYCL is an open industry standard, not a programming language itself, developed by the Khronos Group to facilitate data parallelism in C++ for heterogeneous computing systems. This means you can write single-source C++ code that utilizes the power of various processors like CPUs, GPUs, and FPGAs in a single program. This improves code maintainability and portability across different hardware platforms.

Core characteristics of SYCL:

  • Single-source programming: Write code once and target diverse hardware architectures without rewriting for each device.
  • Data parallelism: Leverage parallel execution for data-intensive tasks, significantly boosting performance.
  • Abstraction layer: SYCL hides the underlying hardware complexities, providing a unified programming model for different devices.
  • Memory management: SYCL offers various memory models (e.g., unified shared memory) to efficiently manage data movement between host and devices.
  • DPC++ (Data Parallel C++): Part of Intel’s oneAPI suite and also available as an open source compiler, DPC++ leverages LLVM/Clang for broad CPU and GPU support across Intel, NVIDIA, and AMD. In this post we will use DPC++ from oneAPI.
  • ComputeCPP: Open-source and multi-backend, ComputeCPP excels on Intel and AMD hardware with good CPU and GPU coverage. NVIDIA GPU support is under development.
  • hipSYCL: Primarily focused on NVIDIA GPUs with AMD support, hipSYCL offers multiple backends and experimental Intel GPU compatibility.

Additional technical points to consider

  • Performance portability: While SYCL promotes code portability, achieving optimal performance on each hardware platform might require specific tuning.
  • Learning curve: Compared to lower-level hardware-specific languages, SYCL has a gentler learning curve with support to modern C++ syntax but still requires understanding parallel programming concepts.
  • Ecosystem: The SYCL ecosystem is evolving, with growing adoption and toolchain improvements.
  • SYCL and IntelGPU in HPC: Aurora, the upcoming exascale supercomputer, will harness the power of SYCL and oneAPI on Intel GPUs to unlock groundbreaking scientific discoveries.

SYCL 2020

SYCL 2020 marked a significant milestone in the evolution of data parallelism for heterogeneous systems. DPC++ embraces the cutting-edge features introduced in the SYCL 2020 specification, empowering developers with:

  • Unified Shared Memory (USM): Streamlined memory management with seamless data access between host and devices.
  • Sub-groups: Fine-grained control over parallel execution within work groups for enhanced performance and algorithm optimization.
  • C++17 Syntax Simplifications (CTAD): Class template argument deduction for cleaner, more concise code.
  • Anonymous Lambdas: Direct use of lambda expressions without explicit naming, promoting code readability and flexibility.

These advancements collectively elevate the developer experience and performance capabilities of SYCL 2020, solidifying its position as a powerful and user-friendly tool for heterogeneous programming.

Key attributes of SYCL (DPC++)

  • Single source: Supports containing both host and device code in a single source file. We can also keep them in separate files, but it is not required.
  • Host: CPU. Need to have C++17 support to support DPC++ and SYCL programs.
  • Device: GPU/FPGA/DSP/ASIC/CPU. Used for acceleration offloading.
  • Sharing devices: As GPU device can be used by other programs, the performance of DPC++ program may be impacted.
  • Kernel code: Code that runs on accelerated devices.
  • List of features that are supported:
    • Lambdas
    • Operator overloading
    • Templates
    • Classes
    • Static polymorphism
    • Read only values of non-global host variables via lambda captures
  • The list of features not supported by kernel code are:
    • Dynamic polymorphism
    • Dynamic memory allocations (no new/delete operators)
    • Static variables
    • Function pointers
    • Runtime type information
    • Exception handling
    • No virtual member functions / variadic function can be called from kernel code
    • Recursion is not allowed within kernel code

Abstractions for Memory management in SYCL (DPC++)

  • Unified Shared Memory (USM)
    • Pointer based approach
    • Easier integration with existing C++ code that works on pointers
    • If we prefer to think about dependencies as performing one computation before another using an in-order queue, use USM
    • Allocation types:
    • device: Locates in device memory and cannot be accessed by host. Use malloc_device
    • host: Locates in host memory and accessible from both host and device. Use malloc_host
    • shared: Location shared and migrated automatically, accessible from both host and device. Use malloc_shared
    • Supports both explicit and implicit data movement based on allocation type
    • Explicit: Occurs in device allocation using handler’s memcpy() to transfer data from host to device and vice versa
    • Implicit: Occurs in host and shared allocations. No need to use memcpy() operation
  • Buffers
    • Represented by the buffer template class
    • Describe 1D/2D/3D arrays
    • Buffers follow implicit data movement strategy
    • If we prefer to think about data dependencies between kernels, choose buffers
    • Buffer object can be a scalar data type(int, float, double), vector+ data type or a user defined class or structure
    • Data structures in buffers must be C++ trivially copyable without the need of copy constructor
    • Can be accessed from host and device
    • Buffers represent data objects rather than specific memory addresses thus cannot be directly accessed like arrays
    • Buffers are not directly accessible, but can be accessed by accessor objects
    • Buffer access mode:
    • read: Read only access
    • write: Write only access. Previous content not disgarded
    • read_write: Read and write access
  • Images
    • Specific type of buffers for image processing
    • Supports image formats, reading of images using sampler objects

C++ Lambda Functions in SYCL

  • Lambda functions are introduced in C++11
  • Used to create anonymous function objects, can also be used with named variables
  • Lambda expression has the following syntax:

    [ capture-list ] ( params ) -> ret { body }
    
  • capture-list:

    • Lambda expression starts with a square bracket
    • Denotes how to capture variables that are used within the lambda but not passed as parameters
    • Comma separated list of captures
    • Variables from the surrounding scope mentioned in this list are available in the lambda body
    • Controls the visibility and lifetime of variables captured by lambda
    • Global variables are not captured in lambda expression
    • Non-global static variables can be used in a kernel but only if they are const
    • Capture a variable by value by listing the variable name in the capture-list. Even the value is modified in the body, it does not affect the original value outside the lambda.
    • Capture a variable by reference by listing the variable name prefixing with ampersand (&variable). If this variable is modified in the body, the original variable is also modified.
    • [=] shorthand to capture all variables used in the body by value and current object by reference
    • [&] shorthand to capture all variables used in the body by reference and current object by reference
    • [] captures nothing
    • SYCL uses [=] as kernel does not support capturing variable by reference
  • params:

    • List of function parameters similar to named function
    • In SYCL, it can be unique 1D id, or 2D/3D id
  • ret:

    • Defines the return type of the lambda expression
    • If -> ret is not specified, it is inferred from the return statement in the body
    • Return with no value implies void return type
    • SYCL kernels must always have a void return type thus it is not specified in SYCL kernel
  • body:

    • Contains function body
    • SYCL kernel body does not have any return statement

Use DPC++ using Docker in interactive mode

alt Use DPC++ using Docker in Interactive mode

Step 1: Pull Docker image

Pull Docker image for Intel oneAPI Basekit, which includes essential development tools. The process may take several minutes due to the image’s size (around a few gigabytes).:

docker pull intel/oneapi-basekit:devel-ubuntu22.04

Step 2: Prepare local environment

  • Open a terminal in a local directory where you plan to execute a sample SYCL program (e.g., vector_add.cpp).
  • Place your SYCL program in this directory. A sample program is provided below for reference:

    #include <sycl/sycl.hpp>
    
    using namespace sycl;
    
    void initialize_array(int n, double *a, double *b, double *c_result) {
    for (int i = 0; i < n; i++) {
        a[i] = 1.2;
        b[i] = 2.8;
        c_result[i] = a[i] + b[i];
    }
    }
    
    void vector_add(queue &Q, int n, double *a, double *b, double *c) {
    buffer<double, 1> a_buffer(a, range<1>(n));
    buffer<double, 1> b_buffer(b, range<1>(n));
    buffer<double, 1> c_buffer(c, range<1>(n));
    auto task_add = Q.submit([&](sycl::handler &cgh) {
        accessor a_accessor(a_buffer, cgh, read_only);
        accessor b_accessor(b_buffer, cgh, read_only);
        accessor c_accessor(c_buffer, cgh, write_only, no_init);
        cgh.parallel_for(range<1>(n), [=](id<1> idx) {
            c_accessor[idx] = a_accessor[idx] + b_accessor[idx];
        });
    });
    task_add.wait();
    }
    
    int main() {
    queue Q;
    std::cout << "Running on device: "
              << Q.get_device().get_info<info::device::name>()
              << std::endl;
    int n = 10000;
    double a[n];
    double b[n];
    double c[n];
    double c_result[n];
    
    initialize_array(n, a, b, c_result);
    vector_add(Q, n, a, b, c);
    
    int mismatch = 0;
    for (int i = 0; i < n; i++) {
        if (c_result[i] != c[i]) {
            mismatch++;
        }
    }
    if (mismatch == 0) {
        std::cout << "Verification passed" << std::endl;
    } else {
        std::cout << "Verification failed. Mismatch found: " << mismatch << std::endl;
    }
    
    return 0;
    }
    

Step 3: Mount the current directory to docker run

Mount the current directory into the Docker container to execute the SYCL program. The following command runs a container named inteldpc based on the intel/oneapi-basekit:devel-ubuntu22.04 image, mounting the current local directory to the /sycl_demo directory inside the container in interactive mode.

docker run -v $(pwd):/sycl_demo -it --name=inteldpc intel/oneapi-basekit:devel-ubuntu22.04

Step 4: List of available devices and compiler version in Docker system

  • Within the Docker container, list the available devices in the system using sycl-ls:

    root@5d3cac1a5b20:/# sycl-ls
    [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2022.15.12.0.01_081451]
    [opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900H 3.0 [2022.15.12.0.01_081451]
    
  • Check the version of Intel LLVM-based compilers Intel® oneAPI DPC++/C++ Compiler:

    root@5d3cac1a5b20:/# icpx --version
    Intel(R) oneAPI DPC++/C++ Compiler 2023.0.0 (2023.0.0.20221201)
    Target: x86_64-unknown-linux-gnu
    Thread model: posix
    InstalledDir: /opt/intel/oneapi/compiler/2023.0.0/linux/bin-llvm
    Configuration file: /opt/intel/oneapi/compiler/2023.0.0/linux/bin-llvm/../bin/icpx.cfg
    

Step 5: Compile and execute SYCL program

  • Navigate to the mounted directory within the Docker container:

    root@5d3cac1a5b20:/# cd sycl_demo
    
  • Use the Intel oneAPI DPC++ compiler (icpx) to compile the SYCL program and run the executable:

    root@5d3cac1a5b20:/sycl_demo# icpx -fsycl vector_add.cpp
    root@5d3cac1a5b20:/sycl_demo# ./a.out
    
  • Upon successful execution, the program will display information about the running device and verification status:

    Running on device: 13th Gen Intel(R) Core(TM) i9-13900H
    Verification passed
    
  • If you exit the docker interactive mode and wants to turn on the interactive mode again you need to delete the container first:

    docker rm inteldpc
    

    Then follow the above steps again to compile and run SYCL program in docker interactive mode.

(Optional) Step 6: Compile and execute SYCL program using Makefile

  • It is easier to use the following Makefile to compile and run the program in the container interactive mode:

    CXX = icpx
    CXXFLAGS = -fsycl -std=c++17 -w
    SRCS := $(wildcard *.cpp)
    PRGS := $(patsubst %.cpp,%,$(SRCS))
    RUN :=  $(addprefix run_, $(PRGS))
    
    .PHONY: run_all
    run_all: $(sort $(RUN))
    
    .PHONY: all
    all: $(PRGS)
    
    .PHONY: all
    %: %.cpp
    	$(CXX) $(CXXFLAGS) -o $@ $^
    
    run_%:%
    	./$^
    
    .PHONY: clean
    clean:
    	rm -f $(PRGS)
    
  • Compile and run the program vector_add.cpp using make command:

    root@5d3cac1a5b20:/sycl_demo# make vector_add
    icpx -fsycl -std=c++17 -w -o vector_add vector_add.cpp
    root@5d3cac1a5b20:/sycl_demo# make run_vector_add
    ./vector_add
    Running on device: 13th Gen Intel(R) Core(TM) i9-13900H
    Verification passed
    

Some Useful Docker Commands

  • List and delete local Docker image:

    ➜  ~ docker image ls
    REPOSITORY                            TAG                                        IMAGE ID       CREATED        SIZE
    ghcr.io/intel/llvm/ubuntu2204_build   7ed894ab0acc8ff09262113fdb08940d22654a30   82f0019ffd14   17 hours ago   13.4GB
    # Delete a specific image using its IMAGE ID
    ➜  ~ sudo docker image remove 82f0019ffd14 -f
    Deleted: sha256:82f0019ffd14bdfc269d5f01fc28eda497fc339d367448086f4441eeae634e7a
    
  • List and delete Docker container:

    # List all Docker containers, including stopped ones
    docker ps -a
    
    # Delete a specific container using its NAME or CONTAINER ID
    docker rm <NAME>
    
  • See disk usage in Docker:

    ➜  ~ docker system df
    TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
    Images          5         5         16.69GB   74.82MB (0%)
    Containers      7         0         20.95kB   20.95kB (100%)
    Local Volumes   2         2         50.18MB   0B (0%)
    Build Cache     0         0         0B        0B
    
  • Prune unused Docker data, including stopped containers, networks, and dangling images:

    ➜  ~ docker system prune
    WARNING! This will remove:
    - all stopped containers
    - all networks not used by at least one container
    - all dangling images
    - all dangling build cache
    
    Are you sure you want to continue? [y/N] y
    Deleted Containers:
    accd2f72561b96e4cebd297440083053fb7b420503a2acca1b363878c88cfbb6
    76439ff0db662783f66edf94941a4afbfb9518cebbe4378df02c550a82705919
    700aa68c01881bbdc1e7c382cd5d0de98ed95a61bb668a23a0c8bb14a556b4b0
    d4857a1ac199c765539af1e1fb722267da66bc82ab7f2d5d8cd606e797645b76
    b44844e8146badf9846d2d444b43f9db459428a826946bcbc3104f3bf2b7a6e4
    4eff17fd7646f3b7f50e4e0b3bce0b22452e746662150cacc0fd0e5141a5eda7
    83cf07d8e33fa4a585dd1bed239f80105b0c0e5850c6ba8d6075d7ce62fb59ab
    
    Deleted Networks:
    outline_default
    
    Total reclaimed space: 20.95kB
    

References

Advertisement

Citation

Click to select citation style

Shovon, A. R. (2023, December 26). Introduction to SYCL and DPC++. Ahmedur Rahman Shovon. Retrieved December 3, 2024, from https://arshovon.com/blog/sycl-docker-interactive/

Shovon, Ahmedur Rahman. “Introduction to SYCL and DPC++.” Ahmedur Rahman Shovon, 26 Dec. 2023. Web. 3 Dec. 2024. https://arshovon.com/blog/sycl-docker-interactive/.

@misc{ shovon_2023,
    author = "Shovon, Ahmedur Rahman",
    title = "Introduction to SYCL and DPC++",
    year = "2023",
    url = "https://arshovon.com/blog/sycl-docker-interactive/",
    note = "[Online; accessed 3-December-2024]"
}
Related contents in this website