hipSYCL - Implementation of SYCL 1.2.1 over AMD HIP/NVIDIA CUDA

  •        28

The goal of the hipSYCL project is to develop a SYCL 1.2.1 implementation that is built upon NVIDIA CUDA/AMD HIP. hipSYCL provides a SYCL interface to NVIDIA CUDA and AMD HIP. hipSYCL applications are then compiled with the regular vendor compilers (nvcc for nvidia and hcc for AMD), and hence can enjoy full vendor support.




Related Projects

Arraymancer - A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU, OpenCL and embedded devices

  •    Nim

Arraymancer is a tensor (N-dimensional array) project in Nim. The main focus is providing a fast and ergonomic CPU, Cuda and OpenCL ndarray library on which to build a scientific computing and in particular a deep learning ecosystem. The library is inspired by Numpy and PyTorch. The library provides ergonomics very similar to Numpy, Julia and Matlab but is fully parallel and significantly faster than those libraries. It is also faster than C-based Torch.

vexcl - VexCL is a C++ vector expression template library for OpenCL/CUDA

  •    C++

VexCL is a vector expression template library for OpenCL/CUDA. It has been created for ease of GPGPU development with C++. VexCL strives to reduce amount of boilerplate code needed to develop GPGPU applications. The library provides convenient and intuitive notation for vector arithmetic, reduction, sparse matrix-vector products, etc. Multi-device and even multi-platform computations are supported. The source code of the library is distributed under very permissive MIT license.

collenchyma - Extendable HPC-Framework for CUDA, OpenCL and common CPU

  •    Rust

Collenchyma is an extensible, pluggable, backend-agnostic framework for parallel, high-performance computations on CUDA, OpenCL and common host CPU. It is fast, easy to build and provides an extensible Rust struct to execute operations on almost any machine, even if it does not have CUDA or OpenCL capable devices. Collenchyma's abstracts over the different computation languages (Native, OpenCL, Cuda) and let's you run highly-performant code, thanks to easy parallelization, on servers, desktops or mobiles without the need to adapt your code for the machine you deploy to. Collenchyma does not require OpenCL or Cuda on the machine and automatically falls back to the native host CPU, making your application highly flexible and fast to build.

neanderthal - Fast Clojure Matrix Library

  •    Clojure

Neanderthal is a Clojure library for fast matrix and linear algebra computations based on the highly optimized native libraries of BLAS and LAPACK computation routines for both CPU and GPU.. Read the documentation at Neanderthal Web Site.

SyclParallelSTL - Open Source Parallel STL implementation

  •    C++

This project features an implementation of the Parallel STL library using the Khronos SYCL standard. SYCL is a royalty-free, cross-platform C++ abstraction layer that builds on top of OpenCL. SYCL enables single-source development of OpenCL applications in C++ whilst enabling traditional host compilers to produce standard C++ code.

qcgpu-rust - A High Performance, Hardware accelerated, Quantum computer simulator in Rust

  •    Rust

The goal of QCGPU is to provide a library for the simulation of quantum computers that is fast, efficient and portable. QCGPU is written in Rust and uses OpenCL to run code on the CPU, GPU or any other OpenCL supported devices. This library is meant to be used both independently and alongside established tools for example compilers or more general and high level frameworks. If you are interested in using QCGPU with IBM's QISKit framework or QISKit ACQUA, please see the repository qiskit-addon-qcgpu.

gunrock - High-Performance Graph Primitives on GPUs

  •    Cuda

Gunrock is a CUDA library for graph-processing designed specifically for the GPU. It uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. For more details, please visit our website, read Why Gunrock, our TOPC 2017 paper Gunrock: GPU Graph Analytics, look at our results, and find more details in our publications. See Release Notes to keep up with the our latest changes.

ArrayFire - Parallel Computing Library

  •    C++

ArrayFire is a high performance software library for parallel computing with an easy-to-use API. Its array based function set makes parallel programming simple. ArrayFire's multiple backends (CUDA, OpenCL and native CPU) make it platform independent and highly portable. A few lines of code in ArrayFire can replace dozens of lines of parallel computing code, saving you valuable time and lowering development costs.

compute - A C++ GPU Computing Library for OpenCL

  •    C++

Boost.Compute is a GPU/parallel-computing library for C++ based on OpenCL. The core library is a thin C++ wrapper over the OpenCL API and provides access to compute devices, contexts, command queues and memory buffers.

Chlorine - Dead Simple OpenCL

  •    C++

Chlorine is the easiest way to interact with OpenCL compatible devices. It is a header-only C++11 library that allows you to write cross-platform code that runs on GPUs without ever touching the complicated OpenCL API, leaving you free to write code that matters: kernels that process data. Chlorine is composed of just two headers: chlorine.hpp, and its dependency, the OpenCL C++ Bindings. To integrate Chlorine into your own project, install OpenCL; then add chlorine/include to your include paths and link with OpenCL. Chlorine also requires a compiler with C++11 support. An example of how to use Chlorine is below, or read a more detailed walkthrough if you prefer.

coriander - Build NVIDIA® CUDA™ code for OpenCL™ 1.2 devices

  •    LLVM

Build applications written in NVIDIA® CUDA™ code for OpenCL™ 1.2 devices. Other systems should work too, ideally. You will need at a minimum at least one OpenCL-enabled GPU, and appropriate OpenCL drivers installed, for the GPU. Both linux and Mac systems stand a reasonable chance of working ok.

cutlass - CUDA Templates for Linear Algebra Subroutines

  •    C++

CUTLASS 1.0 is a collection of CUDA C++ template abstractions for implementing high-performance matrix-multiplication (GEMM) at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement similar to those used to implement cuBLAS. CUTLASS decomposes these "moving parts" into reusable, modular software components abstracted by C++ template classes. These thread-wide, warp-wide, block-wide, and device-wide primitives can be specialized and tuned via custom tiling sizes, data types, and other algorithmic policy. The resulting flexibility simplifies their use as building blocks within custom kernels and applications. To support a wide variety of applications, CUTLASS provides extensive support for mixed-precision computations, providing specialized data-movement and multiply-accumulate abstractions for 8-bit integer, half-precision floating point (FP16), single-precision floating point (FP32), and double-precision floating point (FP64) types. Furthermore, CUTLASS demonstrates CUDA's WMMA API for targeting the programmable, high-throughput Tensor Cores provided by NVIDIA's Volta architecture and beyond.

lwjgl3 - LWJGL is a Java library that enables cross-platform access to popular native APIs useful in the development of graphics (OpenGL), audio (OpenAL) and parallel computing (OpenCL) applications

  •    Kotlin

LWJGL (https://www.lwjgl.org) is a Java library that enables cross-platform access to popular native APIs useful in the development of graphics (OpenGL/Vulkan), audio (OpenAL) and parallel computing (OpenCL) applications. This access is direct and high-performance, yet also wrapped in a type-safe and user-friendly layer, appropriate for the Java ecosystem.LWJGL is an enabling technology and provides low-level access. It is not a framework and does not provide higher-level utilities than what the native libraries expose. As such, novice programmers are encouraged to try one of the frameworks or game engines that make use of LWJGL, before working directly with the library.

chainer - A flexible framework of neural networks for deep learning

  •    Python

Chainer is a Python-based deep learning framework aiming at flexibility. It provides automatic differentiation APIs based on the define-by-run approach (a.k.a. dynamic computational graphs) as well as object-oriented high-level APIs to build and train neural networks. It also supports CUDA/cuDNN using CuPy for high performance training and inference. For more details of Chainer, see the documents and resources listed above and join the community in Forum, Slack, and Twitter. The stable version of current Chainer is separated in here: v3.


  •    DotNet

managedCUDA makes the CUDA Driver API available in .net applications written in C#, Visual Basic or any other .net language. It also includes classes for an easy handling and interop with CUDA, i.e. build-in CUDA types like float3.

accelerate - Embedded language for high-performance array computations

  •    Haskell

Data.Array.Accelerate defines an embedded language of array computations for high-performance computing in Haskell. Computations on multi-dimensional, regular arrays are expressed in the form of parameterised collective operations (such as maps, reductions, and permutations). These computations are online-compiled and executed on a range of architectures. Chapter 6 of Simon Marlow's book Parallel and Concurrent Programming in Haskell contains a tutorial introduction to Accelerate.

scikit-cuda - Python interface to GPU-powered libraries

  •    Python

scikit-cuda provides Python interfaces to many of the functions in the CUDA device/runtime, CUBLAS, CUFFT, and CUSOLVER libraries distributed as part of NVIDIA's CUDA Programming Toolkit, as well as interfaces to select functions in the CULA Dense Toolkit. Both low-level wrapper functions similar to their C counterparts and high-level functions comparable to those in NumPy and Scipy are provided. Package documentation is available at http://scikit-cuda.readthedocs.org/. Many of the high-level functions have examples in their docstrings. More illustrations of how to use both the wrappers and high-level functions can be found in the demos/ and tests/ subdirectories.

Permutations with CUDA and OpenCL


Finding massive permutations on GPU with CUDA and OpenCL

SublimeAStyleFormatter - SublimeAStyleFormatter is a code formatter/beautifier for Sublime Text 2 & 3

  •    Python

SublimeAStyleFormatter is a simple code formatter plugin for Sublime Text. It provides ability to format C, C++, Cuda-C++, OpenCL, Arduino, C#, and Java files. NOTE: Syntax files required to be installed separately for Cuda-C++ and OpenCL.