cu - package cu provides an idiomatic interface to the CUDA Driver API.

  •        21

Package cu is a package that interfaces with the CUDA Driver API. This package was directly inspired by Arne Vansteenkiste's cu package. The main reason why this package was written (as opposed to just using the already-excellent cu package) was because of errors. Specifically, the main difference between this package and Arne's package is that this package returns errors instead of panicking.



Related Projects


  •    DotNet

managedCUDA makes the CUDA Driver API available in .net applications written in C#, Visual Basic or any other .net language. It also includes classes for an easy handling and interop with CUDA, i.e. build-in CUDA types like float3.

CUDA driver API


Making the CUDA driver API as simple to use as the runtime API. Almost.

cutlass - CUDA Templates for Linear Algebra Subroutines

  •    C++

CUTLASS 1.0 is a collection of CUDA C++ template abstractions for implementing high-performance matrix-multiplication (GEMM) at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement similar to those used to implement cuBLAS. CUTLASS decomposes these "moving parts" into reusable, modular software components abstracted by C++ template classes. These thread-wide, warp-wide, block-wide, and device-wide primitives can be specialized and tuned via custom tiling sizes, data types, and other algorithmic policy. The resulting flexibility simplifies their use as building blocks within custom kernels and applications. To support a wide variety of applications, CUTLASS provides extensive support for mixed-precision computations, providing specialized data-movement and multiply-accumulate abstractions for 8-bit integer, half-precision floating point (FP16), single-precision floating point (FP32), and double-precision floating point (FP64) types. Furthermore, CUTLASS demonstrates CUDA's WMMA API for targeting the programmable, high-throughput Tensor Cores provided by NVIDIA's Volta architecture and beyond.

cupy - NumPy-like API accelerated with CUDA

  •    Python

CuPy is an implementation of NumPy-compatible multi-dimensional array on CUDA. CuPy consists of the core multi-dimensional array class, cupy.ndarray, and many functions on it. It supports a subset of numpy.ndarray interface. For detailed instructions on installing CuPy, see the installation guide.

scikit-cuda - Python interface to GPU-powered libraries

  •    Python

scikit-cuda provides Python interfaces to many of the functions in the CUDA device/runtime, CUBLAS, CUFFT, and CUSOLVER libraries distributed as part of NVIDIA's CUDA Programming Toolkit, as well as interfaces to select functions in the CULA Dense Toolkit. Both low-level wrapper functions similar to their C counterparts and high-level functions comparable to those in NumPy and Scipy are provided. Package documentation is available at Many of the high-level functions have examples in their docstrings. More illustrations of how to use both the wrappers and high-level functions can be found in the demos/ and tests/ subdirectories.

collenchyma - Extendable HPC-Framework for CUDA, OpenCL and common CPU

  •    Rust

Collenchyma is an extensible, pluggable, backend-agnostic framework for parallel, high-performance computations on CUDA, OpenCL and common host CPU. It is fast, easy to build and provides an extensible Rust struct to execute operations on almost any machine, even if it does not have CUDA or OpenCL capable devices. Collenchyma's abstracts over the different computation languages (Native, OpenCL, Cuda) and let's you run highly-performant code, thanks to easy parallelization, on servers, desktops or mobiles without the need to adapt your code for the machine you deploy to. Collenchyma does not require OpenCL or Cuda on the machine and automatically falls back to the native host CPU, making your application highly flexible and fast to build.

CUDA VS Wizard

  •    Javascript

A VS Project Wizard for CUDA. After you install the CUDA VS Wizard, you can see the CUDAWinApp in your Visual Studio installed templates Category. Then it's easy to create a new CUDA project in VS. It can support Windows 32bit amp; 64bit system, VS2005 amp; V

tensorflow_tutorials - From the basics to slightly more interesting applications of Tensorflow

  •    Jupyter

You can find python source code under the python directory, and associated notebooks under notebooks. For Ubuntu users using python3.4+ w/ CUDA 7.5 and cuDNN 7.0, you can find compiled wheels under the wheels directory. Use pip3 install tensorflow-0.8.0rc0-py3-none-any.whl to install, e.g. and be sure to add: export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64" to your .bashrc. Note, this still requires you to install CUDA 7.5 and cuDNN 7.0 under /usr/local/cuda.

CudaSift - A CUDA implementation of SIFT for NVidia GPUs (1.6 ms on a GTX 1060)

  •    Cuda

This is the fourth version of a SIFT (Scale Invariant Feature Transform) implementation using CUDA for GPUs from NVidia. The first version is from 2007 and GPUs have evolved since then. This version is slightly more precise and considerably faster than the previous versions and has been optimized for Kepler and later generations of GPUs. On a GTX 1060 GPU the code takes about 1.6 ms on a 1280x960 pixel image and 2.4 ms on a 1920x1080 pixel image. There is also code for brute-force matching of features that takes about 2.2 ms for two sets of around 1900 SIFT features each.

cuda-convnet2 - Automatically exported from

  •    Cuda

Automatically exported from

ArrayFire - Parallel Computing Library

  •    C++

ArrayFire is a high performance software library for parallel computing with an easy-to-use API. Its array based function set makes parallel programming simple. ArrayFire's multiple backends (CUDA, OpenCL and native CPU) make it platform independent and highly portable. A few lines of code in ArrayFire can replace dozens of lines of parallel computing code, saving you valuable time and lowering development costs.

node-cuda - NVIDIA CUDA™ bindings for Node.js

  •    C++

NVIDIA CUDA™ bindings for Node.js

cutorch - A CUDA backend for Torch7

  •    Cuda

Cutorch provides a CUDA backend for torch7. Note: these are currently limited to copying/conversion, and several indexing and shaping operations (e.g. narrow, select, unfold, transpose).

vexcl - VexCL is a C++ vector expression template library for OpenCL/CUDA

  •    C++

VexCL is a vector expression template library for OpenCL/CUDA. It has been created for ease of GPGPU development with C++. VexCL strives to reduce amount of boilerplate code needed to develop GPGPU applications. The library provides convenient and intuitive notation for vector arithmetic, reduction, sparse matrix-vector products, etc. Multi-device and even multi-platform computations are supported. The source code of the library is distributed under very permissive MIT license.

kmcuda - Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA

  •    Jupyter

K-means implementation is based on "Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup". While it introduces some overhead and many conditional clauses which are bad for CUDA, it still shows 1.6-2x speedup against the Lloyd algorithm. K-nearest neighbors employ the same triangle inequality idea and require precalculated centroids and cluster assignments, similar to the flattened ball tree. Technically, this project is a shared library which exports two functions defined in kmcuda.h: kmeans_cuda and knn_cuda. It has built-in Python3 and R native extension support, so you can from libKMCUDA import kmeans_cuda or dyn.load("").

cnn-benchmarks - Benchmarks for popular CNN models

  •    Python

Benchmarks for popular convolutional neural network models on CPU and different GPUs, with and without cuDNN. All benchmarks were run in Torch. The GTX 1080 and Maxwell Titan X benchmarks were run on a machine with dual Intel Xeon E5-2630 v3 processors (8 cores each plus hyperthreading means 32 threads) and 64GB RAM running Ubuntu 14.04 with the CUDA 8.0 Release Candidate. The Pascal Titan X benchmarks were run on a machine with an Intel Core i5-6500 CPU and 16GB RAM running Ubuntu 16.04 with the CUDA 8.0 Release Candidate. The GTX 1080 Ti benchmarks were run on a machine with an Intel Core i7-7700 CPU and 64GB RAM running Ubuntu 16.04 with the CUDA 8.0 release.

cudahandbook - Source code that accompanies The CUDA Handbook.

  •    Cuda

Source code that accompanies The CUDA Handbook.

gunrock - High-Performance Graph Primitives on GPUs

  •    Cuda

Gunrock is a CUDA library for graph-processing designed specifically for the GPU. It uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. For more details, please visit our website, read Why Gunrock, our TOPC 2017 paper Gunrock: GPU Graph Analytics, look at our results, and find more details in our publications. See Release Notes to keep up with the our latest changes.

dockerfiles - Compilation of Dockerfiles with automated builds enabled on the Docker Registry

  •    Dockerfile

Compilation of Dockerfiles with automated builds enabled on the Docker Hub. Not suitable for production environments. These images are under continuous development, so breaking changes may be introduced. Nearly all images are based on Ubuntu Core 14.04 LTS, built with minimising size/layers and best practices in mind. Dependencies are indicated left to right e.g. cuda-vnc is VNC built on top of CUDA. Explicit dependencies are excluded.

gorgonia - Gorgonia is a library that helps facilitate machine learning in Go.

  •    Go

Gorgonia is a library that helps facilitate machine learning in Go. Write and evaluate mathematical equations involving multidimensional arrays easily. If this sounds like Theano or TensorFlow, it's because the idea is quite similar. Specifically, the library is pretty low-level, like Theano, but has higher goals like Tensorflow.The main reason to use Gorgonia is developer comfort. If you're using a Go stack extensively, now you have access to the ability to create production-ready machine learning systems in an environment that you are already familiar and comfortable with.