ssd-gpu-dma - Build userspace NVMe drivers and storage applications with CUDA support

  •        100

This library is a userspace API implemented in C for writing custom NVM Express (NVMe) drivers and high-performance storage applications. The API provides simple semantics and functions which a userspace program can use to control or manage one or more NVMe disk controllers. The API is in essence similar to SPDK, in that it moves driver code to userspace and relies on hardware polling rather than being interrupt driven. By mapping userspace memory directly, libnvm eliminates the cost of context switching into kernel space and enables zero-copy access from userspace. This greatly reduces the latency of IO operations compared to accessing storage devices through normal file system abstractions provided by the Linux kernel.



Related Projects

CudaSift - A CUDA implementation of SIFT for NVidia GPUs (1.6 ms on a GTX 1060)

  •    Cuda

This is the fourth version of a SIFT (Scale Invariant Feature Transform) implementation using CUDA for GPUs from NVidia. The first version is from 2007 and GPUs have evolved since then. This version is slightly more precise and considerably faster than the previous versions and has been optimized for Kepler and later generations of GPUs. On a GTX 1060 GPU the code takes about 1.6 ms on a 1280x960 pixel image and 2.4 ms on a 1920x1080 pixel image. There is also code for brute-force matching of features that takes about 2.2 ms for two sets of around 1900 SIFT features each.

coriander - Build NVIDIA® CUDA™ code for OpenCL™ 1.2 devices

  •    LLVM

Build applications written in NVIDIA® CUDA™ code for OpenCL™ 1.2 devices. Other systems should work too, ideally. You will need at a minimum at least one OpenCL-enabled GPU, and appropriate OpenCL drivers installed, for the GPU. Both linux and Mac systems stand a reasonable chance of working ok.

gunrock - High-Performance Graph Primitives on GPUs

  •    Cuda

Gunrock is a CUDA library for graph-processing designed specifically for the GPU. It uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. For more details, please visit our website, read Why Gunrock, our TOPC 2017 paper Gunrock: GPU Graph Analytics, look at our results, and find more details in our publications. See Release Notes to keep up with the our latest changes.

kmcuda - Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA

  •    Jupyter

K-means implementation is based on "Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup". While it introduces some overhead and many conditional clauses which are bad for CUDA, it still shows 1.6-2x speedup against the Lloyd algorithm. K-nearest neighbors employ the same triangle inequality idea and require precalculated centroids and cluster assignments, similar to the flattened ball tree. Technically, this project is a shared library which exports two functions defined in kmcuda.h: kmeans_cuda and knn_cuda. It has built-in Python3 and R native extension support, so you can from libKMCUDA import kmeans_cuda or dyn.load("").

TinyNvidiaUpdateChecker - Check for NVIDIA GPU driver updates!

  •    CSharp

This application has a simple concept, when launched it will check for new driver updates for your NVIDIA gpu! With this you no longer need waste your time searching if there's something new to get. HTML Agility Pack will automatically install when attempting to debug the project (make sure you're running the latest version of VS2017), or you may manually install it by doing the following: Open up your Package Manager Console and type in Install-Package HtmlAgilityPack.


  •    DotNet

managedCUDA makes the CUDA Driver API available in .net applications written in C#, Visual Basic or any other .net language. It also includes classes for an easy handling and interop with CUDA, i.e. build-in CUDA types like float3.

Arraymancer - A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU, OpenCL and embedded devices

  •    Nim

Arraymancer is a tensor (N-dimensional array) project in Nim. The main focus is providing a fast and ergonomic CPU, Cuda and OpenCL ndarray library on which to build a scientific computing and in particular a deep learning ecosystem. The library is inspired by Numpy and PyTorch. The library provides ergonomics very similar to Numpy, Julia and Matlab but is fully parallel and significantly faster than those libraries. It is also faster than C-based Torch.



Before you begin, you may need to disable the opensource ubuntu NVIDIA driver called nouveau. If nouveau driver(s) are still loaded do not proceed with the installation guide and troubleshoot why it's still loaded.

fatcache - Memcache on SSD

  •    C

fatcache is memcache on SSD. Think of fatcache as a cache for your big data.There are two ways to think of SSDs in system design. One is to think of SSD as an extension of disk, where it plays the role of making disks fast and the other is to think of them as an extension of memory, where it plays the role of making memory fat. The latter makes sense when persistence (non-volatility) is unnecessary and data is accessed over the network. Even though memory is thousand times faster than SSD, network connected SSD-backed memory makes sense, if we design the system in a way that network latencies dominate over the SSD latencies by a large factor.

emu - a language for programming GPUs, with a focus on ergonomics first and performance second

  •    Rust

⚠ Please note that while Emu 0.2.0 is quite usable, it suffers from 2 key issues. It firstly does nothing to minimize CPU-GPU data transfer and secondly it's compiler is not well-tested. These can be reasons not to use Emu 0.2.0. A new version of Emu is in the works, however, with significant improvements in the language, compiler, and compile-time checker. This new version of Emu should be released some time in Q4 of 2019. But unlike OpenCL/CUDA/Halide/Futhark, Emu is embedded in Rust. This lets it take advantage of the ecosystem in ways...

scikit-cuda - Python interface to GPU-powered libraries

  •    Python

scikit-cuda provides Python interfaces to many of the functions in the CUDA device/runtime, CUBLAS, CUFFT, and CUSOLVER libraries distributed as part of NVIDIA's CUDA Programming Toolkit, as well as interfaces to select functions in the CULA Dense Toolkit. Both low-level wrapper functions similar to their C counterparts and high-level functions comparable to those in NumPy and Scipy are provided. Package documentation is available at Many of the high-level functions have examples in their docstrings. More illustrations of how to use both the wrappers and high-level functions can be found in the demos/ and tests/ subdirectories.

imagenet-multiGPU.torch - an imagenet example in torch.

  •    Lua

If your imagenet dataset is on HDD or a slow SSD, run this command to resize all the images such that the smaller dimension is 256 and the aspect ratio is intact. This helps with loading the data from disk faster. To run the training, simply run main.lua By default, the script runs 1-GPU AlexNet with the CuDNN backend and 2 data-loader threads.

neanderthal - Fast Clojure Matrix Library

  •    Clojure

Neanderthal is a Clojure library for fast matrix and linear algebra computations based on the highly optimized native libraries of BLAS and LAPACK computation routines for both CPU and GPU.. Read the documentation at Neanderthal Web Site.

gloo - Collective communications library with various primitives for multi-machine training.

  •    C++

Gloo is a collective communications library. It comes with a number of collective algorithms useful for machine learning applications. These include a barrier, broadcast, and allreduce. Transport of data between participating machines is abstracted so that IP can be used at all times, or InifiniBand (or RoCE) when available. In the latter case, if the InfiniBand transport is used, GPUDirect can be used to accelerate cross machine GPU-to-GPU memory transfers.

Kubernetes-GPU-Guide - This guide should help fellow researchers and hobbyists to easily automate and accelerate there deep leaning training with their own Kubernetes GPU cluster

  •    Shell

This guide should help fellow researchers and hobbyists to easily automate and accelerate there deep leaning training with their own Kubernetes GPU cluster. Therefore I will explain how to easily setup a GPU cluster on multiple Ubuntu 16.04 bare metal servers and provide some useful scripts and .yaml files that do the entire setup for you. By the way: If you need a Kubernetes GPU-cluster for other reasons, this guide might be helpful to you as well.



ttgLib is a C++ library for parallel resource-intensive programs creation for hybrid architectures like CPU+GPU. This library provides ttg::pipeline parallel primitive with wise load distribution over different computing API like as OpenMP or Intel TBB, NVidia CUDA and OpenCL.

nvidia-docker - Build and run Docker containers leveraging NVIDIA GPUs

  •    Makefile

The full documentation and frequently asked questions are available on the repository wiki. An introduction to the NVIDIA Container Runtime is also covered in our blog post.

xmrig-nvidia - Monero (XMR) NVIDIA miner

  •    C++

⚠️ You must update miners to version 2.5 before April 6 due Monero PoW change. XMRig is high performance Monero (XMR) NVIDIA miner, with the official full Windows support.

Deep-Learning-Boot-Camp - A community run, 5-day PyTorch Deep Learning Bootcamp

  •    Jupyter

Tel-Aviv Deep Learning Bootcamp is an intensive (and free!) 5-day program intended to teach you all about deep learning. It is nonprofit focused on advancing data science education and fostering entrepreneurship. The Bootcamp is a prominent venue for graduate students, researchers, and data science professionals. It offers a chance to study the essential and innovative aspects of deep learning. Participation is via a donation to the A.L.S ASSOCIATION for promoting research of the Amyotrophic Lateral Sclerosis (ALS) disease.