tvm - bring deep learning workloads to bare metal

  •        93

TVM is a Tensor intermediate representation(IR) stack for deep learning systems. It is designed to close the gap between the productivity-focused deep learning frameworks, and the performance- and efficiency-focused hardware backends. TVM works with deep learning frameworks to provide end to end compilation to different backends. Checkout our announcement for more details.© Contributors, 2017. Licensed under an Apache-2.0 license.



Related Projects

nnvm - Bring deep learning to bare metal

  •    C++

The following code snippet demonstrates the general workflow of nnvm compiler.Licensed under an Apache-2.0 license.

Arraymancer - A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU, OpenCL and embedded devices

  •    Nim

Arraymancer is a tensor (N-dimensional array) project in Nim. The main focus is providing a fast and ergonomic CPU, Cuda and OpenCL ndarray library on which to build a scientific computing and in particular a deep learning ecosystem. The library is inspired by Numpy and PyTorch. The library provides ergonomics very similar to Numpy, Julia and Matlab but is fully parallel and significantly faster than those libraries. It is also faster than C-based Torch.

mshadow - Matrix Shadow:Lightweight CPU/GPU Matrix and Tensor Template Library in C++/CUDA for (Deep) Machine Learning

  •    C++

MShadow is a lightweight CPU/GPU Matrix/Tensor Template Library in C++/CUDA. The goal of mshadow is to support efficient, device invariant and simple tensor library for machine learning project that aims for maximum performance and control, while also emphasize simplicity.MShadow also provides interface that allows writing Multi-GPU and distributed deep learning programs in an easy and unified way.

PyTorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration

  •    Python

PyTorch is a deep learning framework that puts Python first. It is a python package that provides Tensor computation (like numpy) with strong GPU acceleration, Deep Neural Networks built on a tape-based autograd system. You can reuse your favorite python packages such as numpy, scipy and Cython to extend PyTorch when needed.

Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine

  •    C++

DSSTNE (pronounced "Destiny") is an open source software library for training and deploying recommendation models with sparse inputs, fully connected hidden layers, and sparse outputs. Models with weight matrices that are too large for a single GPU can still be trained on a single host. DSSTNE has been used at Amazon to generate personalized product recommendations for our customers at Amazon's scale.

ROCm - ROCm - Open Source Platform for HPC and Ultrascale GPU Computing


The ROCm Platform brings a rich foundation to advanced computing by seamlessly integrating the CPU and GPU with the goal of solving real-world problems. This software enables the high-performance operation of AMD GPUs for computationally oriented tasks in the Linux operating system. ROCm is focused on using AMD GPUs to accelerate computational tasks such as machine learning, engineering workloads, and scientific computing. In order to focus our development efforts on these domains of interest, ROCm supports a targeted set of hardware configurations which are detailed further in this section.

swift - Swift for TensorFlow project home page


Swift for TensorFlow is a new way to develop machine learning models. It gives you the power of TensorFlow directly integrated into the Swift programming language. With Swift, you can write the following imperative code, and Swift automatically turns it into a single TensorFlow Graph and runs it with the full performance of TensorFlow Sessions on CPU, GPU and TPU. Swift combines the flexibility of Eager Execution with the high performance of Graphs and Sessions. Behind the scenes, Swift analyzes your Tensor code and automatically builds graphs for you. Swift also catches type errors and shape mismatches before running your code, and has Automatic Differentiation built right in. We believe that machine learning tools are so important that they deserve a first-class language and a compiler.

Kubernetes-GPU-Guide - This guide should help fellow researchers and hobbyists to easily automate and accelerate there deep leaning training with their own Kubernetes GPU cluster

  •    Shell

This guide should help fellow researchers and hobbyists to easily automate and accelerate there deep leaning training with their own Kubernetes GPU cluster. Therefore I will explain how to easily setup a GPU cluster on multiple Ubuntu 16.04 bare metal servers and provide some useful scripts and .yaml files that do the entire setup for you. By the way: If you need a Kubernetes GPU-cluster for other reasons, this guide might be helpful to you as well.

hcc - HCC is an Open Source, Optimizing C++ Compiler for Heterogeneous Compute currently for the ROCm GPU Computing Platform

  •    C++

This repository hosts the HCC compiler implementation project. The goal is to implement a compiler that takes a program that conforms to a parallel programming standard such as HC, C++ 17 ParallelSTL and transforms it into the AMD GCN ISA. AMD is deprecating HCC to put more focus on HIP development and on other languages supporting heterogeneous compute. We will no longer develop any new feature in HCC and we will stop maintaining HCC after its final release, which is planned for June 2019. If your application was developed with the hc C++ API, we would encourage you to transition it to other languages supported by AMD, such as HIP or OpenCL. HIP and hc language share the same compiler technology, so many hc kernel language features (including inline assembly) are also available through the HIP compilation path.



testify is a testing automation framework for embeded systems. It contains a DSL-Tata and its compiler, a virtual machine TVM and a Notepad++ based IDE which makes the Tata programming a piece of cake.

emu - a language for programming GPUs, with a focus on ergonomics first and performance second

  •    Rust

⚠ Please note that while Emu 0.2.0 is quite usable, it suffers from 2 key issues. It firstly does nothing to minimize CPU-GPU data transfer and secondly it's compiler is not well-tested. These can be reasons not to use Emu 0.2.0. A new version of Emu is in the works, however, with significant improvements in the language, compiler, and compile-time checker. This new version of Emu should be released some time in Q4 of 2019. But unlike OpenCL/CUDA/Halide/Futhark, Emu is embedded in Rust. This lets it take advantage of the ecosystem in ways...

ngraph - nGraph is an open source C++ library, compiler and runtime for Deep Learning frameworks

  •    C++

Welcome to the open-source repository for the Intel® nGraph™ Library. Our code base provides a Compiler and runtime suite of tools (APIs) designed to give developers maximum flexibility for their software design, allowing them to create or customize a scalable solution using any framework while also avoiding device-level hardware lock-in that is so common with many AI vendors. A neural network model compiled with nGraph can run on any of our currently-supported backends, and it will be able to run on any backends we support in the future with minimal disruption to your model. With nGraph, you can co-evolve your software and hardware's capabilities to stay at the forefront of your industry. The nGraph Compiler is Intel's graph compiler for Artificial Neural Networks. Documentation in this repo describes how you can program any framework to run training and inference computations on a variety of Backends including Intel® Architecture Processors (CPUs), Intel® Nervana™ Neural Network Processors (NNPs), cuDNN-compatible graphics cards (GPUs), custom VPUs like Movidius, and many others. The default CPU Backend also provides an interactive Interpreter mode that can be used to zero in on a DL model and create custom nGraph optimizations that can be used to further accelerate training or inference, in whatever scenario you need.

DeepLearning.scala - A simple library for creating complex neural networks

  •    Scala

DeepLearning.scala is a simple library for creating complex neural networks from object-oriented and functional programming constructs. Like other deep learning toolkits, DeepLearning.scala allows you to build neural networks from mathematical formulas. It supports floats, doubles, GPU-accelerated N-dimensional arrays, and calculates derivatives of the weights in the formulas.

glsl-optimizer - GLSL optimizer based on Mesa's GLSL compiler

  •    C++

A C++ library that takes GLSL shaders, does some GPU-independent optimizations on them and outputs GLSL or Metal source back. Optimizations are function inlining, dead code removal, copy propagation, constant folding, constant propagation, arithmetic optimizations and so on. Apparently quite a few mobile platforms are pretty bad at optimizing shaders; and unfortunately they also lack offline shader compilers. So using a GLSL optimizer offline before can make the shader run much faster on a platform like that. See performance numbers in this blog post.

GPUImage3 - GPUImage 3 is a BSD-licensed Swift framework for GPU-accelerated video and image processing using Metal

  •    Swift

GPUImage 3 is the third generation of the GPUImage framework, an open source project for performing GPU-accelerated image and video processing on Mac and iOS. The original GPUImage framework was written in Objective-C and targeted Mac and iOS, the second iteration rewritten in Swift using OpenGL to target Mac, iOS, and Linux, and now this third generation is redesigned to use Metal in place of OpenGL. The objective of the framework is to make it as easy as possible to set up and perform realtime video processing or machine vision against image or video sources. Previous iterations of this framework wrapped OpenGL (ES), hiding much of the boilerplate code required to render images on the GPU using custom vertex and fragment shaders. This version of the framework replaces OpenGL (ES) with Metal. Largely driven by Apple's deprecation of OpenGL (ES) on their platforms in favor of Metal, it will allow for exploring performance optimizations over OpenGL and a tighter integration with Metal-based frameworks and operations.

DIGITS - Deep Learning GPU Training System

  •    HTML

DIGITS (the Deep Learning GPU Training System) is a webapp for training deep learning models. The currently supported frameworks are: Caffe, Torch, and Tensorflow. Once you have installed DIGITS, visit docs/ for an introductory walkthrough.

chainer - A flexible framework of neural networks for deep learning

  •    Python

Chainer is a Python-based deep learning framework aiming at flexibility. It provides automatic differentiation APIs based on the define-by-run approach (a.k.a. dynamic computational graphs) as well as object-oriented high-level APIs to build and train neural networks. It also supports CUDA/cuDNN using CuPy for high performance training and inference. For more details of Chainer, see the documents and resources listed above and join the community in Forum, Slack, and Twitter. The stable version of current Chainer is separated in here: v3.

jetson-reinforcement - Deep reinforcement learning GPU libraries for NVIDIA Jetson with PyTorch, OpenAI Gym, and Gazebo robotics simulator

  •    C++

In this tutorial, we'll be creating artificially intelligent agents that learn from interacting with their environment, gathering experience, and a system of rewards with deep reinforcement learning (deep RL). Using end-to-end neural networks that translate raw pixels into actions, RL-trained agents are capable of exhibiting intuitive behaviors and performing complex tasks. Ultimately, our aim will be to train reinforcement learning agents from virtual robotic simulation in 3D and transfer the agent to a real-world robot. Reinforcement learners choose the best action for the agent to perform based on environmental state (like camera inputs) and rewards that provide feedback to the agent about it's performance. Reinforcement learning can learn to behave optimally in it's environment given a policy, or task - like obtaining the reward.

Flux.jl - Relax! Flux is the ML library that doesn't make you tensor

  •    Julia

Flux is an elegant approach to machine learning. It's a 100% pure-Julia stack, and provides lightweight abstractions on top of Julia's native GPU and AD support. Flux makes the easy things easy while remaining fully hackable. See the documentation or the model zoo for examples.