Umpire - An application-focused API for memory management on NUMA & GPU architectures

  •        6

Umpire is a resource management library that allows the discovery, provision, and management of memory on next-generation architectures. For more advanced configuration you can use standard CMake variables.

https://github.com/LLNL/Umpire

Tags
Implementation
License
Platform

   




Related Projects

compute - A C++ GPU Computing Library for OpenCL

  •    C++

Boost.Compute is a GPU/parallel-computing library for C++ based on OpenCL. The core library is a thin C++ wrapper over the OpenCL API and provides access to compute devices, contexts, command queues and memory buffers.

AresDB - A GPU-powered real-time analytics storage and query engine

  •    Go

AresDB is a GPU-powered real-time analytics storage and query engine. It features low query latency, high data freshness and highly efficient in-memory and on disk storage management.

aws-parallelcluster - AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud

  •    Python

AWS ParallelCluster is an AWS supported Open Source cluster management tool that makes it easy for you to deploy and manage High Performance Computing (HPC) clusters in the AWS cloud. Built on the Open Source CfnCluster project, AWS ParallelCluster enables you to quickly build an HPC compute environment in AWS. It automatically sets up the required compute resources and a shared filesystem and offers a variety of batch schedulers such as AWS Batch, SGE, Torque, and Slurm. AWS ParallelCluster facilitates both quick start proof of concepts (POCs) and production deployments. You can build higher level workflows, such as a Genomics portal that automates the entire DNA sequencing workflow, on top of AWS ParallelCluster. For more information on any of these steps see the Getting Started Guide.

Knet.jl - Koç University deep learning framework.

  •    Julia

Knet uses dynamic computational graphs generated at runtime for automatic differentiation of (almost) any Julia code. This allows machine learning models to be implemented by defining just the forward calculation (i.e. the computation from parameters and data to loss) using the full power and expressivity of Julia. The implementation can use helper functions, loops, conditionals, recursion, closures, tuples and dictionaries, array indexing, concatenation and other high level language features, some of which are often missing in the restricted modeling languages of static computational graph systems like Theano, Torch, Caffe and Tensorflow. GPU operation is supported by simply using the KnetArray type instead of regular Array for parameters and data. Knet builds a dynamic computational graph by recording primitive operations during forward calculation. Only pointers to inputs and outputs are recorded for efficiency. Therefore array overwriting is not supported during forward and backward passes. This encourages a clean functional programming style. High performance is achieved using custom memory management and efficient GPU kernels. See Under the hood for more details.

HPC with GPUs

  •    

High-performance computing (HPC) in Windows environments integrated with Graphic Processing Units (GPU).


HPC with GPUs applied to CG

  •    

High-performance computing (HPC) in Windows environments integrated with Graphic Processing Units (GPU) applied to Computer Graphics (CG)

nvParse - Fast, gpu-based CSV parser

  •    Cuda

it doesn't take advantage of multiple cores of modern CPUs. The first line counts the number of lines in a buffer (assuming that file is read into memory and copied to gpu buffer d_readbuff). The second line creates a vector in gpu memory that will hold the positions of new line characters. The last line compares the characters in a buffer to new line character and, if a match is found, copies the position of the character to dev_pos vector.

blt - Acquia's toolset for automating Drupal 8 development, testing, and deployment.

  •    PHP

BLT (Build and Launch Tool) provides an automation layer for testing, building, and launching Drupal 8 applications. See INSTALL.md for a list of prequisites and links to instructions for creating new projects, adding BLT to existing projects, and updating BLT.

futhark - :boom::computer::boom: A data-parallel functional programming language

  •    Haskell

Futhark is a purely functional data-parallel programming language. Its optimising compiler is able to compile it to typically very performant GPU code. The language and compiler is developed at DIKU at the University of Copenhagen, originally as part of the HIPERFIT centre. Although still under heavy development, Futhark is already useful for practical high-performance programming. For more information, see the website.

ROCm - ROCm - Open Source Platform for HPC and Ultrascale GPU Computing

  •    

The ROCm Platform brings a rich foundation to advanced computing by seamlessly integrating the CPU and GPU with the goal of solving real-world problems. This software enables the high-performance operation of AMD GPUs for computationally oriented tasks in the Linux operating system. ROCm is focused on using AMD GPUs to accelerate computational tasks such as machine learning, engineering workloads, and scientific computing. In order to focus our development efforts on these domains of interest, ROCm supports a targeted set of hardware configurations which are detailed further in this section.

persistent-rnn - Fast Recurrent Networks Library

  •    C++

A fast implementation of recurrent neural network layers in CUDA. For a GPU, the largest source of on-chip memory is distributed among the individual register files of thousands of threads. For example, the NVIDIA TitanX GPU has 6.3 MB of register file memory, which is enough to store a recurrent layer with approximately 1200 activations. Persistent kernels exploit this register file memory to cache recurrent weights and reuse them over multiple timesteps.

gradient-checkpointing - Make huge neural nets fit in memory

  •    Python

Training very deep neural networks requires a lot of memory. Using the tools in this package, developed jointly by Tim Salimans and Yaroslav Bulatov, you can trade off some of this memory usage with computation to make your model fit into memory more easily. For feed-forward models we were able to fit more than 10x larger models onto our GPU, at only a 20% increase in computation time. The memory intensive part of training deep neural networks is computing the gradient of the loss by backpropagation. By checkpointing nodes in the computation graph defined by your model, and recomputing the parts of the graph in between those nodes during backpropagation, it is possible to calculate this gradient at reduced memory cost. When training deep feed-forward neural networks consisting of n layers, we can reduce the memory consumption to O(sqrt(n)) in this way, at the cost of performing one additional forward pass (see e.g. Training Deep Nets with Sublinear Memory Cost, by Chen et al. (2016)). This repository provides an implementation of this functionality in Tensorflow, using the Tensorflow graph editor to automatically rewrite the computation graph of the backward pass.

scala-offheap - Experimental type-safe off-heap memory for Scala.

  •    Scala

Garbage collection is the standard memory management paradigm on the JVM. In theory, it lets one completely forget about the hurdles of memory management and delegate all of it to the underlying runtime. In practice, GC often leads to scalability issues on large heaps and latency-sensitive workloads. The goal of this project is to expose a completely different memory management paradigm to the developers: explicitly annotated region-based memory. This paradigm gives more control over memory management without the need to micro-manage allocations.

SystemMonitor - iOS application providing you all information about your device - hardware, operating system, processor, memory, GPU, network interface, storage and battery, including OpenGL powered visual representation in real time

  •    Objective-C

iOS application providing you all information about your device - hardware, operating system, processor, memory, GPU, network interface, storage and battery, including OpenGL powered visual representation in real time.

memreduct - Lightweight real-time memory management application to monitor and clean system memory on your computer

  •    C++

Lightweight real-time memory management application to monitor and clean system memory on your computer. The program used undocumented internal system features (Native API) to clear system cache (system working set, working set, standby page lists, modified page lists) with variable result ~10-50%. Application it is compatible with Windows XP SP3 and higher operating systems, but some general features available only since Windows Vista.

GlideBitmapPool - Glide Bitmap Pool is a memory management library for reusing the bitmap memory

  •    Java

Glide Bitmap Pool is a memory management library for reusing the bitmap memory. As it reuses bitmap memory , so no more GC calling again and again , hence smooth running application. It uses inBitmap while decoding the bitmap on the supported android versions. All the version use-cases has been handled to optimize it better. Glide Bitmap Pool can be included in any Android or Java application.

cymem - 💥 Cython memory pool for RAII-style memory management

  •    Python

cymem provides two small memory-management helpers for Cython. They make it easy to tie memory to a Python object's life-cycle, so that the memory is freed when the object is garbage collected. The Pool object saves the memory addresses internally, and frees them when the object is garbage collected. Typically you'll attach the Pool to some cdef'd class. This is particularly handy for deeply nested structs, which have complicated initialization functions. Just pass the Pool object into the initializer, and you don't have to worry about freeing your struct at all — all of the calls to Pool.alloc will be automatically freed when the Pool expires.

CuBLAS.Net

  •    

A wrapper for NVidia's CuBLAS (Compute Unified Basic Linear Algebra Subprograms) for the CLR.

nnabla - Neural Network Libraries

  •    C++

Neural Network Libraries is a deep learning framework that is intended to be used for research, development and production. We aim to have it running everywhere: desktop PCs, HPC clusters, embedded devices and production servers.This installs the CPU version of Neural Network Libraries. GPU-acceleration can be added by installing the CUDA extension with pip install nnabla-ext-cuda.

efficient_densenet_pytorch - A memory-efficient implementation of DenseNets

  •    Python

A PyTorch implementation of DenseNets, optimized to save GPU memory. While DenseNets are fairly easy to implement in deep learning frameworks, most implmementations (such as the original) tend to be memory-hungry. In particular, the number of intermediate feature maps generated by batch normalization and concatenation operations grows quadratically with network depth. It is worth emphasizing that this is not a property inherent to DenseNets, but rather to the implementation.