Displaying 1 to 12 from 12 results

mkl-dnn - Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN)

  •    C++

Intel MKL-DNN repository migrated to https://github.com/intel/mkl-dnn. The old address will continue to be available and will redirect to the new repo. Please update your links. Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN) is an open source performance library for deep learning applications. The library accelerates deep learning applications and framework on Intel(R) architecture. Intel(R) MKL-DNN contains vectorized and threaded building blocks which you can use to implement deep neural networks (DNN) with C and C++ interfaces.

Simd - C++ image processing library with using of SIMD: SSE, SSE2, SSE3, SSSE3, SSE4

  •    C++

The Simd Library is a free open source image processing library, designed for C and C++ programmers. It provides many useful high performance algorithms for image processing such as: pixel format conversion, image scaling and filtration, extraction of statistic information from images, motion detection, object detection (HAAR and LBP classifier cascades) and classification, neural network. The algorithms are optimized with using of different SIMD CPU extensions. In particular the library supports following CPU extensions: SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and AVX-512 for x86/x64, VMX(Altivec) and VSX(Power7) for PowerPC (big-endian), NEON for ARM.

libsimdpp - Portable header-only zero-overhead C++ low level SIMD library

  •    C++

libsimdpp is a portable header-only zero-overhead C++ low level SIMD library. The library presents a single interface over SIMD instruction sets present in x86, ARM, PowerPC and MIPS architectures. On architectures that support different SIMD instruction sets the library allows the same source code files to be compiled for each SIMD instruction set and then hooked into an internal or third-party dynamic dispatch mechanism. This allows the capabilities of the processor to be queried on runtime and the most efficient implementation to be selected. The library sits somewhere in the middle between programming directly in SIMD intrinsics and even higher-level SIMD libraries. As much control as possible is given to the developer, so that it's possible to exactly predict what code the compiler will generate.

Vc - SIMD Vector Classes for C++

  •    C++

Recent generations of CPUs, and GPUs in particular, require data-parallel codes for full efficiency. Data parallelism requires that the same sequence of operations is applied to different input data. CPUs and GPUs can thus reduce the necessary hardware for instruction decoding and scheduling in favor of more arithmetic and logic units, which execute the same instructions synchronously. On CPU architectures this is implemented via SIMD registers and instructions. A single SIMD register can store N values and a single SIMD instruction can execute N operations on those values. On GPU architectures N threads run in perfect sync, fed by a single instruction decoder/scheduler. Each thread has local memory and a given index to calculate the offsets in memory for loads and stores. Current C++ compilers can do automatic transformation of scalar codes to SIMD instructions (auto-vectorization). However, the compiler must reconstruct an intrinsic property of the algorithm that was lost when the developer wrote a purely scalar implementation in C++. Consequently, C++ compilers cannot vectorize any given code to its most efficient data-parallel variant. Especially larger data-parallel loops, spanning over multiple functions or even translation units, will often not be transformed into efficient SIMD code.




xsimd - Modern, portable C++ wrappers for SIMD intrinsics and parallelized, optimized math implementations

  •    C++

SIMD (Single Instruction, Multiple Data) is a feature of microprocessors that has been available for many years. SIMD instructions perform a single operation on a batch of values at once, and thus provide a way to significantly accelerate code execution. However, these instructions differ between microprocessor vendors and compilers. xsimd provides a unified means for using these features for library authors. Namely, it enables manipulation of batches of numbers with the same arithmetic operators as for single values. It also provides accelerated implementation of common mathematical functions operating on batches.

base64simd - Base64 coding and decoding with SIMD instructions (SSE/AVX2/AVX512F/AVX512BW/AVX512VBMI/ARM Neon)

  •    C++

Repository contains code for encoding and decoding base64 using SIMD instructions. Depending on CPU's architecture, vectorized encoding is faster than scalar versions by factor from 2 to 4; decoding is faster 2 .. 2.7 times. Daniel Lemire and I wrote also paper Faster Base64 Encoding and Decoding Using AVX2 Instructions which was published by ACM Transactiona on the Web.

parsing-int-series - Parse multiple decimal integers separated by arbitrary number of delimiters

  •    C++

Parsers extract integer numbers from strings. A number can be prepended by a sign character. The numbers are separated by arbitrary sequences of separator chars. All other characters are invalid and the parsers detects them and raise exception. Requires: C++11 compiler (tested with GCC 7.3) and Python 2.7.

sse-popcount - SIMD (SSE) population count --- http://0x80.pl/articles/sse-popcount.html

  •    C++

Daniel Lemire, Nathan Kurz and I published an article Faster Population Counts using AVX2 Instructions. Subdirectory original contains code from 2008 --- it is 32-bit and GCC-centric. The root directory contains fresh C++11 code, written with intrinsics and tested on 64-bit machines.


sse4-strstr - SIMD (SWAR/SSE/SSE4/AVX2/AVX512F/ARM Neon) of Karp-Rabin algorithm's modification

  •    C++

Sample programs for article "SIMD-friendly algorithms for substring searching" (http://0x80.pl/articles/simd-strfind.html). The root directory contains C++11 procedures implemented using intrinsics for SSE, SSE4, AVX2, AVX512F, AVX512BW and ARM Neon (both ARMv7 and ARMv8).

vpu-count - Information about AVX-512 support on recent Intel processors

  •    C

The intended user of this project is the developer of a numerical library that needs to know the throughput of floating-point FMA (fused multiply-add) on Xeon processors that support AVX-512. This project does not attempt to address the throughput of other operations, such as shuffles, permutations, or non-floating-point instructions. The project provides example code to show a user how to determine the number of AVX-512 FMAs in an Intel processor based on the Skylake microarchitecture. We encourage potential users to adapt the code to their usage, as some of the code included is rather pedantic (although this doesn't add significant runtime overhead unless debug printing is enabled).

UnivDisasm - x86 Disassembler and Analyzer

  •    Pascal

UnivDisasm is a powerful x86 disassembler and opcodes analyzer library for x86 architectures. Examples The left side is the instructions opcodes and the right side is the full instructions str decoded by UnivDisasm.

avx512counters - AVX-512 hardware counters collector written in Go, based on Go toolchain

  •    Go

AVX-512 hardware counters collector written in Go, based on Go toolchain. This program utilized Go 1.11 assembler AVX-512 support, extensive end2end test suite and Linux perf tool to build CSV that records some relevant hardware counter values associated with every available AVX-512 instruction form.