Vespa is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time. Vespa is serving platform for Yahoo.com, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Gemini, Flickr.
searchengine search-engine big-data data-processing machine-learning real-timeIPython Notebook(s) demonstrating deep learning functionality.IPython Notebook(s) demonstrating scikit-learn functionality.
machine-learning deep-learning data-science big-data aws tensorflow theano caffe scikit-learn kaggle spark mapreduce hadoop matplotlib pandas numpy scipy kerasThis is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language. If Python is not your language, and it is R, you may want to have a look at our R on Apache Spark (SparkR) notebooks instead. Additionally, if your are interested in being introduced to some basic Data Science Engineering, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R.
spark pyspark data-analysis mllib ipython-notebook notebook ipython data-science machine-learning big-data bigdataCatBoost is a machine learning method based on gradient boosting over decision trees. All CatBoost documentation is available here.
machine-learning decision-trees gradient-boosting gbm gbdt r kaggle gpu-computing catboost tutorial categorical-features distributed gpu coreml opensource data-science big-dataspaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products. spaCy comes with pre-trained statistical models and word vectors, and currently supports tokenization for 20+ languages. It features the fastest syntactic parser in the world, convolutional neural network models for tagging, parsing and named entity recognition and easy deep learning integration. It's commercial open-source software, released under the MIT license. 💫 Version 2.0 out now! Check out the new features here.
natural-language-processing data-science big-data machine-learning cython nlp artificial-intelligence ai spacy nlp-library neural-network neural-networks deep-learningDatumbox is an open-source Machine Learning Framework written in Java which allows the rapid development of Machine Learning and Statistical applications.
machine-learning big-data statistics nlp data-scienceThe facets project contains two visualizations for understanding and analyzing machine learning datasets: Facets Overview and Facets Dive. The visualizations are implemented as Polymer web components, backed by Typescript code and can be easily embedded into Jupyter notebooks or webpages.
machine-learning data-visualization visualization big-dataThis repo contains the projects, additional information and code to support my blog: sciblog. You can find a list of all the post I made in this file.
sciblog sciblog-support machine-learning artificial-intelligence deep-learning neural-networks examples code-examples programming-exercise data-science big-data analyticsHigh Performance Analytics Toolkit (HPAT) scales analytics/ML codes in Python to bare-metal cluster/cloud performance automatically. It compiles a subset of Python (Pandas/Numpy) to efficient parallel binaries with MPI, requiring only minimal code changes. HPAT is orders of magnitude faster than alternatives like Apache Spark. HPAT's documentation can be found here.
big-data parallel-computing compilers machine-learning numpy pandasHigh performance distributed data processing and machine learning.Skale provides a high-level API in Javascript and an optimized parallel execution engine on top of NodeJS.
nodejs cluster aws-s3 azure-storage parquet machine-learning skale big-data etl distributed data-processing cloud s3 azure parallel hpcThe Data Science Virtual Machine (DSVM) is a customized VM image on Microsoft’s Azure cloud built specifically for doing data science. It has many popular data science and other tools pre-installed and pre-configured to jump-start building intelligent applications for advanced analytics. It is available on Windows Server 2016, Windows Server 2012, and on Linux. We offer Linux edition of the DSVM in either Ubuntu 16.04 LTS or on OpenLogic 7.2 CentOS-based Linux distributions. You can try the Data Science VM for free for 30 days (with $200 credits) with a free Azure Trial. The Linux (Ubuntu-based) DSVM also provides a test drive through a button on the product page. The Test Drive will provide full access to you own instance of the VM with just a free Microsoft account (No Azure subscription or CC needed).On this repo, we will feature tools, tips and extensions (see below) to the Data Science VM. We invite the DSVM user community to contribute any useful tools or scripts, extensions you may have written to enhance the user experience on the DSVM.
data-science data-analysis machine-learning deep-learning azure big-data ai ml dsvm r sqlserverThe server components for the AcousticBrainz project. Full installation instructions are available in INSTALL.md file. After installing, continue the following steps.
big-data music web acousticbrainz-server machine-learningThis is the main repository for the web scale data mining project, which took place in summer 2014 as a research project. One of the results are the visualized topics, which have been learned autonomously from terabytes of raw HTML data.
eth machine-learning data-mining spark hadoop big-data topic-modeling web-scaleIf you want to install and run everything on your computer, here are the best tutorials I've found for getting Python and Spark running on your computer. In order to visualize the decision trees in Jupyter, you will need to install Graphviz as well as the Python package.
big-data machine-learning jupyter-notebook graphviz data-exploration pyspark mllibNeural network implementation with backpropagation. It uses map reduce to distribute the computation of cost function and it's gradients. It also implements stochastic/step/batch gradient descent for optimizing cost function
nodejs neural-network big-data machine-learningCLgen is an open source application for generating runnable programs using deep learning. CLgen learns to program using neural networks which model the semantics and usage from large volumes of program fragments, generating many-core OpenCL programs that are representative of, but distinct from, the programs it learns from. See the online documentation for instructions on how to download and install CLgen.
deep-learning gpu machine-learning neural-network opencl benchmarking synthetic-programs big-data lstmUnless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Vertica-ML-Python is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities. It supports the entire data science life cycle, uses a ‘pipeline’ mechanism to sequentialize data transformation operation (called Resilient Vertica Dataset), and offers multiple graphical rendering possibilities.
vertica machine-learning big-data data-visualization preparation data-science python-libraryPlease submit issues, questions and PRs in the new location. The current repo is not maintained. The repository has been moved for several reasons, mainly to improve the integrations with Sparkling Water and for the stability reasons.
h2o spark machine-learning sparklyr deep-learning data-science big-data r waterThe GDLibrary is a pure-Matlab library of a collection of unconstrained optimization algorithms. This solves an unconstrained minimization problem of the form, min f(x). Note that the SGDLibrary internally contains this GDLibrary.
optimization optimization-algorithms machine-learning machine-learning-algorithms big-data gradient-descent gradient logistic-regression newton linear-regression svm lasso matrix-completion rosenbrock-problem softmax-regression multinomial-regression statistical-learning classificationThe SGDLibrary is a pure-MATLAB library of a collection of stochastic optimization algorithms. This solves an unconstrained minimization problem of the form, min f(x) = sum_i f_i(x). The SGDLibrary is also operable on GNU Octave (Free software compatible with many MATLAB scripts). Note that this SGDLibrary internally contains the GDLibrary.
optimization optimization-algorithms machine-learning machine-learning-algorithms stochastic-optimization-algorithms stochastic-gradient-descent big-data gradient-descent-algorithm gradient logistic-regression sgd variance-reduction newtons-method linear-regression classification online-learning quasi-newton
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.