Displaying 1 to 20 from 32 results

spark-py-notebooks - Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

  •    Jupyter

This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language. If Python is not your language, and it is R, you may want to have a look at our R on Apache Spark (SparkR) notebooks instead. Additionally, if your are interested in being introduced to some basic Data Science Engineering, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R.

awesome-scalability - Scalable, Available, Stable, Performant, and Intelligent System Design Patterns

  •    

An updated and curated list of readings to illustrate best practices and patterns in building scalable, available, stable, performant, and intelligent large-scale systems. Concepts are explained in the articles of prominent engineers and credible references. Case studies are taken from battle-tested systems that serve millions to billions of users. Understand your problems: scalability problem (fast for a single user but slow under heavy load) or performance problem (slow for a single user) by reviewing some design principles and checking how scalability and performance problems are solved at tech companies. The section of intelligence are created for those who work with data and machine learning at big (data) and deep (learning) scale.




Stream-Framework - Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis

  •    Python

Stream Framework is a python library which allows you to build activity streams & newsfeeds using Cassandra and/or Redis. If you're not using python have a look at Stream, which supports Node, Ruby, PHP, Python, Go, Scala, Java and REST. Stream Framework's authors also offer a web service for building scalable newsfeeds & activity streams at Stream. It allows you to create your feeds by talking to a beautiful and easy to use REST API. There are clients available for Node, Ruby, PHP, Python, Go, Scala and Java. The Get Started page explains the API & concept in a few clicks. It's a lot easier to use, free up to 3 million feed updates and saves you the hassle of maintaining Cassandra, Redis, Faye, RabbitMQ and Celery workers.

spark-movie-lens - An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset

  •    Jupyter

This Apache Spark tutorial will guide you step-by-step into how to use the MovieLens dataset to build a movie recommender using collaborative filtering with Spark's Alternating Least Saqures implementation. It is organised in two parts. The first one is about getting and parsing movies and ratings data into Spark RDDs. The second is about building and using the recommender and persisting it for later use in our on-line recommender system. This tutorial can be used independently to build a movie recommender model based on the MovieLens dataset. Most of the code in the first part, about how to use ALS with the public MovieLens dataset, comes from my solution to one of the exercises proposed in the CS100.1x Introduction to Big Data with Apache Spark by Anthony D. Joseph on edX, that is also publicly available since 2014 at Spark Summit. Starting from there, I've added with minor modifications to use a larger dataset, then code about how to store and reload the model for later use, and finally a web service using Flask.

hpat - A compiler-based big data framework in Python

  •    Python

High Performance Analytics Toolkit (HPAT) scales analytics/ML codes in Python to bare-metal cluster/cloud performance automatically. It compiles a subset of Python (Pandas/Numpy) to efficient parallel binaries with MPI, requiring only minimal code changes. HPAT is orders of magnitude faster than alternatives like Apache Spark. HPAT's documentation can be found here.

cogcomp-nlp - CogComp's Natural Language Processing libraries and Demos:

  •    Java

This project collects a number of core libraries for Natural Language Processing (NLP) developed by Cognitive Computation Group. Each library contains detailed readme and instructions on how to use it. In addition the javadoc of the whole project is available here.


azure-arch-enterprise-bi-and-reporting - A reference implementation of an end to end Azure infrastructure for scalable enterprise BI and Reporting

  •    PowerShell

Once deployed, the TRI can be used as-is, or customized to fit the application needs using the technical documentation that is provided with the TRI. This enables the customer to build the solution that delivers the business goals based on a robust and functional infrastructure.It is recommended that the TRI is reviewed and deployed by a person who is familiar with operational concepts of data warehousing, business intelligence, and analytics. Knowledge of Azure is a plus, but not mandatory. The technical guides provide pointers to Azure documentation for all the resources employed in this TRI.

tf_G - My Final Degree Project called "Algorithms for Big Data: Graphs and PageRank"

  •    TeX

This work consists of a study of a set of techniques and strategies related with algorithm's design, whose purpose is the resolution of problems on massive data sets, in an efficient way. This field is known as Algorithms for Big Data. In particular, this work has studied the Streaming Algorithms, which represents the basis of the data structures of sublinear order o(n) in space, known as Sketches. In addition, it has deepened in the study of problems applied to Graphs on the Semi-Streaming model. Next, the PageRank algorithm was analyzed as a concrete case study. Finally, the development of a library for the resolution of graph problems, implemented on the top of the intensive mathematical computation platform known as TensorFlow has been started.

amazon-kinesis-client-nodejs - Amazon Kinesis Client Library for Node.js

  •    Javascript

This package provides an interface to the Amazon Kinesis Client Library (KCL) MultiLangDaemon for the Node.js framework.Developers can use the KCL to build distributed applications that process streaming data reliably at scale. The KCL takes care of many of the complex tasks associated with distributed computing, such as load-balancing across multiple instances, responding to instance failures, checkpointing processed records, and reacting to changes in stream volume.

ethz-web-scale-data-mining-project - ETH Zurich - Web Scale Data Processing and Mining Project

  •    HTML

This is the main repository for the web scale data mining project, which took place in summer 2014 as a research project. One of the results are the visualized topics, which have been learned autonomously from terabytes of raw HTML data.

spark-r-notebooks - R on Apache Spark (SparkR) tutorials for Big Data analysis and Machine Learning as IPython / Jupyter notebooks

  •    Jupyter

This is a collection of Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the R language. If your are interested in being introduced to some basic Data Science Engineering concepts and applications, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R. Additionally, if you are interested in using Python with Spark, you can have a look at our pySpark notebooks.

shiny-guide - an intelligent big data healthcare application for tongue diagnosis aid.

  •    Javascript

an iOS app for TCM's tongue diagnosis. Then OpenCV, which is the tool for analyzing image, is used to process tongue image in the environment of Visual C++. Firstly, by some algorithms OpenCV is used to make color tongue image into gray image, and gray image into binary image. Then algorithms are used to extracts tongue target. From the tongue target, some features are extracted. These features combining with traditional Chinese medicine, diagnosis can be given.

dvid - Distributed, Versioned, Image-oriented Dataservice

  •    Go

Status: In production use at Janelia. See wiki page for outside lab use of DVID. See the DVID Wiki for more information including installation and examples of use.

warp - Convert and analyze large data sets at light speed, on Mac and iOS.

  •    Swift

Warp allows you to convert and analyze (very) large databases with ease at the speed of light. In Warp, you work on a small subset of the data, after which Warp repeats your actions on the entire dataset. Unlike most data analysis apps, you do not have to type any codes in Warp. Effortlessly juggle around data between files and databases by simply dragging-and-dropping! Load CSV files into MySQL or transfer a PostgreSQL table to a RethinkDB table by just dragging one to the other.

NodeNeuralNetwork - Nodejs implementation of Neural Network

  •    Javascript

Neural network implementation with backpropagation. It uses map reduce to distribute the computation of cost function and it's gradients. It also implements stochastic/step/batch gradient descent for optimizing cost function

big-data-upf - RECSM-UPF Summer School: Social Media and Big Data Research

  •    HTML

Citizens across the globe spend an increasing proportion of their daily lives online. Their activities leave behind granular, time-stamped footprints of human behavior and personal interactions that represent a new and exciting source of data to study standing questions about political and social behavior. At the same time, the volume and heterogeneity of web data present unprecedented methodological challenges. The goal of this course is to introduce participants to new computational methods and tools required to explore and analyze Big Data from online sources using the R programming language. We will focus in particular on data collected from social networking sites, such as Facebook and Twitter, whose use is becoming widespread in the social sciences. There are two ways you can follow the course and run the code contained in this GitHub repository. The recommended method is to connect to the provided RStudio server where all the R packages have already been installed, and all the R code is available. To access the server, visit bigdata.pablobarbera.com and log in with the information provided during class.