AzureDSVM - AzureDSVM is an R package that offers convenient harness of Azure DSVM, remote execution of scalable and elastic data science work, and monitoring of on-demand resource consumption

  •        7

The AzureDSVM (Azure Data Science Virtual Machine) is an R Package for Data Scientists working with the Azure compute platform as a complement to the underlying AzureSMR for controlling Azure Data Science Virtual Machines.Azure Data Science Virtual Machine (DSVM) is a powerful data science development environment with pre-installed tools and packages that empower data scientists for convenient data wrangling, model building, and service deployment.



Related Projects

DataScienceVM - Tools and Docs on the Azure Data Science Virtual Machine (

The Data Science Virtual Machine (DSVM) is a customized VM image on Microsoft’s Azure cloud built specifically for doing data science. It has many popular data science and other tools pre-installed and pre-configured to jump-start building intelligent applications for advanced analytics. It is available on Windows Server 2016, Windows Server 2012, and on Linux. We offer Linux edition of the DSVM in either Ubuntu 16.04 LTS or on OpenLogic 7.2 CentOS-based Linux distributions. You can try the Data Science VM for free for 30 days (with $200 credits) with a free Azure Trial. The Linux (Ubuntu-based) DSVM also provides a test drive through a button on the product page. The Test Drive will provide full access to you own instance of the VM with just a free Microsoft account (No Azure subscription or CC needed).On this repo, we will feature tools, tips and extensions (see below) to the Data Science VM. We invite the DSVM user community to contribute any useful tools or scripts, extensions you may have written to enhance the user experience on the DSVM.


This repository contains walkthroughs, templates and documentation related to Machine Learning & Data Science services and platforms on Azure. Services and platforms include Data Science Virtual Machine, Azure ML, HDInsight, Microsoft R Server, SQL-Server, Azure Data Lake etc.There are also materials from tutorials we have delivered at KDD, Strata etc., using the above services and platforms.

Azure-TDSP-ProjectTemplate - Data science project template repository with standardized directory structure and document templates to support efficient project execution and collaboration

This is a general project directory structure for Team Data Science Process developed by Microsoft. It also contains templates for various documents that are recommended as part of executing a data science project when using TDSP.Team Data Science Process (TDSP) is an agile, iterative, data science methodology to improve collaboration and team learning. It is supported through a lifecycle definition, standard project structure, artifact templates, and tools for productive data science.

MMLSpark - Microsoft Machine Learning for Apache Spark

MMLSpark provides a number of deep learning and data science tools for Apache Spark, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK) and OpenCV, enabling you to quickly create powerful, highly-scalable predictive and analytical models for large image and text datasets.MMLSpark requires Scala 2.11, Spark 2.1+, and either Python 2.7 or Python 3.5+. See the API documentation for Scala and for PySpark.

Azure-TDSP-Utilities - Utilities and scripts developed as part of Microsoft's Team Data Science Process for productive data science

This repository contains the Data Science Utilities developed by Team Data Science Process (TDSP) from Microsoft.Shared data science utility is a key component of TDSP. Shared data science utilities can make the execution of data science projects more efficient.

Customer-Churn-Demo-MRS-Spark-HDI - This demo demonstrates how to use Microsoft R Server, Azure HDInsight with R on Linux, Azure Machine Learning, Spark, Scala, and Hive to build an end-to-end, cloud solution for Retail Customer Churn

This demo demonstrates how to use Microsoft R Server, Azure HDInsight with R on Linux, Azure Machine Learning, Spark, Scala, Hive, etc. to build an end-to-end, cloud solution for Retail Customer Churn. The demo attempts to simulate the real-world use case of data placement/storage, feature engineering, model retraining, prediction, and visualization.An Azure subscription: Before you begin, you must have an Azure subscription that have access to Azure HDInsight, Azure Blob Storage, etc. See Get Azure free trial for more information.

cortana-intelligence-personalized-offers - Generate real-time personalized offers on a retail website to engage more closely with customers

In today’s highly competitive and connected environment, modern businesses can no longer survive with generic, static online content. Furthermore, marketing strategies using traditional tools are often expensive, hard to implement, and do not produce the desired return on investment. These systems often fail to take full advantage of the data collected to create a more personalized experience for the user. Surfacing offers that are customized for the user has become essential to build customer loyalty and remain profitable. On a retail website, customers desire intelligent systems which provide offers and content based on their unique interests and preferences. Today’s digital marketing teams can build this intelligence using the data generated from all types of user interactions. By analyzing massive amounts of data, marketers have the unique opportunity to deliver highly relevant and personalized offers to each user. However, building a reliable and scalable big data infrastructure, and developing sophisticated machine learning models that personalize to each user is not trivial.Cortana Intelligence provides advanced analytics tools through Microsoft Azure — data ingestion, data storage, data processing and advanced analytics components — all of the essential elements for building an demand forecasting for energy solution. This solution combines several Azure services to provide powerful advantages. Event Hubs collects real-time consumption data. Stream Analytics aggregates the streaming data and updates the data used in making personalized offers to the customer. Azure DocumentDB stores the customer, product and offer information. Azure Storage is used to manage the queues that simulate user interaction. Azure Functions are used as a coordinator for the user simulation and as the central portion of the solution for generating personalized offers. Azure Machine Learning implements and executes the product recommendations and when no user history is available Azure Redis Cache is used to provide pre-computed product recommendations for the customer. PowerBI visualizes the activity of the system with the data from DocumentDB.

PySpark-Predictive-Maintenance - Predictive Maintenance using Pyspark

Predictive maintenance is one of the most common machine learning use cases and with the latest advancements in information technology, the volume of stored data is growing faster in this domain than ever before which makes it necessary to leverage big data analytic capabilities to efficiently transform large amounts of data into business intelligence. Microsoft has published a series of learning materials including blogs, solution templates, modeling guides and sample tutorials in the domain of predictive maintenance. In this tutorial, we extended those materials by providing a detailed step-by-step process of using Spark Python API PySpark to demonstrate how to approach predictive maintenance for big data scenarios. The tutorial covers typical data science steps such as data ingestion, cleansing, feature engineering and model development.The input data is simulated to reflect features that are generic for most of the predictive maintenance scenarios. To enable the tutorial to be completed very quickly, the data was simulated to be around 1.3 GB but the same PySpark framework can be easily applied to a much larger data set. The data is hosted on a publicly accessible Azure Blob Storage container and can be downloaded from here. In this tutorial, we import the data directly from the blob storage.

acceleratoRs - R based data science solution accelerator suite that provides templates for prototyping, reporting, and presenting data science analytics of specific domains

acceleratoRs are a collection of R based lightweight data science solutions that offer quick start for data scientists to experiment, prototype, and present their data analytics of specific domains.Each of accelerators shared in this repo is structured following the project template of the Microsoft Team Data Science Process, in a simplified and accelerator-friendly version. The analytics are scripted in R markdown (notebook), and can be used to conveniently yield outputs in various formats (ipynb, PDF, html, etc.).

KDD2017R - Tutorial on Scaling R at KDD 2017

The first section of this half-day tutorial is the in-database advanced analytics in SQL Server 2016 with Microsoft R. You will be using a Jupyter notebook running on an Azure virtual machine with R kernel. The jupyter notebook will connect to a SQL Server hosted on another Azure virtual machine. Both Jupyter Notebook server and SQL Server virtual machines have been created for you. You will need the information on the paper clip handed out to you when you enter the tutorial room. Since multiple users will be using the same Jupyter Notebook server (10 servers created), and the same SQL Server (5 servers created), please follow the following steps as much as you can, to minimize the interference with other users on the same machine.Step 1. Open https://<ip address>:9999 from a browser, Ignore security warnings.

LearnAnalytics-mr4ds - R and Microsoft R Workflows for Data Science

Welcome to the Microsoft R for Data Science Course Repository. You can find the latest materials from the workshop here, and links for course materials from prior iterations of the course ca be found in the version pane. While this course is intended for data scientists and analysts interested in the Microsoft R programming stack (i.e., Microsoft employees in the Algorithms and Data Science group), other programmers might find the material useful as well.Please refer to the course syllabus for the full syllabus. The goal of this course is to cover the following modules, although some of the latter modules may be repalced for a hackathon/office hours.

doAzureParallel - A R package that allows users to submit parallel workloads in Azure

The doAzureParallel package is a parallel backend for the widely popular foreach package. With doAzureParallel, each iteration of the foreach loop runs in parallel on an Azure Virtual Machine (VM), allowing users to scale up their R jobs to tens or hundreds of machines.doAzureParallel is built to support the foreach parallel computing package. The foreach package supports parallel execution - it can execute multiple processes across some parallel backend. With just a few lines of code, the doAzureParallel package helps create a cluster in Azure, register it as a parallel backend, and seamlessly connects to the foreach package.

LearnAnalytics-mrs-spark - Course Materials for Microsoft R Server on Spark

This is a one-day workshop on Using R Server on Spark. Student can find course modules as rmarkdown documents in the Student-Resources directory. Instructions on how to deliver the course can be found in the Instructor-Resources directory. It is usually expected that the student has already completed the Microsoft R for Data Science Course.We will run our R scripts using the RStudio IDE. To launch RStudio in your browser, from the cluster overview in the Azure portal, click "R Server dashboards" and then "R Studio server". At the first login screen, enter "admin" and the password you supplied. At the second login screen, enter "sshuser" and the password you supplied.

vscode-tools-for-ai - VS Code Tools for AI is an extension to build, test, and deploy Deep Learning / AI solutions

Visual Studio Code Tools for AI is an extension to build, test, and deploy Deep Learning / AI solutions. It seamlessly integrates with Azure Machine Learning for robust experimentation capabilities, including but not limited to submitting data preparation and model training jobs transparently to different compute targets. Additionally, it provides support for custom metrics and run history tracking, enabling data science reproducibility and auditing. Enterprise ready collaboration, allow to securely work on project with other people.Get started with deep learning using Microsoft Cognitive Toolkit (CNTK), Google TensorFlow, or other deep-learning frameworks today.

connectthedots - Connect tiny devices to Microsoft Azure services to build IoT solutions is an open source project created by Microsoft to help you get tiny devices connected to Microsoft Azure IoT and to implement great IoT solutions taking advantage of Microsoft Azure advanced analytic services such as Azure Stream Analytics and Azure Machine Learning.The project is built with the assumption that the sensors get the raw data and format it into a JSON string. That string is then sent to Azure IoT Hub, from which a Web app gathers the data and displays it as a chart. Optional other functions of the Azure cloud include detecting and displaying alerts and averages, however this is not required.

BigDataR_Examples - Data Science and Machine Learning Examples for Data Science Linux

Data Science and Machine Learning Examples for Data Science Linux

Microsoft-TDSP - Repository for Microsoft Team Data Science Process containing documents and scripts

For execution of data science projects, TDSP provides guidelines on how to structure collaborative teams and tasks for data science projects, and execute data science projects using Agile planning and version control.To perform certain stages of a data science project efficiently and semi-automated manner, TDSP also provides data exploration and (semi)automated modeling tools in R and Python. These also provide standardized reports or artifacts.

workshops-norwich-2013-09 - Workshop page for Data Science in R at Norwich

Welcome to the GitHub repository for the rOpenSci workshop on data science using R. For this workshop you will be using the current version of R (3.0.1) along with the RStudio integrated development environment (IDE). We will provide a hosted version of both for this workshop.Note: Server is not always on. But if you have an Amazon account you can spin up an instance of our machine image at any time. email us if you're interested.

featran - A Scala feature transformation library for data science and machine learning

Featran, also known as Featran77 or F77 (get it?), is a Scala library for feature transformation. It aims to simplify the time consuming task of feature engineering in data science and machine learning processes. It supports various collection types for feature extraction and output formats for feature representation.We can implement this in a naive way using reduce and map.