| title | Data Science Deep Dive: Using the RevoScaleR Packages | Microsoft Docs | |
|---|---|---|
| ms.custom | ||
| ms.date | 09/27/2016 | |
| ms.prod | sql-server-2016 | |
| ms.reviewer | ||
| ms.suite | ||
| ms.technology |
|
|
| ms.tgt_pltfrm | ||
| ms.topic | article | |
| applies_to |
|
|
| dev_langs |
|
|
| ms.assetid | c2efb3f2-cad5-4188-b889-15d68b742ef5 | |
| caps.latest.revision | 18 | |
| author | jeannt | |
| ms.author | jeannt | |
| manager | jhubbard |
This tutorial is an introduction to the enhanced R packages provided in [!INCLUDErsql_productname]. You will learn how to use the scalable enterprise framework for execution of R packages in [!INCLUDEmsCoName] [!INCLUDEssCurrent]. By using these scalable R functions, a data scientist can build custom R solutions that run in either local or server contexts, to support high-performance big data analytics.
In this tutorial you'll learn how to move data between [!INCLUDEssNoVersion] and your R workstation, analyze and plot data, and create and deploy models.
To illustrate the flexibility and processing power of the ScaleR packages, in this tutorial you'll move data and swap compute contexts frequently.
- Data is initially obtained from CSV files or XDF files. You'll import the data into [!INCLUDEssNoVersion] using the functions in the RevoScaleR package.
- Model training and scoring will be performed in the [!INCLUDEssNoVersion] compute context. You'll create new [!INCLUDEssNoVersion] tables, using the rx functions, to save your scoring results.
- You'll create plots both on the server and in the local compute context.
- To train the model, you will use data already stored in a [!INCLUDEssNoVersion] database. All computations will be performed on the [!INCLUDEssNoVersion] instance.
- You'll extract a subset of data and save it as an XDF file for re-use in analysis on your local workstation.
- New data used during the scoring process is extracted from the [!INCLUDEssNoVersion] database using an ODBC connection. All computations are performed on the local workstation.
- Finally, you'll perform a simulation based on a custom R function, using the server compute context.
This tutorial takes about an hour to complete, not including setup.
This tutorial is intended for data scientists or for people who are already somewhat familiar with R and data science tasks including exploration, statistical analysis, and model tuning. However, all the code is provided, so you can easily run the code and follow along, assuming you have the required server and client environments.
You should also be comfortable with [!INCLUDEtsql] syntax and know how to access a [!INCLUDEssNoVersion] database using [!INCLUDEssManStudioFull] or other database tools, such as Visual Studio.
Tip
Save your R workspace between lessons, so that you can easily pick up where you left off.
-
Database server with support for R
Install [!INCLUDEssCurrent] and enable SQL Server R Services (in-Database). This process is described in SQL Server 2016 Books Online.
-
Database permissions
To run the queries used to train the model, you must have db_datareader privileges on the database where the data is stored.
-
Data science workstation
You must install the RevoScaleR packages. The easiest way to do this is to install Microsoft R Server (Standalone) or Microsoft R Client. For more information, see Set Up a Data Science Client
[!NOTE] Other versions of Revolution R Enterprise or Revolution R Open are not supported.
An open source distribution of R, such as R 3.2.2, will not work in this tutorial, because only the ScaleR function can use remote compute contexts.
-
Additional R Packages
For this tutorial, you will need to install the following packages: dplyr, ggplot2, ggthemes, reshape2, and e1071. Instructions are provided as part of the tutorial.
All packages must also be installed on the [!INCLUDEssNoVersion] instance where training will be performed. It is important that the packages be installed in the R package library used by SQL Server. Do not install the packages to a user library. If you do not have permission to install packages in this folder, ask a database administrator to add the packages.
For more information, see Prerequisites for Data Science Walkthroughs (SQL Server R Services).
In general, before you begin to write and run R scripts in your local development environment, you should always take a minute to plan your data usage, and determine where to run each part of the solution for the best performance.
In this tutorial, you'll get experience with the high performance functions for analyzing data, building models, and creating plots that are included with [!INCLUDErsql_productname]. We'll refer to these functions in general as ScaleR, or Microsoft R, to distinguish them from functions derived from other open source R packages. For more information about how Microsoft R differs from open source R, see this Getting Started guide.
A key benefit of using the ScaleR functions in that they support the use of either local or server data sources, and local or remote compute contexts. Therefore, as you work through this tutorial, consider the data strategies you might need to adopt for your own solutions.
-
What type of analysis do you want to perform? Is it something for your use alone, or will you be sharing the models, results or charts with others?
In this tutorial you'll learn how to move results back and forth between your development environment and the server to faciliate sharing and analysis.
-
Does the R package you need support remote execution? All of the functions in the ScaleR packages provided by [!INCLUDErsql_productname] can be run in remote compute contexts and potentially can use parallel execution. In contrast, functions in third-party packages might require additional resources for single-threaded execution. In this tutorial, you'll learn how to switch between local and remote compute contexts to take advantage of server resources when needed. You'll also learn how to wrap standard R functions in rxExec to support remote execution of arbitrary R functions.
-
Where is your data, and what are its characteristics? If your data resides locally, you must decide whether you will upload all the new data to [!INCLUDEssNoVersion], or train locally and save only the model to the database. However, when you deploy to production, you might need to train from enterrise data, and use ETL processes to clean adn load the data.
-
Similar questions apply to scoring data. Will you be creating the data pipeline for scoring data on your workstation, or will you be using enterprise data sources? Should the data cleansing and preparation be performed as part of ETL processes, or are you performing a one-time experiment?
In this tutorial, you'll learn how to efficiently and securely move data between your local R environment and SQL Server.
-
Which compute context should you use? You might want to train your model locally on sampled data and then switch to using server data for testing and production.
In this tutorial, you'll learn how to move data between SQL Server and R using R. You'll also learn how to work with data using XDF files, and how to process data in chunks using the ScaleR functions.
Lesson 1: Work with SQL Server Data using R (Data Science Deep Dive)