Skip to content

Commit 5a6ea6f

Browse files
authored
Merge pull request #2319 from jeannt/nativescore
revoscalepy code change
2 parents 4ee9b8b + be2440f commit 5a6ea6f

8 files changed

Lines changed: 418 additions & 420 deletions

docs/advanced-analytics/real-time-scoring.md

Lines changed: 112 additions & 91 deletions
Large diffs are not rendered by default.

docs/advanced-analytics/tutorials/rtsql-create-a-predictive-model-r.md

Lines changed: 28 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: "Create a Predictive Model (R in T-SQL Tutorial) | Microsoft Docs"
33
ms.custom: ""
4-
ms.date: "03/10/2017"
4+
ms.date: "07/03/2017"
55
ms.prod: "sql-server-2016"
66
ms.reviewer: ""
77
ms.suite: ""
@@ -18,36 +18,40 @@ author: "jeannt"
1818
ms.author: "jeannt"
1919
manager: "jhubbard"
2020
---
21-
# Create a Predictive Model (R in T-SQL Tutorial)
21+
# Create an R Model using SQL
22+
2223
In this step, you'll learn how to train a model using R, and then save the model to a table in SQL Server. The model is a simple regression model that predicts the stopping distance of a car based on speed. You'll use the `cars` dataset already included with R, because it is small and easy to understand.
2324

2425
## Create the source data
2526

26-
First, create a table to save the training data.
27+
First, create a table to save the training data.
2728

28-
```
29+
```sql
2930
CREATE TABLE CarSpeed ([speed] int not null, [distance] int not null)
3031
INSERT INTO CarSpeed
3132
EXEC sp_execute_external_script
3233
@language = N'R'
33-
, @script = N'car_speed <- cars;'
34+
, @script = N'car_speed <- cars;'
3435
, @input_data_1 = N''
3536
, @output_data_1_name = N'car_speed'
3637
```
3738

38-
+ If you want to use temporary tables, be aware that some R clients will disconnect sessions between batches.
39+
+ Some people like to use temporary tables, but be aware that some R clients disconnect sessions between batches.
40+
3941
+ Many datasets, small and large, are included with the R runtime. To get a list of datasets installed with R, type `library(help="datasets")` from an R command prompt.
4042

4143
## Create a regression model
4244

43-
The car speed data contains two columns, both numeric, `dist` and`speed`. There are multiple observations of some speeds. From this data, you will create a linear regression model that describes some relationship between car speed and the distance required to stop a car.
45+
The car speed data contains two columns, both numeric, `dist` and`speed`. There are multiple observations of some speeds. From this data, you will create a linear regression model that describes some relationship between car speed and the distance required to stop a car.
4446

4547
The requirements of a linear model are simple:
48+
4649
+ Define a formula that describes the relationship between the dependent variable `speed` and the independent variable `distance`
47-
+ Provide input data to use in training the model
4850

49-
If you need a refresher on linear models, see this tutorial, which describes the process of fitting a linear models using rxLInMod: [Fitting Linear Models](https://msdn.microsoft.com/microsoft-r/scaler-user-guide-linear-model).
51+
+ Provide input data to use in training the model
5052

53+
> [!TIP]
54+
> If you need a refresher on linear models, we recommend this tutorial, which describes the process of fitting a model using rxLinMod: [Fitting Linear Models](https://docs.microsoft.com/r-server/r/how-to-revoscaler-linear-model)
5155
5256
To actually build the model, you define the formula inside your R code, and pass the data as an input parameter.
5357

@@ -68,21 +72,20 @@ BEGIN
6872
END;
6973
GO
7074
```
75+
7176
+ The first argument to rxLinMod is the *formula* parameter, which defines distance as dependent on speed.
7277
+ The input data is stored in the variable `CarsData`, which is populated by the SQL query. If you don't assign a specific name to your input data, the default variable name would be _InputDataSet_.
7378

74-
7579
## Create a table for storing the model
7680

77-
Now you'll store the model so you can retrain or use it for prediction.
78-
79-
The output of an R package that creates a model is usually a **binary object**. Therefore, the table where you store the model must provide a column of **varbinary** type.
81+
Next, store the model so you can retrain or use it for prediction. The output of an R package that creates a model is usually a **binary object**. Therefore, the table where you store the model must provide a column of **varbinary** type.
8082

8183
```sql
8284
CREATE TABLE stopping_distance_models (
8385
model_name varchar(30) not null default('default model') primary key,
8486
model varbinary(max) not null);
8587
```
88+
8689
## Save the model
8790

8891
To save the model, run the following Transact-SQL statement to call the stored procedure, generate the model, and save it to a table.
@@ -92,27 +95,27 @@ INSERT INTO stopping_distance_models (model)
9295
EXEC generate_linear_model;
9396
```
9497

95-
Note that if you run this a second time, you'll get this error:
98+
Note that if you run this code a second time, you get this error:
9699

97-
*Violation of PRIMARY KEY constraint...Cannot insert duplicate key in object dbo.stopping_distance_models*
100+
```
101+
Violation of PRIMARY KEY constraint...Cannot insert duplicate key in object dbo.stopping_distance_models
102+
```
98103

99104
One option for avoiding this error is to update the name for each new model. For example, you could change the name to something more descriptive, and include the model type, the day you created it, and so forth.
100105

101106
```sql
102-
UPDATE stopping_distance_models
107+
UPDATE stopping_distance_models
103108
SET model_name = 'rxLinMod ' + format(getdate(), 'yyyy.MM.HH.mm', 'en-gb')
104109
WHERE model_name = 'default model'
105110
```
106111

107-
108112
## Output additional variables
109113

110-
Generally, the output of R from the stored procedure [sp_execute_external_script](../../relational-databases/system-stored-procedures/sp-execute-external-script-transact-sql.md) is limited to a single data frame. (This limitation might be removed in future.)
111-
112-
However, you can return outputs of other types, such as scalars, in addition to the data frame.
114+
Generally, the output of R from the stored procedure [sp_execute_external_script](../../relational-databases/system-stored-procedures/sp-execute-external-script-transact-sql.md) is limited to a single data frame. (This limitation might be removed in future.)
113115

114-
For example, suppose you want to train a model but immediately view a table of coefficients from the model. You could create the table of coefficients as the main result set, and output the trained model in a SQL variable. You could immediately re-use the model by callings variable, or you could save it to a table as shown here.
116+
However, you can return outputs of other types, such as scalars, in addition to the data frame.
115117

118+
For example, suppose you want to train a model but immediately view a table of coefficients from the model. You could create the table of coefficients as the main result set, and output the trained model in a SQL variable. You could immediately re-use the model by callings variable, or you could save it to a table as shown here.
116119

117120
```sql
118121
DECLARE @model varbinary(max), @modelname varchar(30)
@@ -137,20 +140,18 @@ VALUES (' latest model', @model)
137140

138141
![rslq_basictut_coefficients](media/rslq-basictut-coefficients.PNG)
139142

140-
141143
### Summary
142144

143-
Remember these rules for working with SQL parameters and R variables in sp_execute_external_script:
145+
Remember these rules for working with SQL parameters and R variables in `sp_execute_external_script`:
144146

145-
+ All SQL parameters mapped to R script must be listed by name in the _@params_ argument of sp_execute_external_script.
147+
+ All SQL parameters mapped to R script must be listed by name in the _@params_ argument.
146148
+ To output one of these parameters, add the OUTPUT keyword in the _@params_ list.
147149
+ After listing the mapped parameters, provide the mapping, line by line, of SQL parameters to R variables, immediately after the _@params_ list.
148150

149-
150-
## Next Step
151+
## Next lesson
151152

152153
Now that you have a model, in the final step, you'll learn how to generate predictions from it and plot the results.
153154

154-
[Predict and Plot from Model](../../advanced-analytics/r-services/predict-and-plot-from-model-r-in-t-sql-tutorial.md)
155+
[Predict and Plot from Model](/rtsql-predict-and-plot-from-model.md)
155156

156157

docs/advanced-analytics/tutorials/rtsql-predict-and-plot-from-model.md

Lines changed: 20 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: "Predict and Plot from Model (R in T-SQL Tutorial) | Microsoft Docs"
33
ms.custom: ""
4-
ms.date: "06/29/2017"
4+
ms.date: "07/03/2017"
55
ms.prod: "sql-server-2016"
66
ms.reviewer: ""
77
ms.suite: ""
@@ -18,9 +18,9 @@ author: "jeannt"
1818
ms.author: "jeannt"
1919
manager: "jhubbard"
2020
---
21-
# Predict and Plot from Model (R in T-SQL Tutorial)
21+
# Use SQL to Predict and Plot from an R Model
2222

23-
To score new data, you'll get one of the trained models from the table, and then call a new set of data on which to base predictions.
23+
To perform _scoring_ using new data, you'll get one of the trained models from the table, and then call a new set of data on which to base predictions. Scoring is a term sometimes used in data science to mean generating predictions, probabilities, or other values based on new data fed into a trained model.
2424

2525
## Create the table of new speeds
2626

@@ -38,12 +38,11 @@ VALUES (40), (50), (60), (70), (80), (90), (100)
3838

3939
## Predict stopping distance
4040

41-
4241
By now, your table might contain multiple R models, all built using different parameters or algorithms, or trained on different subsets of data.
4342

4443
![rsql_basictut_listofmodels](media/rsql-basictut-listofmodels.png)
4544

46-
To get predictions based on a specific model, you must write a SQL script that does the following:
45+
To get predictions based on one specific model, you must write a SQL script that does the following:
4746

4847
1. Gets the model you want
4948
2. Gets the new input data
@@ -69,26 +68,23 @@ EXEC sp_execute_external_script
6968
WITH RESULT SETS (([new_speed] INT, [predicted_distance] INT))
7069
```
7170

72-
**Notes**
73-
7471
+ Use a SELECT statement to get a single model from the table, and pass it as an input parameter.
7572
+ After retrieving the model from the table, call the `unserialize` function on the model.
7673
+ Apply the `rxPredict` function with appropriate arguments to the model, and provide the new input data.
77-
+ We used the `str` function while testing to check the schema of data being returned from R. You can always remove the statement later.
78-
+ You can add column names to the output data frame as part of your R script, but here we just used the WITH RESULTS clause.
79-
+ To return columns from the original dataset together with the prediction results, concatenate the source column with the predicted values column as part of your R script, and then return the data frame to SQL Server.
74+
+ In the example, the `str` function is added during the testing phase, to check the schema of data being returned from R. You can remove the statement later.
75+
+ The column names used in the R script are not necessarily passed to the stored procedure output. Here we've used the WITH RESULTS clause to define some new column names.
8076

8177
**Results**
8278

8379
![rsql_basictut_scoringresults_smalldata](media/rsql-basictut-scoringresults-smalldata.PNG)
8480

8581
## Perform scoring in parallel
8682

87-
The predictions came back fairly fast on this tiny data set. But suppose you needed to make lots of predictions very fast? There are many ways to speed up operations in SQL Server, more so if the operations are parallelizable. For scoring in particular, one easy way is to add the *@parallel* parameter to `sp_execute_external_script` and set the value to **1**.
83+
The predictions came back fairly fast on this tiny data set. But suppose you needed to make lots of predictions very fast? There are many ways to speed up operations in SQL Server, more so if the operations can be processed in parallel. For scoring in particular, one easy way is to add the *@parallel* parameter to `sp_execute_external_script` and set the value to **1**.
8884

8985
Let's assume that you have obtained a much bigger table of possible car speeds, including hundreds of thousands of values. There are many sample T-SQL scripts from the community to help you generate number tables, so we won't reproduce those here. Let's just say that you have a column containing many integers, and want to use that as input for `speed` in the model.
9086

91-
To do this, just run the same prediction query, but substitute the larger dataset, and add the _@parallel = 1_ parameter.
87+
To do this, just run the same prediction query, but substitute the larger dataset, and add the `@parallel = 1` argument.
9288

9389
```sql
9490
DECLARE @speedmodel varbinary(max) = (select model from [dbo].[stopping_distance_models] where model_name = 'default model');
@@ -108,15 +104,13 @@ EXEC sp_execute_external_script
108104
WITH RESULT SETS (([new_speed] INT, [predicted_distance] INT))
109105
```
110106

111-
**Notes**
112-
113-
+ Parallel execution provides benefits only when working with very large data. Moreover, the SQL query that gets your data must be capable of generating a parallel query plan.
107+
+ Parallel execution generally provides benefits only when working with very large data. The SQL database engine might decide that parallel execution is not needed. Moreover, the SQL query that gets your data must be capable of generating a parallel query plan.
114108

115109
+ When using the option for parallel execution, you **must** specify the output results schema in advance, by using the WITH RESULT SETS clause. Specifying the output schema in advance allows SQL Server to aggregate the results of multiple parallel datasets, which otherwise might have unknown schemas.
116110

117111
+ If you are *training* a model instead of *scoring*, this parameter often won't have an effect. Depending on the model type, model creation might require that all the rows be read before summaries can be created.
118112

119-
To get the benefits of parallel processing when you train your model, we recommend that you use one of the **RevoScaleR** algorithms. These algorithms are designed to distribute processing automatically, even if you don't specify <code>@parallel =1</code> in the call to `sp_execute_external_script`. For guidance on how to get the best performance with RevoScaleR algorithms, see [ScaleR Distributed Computing](https://docs.microsoft.com/r-server/r/how-to-revoscaler-distributed-computing).
113+
+ To get the benefits of parallel processing when you train your model, we recommend that you use one of the **RevoScaleR** algorithms. These algorithms are designed to distribute processing automatically, even if you don't specify <code>@parallel =1</code> in the call to `sp_execute_external_script`. For guidance on how to get the best performance with RevoScaleR algorithms, see [Distributed and parallel computing with ScaleR in Microsoft R](https://docs.microsoft.com/r-server/r/how-to-revoscaler-distributed-computing).
120114

121115
## Create an R plot of the model
122116

@@ -129,10 +123,10 @@ The following example demonstrates how to create a simple graphic using a plotti
129123
```sql
130124
EXECUTE sp_execute_external_script
131125
@language = N'R'
132-
, @script = N'
126+
, @script = N'
133127
imageDir <- ''C:\\temp\\plots'';
134128
image_filename = tempfile(pattern = "plot_", tmpdir = imageDir, fileext = ".jpg")
135-
print(image_filename);
129+
print(image_filename);
136130
jpeg(filename=image_filename, width=600, height = 800);
137131
print(plot(distance~speed, data=InputDataSet, xlab="Speed", ylab="Stopping distance", main = "1920 Car Safety"));
138132
abline(lm(distance~speed, data = InputDataSet));
@@ -143,12 +137,10 @@ The following example demonstrates how to create a simple graphic using a plotti
143137
WITH RESULT SETS ((plot varbinary(max)));
144138
```
145139

146-
**Notes**
147-
148140
+ The `tempfile` function returns a string that can be used as a file name, but the file is not actually generated yet.
149-
+ For arguments to `tempfile`, you can specify a prefix and file extension, as well as a tmpdir. To verify the file name and path, we printed a message using `str()`.
141+
+ For arguments to `tempfile`, you can specify a prefix and file extension, as well as a tmpdir. To verify the file name and path, print a message using `str()`.
150142
+ The `jpeg` function creates an R device with the specified parameters.
151-
+ After you have created the plot, you can add more visual features to it. In this case, a regression line was added using `abline`.
143+
+ After you create the plot, you can add more visual features to it. In this case, a regression line is added using `abline`.
152144
+ When you are done adding plot features, you must close the graphics device using the `dev.off()` function.
153145
+ The `readBin` function takes a file to read, a format specification, and the number of records. The **rb** keyword indicates that the file is binary rather than containing text.
154146

@@ -158,12 +150,12 @@ The following example demonstrates how to create a simple graphic using a plotti
158150

159151
If you want to do some more elaborate plots, using some of the great graphics packages for R, we recommend these articles. Both require the popular **ggplot2** package.
160152

161-
+ [Loan Classification using SQL Server 2016 R Services](https://blogs.msdn.microsoft.com/microsoftrservertigerteam/2016/09/27/loan-classification-using-sql-server-2016-r-services/): End-to-end scenario based on insurance data. Also requires the **reshape** package.
162-
+ [Create Graphs and Plots Using R](https://msdn.microsoft.com/library/mt629162.aspx): Lesson 2 in an end-to-end solution, based on the NYC taxi data.
153+
+ [Loan Classification using SQL Server 2016 R Services](https://blogs.msdn.microsoft.com/microsoftrservertigerteam/2016/09/27/loan-classification-using-sql-server-2016-r-services/): End-to-end scenario based on insurance data. Requires the **reshape** package.
154+
+ [Create Graphs and Plots Using R](/walkthrough-create-graphs-and-plots-using-r.md)
163155

164-
## Conclusion
156+
## Conclusions
165157

166-
Integration of R with SQL Server makes it easier to deploy R solutions at scale, leveraging the best features of R and relational databases, for high-performance data handling and rapid R analytics.
158+
Integration of R with SQL Server makes it easier to deploy R solutions at scale, leveraging the best features of R and relational databases, for high-performance data handling and rapid R analytics. See these additional resources for more R samples:
167159

168160
+ [SQL Server R tutorials](/sql-server-r-tutorials.md)
169161

@@ -175,8 +167,8 @@ Integration of R with SQL Server makes it easier to deploy R solutions at scale,
175167

176168
+ [Tutorials and sample data for Microsoft R](https://docs.microsoft.com/r-server/r/tutorial-introduction)
177169

178-
Learn how to use the new RevoScaleR packages ot create models and transform data.
170+
Learn how to use the new RevoScaleR packages to create models and transform data.
179171

180172
+ [Get Started with MicrosoftML](https://docs.microsoft.com/r-server/r/concept-what-is-the-microsoftml-package)
181173

182-
Learn how to use the fast, scalable machine learning algorithms from Microsoft Research.
174+
Learn more about the fast, scalable machine learning algorithms from Microsoft Research.

0 commit comments

Comments
 (0)