You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/advanced-analytics/tutorials/rtsql-create-a-predictive-model-r.md
+28-27Lines changed: 28 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
2
title: "Create a Predictive Model (R in T-SQL Tutorial) | Microsoft Docs"
3
3
ms.custom: ""
4
-
ms.date: "03/10/2017"
4
+
ms.date: "07/03/2017"
5
5
ms.prod: "sql-server-2016"
6
6
ms.reviewer: ""
7
7
ms.suite: ""
@@ -18,36 +18,40 @@ author: "jeannt"
18
18
ms.author: "jeannt"
19
19
manager: "jhubbard"
20
20
---
21
-
# Create a Predictive Model (R in T-SQL Tutorial)
21
+
# Create an R Model using SQL
22
+
22
23
In this step, you'll learn how to train a model using R, and then save the model to a table in SQL Server. The model is a simple regression model that predicts the stopping distance of a car based on speed. You'll use the `cars` dataset already included with R, because it is small and easy to understand.
+ If you want to use temporary tables, be aware that some R clients will disconnect sessions between batches.
39
+
+ Some people like to use temporary tables, but be aware that some R clients disconnect sessions between batches.
40
+
39
41
+ Many datasets, small and large, are included with the R runtime. To get a list of datasets installed with R, type `library(help="datasets")` from an R command prompt.
40
42
41
43
## Create a regression model
42
44
43
-
The car speed data contains two columns, both numeric, `dist` and`speed`. There are multiple observations of some speeds. From this data, you will create a linear regression model that describes some relationship between car speed and the distance required to stop a car.
45
+
The car speed data contains two columns, both numeric, `dist` and`speed`. There are multiple observations of some speeds. From this data, you will create a linear regression model that describes some relationship between car speed and the distance required to stop a car.
44
46
45
47
The requirements of a linear model are simple:
48
+
46
49
+ Define a formula that describes the relationship between the dependent variable `speed` and the independent variable `distance`
47
-
+ Provide input data to use in training the model
48
50
49
-
If you need a refresher on linear models, see this tutorial, which describes the process of fitting a linear models using rxLInMod: [Fitting Linear Models](https://msdn.microsoft.com/microsoft-r/scaler-user-guide-linear-model).
51
+
+ Provide input data to use in training the model
50
52
53
+
> [!TIP]
54
+
> If you need a refresher on linear models, we recommend this tutorial, which describes the process of fitting a model using rxLinMod: [Fitting Linear Models](https://docs.microsoft.com/r-server/r/how-to-revoscaler-linear-model)
51
55
52
56
To actually build the model, you define the formula inside your R code, and pass the data as an input parameter.
53
57
@@ -68,21 +72,20 @@ BEGIN
68
72
END;
69
73
GO
70
74
```
75
+
71
76
+ The first argument to rxLinMod is the *formula* parameter, which defines distance as dependent on speed.
72
77
+ The input data is stored in the variable `CarsData`, which is populated by the SQL query. If you don't assign a specific name to your input data, the default variable name would be _InputDataSet_.
73
78
74
-
75
79
## Create a table for storing the model
76
80
77
-
Now you'll store the model so you can retrain or use it for prediction.
78
-
79
-
The output of an R package that creates a model is usually a **binary object**. Therefore, the table where you store the model must provide a column of **varbinary** type.
81
+
Next, store the model so you can retrain or use it for prediction. The output of an R package that creates a model is usually a **binary object**. Therefore, the table where you store the model must provide a column of **varbinary** type.
80
82
81
83
```sql
82
84
CREATETABLEstopping_distance_models (
83
85
model_name varchar(30) not null default('default model') primary key,
84
86
model varbinary(max) not null);
85
87
```
88
+
86
89
## Save the model
87
90
88
91
To save the model, run the following Transact-SQL statement to call the stored procedure, generate the model, and save it to a table.
@@ -92,27 +95,27 @@ INSERT INTO stopping_distance_models (model)
92
95
EXEC generate_linear_model;
93
96
```
94
97
95
-
Note that if you run this a second time, you'll get this error:
98
+
Note that if you run this code a second time, you get this error:
96
99
97
-
*Violation of PRIMARY KEY constraint...Cannot insert duplicate key in object dbo.stopping_distance_models*
100
+
```
101
+
Violation of PRIMARY KEY constraint...Cannot insert duplicate key in object dbo.stopping_distance_models
102
+
```
98
103
99
104
One option for avoiding this error is to update the name for each new model. For example, you could change the name to something more descriptive, and include the model type, the day you created it, and so forth.
100
105
101
106
```sql
102
-
UPDATE stopping_distance_models
107
+
UPDATE stopping_distance_models
103
108
SET model_name ='rxLinMod '+ format(getdate(), 'yyyy.MM.HH.mm', 'en-gb')
104
109
WHERE model_name ='default model'
105
110
```
106
111
107
-
108
112
## Output additional variables
109
113
110
-
Generally, the output of R from the stored procedure [sp_execute_external_script](../../relational-databases/system-stored-procedures/sp-execute-external-script-transact-sql.md) is limited to a single data frame. (This limitation might be removed in future.)
111
-
112
-
However, you can return outputs of other types, such as scalars, in addition to the data frame.
114
+
Generally, the output of R from the stored procedure [sp_execute_external_script](../../relational-databases/system-stored-procedures/sp-execute-external-script-transact-sql.md) is limited to a single data frame. (This limitation might be removed in future.)
113
115
114
-
For example, suppose you want to train a model but immediately view a table of coefficients from the model. You could create the table of coefficients as the main result set, and output the trained model in a SQL variable. You could immediately re-use the model by callings variable, or you could save it to a table as shown here.
116
+
However, you can return outputs of other types, such as scalars, in addition to the data frame.
115
117
118
+
For example, suppose you want to train a model but immediately view a table of coefficients from the model. You could create the table of coefficients as the main result set, and output the trained model in a SQL variable. You could immediately re-use the model by callings variable, or you could save it to a table as shown here.
Copy file name to clipboardExpand all lines: docs/advanced-analytics/tutorials/rtsql-predict-and-plot-from-model.md
+20-28Lines changed: 20 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
2
title: "Predict and Plot from Model (R in T-SQL Tutorial) | Microsoft Docs"
3
3
ms.custom: ""
4
-
ms.date: "06/29/2017"
4
+
ms.date: "07/03/2017"
5
5
ms.prod: "sql-server-2016"
6
6
ms.reviewer: ""
7
7
ms.suite: ""
@@ -18,9 +18,9 @@ author: "jeannt"
18
18
ms.author: "jeannt"
19
19
manager: "jhubbard"
20
20
---
21
-
# Predict and Plot from Model (R in T-SQL Tutorial)
21
+
# Use SQL to Predict and Plot from an R Model
22
22
23
-
To score new data, you'll get one of the trained models from the table, and then call a new set of data on which to base predictions.
23
+
To perform _scoring_ using new data, you'll get one of the trained models from the table, and then call a new set of data on which to base predictions. Scoring is a term sometimes used in data science to mean generating predictions, probabilities, or other values based on new data fed into a trained model.
WITH RESULT SETS (([new_speed] INT, [predicted_distance] INT))
70
69
```
71
70
72
-
**Notes**
73
-
74
71
+ Use a SELECT statement to get a single model from the table, and pass it as an input parameter.
75
72
+ After retrieving the model from the table, call the `unserialize` function on the model.
76
73
+ Apply the `rxPredict` function with appropriate arguments to the model, and provide the new input data.
77
-
+ We used the `str` function while testing to check the schema of data being returned from R. You can always remove the statement later.
78
-
+ You can add column names to the output data frame as part of your R script, but here we just used the WITH RESULTS clause.
79
-
+ To return columns from the original dataset together with the prediction results, concatenate the source column with the predicted values column as part of your R script, and then return the data frame to SQL Server.
74
+
+ In the example, the `str` function is added during the testing phase, to check the schema of data being returned from R. You can remove the statement later.
75
+
+ The column names used in the R script are not necessarily passed to the stored procedure output. Here we've used the WITH RESULTS clause to define some new column names.
The predictions came back fairly fast on this tiny data set. But suppose you needed to make lots of predictions very fast? There are many ways to speed up operations in SQL Server, more so if the operations are parallelizable. For scoring in particular, one easy way is to add the *@parallel* parameter to `sp_execute_external_script` and set the value to **1**.
83
+
The predictions came back fairly fast on this tiny data set. But suppose you needed to make lots of predictions very fast? There are many ways to speed up operations in SQL Server, more so if the operations can be processed in parallel. For scoring in particular, one easy way is to add the *@parallel* parameter to `sp_execute_external_script` and set the value to **1**.
88
84
89
85
Let's assume that you have obtained a much bigger table of possible car speeds, including hundreds of thousands of values. There are many sample T-SQL scripts from the community to help you generate number tables, so we won't reproduce those here. Let's just say that you have a column containing many integers, and want to use that as input for `speed` in the model.
90
86
91
-
To do this, just run the same prediction query, but substitute the larger dataset, and add the _@parallel = 1_ parameter.
87
+
To do this, just run the same prediction query, but substitute the larger dataset, and add the `@parallel = 1` argument.
92
88
93
89
```sql
94
90
DECLARE @speedmodel varbinary(max) = (select model from [dbo].[stopping_distance_models] where model_name ='default model');
WITH RESULT SETS (([new_speed] INT, [predicted_distance] INT))
109
105
```
110
106
111
-
**Notes**
112
-
113
-
+ Parallel execution provides benefits only when working with very large data. Moreover, the SQL query that gets your data must be capable of generating a parallel query plan.
107
+
+ Parallel execution generally provides benefits only when working with very large data. The SQL database engine might decide that parallel execution is not needed. Moreover, the SQL query that gets your data must be capable of generating a parallel query plan.
114
108
115
109
+ When using the option for parallel execution, you **must** specify the output results schema in advance, by using the WITH RESULT SETS clause. Specifying the output schema in advance allows SQL Server to aggregate the results of multiple parallel datasets, which otherwise might have unknown schemas.
116
110
117
111
+ If you are *training* a model instead of *scoring*, this parameter often won't have an effect. Depending on the model type, model creation might require that all the rows be read before summaries can be created.
118
112
119
-
To get the benefits of parallel processing when you train your model, we recommend that you use one of the **RevoScaleR** algorithms. These algorithms are designed to distribute processing automatically, even if you don't specify <code>@parallel =1</code> in the call to `sp_execute_external_script`. For guidance on how to get the best performance with RevoScaleR algorithms, see [ScaleR Distributed Computing](https://docs.microsoft.com/r-server/r/how-to-revoscaler-distributed-computing).
113
+
+To get the benefits of parallel processing when you train your model, we recommend that you use one of the **RevoScaleR** algorithms. These algorithms are designed to distribute processing automatically, even if you don't specify <code>@parallel =1</code> in the call to `sp_execute_external_script`. For guidance on how to get the best performance with RevoScaleR algorithms, see [Distributed and parallel computing with ScaleR in Microsoft R](https://docs.microsoft.com/r-server/r/how-to-revoscaler-distributed-computing).
120
114
121
115
## Create an R plot of the model
122
116
@@ -129,10 +123,10 @@ The following example demonstrates how to create a simple graphic using a plotti
print(plot(distance~speed, data=InputDataSet, xlab="Speed", ylab="Stopping distance", main = "1920 Car Safety"));
138
132
abline(lm(distance~speed, data = InputDataSet));
@@ -143,12 +137,10 @@ The following example demonstrates how to create a simple graphic using a plotti
143
137
WITH RESULT SETS ((plot varbinary(max)));
144
138
```
145
139
146
-
**Notes**
147
-
148
140
+ The `tempfile` function returns a string that can be used as a file name, but the file is not actually generated yet.
149
-
+ For arguments to `tempfile`, you can specify a prefix and file extension, as well as a tmpdir. To verify the file name and path, we printed a message using `str()`.
141
+
+ For arguments to `tempfile`, you can specify a prefix and file extension, as well as a tmpdir. To verify the file name and path, print a message using `str()`.
150
142
+ The `jpeg` function creates an R device with the specified parameters.
151
-
+ After you have created the plot, you can add more visual features to it. In this case, a regression line was added using `abline`.
143
+
+ After you create the plot, you can add more visual features to it. In this case, a regression line is added using `abline`.
152
144
+ When you are done adding plot features, you must close the graphics device using the `dev.off()` function.
153
145
+ The `readBin` function takes a file to read, a format specification, and the number of records. The **rb** keyword indicates that the file is binary rather than containing text.
154
146
@@ -158,12 +150,12 @@ The following example demonstrates how to create a simple graphic using a plotti
158
150
159
151
If you want to do some more elaborate plots, using some of the great graphics packages for R, we recommend these articles. Both require the popular **ggplot2** package.
160
152
161
-
+[Loan Classification using SQL Server 2016 R Services](https://blogs.msdn.microsoft.com/microsoftrservertigerteam/2016/09/27/loan-classification-using-sql-server-2016-r-services/): End-to-end scenario based on insurance data. Also requires the **reshape** package.
162
-
+[Create Graphs and Plots Using R](https://msdn.microsoft.com/library/mt629162.aspx): Lesson 2 in an end-to-end solution, based on the NYC taxi data.
153
+
+[Loan Classification using SQL Server 2016 R Services](https://blogs.msdn.microsoft.com/microsoftrservertigerteam/2016/09/27/loan-classification-using-sql-server-2016-r-services/): End-to-end scenario based on insurance data. Requires the **reshape** package.
154
+
+[Create Graphs and Plots Using R](/walkthrough-create-graphs-and-plots-using-r.md)
163
155
164
-
## Conclusion
156
+
## Conclusions
165
157
166
-
Integration of R with SQL Server makes it easier to deploy R solutions at scale, leveraging the best features of R and relational databases, for high-performance data handling and rapid R analytics.
158
+
Integration of R with SQL Server makes it easier to deploy R solutions at scale, leveraging the best features of R and relational databases, for high-performance data handling and rapid R analytics. See these additional resources for more R samples:
167
159
168
160
+[SQL Server R tutorials](/sql-server-r-tutorials.md)
169
161
@@ -175,8 +167,8 @@ Integration of R with SQL Server makes it easier to deploy R solutions at scale,
175
167
176
168
+[Tutorials and sample data for Microsoft R](https://docs.microsoft.com/r-server/r/tutorial-introduction)
177
169
178
-
Learn how to use the new RevoScaleR packages ot create models and transform data.
170
+
Learn how to use the new RevoScaleR packages to create models and transform data.
179
171
180
172
+[Get Started with MicrosoftML](https://docs.microsoft.com/r-server/r/concept-what-is-the-microsoftml-package)
181
173
182
-
Learn how to use the fast, scalable machine learning algorithms from Microsoft Research.
174
+
Learn more about the fast, scalable machine learning algorithms from Microsoft Research.
0 commit comments