| title | Lesson 3 Train and save a model using R and T-SQL (SQL Server Machine Learning) | Microsoft Docs |
|---|---|
| description | Tutorial showing how to embed R in SQL Server stored procedures and T-SQL functions |
| ms.prod | sql |
| ms.technology | machine-learning |
| ms.date | 10/29/2018 |
| ms.topic | tutorial |
| author | HeidiSteen |
| ms.author | heidist |
| manager | cgronlun |
[!INCLUDEappliesto-ss-xxxx-xxxx-xxx-md-winonly]
This article is part of a tutorial for SQL developers on how to use R in SQL Server.
In this lesson, you'll learn how to train a machine learning model by using R. You'll train the model using the data features you created in the previous lesson, and then save the trained model in a [!INCLUDEssNoVersion] table. In this case, the R packages are already installed with [!INCLUDErsql_productname], so everything can be done from SQL.
When calling R from T-SQL, you use the system stored procedure, sp_execute_external_script. However, for processes that you repeat often, such as retraining a model, it is easier to encapsulate the call to sp_execute_exernal_script in another stored procedure.
-
In [!INCLUDEssManStudio], open a new Query window.
-
Run the following statement to create the stored procedure RxTrainLogitModel. This stored procedure defines the input data and uses rxLogit from RevoScaleR to create a logistic regression model.
CREATE PROCEDURE [dbo].[RxTrainLogitModel] AS BEGIN DECLARE @inquery nvarchar(max) = N' select tipped, fare_amount, passenger_count,trip_time_in_secs,trip_distance, pickup_datetime, dropoff_datetime, dbo.fnCalculateDistance(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude) as direct_distance from nyctaxi_sample tablesample (70 percent) repeatable (98052) ' -- Insert the trained model into a database table INSERT INTO nyc_taxi_models EXEC sp_execute_external_script @language = N'R', @script = N' ## Create model logitObj <- rxLogit(tipped ~ passenger_count + trip_distance + trip_time_in_secs + direct_distance, data = InputDataSet) summary(logitObj) ## Serialize model and put it in data frame trained_model <- data.frame(model=as.raw(serialize(logitObj, NULL))); ', @input_data_1 = @inquery, @output_data_1_name = N'trained_model' ; END GO
-To ensure that some data is left over to test the model, 70% of the data are randomly selected from the taxi data table for training purposes.
-
The SELECT query uses the custom scalar function fnCalculateDistance to calculate the direct distance between the pick-up and drop-off locations. The results of the query are stored in the default R input variable,
InputDataset. -
The R script calls the rxLogit function, which is one of the enhanced R functions included with [!INCLUDErsql_productname], to create the logistic regression model.
The binary variable tipped is used as the label or outcome column, and the model is fit using these feature columns: passenger_count, trip_distance, trip_time_in_secs, and direct_distance.
-
The trained model, saved in the R variable
logitObj, is serialized and put in a data frame for output to [!INCLUDEssNoVersion]. That output is inserted into the database table nyc_taxi_models, so that you can use it for future predictions.
-
Because the stored procedure already includes a definition of the input data, you don't need to provide an input query.
-
To generate the R model, call the stored procedure without any other parameters:
EXEC RxTrainLogitModel
-
Watch the Messages window of [!INCLUDEssManStudio] for messages that would be piped to R's stdout stream, like this message:
"STDOUT message(s) from external script: Rows Read: 1193025, Total Rows Processed: 1193025, Total Chunk Time: 0.093 seconds"
You might also see messages specific to the individual function,
rxLogit, displaying the variables and test metrics generated as part of model creation. -
When the statement has completed, open the table nyc_taxi_models. Processing of the data and fitting the model might take a while.
You can see that one new row has been added, which contains the serialized model in the column model.
model ------ 0x580A00000002000302020....
In the next step you'll use the trained model to generate predictions.
Lesson 4: Predict potential outcomes using an R model in a stored procedure