| title | R + T-SQL tutorial: Train model |
|---|---|
| description | Tutorial showing how to train, serialize, and save an R model using SQL Server stored procedures and T-SQL functions. |
| ms.prod | sql |
| ms.technology | machine-learning |
| ms.date | 11/16/2018 |
| ms.topic | tutorial |
| author | dphansen |
| ms.author | davidph |
| ms.custom | seo-lt-2019 |
| monikerRange | >=sql-server-2016||>=sql-server-linux-ver15||=sqlallproducts-allversions |
[!INCLUDEappliesto-ss-xxxx-xxxx-xxx-md]
This article is part of a tutorial for SQL developers on how to use R in SQL Server.
In this lesson, you'll learn how to train a machine learning model by using R. You'll train the model using the data features you created in the previous lesson, and then save the trained model in a [!INCLUDEssNoVersion] table. In this case, the R packages are already installed with [!INCLUDErsql_productname], so everything can be done from SQL.
When calling R from T-SQL, you use the system stored procedure, sp_execute_external_script. However, for processes that you repeat often, such as retraining a model, it is easier to encapsulate the call to sp_execute_exernal_script in another stored procedure.
-
In [!INCLUDEssManStudio], open a new Query window.
-
Run the following statement to create the stored procedure RxTrainLogitModel. This stored procedure defines the input data and uses rxLogit from RevoScaleR to create a logistic regression model.
CREATE PROCEDURE [dbo].[RxTrainLogitModel] (@trained_model varbinary(max) OUTPUT) AS BEGIN DECLARE @inquery nvarchar(max) = N' select tipped, fare_amount, passenger_count,trip_time_in_secs,trip_distance, pickup_datetime, dropoff_datetime, dbo.fnCalculateDistance(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude) as direct_distance from nyctaxi_sample tablesample (70 percent) repeatable (98052) ' EXEC sp_execute_external_script @language = N'R', @script = N' ## Create model logitObj <- rxLogit(tipped ~ passenger_count + trip_distance + trip_time_in_secs + direct_distance, data = InputDataSet) summary(logitObj) ## Serialize model trained_model <- as.raw(serialize(logitObj, NULL)); ', @input_data_1 = @inquery, @params = N'@trained_model varbinary(max) OUTPUT', @trained_model = @trained_model OUTPUT; END GO
-
To ensure that some data is left over to test the model, 70% of the data are randomly selected from the taxi data table for training purposes.
-
The SELECT query uses the custom scalar function fnCalculateDistance to calculate the direct distance between the pick-up and drop-off locations. The results of the query are stored in the default R input variable,
InputDataset. -
The R script calls the rxLogit function, which is one of the enhanced R functions included with [!INCLUDErsql_productname], to create the logistic regression model.
The binary variable tipped is used as the label or outcome column, and the model is fit using these feature columns: passenger_count, trip_distance, trip_time_in_secs, and direct_distance.
-
The trained model, saved in the R variable
logitObj, is serialized and returned as an output parameter.
-
Because the stored procedure already includes a definition of the input data, you don't need to provide an input query.
-
To train and deploy the R model, call the stored procedure and insert it into the database table nyc_taxi_models, so that you can use it for future predictions:
DECLARE @model VARBINARY(MAX); EXEC RxTrainLogitModel @model OUTPUT; INSERT INTO nyc_taxi_models (name, model) VALUES('RxTrainLogit_model', @model);
-
Watch the Messages window of [!INCLUDEssManStudio] for messages that would be piped to R's stdout stream, like this message:
"STDOUT message(s) from external script: Rows Read: 1193025, Total Rows Processed: 1193025, Total Chunk Time: 0.093 seconds"
You might also see messages specific to the individual function,
rxLogit, displaying the variables and test metrics generated as part of model creation. -
When the statement has completed, open the table nyc_taxi_models. Processing of the data and fitting the model might take a while.
You can see that one new row has been added, which contains the serialized model in the column model and the model name RxTrainLogit_model in the column name.
model name ---------------------------- ------------------ 0x580A00000002000302020.... RxTrainLogit_model
In the next step you'll use the trained model to generate predictions.
Lesson 4: Predict potential outcomes using an R model in a stored procedure