Skip to content

Latest commit

 

History

History
156 lines (114 loc) · 6.41 KB

File metadata and controls

156 lines (114 loc) · 6.41 KB
title Create a Predictive Model (R in T-SQL Tutorial) | Microsoft Docs
ms.custom
ms.date 03/10/2017
ms.prod sql-server-2016
ms.reviewer
ms.suite
ms.technology
r-services
ms.tgt_pltfrm
ms.topic article
dev_langs
R
SQL
ms.assetid 6eb78a80-5791-438f-9ca6-d142ab5d9bb1
caps.latest.revision 11
author jeannt
ms.author jeannt
manager jhubbard

Create a Predictive Model (R in T-SQL Tutorial)

In this step, you'll learn how to train a model using R, and then save the model to a table in SQL Server. The model is a simple regression model that predicts the stopping distance of a car based on speed. You'll use the cars dataset already included with R, because it is small and easy to understand.

Create the source data

First, create a table to save the training data.

CREATE TABLE CarSpeed ([speed] int not null, [distance] int not null)
INSERT INTO CarSpeed
EXEC sp_execute_external_script
        @language = N'R'
 	    , @script = N'car_speed <- cars;'
        , @input_data_1 = N''
        , @output_data_1_name = N'car_speed'
  • If you want to use temporary tables, be aware that some R clients will disconnect sessions between batches.
  • Many datasets, small and large, are included with the R runtime. To get a list of datasets installed with R, type library(help="datasets") from an R command prompt.

Create a regression model

The car speed data contains two columns, both numeric, dist andspeed. There are multiple observations of some speeds. From this data, you will create a linear regression model that describes some relationship between car speed and the distance required to stop a car.

The requirements of a linear model are simple:

  • Define a formula that describes the relationship between the dependent variable speed and the independent variable distance
  • Provide input data to use in training the model

If you need a refresher on linear models, see this tutorial, which describes the process of fitting a linear models using rxLInMod: Fitting Linear Models.

To actually build the model, you define the formula inside your R code, and pass the data as an input parameter.

DROP PROCEDURE IF EXISTS generate_linear_model;
GO
CREATE PROCEDURE generate_linear_model
AS
BEGIN
    EXEC sp_execute_external_script
    @language = N'R'
    , @script = N'lrmodel <- rxLinMod(formula = distance ~ speed, data = CarsData);
        trained_model <- data.frame(payload = as.raw(serialize(lrmodel, connection=NULL)));'
    , @input_data_1 = N'SELECT [speed], [distance] FROM CarSpeed'
    , @input_data_1_name = N'CarsData'
    , @output_data_1_name = N'trained_model'
    WITH RESULT SETS ((model varbinary(max)));
END;
GO
  • The first argument to rxLinMod is the formula parameter, which defines distance as dependent on speed.
  • The input data is stored in the variable CarsData, which is populated by the SQL query. If you don't assign a specific name to your input data, the default variable name would be InputDataSet.

Create a table for storing the model

Now you'll store the model so you can retrain or use it for prediction.

The output of an R package that creates a model is usually a binary object. Therefore, the table where you store the model must provide a column of varbinary type.

CREATE TABLE stopping_distance_models (
	model_name varchar(30) not null default('default model') primary key,
	model varbinary(max) not null);

Save the model

To save the model, run the following Transact-SQL statement to call the stored procedure, generate the model, and save it to a table.

INSERT INTO stopping_distance_models (model)
EXEC generate_linear_model;

Note that if you run this a second time, you'll get this error:

Violation of PRIMARY KEY constraint...Cannot insert duplicate key in object dbo.stopping_distance_models

One option for avoiding this error is to update the name for each new model. For example, you could change the name to something more descriptive, and include the model type, the day you created it, and so forth.

UPDATE stopping_distance_models 
SET model_name = 'rxLinMod ' + format(getdate(), 'yyyy.MM.HH.mm', 'en-gb')
WHERE model_name = 'default model'

Output additional variables

Generally, the output of R from the stored procedure sp_execute_external_script is limited to a single data frame. (This limitation might be removed in future.)

However, you can return outputs of other types, such as scalars, in addition to the data frame.

For example, suppose you want to train a model but immediately view a table of coefficients from the model. You could create the table of coefficients as the main result set, and output the trained model in a SQL variable. You could immediately re-use the model by callings variable, or you could save it to a table as shown here.

DECLARE @model varbinary(max), @modelname varchar(30)
EXEC sp_execute_external_script
    @language = N'R'
    , @script = N'
        speedmodel <- rxLinMod(distance ~ speed, CarsData)
        modelbin <- serialize(speedmodel, NULL)
        OutputDataSet <- data.frame(coefficients(speedmodel));'
    , @input_data_1 = N'SELECT [speed], [distance] FROM CarSpeed'
    , @input_data_1_name = N'CarsData'
    , @params = N'@modelbin varbinary(max) OUTPUT'
    , @modelbin = @model OUTPUT
    WITH RESULT SETS (([Coefficient] float not null));

-- Save the generated model
INSERT INTO [dbo].[stopping_distance_models] (model_name, model)
VALUES (' latest model', @model)

Results

rslq_basictut_coefficients

Summary

Remember these rules for working with SQL parameters and R variables in sp_execute_external_script:

  • All SQL parameters mapped to R script must be listed by name in the @params argument of sp_execute_external_script.
  • To output one of these parameters, add the OUTPUT keyword in the @params list.
  • After listing the mapped parameters, provide the mapping, line by line, of SQL parameters to R variables, immediately after the @params list.

Next Step

Now that you have a model, in the final step, you'll learn how to generate predictions from it and plot the results.

Predict and Plot from Model