| title | Lesson 3: Create Data Features (Data Science End-to-End Walkthrough) | Microsoft Docs | |
|---|---|---|
| ms.custom | ||
| ms.date | 11/22/2016 | |
| ms.prod | sql-server-2016 | |
| ms.reviewer | ||
| ms.suite | ||
| ms.technology |
|
|
| ms.tgt_pltfrm | ||
| ms.topic | article | |
| applies_to |
|
|
| dev_langs |
|
|
| ms.assetid | 4981d4eb-0874-4fe9-82e1-edf99890e27a | |
| caps.latest.revision | 21 | |
| author | jeannt | |
| ms.author | jeannt | |
| manager | jhubbard |
Data engineering is an important part of machine learning. Data often needs to be transformed before you can use it for predictive modeling. If the data does not have the features you need, you can engineer them from existing values.
For this modeling task, rather than using the raw latitude and longitude values of the pickup and drop-off location, you'd like to have the distance in miles between the two locations. To create this feature, you'll compute the direct linear distance between two points, by using the haversine formula.
You'll compare two different methods for creating a feature from data:
- Using R and the rxDataStep function
- Using a custom function in [!INCLUDEtsql]
For both methods, the result of the code is a [!INCLUDEssNoVersion] data source object, featureDataSource, that includes the new numeric feature, direct_distance.
The R language is well-known for its rich and varied statistical libraries, but you still might need to create custom data transformations.
- You'll create a new R function, ComputeDist, to calculate the linear distance between two points specified by latitude and longitude values.
- You'll call the function to transform the data in the [!INCLUDEssNoVersion] data object you created earlier, an save it in a new data source, featureDataSource.
-
Run the following code to create a custom R function, ComputeDist. It takes in two pairs of latitude and longitude values, and calculates the linear distance between them. The function returns a distance in miles.
env <- new.env() env$ComputeDist <- function(pickup_long, pickup_lat, dropoff_long, dropoff_lat){ R <- 6371/1.609344 #radius in mile delta_lat <- dropoff_lat - pickup_lat delta_long <- dropoff_long - pickup_long degrees_to_radians = pi/180.0 a1 <- sin(delta_lat/2*degrees_to_radians) a2 <- as.numeric(a1)^2 a3 <- cos(pickup_lat*degrees_to_radians) a4 <- cos(dropoff_lat*degrees_to_radians) a5 <- sin(delta_long/2*degrees_to_radians) a6 <- as.numeric(a5)^2 a <- a2+a3*a4*a6 c <- 2*atan2(sqrt(a),sqrt(1-a)) d <- R*c return (d) }
- The first line defines a new environment. In R, an environment can be used to encapsulate name spaces in packages and such.
- You can use the
search()function to view the environments in your workspace. To view the objects in a specific environment, typels(<envname>). - The lines beginning with
$env.ComputeDistancecontain the code that defines the haversine formula, which calculates the great-circle distance between two points on a sphere.
Having defined the function, you will apply it to the data to create a new feature column, direct_distance.
-
Create a data source to work with by using the RxSqlServerData constructor.
featureDataSource = RxSqlServerData(table = "features", colClasses = c(pickup_longitude = "numeric", pickup_latitude = "numeric", dropoff_longitude = "numeric", dropoff_latitude = "numeric", passenger_count = "numeric", trip_distance = "numeric", trip_time_in_secs = "numeric", direct_distance = "numeric"), connectionString = connStr)
-
Call the rxDataStep function to apply the
env$ComputeDistfunction to the specified data.start.time <- proc.time() rxDataStep(inData = inDataSource, outFile = featureDataSource, overwrite = TRUE, varsToKeep=c("tipped", "fare_amount", passenger_count", "trip_time_in_secs", "trip_distance", "pickup_datetime", "dropoff_datetime", "pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude") , transforms = list(direct_distance=ComputeDist(pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude)), transformEnvir = env, rowsPerRead=500, reportProgress = 3) used.time <- proc.time() - start.time print(paste("It takes CPU Time=", round(used.time[1]+used.time[2],2)," seconds, Elapsed Time=", round(used.time[3],2), " seconds to generate features.", sep=""))
- The rxDataStep function can modify data in place. The arguments include a character vector of columns to pass through (varsToKeep), and a list that defines transformations.
- Any columns that are transformed are automatically output and therefore do not need to be included in the varsToKeep argument.
- Alternatively, you can specify that all columns in the source be included except for the specified variables, by using the varsToDrop argument.
-
Call rxGetVarInfo to inspect the schema of the new data source:
rxGetVarInfo(data = featureDataSource)
Results
"It takes CPU Time=0.74 seconds, Elapsed Time=35.75 seconds to generate features."
Var 1: tipped, Type: integer
Var 2: fare_amount, Type: numeric
Var 3: passenger_count, Type: numeric
Var 4: trip_time_in_secs, Type: numeric
Var 5: trip_distance, Type: numeric
Var 6: pickup_datetime, Type: character
Var 7: dropoff_datetime, Type: character
Var 8: pickup_longitude, Type: numeric
Var 9: pickup_latitude, Type: numeric
Var 10: dropoff_longitude, Type: numeric
Var 11: dropoff_latitude, Type: numeric
Var 12: direct_distance, Type: numeric
Now you'll create a custom SQL function, ComputeDist, to do the same thing as the R function you just created. The custom SQL function ComputeDist operates on an existing RxSqlServerData data object to create the new distance features from the existing latitude and longitude values.
You'll save the results of the transformation to a [!INCLUDEssNoVersion] data object, featureDataSource, just as you did using R.
-
Define a new custom SQL function, named fnCalculateDistance.
CREATE FUNCTION [dbo].[fnCalculateDistance] (@Lat1 float, @Long1 float, @Lat2 float, @Long2 float) -- User-defined function calculates the direct distance between two geographical coordinates. RETURNS float AS BEGIN DECLARE @distance decimal(28, 10) -- Convert to radians SET @Lat1 = @Lat1 / 57.2958 SET @Long1 = @Long1 / 57.2958 SET @Lat2 = @Lat2 / 57.2958 SET @Long2 = @Long2 / 57.2958 -- Calculate distance SET @distance = (SIN(@Lat1) * SIN(@Lat2)) + (COS(@Lat1) * COS(@Lat2) * COS(@Long2 - @Long1)) --Convert to miles IF @distance <> 0 BEGIN SET @distance = 3958.75 * ATAN(SQRT(1 - POWER(@distance, 2)) / @distance); END RETURN @distance END
- The code for this user-defined SQL function is provided as part of the PowerShell script you ran to create and configure the database. The function should already exist in your database. If it does not exist, use SQL Server Management Studio to generate the function in the same database where the taxi data is stored.
-
Run the following [!INCLUDEtsql] statement from any application that supports [!INCLUDEtsql], just to see how the function works.
SELECT tipped, fare_amount, passenger_count,trip_time_in_secs,trip_distance, pickup_datetime, dropoff_datetime, dbo.fnCalculateDistance(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude) as direct_distance, pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude FROM nyctaxi_sample
-
To use the custom SQL function in R code, save the feature engineering query in an R variable.
featureEngineeringQuery = "SELECT tipped, fare_amount, passenger_count, trip_time_in_secs,trip_distance, pickup_datetime, dropoff_datetime, dbo.fnCalculateDistance(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude) as direct_distance, pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude FROM nyctaxi_joined_1_percent tablesample (1 percent) repeatable (98052)"
[!TIP] This query is slightly different from the [!INCLUDEtsql] query used earlier. It has been modified to get a smaller sample of data, to make this walkthrough faster.
-
Use the following lines of code to call the [!INCLUDEtsql] function from your R environment and apply it to the data defined in featureEngineeringQuery.
featureDataSource = RxSqlServerData(sqlQuery = featureEngineeringQuery, colClasses = c(pickup_longitude = "numeric", pickup_latitude = "numeric", dropoff_longitude = "numeric", dropoff_latitude = "numeric", passenger_count = "numeric", trip_distance = "numeric", trip_time_in_secs = "numeric", direct_distance = "numeric"), connectionString = connStr)
-
Now that the new feature is created, call rxGetVarsInfo to create a summary of the feature table.
rxGetVarInfo(data = featureDataSource)
As it turns out, for this particular task, the [!INCLUDEtsql] function approach is faster than the custom R function. Therefore, you'll use the [!INCLUDEtsql] function for these calculations in subsequent steps.
Proceed to the next lesson to learn how to build a predictive model using this data and save the model to a [!INCLUDEssNoVersion] table.
Tip
Very often, feature engineering using [!INCLUDEtsql] will be faster than R. For example, T-SQL includes windowing and ranking functions that are extremely fast in tasks that data scientists frequently perform in R, such as rolling moving averages and ntiles. Choose the most efficient method based on your data and task.
Lesson 4: Build and Save the Model (Data Science End-to-End Walkthrough)
Lesson 2: View and Explore the Data (Data Science End-to-End Walkthrough)