| title | In-database Python analytics for SQL developers | Microsoft Docs | ||
|---|---|---|---|
| ms.custom | |||
| ms.date | 10/13/2017 | ||
| ms.reviewer | |||
| ms.suite | sql | ||
| ms.prod | machine-learning-services | ||
| ms.prod_service | machine-learning-services | ||
| ms.component | |||
| ms.technology | |||
| ms.tgt_pltfrm | |||
| ms.topic | tutorial | ||
| applies_to |
|
||
| dev_langs |
|
||
| ms.author | heidist | ||
| author | HeidiSteen | ||
| manager | cgronlun |
[!INCLUDEappliesto-ss-xxxx-xxxx-xxx-md-winonly]
The goal of this walkthrough is to provide SQL programmers with hands-on experience building a machine learning solution using Python that runs in SQL Server. In this walkthrough, you'll learn how to add Python code to stored procedures and run stored procedures to build and predict from models.
Note
Prefer R? See this tutorial, which provides a similar solution but uses R, and can be run in either SQL Server 2016 or SQL Server 2017.
The process of building a machine learning solution is a complex one that can involve multiple tools, and the coordination of subject matter experts across several phases:
- obtaining and cleaning data
- exploring the data and building features useful for modeling
- training and tuning the model
- deployment to production
The focus of this walkthrough is on building and deploying a solution using SQL Server.
The data is from the well-known NYC Taxi data set. To make this walkthrough quick and easy, the data is sampled. You'll create a binary classification model that predicts whether a particular trip is likely to get a tip or not, based on columns such as the time of day, distance, and pick-up location.
All tasks can be done using [!INCLUDEtsql] stored procedures in the familiar environment of [!INCLUDEssManStudio]
-
Step 1: Download the sample data
Download the sample dataset and all script files to a local computer.
-
Step 2: Import data to SQL Server using PowerShell
Execute a PowerShell script that creates a database and a table on the specified instance, and loads the sample data to the table.
-
Step 3: Explore and visualize the data using Python
Perform basic data exploration and visualization, by calling Python from [!INCLUDEtsql] stored procedures.
-
Step 4: Create data features using Python in T-SQL
Create new data features using custom SQL functions.
-
Step 5: Train and save a Python model using T-SQL
Build and save the machine learning model, using Python in stored procedures.
This walkthrough demonstrates how to perform a binary classification task; you could also use the data to build models for regression or multiclass classification.
-
Step 6: Operationalize the Python model
After the model has been saved to the database, call the model for prediction using [!INCLUDEtsql].
- Install an instance of SQL Server 2017 with Machine Learning Services and Python enabled. For more information, see Install SQL Server 2017 Machine Learning Services (In-Database).
- The login that you use for this walkthrough must have permissions to create databases and other objects, to upload data, select data, and run stored procedures.
You should be familiar with fundamental database operations, such as creating databases and tables, importing data into tables, and creating SQL queries.
An experienced SQL programmer should be able to complete this walkthrough by using [!INCLUDEtsql] in [!INCLUDEssManStudioFull] or by running the provided PowerShell scripts.
Python: Basic knowledge is useful but not required. All Python code is provided.
Some knowledge of PowerShell is helpful.
For this tutorial, we're assuming that you've reached the deployment phase. You have been given clean data, complete T-SQL code for feature engineering, and working Python code. Therefore, you can complete this tutorial using SQL Server Management Studio, or any other tool that supports running SQL statements.
If you want to develop and test your own Python code, or debug a Python solution, we recommend using a dedicated development environment:
-
Visual Studio 2017 supports both R and Python. We recommend the Data Science workload, which also supports R and F#.
-
If you have an earlier version of Visual Studio, Python Extensions for Visual Studio makes it easy to manage multiple Python environments.
-
PyCharm is a popular IDE among Python developers.
[!NOTE] In general, avoid writing or testing new Python code in [!INCLUDEssManStudioFull]. If the code that you embed in a stored procedure has any problems, the information that is returned from the stored procedure is usually inadequate to understand the cause of the error.
Use the following resources to help you plan and execute a successful machine learning project:
| Step | Time (hrs) |
|---|---|
| Download the sample data | 0:15 |
| Import data to SQL Server using PowerShell | 0:15 |
| Explore and visualize the data | 0:20 |
| Create data features using T-SQL | 0:30 |
| Train and save a Model using T-SQL | 0:15 |
| Operationalize the model | 0:40 |