Skip to content

Latest commit

 

History

History
100 lines (65 loc) · 6.33 KB

File metadata and controls

100 lines (65 loc) · 6.33 KB
title Python tutorial: Categorize users
description In this four-part tutorial series, you'll cluster customers, using K-Means, in a SQL database using Python with SQL Server Machine Learning Services.
ms.prod sql
ms.technology machine-learning
ms.devlang python
ms.date 08/30/2019
ms.topic tutorial
author garyericson
ms.author garye
ms.reviewer davidph
ms.custom seo-lt-2019
monikerRange >=sql-server-2017||>=sql-server-linux-ver15||=sqlallproducts-allversions

Tutorial: Categorizing customers using k-means clustering with SQL Server Machine Learning Services

[!INCLUDEappliesto-ss-xxxx-xxxx-xxx-md]

In this four-part tutorial series, you'll use Python to develop and deploy a K-Means clustering model in SQL Server Machine Learning Services to cluster customer data.

In part one of this series, you'll set up the prerequisites for the tutorial and then restore a sample dataset to a SQL database. Later in this series, you'll use this data to train and deploy a clustering model in Python with SQL Server Machine Learning Services.

In parts two and three of this series, you'll develop some Python scripts in an Azure Data Studio notebook to analyze and prepare your data and train a machine learning model. Then, in part four, you'll run those Python scripts inside a SQL database using stored procedures.

Clustering can be explained as organizing data into groups where members of a group are similar in some way. For this tutorial series, imagine you own a retail business. You'll use the K-Means algorithm to perform the clustering of customers in a dataset of product purchases and returns. By clustering customers, you can focus your marketing efforts more effectively by targeting specific groups. K-Means clustering is an unsupervised learning algorithm that looks for patterns in data based on similarities.

In this article, you'll learn how to:

[!div class="checklist"]

  • Restore a sample database into a SQL Server instance

In part two, you'll learn how to prepare the data from a SQL database to perform clustering.

In part three, you'll learn how to create and train a K-Means clustering model in Python.

In part four, you'll learn how to create a stored procedure in a SQL database that can perform clustering in Python based on new data.

Prerequisites

  • SQL Server Machine Learning Services with the Python language option - Follow the installation instructions in the Windows installation guide or the Linux installation guide.

  • Python IDE - This tutorial uses a Python notebook in Azure Data Studio. For more information, see How to use notebooks in Azure Data Studio. You can also use your own Python IDE, such as a Jupyter notebook or Visual Studio Code with the Python extension and the mssql extension.

  • revoscalepy package - The revoscalepy package is included in SQL Server Machine Learning Services. To use the package on a client computer, see Set up a data science client for Python development for options to install this package locally.

    If you're using a Python notebook in Azure Data Studio, follow these additional steps to use revoscalepy:

    1. Open Azure Data Studio
    2. From the File menu, select Preferences and then Settings
    3. Expand Extensions and select Notebook configuration
    4. Under Python Path, enter the path where you installed the libraries (for example, C:\path-to-python-for-mls)
    5. Make sure Use Existing Python is checked
    6. Restart Azure Data Studio

    If you're using a different Python IDE, follow similar steps for your IDE.

  • SQL query tool - This tutorial assumes you're using Azure Data Studio. You can also use SQL Server Management Studio (SSMS).

  • Additional Python packages - The examples in this tutorial series use Python packages that you may or may not have installed. Use the following pip commands to install these packages if necessary.

    pip install matplotlib
    pip install scipy
    pip install sklearn

Restore the sample database

The sample dataset used in this tutorial has been saved to a .bak database backup file for you to download and use. This dataset is derived from the tpcx-bb dataset provided by the Transaction Processing Performance Council (TPC).

  1. Download the file tpcxbb_1gb.bak.

  2. Follow the directions in Restore a database from a backup file in Azure Data Studio, using these details:

    • Import from the tpcxbb_1gb.bak file you downloaded
    • Name the target database "tpcxbb_1gb"
  3. You can verify that the dataset exists after you have restored the database by querying the dbo.customer table:

    USE tpcxbb_1gb;
    SELECT * FROM [dbo].[customer];

Clean up resources

If you're not going to continue with this tutorial, delete the tpcxbb_1gb database from SQL Server.

Next steps

In part one of this tutorial series, you completed these steps:

  • Restore a sample database into a SQL Server instance

To prepare the data for the machine learning model, follow part two of this tutorial series:

[!div class="nextstepaction"] Tutorial: Prepare data to perform clustering in Python with SQL Server Machine Learning Services