Skip to content

Latest commit

 

History

History
152 lines (113 loc) · 6.79 KB

File metadata and controls

152 lines (113 loc) · 6.79 KB
title Tutorial: Build a clustering model in Python
description In part three of this four-part tutorial series, you'll build a K-Means model to perform clustering in Python with SQL Server Machine Learning Services.
ms.prod sql
ms.technology machine-learning
ms.devlang python
ms.date 08/27/2019
ms.topic tutorial
author garyericson
ms.author garye
ms.reviewer davidph
monikerRange >=sql-server-ver15||=sqlallproducts-allversions

Tutorial: Build a clustering model in Python with SQL Server Machine Learning Services

In part three of this four-part tutorial series, you'll build a K-Means model in Python to perform clustering. In the next part of this series, you'll deploy this model in a SQL database with SQL Server Machine Learning Services.

In this article, you'll learn how to:

[!div class="checklist"]

  • Define the number of clusters for a K-Means algorithm
  • Perform clustering
  • Analyze the results

In part one, you installed the prerequisites and imported the sample database.

In part two, you learned how to prepare the data from a SQL database to perform clustering.

In part four, you'll learn how to create a stored procedure in a SQL database that can perform clustering in Python based on new data.

Prerequisites

  • Part three of this tutorial assumes you have fulfilled the prerequisites of part one, and completed the steps in part two.

Define the number of clusters

To cluster your customer data, you'll use the K-Means clustering algorithm, one of the simplest and most well-known ways of grouping data. You can read more about K-Means in A complete guide to K-means clustering algorithm.

The algorithm accepts two inputs: The data itself, and a predefined number "k" representing the number of clusters to generate. The output is k clusters with the input data partitioned among the clusters.

To determine the number of clusters for the algorithm to use, use a plot of the within groups sum of squares, by number of clusters extracted. The appropriate number of clusters to use is at the bend or "elbow" of the plot.

# Determine number of clusters by using a plot of the within groups sum of squares,
# by number of clusters extracted. 
wss <- (nrow(customer_data) - 1) * sum(apply(customer_data, 2, var))
for (i in 2:20)
    wss[i] <- sum(kmeans(customer_data, centers = i)$withinss)
plot(1:20, wss, type = "b", xlab = "Number of Clusters", ylab = "Within groups sum of squares")

Elbow graph

Based on the graph, it looks like k = 4 would be a good value to try. That k value will group the customers into four clusters.

Perform clustering

In the following R script, you'll use the function rxKmeans, which is the K-Means function in the RevoScaleR package.

# Output table to hold the customer group mappings.
# This is a table where the cluster mappings will be saved in the database.
return_cluster = RxSqlServerData(table = "return_cluster", connectionString = connStr);
# Set the seed for the random number generator for predictability
set.seed(10);
# Generate clusters using rxKmeans and output key / cluster to a table in SQL database
# called return_cluster
clust <- rxKmeans( ~ orderRatio + itemsRatio + monetaryRatio + frequency,
                   customer_returns,
                   numClusters=4,
                   outFile=return_cluster,
                   outColName="cluster",
                   extraVarsToWrite=c("customer"),
                   overwrite=TRUE);

# Read the customer returns cluster table from the database
customer_cluster <- rxDataStep(return_cluster);

Analyze the results

Now that you've done the clustering using K-Means, the next step is to analyze the result and see if you can find any actionable information.

The clust object contains the results from the K-Means clustering.

#Look at the clustering details to analyze results
clust
Call:
rxKmeans(formula = ~orderRatio + itemsRatio + monetaryRatio + 
    frequency, data = customer_returns, outFile = return_cluster, 
    outColName = "cluster", extraVarsToWrite = c("customer"), 
    overwrite = TRUE, numClusters = 4)
Data: customer_returns
Number of valid observations: 37336
Number of missing observations: 0 
Clustering algorithm:  

K-means clustering with 4 clusters of sizes 31675, 671, 2851, 2139
Cluster means:
   orderRatio   itemsRatio monetaryRatio frequency
1 0.000000000 0.0000000000    0.00000000  0.000000
2 0.007451565 0.0000000000    0.04449653  4.271237
3 1.008067345 0.2707821817    0.49515232  1.031568
4 0.000000000 0.0004675082    0.10858272  1.186068
Within cluster sum of squares by cluster:
         1          2          3          4
    0.0000  1329.0160 18561.3157   363.2188

The four cluster means are given using the variables defined in part one:

  • orderRatio = return order ratio (total number of orders partially or fully returned versus the total number of orders)
  • itemsRatio = return item ratio (total number of items returned versus the number of items purchased)
  • monetaryRatio = return amount ratio (total monetary amount of items returned versus the amount purchased)
  • frequency = return frequency

Data mining using K-Means often requires further analysis of the results, and further steps to better understand each cluster, but it can provide some good leads. Here are a couple ways you could interpret these results:

  • Cluster 1 (the largest cluster) seems to be a group of customers that are not active (all values are zero).
  • Cluster 3 seems to be a group that stands out in terms of return behavior.

Clean up resources

If you're not going to continue with this tutorial, delete the tpcxbb_1gb database from your Azure SQL Database server.

From the Azure portal, follow these steps:

  1. From the left-hand menu in the Azure portal, select All resources or SQL databases.
  2. In the Filter by name... field, enter tpcxbb_1gb, and select your subscription.
  3. Select your tpcxbb_1gb database.
  4. On the Overview page, select Delete.

Next steps

In part three of this tutorial series, you completed these steps:

  • Define the number of clusters for a K-Means algorithm
  • Perform clustering
  • Analyze the results

To deploy the machine learning model you've created, follow part four of this tutorial series:

[!div class="nextstepaction"] Tutorial: Deploy a clustering model in Python with SQL Server Machine Learning Services