Skip to content

Latest commit

 

History

History
91 lines (68 loc) · 3.62 KB

File metadata and controls

91 lines (68 loc) · 3.62 KB
title Plot a histogram for data exploration with Python
titleSuffix SQL machine learning
description Learn how to create a histogram to visualize data using Python.
author dphansen
ms.author davidph
ms.date 07/14/2020
ms.topic how-to
ms.prod sql
ms.technology machine-learning
monikerRange >=sql-server-2017||>=sql-server-linux-ver15||=azuresqldb-mi-current||=azuresqldb-current||=sqlallproducts-allversions

Plot histograms in Python

[!INCLUDESQL Server SQL DB SQL MI]

This article describes how to plot data using the Python package pandas'.hist(). A SQL database is the source used to visualize the histogram data intervals that have consecutive, non-overlapping values.

Prerequisites:

::: moniker range=">=sql-server-2017||>=sql-server-linux-ver15"

::: moniker range="=azuresqldb-current"

::: moniker range="=azuresqldb-mi-current"

Verify restored Database

You can verify that the restored database exists by querying the Person.CountryRegion table:

USE AdventureWorksDW;
SELECT * FROM Person.CountryRegion;

Install Python packages

Download and Install Azure Data Studio.

Install the following Python packages:

  • pyodbc
  • pandas

To install these packages:

  1. In your Azure Data Studio notebook, select Manage Packages.
  2. In the Manage Packages pane, select the Add new tab.
  3. For each of the following packages, enter the package name, click Search, then click Install.

Plot histogram

The distributed data displayed in the histogram is based on a SQL query from AdventureWorksDW. The histogram visualizes data and the frequency of data values. Edit the connection string variables: 'server', 'database', 'username', and 'password' to connect to SQL database.

To create a new notebook:

  1. In Azure Data Studio, select File, select New Notebook.
  2. In the notebook, select kernel Python3, select the +code.
  3. Paste code in notebook, select Run All.
import pyodbc 
import pandas as plt
# Some other example server values are
# server = 'localhost\sqlexpress' # for a named instance
# server = 'myserver,port' # to specify an alternate port
server = 'servername' 
database = 'AdventureWorksDW' 
username = 'yourusername' 
password = 'databasename'  
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
sql = "SELECT DATEDIFF(year, c.BirthDate, GETDATE()) AS Age FROM [dbo].[FactInternetSales] s INNER JOIN dbo.DimCustomer c ON s.CustomerKey = c.CustomerKey"
df = pd.read_sql(sql, cnxn)
df.hist(bins=10)

The display shows the age distribution of customers in the FactInternetSales table.

Pandas Histogram