Demonstration of KDE using real-world datasets
Photo by Marco Bianchetti on Unsplash
I would like to extend my previous story about Kernel Density Estimator (KDE) by considering multidimensional data.
I will start by giving you a mathematical overview of the topic, after which you will receive Python code to experiment with bivariate KDE. I will then go through some of the properties of the Gaussian kernel. Last but not least, using heights and weights and exoplanets datasets, I will demonstrate how KDE may be applied to real-world data.
1. Mathematical introduction
KDE is a composite function made up of several similar kernel functions. I chose the Gaussian kernel, since it is simple to analyze. This function is the prototype of a multidimensional Gaussian
which is itself an extension of
to many dimensions.
The vector x has a total of d dimensions (features), with the superscript representing the index of the features:
H is a d by d matrix of coefficients that govern the form of the function. Here is a two-dimensional (d = 2) example:
Perhaps you recall that only a function with unit area under the curve can join the PDF club. Therefore, to obtain the multivariate Gaussian kernel function, we must add a few normalizing terms:
You may verify for yourself that inserting d = 1 yields a standard unidimensional Gaussian function.
The matrix H serves as a covariance matrix. In the bivariate case (d = 2) shown above, h₁₁ and h₂₂ correspond to the variances of x⁽¹⁾ and x⁽²⁾, respectively, and h₁₂ = h₂₁ represent the covariance of x⁽¹⁾ with x⁽²⁾. That is why the matrix H is considered to be symmetric. As a consequence, in the bivariate case, the user has three parameters to alter the kernel function.
The kernel function is a customized template function that is applied to each data point in order to build the PDF of the entire dataset using a simple summation:
where xᵢ is the i-th data point:
Don’t worry if all of this maths makes you uneasy. I will offer you Python code to create visualizations that will demonstrate how it works. The main point to remember is this:
The Kernel Density Estimator is a composite function made up of kernel function instances allocated one-to-one to each data point.
2. Python code to bring bivariate KDE to life
Here you have the code that may be used as an experimental platform for bivariate Gaussian kernel and KDE experiments.
Imports come first:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import pandas as pd
import seaborn as sns
# to generate a visually appealing images with dataframes
import dataframe_image as dfi
# to make functions with some arguments fixed
from functools import partial
# Latex typefaces will be used for math symbols in figures.
plt.rcParams.update({
“text.usetex”: True,
“font.family”: “sans-serif”,
“font.sans-serif”: [“Helvetica”]
})
The bivariate Gaussian kernel function K requires a 2 by 2 numpy array H. The function K accepts a grid-like array as an argument x.
def K(x, H):
# unpack two dimensions
x1, x2 = x
# extract four components from the matrix inv(H)
a, b, c, d = np.linalg.inv(H).flatten()
# calculate scaling coeff to shorten the equation
scale = 2*np.pi*np.sqrt( np.linalg.det(H))
return np.exp(-(a*x1**2 + d*x2**2 + (b+c)*x1*x2)/2) / scale
The KDE calls the kernel function K for every data point and accumulates all its outcomes, as stated in f(x). If you don’t intend to call K directly in your application, you can nest its definition within the KDE.
def KDE(x, H, data):
# unpack two dimensions
x1, x2 = x
# prepare the grid for output values
output = np.zeros_like(x1)
# process every sample
for sample in data:
output += K([x1-sample[0], x2-sample[1]], H)
return output
Notice that for a single data point, f(x) equals just K(x).
The last function show_pdf displays two-dimensional function func and adds data points data to it, but the data does not have to be related to the function func. There are two perspectives on PDF: contour and surface plots.
def show_pdf(func, data,
resolution = 100,
contours_density = 8,
surf_density = 40,
figsize=(10,5), cmap=cm.cividis,
aspect=’equal’, margins=’auto’,
s = 40, edgecolor=’black’,
markeralpha=1, markercolor=’white’
):
# range for x and y axis is determined from the dataset
x1_min, x2_min = data.min(axis=0)
x1_max, x2_max = data.max(axis=0)
# plus some extra margins
if margins == ‘auto’:
x1_pad = max(3, int(0.3*(x1_max-x1_min)))
x2_pad = max(3, int(0.3*(x2_max-x2_min)))
else:
x1_pad = int(margins*(x1_max-x1_min))
x2_pad = int(margins*(x2_max-x2_min))
x1_range = np.linspace(start=x1_min-x1_pad,
stop=x1_max+x1_pad, num=resolution)
x2_range = np.linspace(start=x2_min-x2_pad,
stop=x2_max+x2_pad, num=resolution)
X1_grid, X2_grid = np.meshgrid(x1_range, x2_range)
# the given func is called here
Z_grid = func([X1_grid, X2_grid])
# draw a figure
fig = plt.figure(figsize=figsize)
ax = fig.add_subplot(1, 2, 1)
c = ax.contourf(X1_grid, X2_grid, Z_grid,
contours_density, cmap=cmap)
c2 = ax.contour(X1_grid, X2_grid, Z_grid,
contours_density, colors=’black’)
ax.set_xlabel(r’$x^{(1)}$’, fontsize=16, labelpad=7)
ax.set_ylabel(r’$x^{(2)}$’, fontsize=16, rotation=0, labelpad=8)
ax.scatter(data[:,0], data[:,1], marker=’s’,
color=markercolor, s=s, edgecolor=edgecolor, alpha=markeralpha)
ax.set_aspect(aspect)
plt.clabel(c2, inline=True, fontsize=10)
ax = fig.add_subplot(1, 2, 2, projection=’3d’)
ax.plot_surface(X1_grid, X2_grid, Z_grid,
rcount=surf_density,
ccount=surf_density, cmap=cmap)
ax.set_xlabel(r’$x^{(1)}$’, fontsize=14)
ax.set_ylabel(r’$x^{(2)}$’, fontsize=14)
plt.show()
3. Properties of the Gaussian kernel
Let us begin with the simplest case, represented by the identity matrix H:
The origin of the coordinate axes serves as a single data point.
To make use of the provided code, you first have to define the array H and at least one data point (the data array has to be two-dimensional). Then you can call KDE for these data. Take note of the fact that show_pdf accepts a function func as an input and calls it with a grid-like array x as the first parameter. As a result, we invoke the partial method from the functools library, which generates the KDE_partial function that performs the same job as KDE, except that the second parameter H is fixed. This is how I am going to use the code throughout this story.
# covariance matrix
H = [[1, 0],
[0, 1]]
# a single point to make PDF
data = np.array([[0,0]])
# fix arguments ‘H’ and ‘data’ of KDE function for further calls
KDE_partial = partial(KDE, H=np.array(H), data=data)
# draw contour and surface plots
show_pdf(func=KDE_partial, data=data)
The Gaussian kernel is centred around the datapoint.
The Gaussian kernel with 0 covariance.
Let us introduce some correlation by fixing off-diagonal elements to 0.5:
The KDE’s shape becomes slanted and thinner. The semi-major axis runs parallel to the x⁽¹⁾ = x⁽²⁾ line.
The Gaussian kernel with 0.5 covariance.
The KDE becomes more narrower as the covariance coefficient increases. Because the function’s values change more quickly, I increased the figure’s resolution.
H = [[1, 0.9],
[0.9, 1]]
KDE_partial = partial(KDE, H=np.array(H), data=data)
# when the function gets too sharp, boost the resolution
show_pdf(func=KDE_partial, data=data,
resolution=300, surf_density=50)The Gaussian kernel with 0.9 covariance.
You can easily predict what would happen when covariance becomes negative.
The Gaussian kernel with -0.5 covariance.
These examples clearly demonstrate how the Gaussian PDF may follow the correlation structure of the data.
You may be wondering how to rotate the Gaussian PDF. To do this, take a rotation matrix R (I prefer the clockwise version):
and replace H with RHRᵀ. Below is a convienience function that returns the array representing the matrix that performs rotation by α degrees (in radians).
def angle2rotation_matrix(angle):
return np.array([ [np.cos(angle), np.sin(angle)],
[-np.sin(angle), np.cos(angle)] ])
R = angle2rotation_matrix(angle=(-1/10)*np.pi)
Since rotating a symmetric Gaussian makes no sense, I stretch it by changing the diagonal components of the matrix H.
# the first axis scale is expanded twice
H = np.array([[2, 0],
[0, 0.2]])
# rotation
H = R @ H @ R.T
data = np.array([[0,0]])
KDE_partial = partial(KDE, H=np.array(H), data=data)
show_pdf(func=KDE_partial, data=data)
Notice that the first-axis scale is extended twice, while the second scale is shrunk five times, as a result of applying different values to the diagonal elements of matrix H.
Stretched and rotated Gaussian kernel.
4. Heights and Weights dataset
There are several ready-to-use datasets accessible in machine learning repositories. That is why I was astonished to discover that nearly every commonly available dataset containing height and weight columns was synthetically generated. In order to get some real-world data, I requested my students to submit their heights and weights and this dataset is now available in my Github repository.
Let us take a look at these data.
filename = ‘HeightsWeightsGender_dataset.xlsx’
# my Github repo
URL = “https://raw.githubusercontent.com/jdrapala/datasets/main/” + filename
# download the data form URL as a DataFrame
df = pd.read_excel(URL)
# make it look appealing
df_styled = df.sample(7).style.background_gradient(axis=0, gmap=df[‘Weight’], cmap=’cividis’)
# and export to file
dfi.export(df_styled, ‘filename.png’, dpi=300)
# extract numpy array from DataFrame
data = df[[‘Height’,’Weight’]].to_numpy()A sample of height and weight data collected from my students.
Let x⁽¹⁾ denote height and x⁽²⁾ denote weight.
Take the identity matrix H as a first attempt to visualize the dataset.
H = [[1, 0],
[0, 1]]
KDE_partial = partial(KDE, H=np.array(H), data=data)
show_pdf(func=KDE_partial, data=data, aspect=’auto’, margins=0.15)Too tiny kernel was used.
The outcome is terrible; the PDF is peaked over a narrow region surrounding data points and falls to near zero everywhere else.
We make the kernel bigger by multiplying the entire matrix H by a constant s (for size). Take s = 10.
H = [[1, 0],
[0, 1]]
s = 10
KDE_partial = partial(KDE, H=s*np.array(H), data=data)
show_pdf(func=KDE_partial, data=data, aspect=’auto’, margins=0.15)The kernel size should be raised even further.
Individual peaks combine into a lovely PDF, yet it still appears overly detailed. So, increase s to 64 to obtain the following result.
The kernel seems to be of the right size.
This kernel function size appears to be a suitable match for this dataset.
Compare our handmade PDF output with seaborn KDE.
# draw PDF
ax = sns.kdeplot(x=’Height’, y=’Weight’, data=df,
fill=True, cmap=cm.cividis,
bw_adjust=1.1)
# draw datapoints
ax.plot(df[‘Height’], df[‘Weight’], ‘s’, color=’white’,
markersize=6, markeredgecolor=’black’, markeredgewidth=0.7)
plt.show()
What effect would non-zero covariance elements in the matrix H have on PDF? For covariance entries with a value of 0.8, the following PDF is returned.
This appears to be a windy-day variation of the preceding figure.
PDF quality has clearly degraded.
Similar experiments show that the Gaussian kernel with the identity matrix H is sufficient in real-world situations. As a result, everything may come down to selecting the appropriate value for parameter s, which determines the region spanned by the kernel function, exactly as bandwidth h in the univariate case.
As an exercise, consider creating a PDF representation for a larger set of height and weight data that I gathered during my many medical data analysis research projects. However, bear in mind that these are data from people with heart disease, diabetes, and other conditions, so proceed with caution when drawing conclusions for the general population. I also have some concerns about the quality of the data because it was collected from many hospitals.
Heights and weights data collected from various hospitals.
5. Exoplanets dataset
KDE is especially useful with multimodally scattered data. Therefore, may I present you another dataset. And I am sure that we are all bored with the Iris dataset.
This dataset was downloaded directly from the Exoplanet Orbit Database webpage. Because the file is large and contains mixed data types, I had to set low_memory=False in the read_csv method.
I picked two columns: the exoplanet’s mass (measured in Jupyter masses) and its distance from the parent star (semi-major axis in astronomical units).
URL = “http://exoplanets.org/csv-files/exoplanets.csv”
df = pd.read_csv(URL, low_memory=False)
# drop rows containing missing values
df = df[[‘NAME’, ‘MSINI’, ‘A’]].dropna()
# i don’t like orignal columns names
df = df.rename(columns={‘NAME’: ‘Name’,
‘MSINI’:’Mass’,
‘A’: ‘Distance’}).set_index(‘Name’).astype(float)
# some masses are not missing but equal to zero, let’s get rid of them too
df = df.drop(df[df[‘Mass’] == 0].index)
# # make it look appealing
df_styled = df.sample(7).style.background_gradient(axis=0,
gmap=df[‘Mass’],
cmap=’cividis’)
dfi.export(df_styled, ‘filename.png’, dpi=300)A sample of exoplanets dataset.
Let us take a quick look at how the data are distributed on the logarithmic scale.
with plt.style.context(‘seaborn’):
sns.scatterplot(data=df, x=’Mass’, y=’Distance’)
plt.xscale(‘log’)
plt.yscale(‘log’)
plt.show()Exoplanets data on the logarithmic scale.
KDE produces a visually appealing PDF for s = 0.7, using these data.
# logarithmic scale is more appropriate for this data
data = np.log(df[[‘Mass’,’Distance’]].to_numpy())
H = [[1, 0],
[0, 1]]
# size of the kernel
s = 0.7
KDE_partial = partial(KDE, H=s*np.array(H), data=data)
show_pdf(func=KDE_partial, data=data, aspect=’auto’,
markeralpha=1, s=1.5, margins=0.15)The exoplanets data is divided into three clusters, although the one on the left is noticeably sparser.
I recommend that you try the KernelDensity method in the Scikit-learn library, which allows you to easily generate synthetic data after the KDE has been fitted to the data.
from sklearn.neighbors import KernelDensity
# fit KDE to the dataset
kde = KernelDensity(kernel=’gaussian’, bandwidth=0.4).fit(data)
# Generate random samples from the model
synthetic_data = kde.sample(1000)
df = pd.DataFrame({‘x1’: synthetic_data[:,0],
‘x2’: synthetic_data[:,1]})
with plt.style.context(‘seaborn’):
sns.scatterplot(df, x=’x1′,y=’x2′)
plt.xlabel(‘Mass’, fontsize=16)
plt.ylabel(‘Distance’, fontsize=16)
plt.show()Synthetically generated exoplanets data using Scikit-learn.
Final Words
Nice visualizations help to learn from data and to draw appropriate conclusions. KDE enables the presentation of data distribution in a visually pleasing manner. As a result, most of its applications in data exploration boil down to a bivariate case.
Keep in mind that calculating the value of KDE at a single point requires processing of all data points, which can be time-consuming in massive calculations. In such a case, you should consider using Gaussian mixture models instead. But that’s another story …
Acknowledgement
This research has made use of the Exoplanet Orbit Database and the Exoplanet Data Explorer at exoplanets.org.
Unless otherwise noted, all images are by the author.
References
[1] Kristan, M., Leonardis, A., Skočaj, D. Multivariate online kernel density estimation with Gaussian kernels (2011). Pattern recognition, 44(10–11), pp. 2630–2642.
[2] S. Węglarczyk, Kernel density estimation and its application (2018), ITM web of conferences, vol. 23, EDP Sciences.
[3] Han, E., Wang, S. X., Wright, J. T., Feng, Y. K., Zhao, M., Fakhouri, O., Brown, J. I., Hancock, C. Exoplanet orbit database. II. Updates to exoplanets.org (2014). Publications of the Astronomical Society of the Pacific, 126(943), 827.
Kernel Density Estimator for Multidimensional Data was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.