#### From scratch in python🐍

Image by author.

One of the most simple and genius ML models, in my opinion, is k-Means clustering. It relates to the group of unsupervised learning algorithms and is capable of finding patterns inside an unlabeled dataset. The most pleasant feature is that it lacks complicated math, and basically any high school student can successfully implement and use this method. So in this article I want to share how you can build k-Means algorithm from scratch in python using only *numpy* and *pandas* libraries and apply it to a real world problem — semantic segmentation of satellite imagery.

#### Firstly, let’s talk about the data we have.

In one of my previous articles, I talked about the problem of the Aral Sea shrinkage. As a result, we managed to get remote sensing imagery from MODIS using Google Earth Engine, which strongly indicates that the sea is drying. So I wondered, how can we estimate the change of the water surface between 2000 and 2023 using ML semantic segmentation? The answer is k-Means!

How can a sea disappear? Case study of the Aral Sea using Python and MODIS data.

Before diving into coding, let’s have a look at the data we are going to use in this tutorial. These are two RGB images of the same area with an interval of 23 years, however it’s clear that the land surface properties and atmospheric conditions (clouds, aerosols etc.) are different. That’s why I decided to train two separate k-Means models, one for each image.

Image by author.To follow up the tutorial, you can download and run the notebook **here**.

First of all, let’s import the necessary libraries and upload the data to the notebook:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import matplotlib.image as mpimg

img = mpimg.imread(‘MOD_01.jpg’)

img2 = mpimg.imread(‘MOD_24.jpg’)

You can see that the area covered by the images is quite large, so I suggest to zoom in a little:

img = img[140:600,110:500,:]

img2 = img2[140:600,110:500,:]

fig, ax = plt.subplots(ncols=2, figsize=(16,9))

ax[0].imshow(img)

ax[1].imshow(img2)

for i in range(2):

ax[i].set_facecolor(‘black’)

ax[i].set_xticks([])

ax[i].set_yticks([])

ax[0].set_title(‘2000-08-01’, fontsize=26)

ax[1].set_title(‘2023-08-01’, fontsize=26)

plt.show()Image by author.

And the last step before the ML phase, let’s convert our images to *pandas* dataframes (one column for each image channel). I do that for the sake of visibility of my explanations. If you want to get it optimized, it’s better to use *numpy* arrays instead.

df = pd.DataFrame({‘R’: img[:,:, 0].flatten(), ‘G’: img[:,:, 1].flatten(), ‘B’:img[:,:, 2].flatten()})

df2 = pd.DataFrame({‘R’: img2[:,:, 0].flatten(), ‘G’: img2[:,:, 1].flatten(), ‘B’:img2[:,:, 2].flatten()})

### k-Means

So what is the idea behind the algorithm?

Imagine that you judge about the taste of food using two criteria: sweetness and price. Keeping this in mind, I’ll give you a set of possible options to eat:

Image by author.

I bet your brain has already split the options into three clusters: fruits, drinks and bakery. Basically, you unconsciously clustered the 2-dimensional data, which are defined by a pair of values — (sweetness; price).

Image by author.

In the case of** k-Means**, the goal of the algorithm is quite similar — to find a *pre-set* number of cluster, **k**, in n-dimensional space (e.g. besides sweetness and price you want to account for nutrition, health, presence of the food in your fridge, and in this case, n = 5).

### The algorithms includes the following stages:

#### I. Define the number of clusters.

As I mentioned beforehand, **k **in k-Means is the number of clusters you want to get in the end, and you’re supposed to set this value **before** training the model.

#### II. Randomly initialize centroids.

Centroid is an integral part of k-Means. Basically, centroid is a circle with a center, which is defined a set of coordinates, and each centroid represents a cluster. For instance, in our previous example there are 3 centroids.

#### III. Calculate distances and assign clusters.

Now we need to find how far each point is from each centroid. Based on this calculations, we assign each point to the least distant centroid (cluster).

#### IV. Calculate new centroids.

Now each of our clusters contains at least one points, so it’s time to re-calculate the centroids simply by taking mean coordinates across all cluster points.

And that’s it! We repeat steps 2–4 until centroids are not changing.

Image by author.

### Time To Code.

Now let’s wrap this really simple idea behind k-Means into python code.

Reminder: in this task we have **3D** problem, i.e. our **X**, **Y** and **Z** are **Red**, **Green** and **Blue** image channels!def kmeans(data, K, kind):

L = list()

new_centroids = data.sample(K).values

data = distance(data.copy(), new_centroids, kind)

old_centroids = new_centroids.copy()

new_centroids = np.array([data[data.Class == Class][[‘R’, ‘G’, ‘B’]].mean().values for Class in data.loc[:,’C1′:f’C{K}’].columns])

i = 1

print(f’Iteration: {i}tDistance: {abs(new_centroids.mean()-old_centroids.mean())}’)

while abs(new_centroids.mean()-old_centroids.mean())>0.001:

L.append(abs(new_centroids.mean()-old_centroids.mean()))

data = distance(data, new_centroids, kind)

old_centroids = new_centroids.copy()

new_centroids = np.array([data[data.Class == Class][[‘R’, ‘G’, ‘B’]].mean().values for Class in data.loc[:,’C1′:f’C{K}’].columns])

i+=1

print(f’Iteration: {i}tDistance: {abs(new_centroids.mean()-old_centroids.mean())}’)

print(f”k-Means has ended with {i} iteratinons”)

return data, L

On the first stage we create a list **L** to collect all the distances between clusters to visualize them afterwards and randomly sample K points from the dataset to make them our centroids (or alternatively, you can assign random values to the centroids).

L = list()

new_centroids = data.sample(K).values

Now we need to calculate the distances between centroids and data points. There are lots of different distance metrics in Data Science, but let’s focus on the following ones — Euclidean, Manhattan, Chebyshev.

For Euclidean distance:

Image by author.

For Manhattan:

Image by author.

For Chebyshev:

Image by author.

To use this formulas, let’s write a versatile function for **any** number of dimensions:

def distance(data, centroids, kind):

#kind = euclidean, manhattan, chebyshev

#Here we add to the dataframe as many clusters C-ith as needed

cols=list()

for i in range(1,k+1):

if kind==’euclidean’:

data[f’C{i}’] = ((centroids[i-1][0]-data.R)**2+(centroids[i-1][1]-data.G)**2+(centroids[i-1][2]-data.B)**2)**0.5

elif kind==’manhattan’:

data[f’C{i}’] = abs(centroids[i-1][0]-data.R)+abs(centroids[i-1][1]-data.G)+abs(centroids[i-1][2]-data.B)

elif kind==’chebyshev’:

merged=pd.concat([centroids[i-1][0]-data.R, centroids[i-1][1]-data.G, centroids[i-1][2]-data.B], axis=1)

data[f’C{i}’] = merged.max(axis=1)

cols.append(f’C{i}’)

data[‘Class’] = data[cols].abs().idxmin(axis=1) #assigning clusters to points

return data #returning the dataframe with k cluster columns and one Class column with the final cluster

So now we can simply calculate distances and assign a cluster to each data point. Thus, our new centroids became old, so we store them in another variable and recalculate the new ones. To do that we iterate over each cluster and take a mean across all the coordinates (in our case, across RGB channels). Therefore, the variable new_centroids has a shape of **(k,3)**.

data = distance(data.copy(), new_centroids, kind)

old_centroids = new_centroids.copy()

new_centroids = np.array([data[data.Class == Class][[‘R’, ‘G’, ‘B’]].mean().values for Class in data.loc[:,’C1′:f’C{K}’].columns])

Finally, we repeat all these steps until centroids’ coordinates don’t change anymore. I expressed this condition as this: the difference between average cluster coordinates should be less than 0.001. But you can play around with other numbers here.

while abs(new_centroids.mean()-old_centroids.mean())>0.001:

L.append(abs(new_centroids.mean()-old_centroids.mean()))

data = distance(data, new_centroids, kind)

old_centroids = new_centroids.copy()

new_centroids = np.array([data[data.Class == Class][[‘R’, ‘G’, ‘B’]].mean().values for Class in data.loc[:,’C1′:f’C{K}’].columns])

And that’s it. The algorithm is ready to be trained! So let’s set k = 3 and store the results into dictionaries.

k = 3

segmented_1, segmented_2, distances_1, distances_2 = {}, {}, {}, {}

segmented_1[‘euclidean’], distances_1[‘euclidean’] = kmeans(df, k, ‘euclidean’)

segmented_2[‘euclidean’], distances_2[‘euclidean’] = kmeans(df2, k, ‘euclidean’)

segmented_1[‘manhattan’], distances_1[‘manhattan’] = kmeans(df, k, ‘manhattan’)

segmented_2[‘manhattan’], distances_2[‘manhattan’] = kmeans(df2, k, ‘manhattan’)

segmented_1[‘chebyshev’], distances_1[‘chebyshev’] = kmeans(df, k, ‘chebyshev’)

segmented_2[‘chebyshev’], distances_2[‘chebyshev’] = kmeans(df2, k, ‘chebyshev’)

I decided to compare all the distance metrics for this particular task as you can see, and it’s evident that here Manhattan distance was the fastest.

Image by author.

Before visualizing the clusters, let’s convert the clusters names into int type:

d = {‘C1’:0, ‘C2’: 1, ‘C3’:2}

for key in segmented_1.keys():

segmented_1[key].Class = segmented_1[key].Class.apply(lambda x: d[x])

segmented_2[key].Class = segmented_2[key].Class.apply(lambda x: d[x])

Time make the final plots!

for key in segmented_1.keys():

fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(10,10))

ax[0, 0].imshow(img)

ax[0, 1].imshow(segmented_1[key].Class.values.reshape(460,390))

ax[0, 0].set_title(‘MOD09GA RGB’, fontsize=18)

ax[0, 1].set_title(f’kMeansn{key[0].upper()+key[1:]} Distance’, fontsize=18)

ax[1, 0].imshow(img2)

ax[1, 1].imshow(segmented_2[key].Class.values.reshape(460,390))

ax[1, 0].set_title(‘MOD09GA RGB’, fontsize=18)

ax[1, 1].set_title(f’kMeansn{key[0].upper()+key[1:]} Distance’, fontsize=18)

for i in range(2):

for j in range(2):

ax[i, j].set_facecolor(‘black’)

ax[i, j].set_xticks([])

ax[i, j].set_yticks([])

plt.savefig(f'{key}.png’)

plt.tight_layout()

plt.show()Image by author.Image by author.Image by author.

It’s not hard to see that Euclidean and Manhattan distance turned out be the most suitable for this particular task. But to make sure that it’s true, let’s evaluate the k-Means clustering results using the Silhouette Coefficient. This metric is perfect for training results assessment when there are no labeled true values for the clustered points.

To calculate it we will use *sklearn* function [1].

Image by sklearn.**a — **the mean distance between a sample and all other points in the same class.**b — **the mean distance between a sample and all other points in the *next nearest cluster*.

The range of values of the Silhouette Coefficient is [-1,1]. And yep, it’s computationally expensive, as you need to calculate distances between thousands of point several times, so be ready to wait.

scores_1, scores_2 = {}, {}

for key in segmented_1.keys(): #key is a metric for the distance estimation

scores_1[key]=round(silhouette_score(segmented_1[key].loc[:, :’C3′], segmented_1[key].Class, metric=key),2)

scores_2[key]=round(silhouette_score(segmented_2[key].loc[:, :’C3′], segmented_2[key].Class, metric=key),2)

print(f’Distance: {key}t Img 1: {scores_1[key]}t Img 2: {scores_2[key]}’)Image by author.

Now you can see that we proved it: Euclidean and Manhattan distances have similarly good performance, so let’s estimate the water surface area loss, using both of them.

for metric, Class in zip([‘euclidean’, ‘manhattan’], [2,1]):

img1_water = np.count_nonzero(segmented_1[metric].Class.values == Class)*500*500*1e-6 #pixel size is 500, so the area is 500*500 and to convert to km2 * 1e-6

img2_water = np.count_nonzero(segmented_2[metric].Class.values == Class)*500*500*1e-6

print(f’Distance: {metric}tWater Area Before: {round(img1_water)}kmu00b2tWater Area After: {round(img2_water)}kmu00b2tChange: -{100-round(img2_water/img1_water*100)}%’)Image by author.

— — — —

Distance: euclidean

Water Area Before: 17125 km²

Water Area After: 1960 km²

Change: -89%

— — — — —

Distance: manhattan

Water Area Before: 16244 km²

Water Area After: 2003 km²

Change: -88%

As you can see, according to our clustering results, the change in water surface area is almost **90% (!!!) water loss **over last 23 years, which is real proof of the fact that the Aral Sea shrinkage is a planetary tragedy…

===========================================

**Reference:**

===========================================

*All my publications on Medium are free and open-access, that’s why I’d really appreciate if you followed me here!*

P.s. I’m extremely passionate about (Geo)Data Science, ML/AI and Climate Change. So if you want to work together on some project pls contact me in LinkedIn.

🛰️Follow for more🛰️

Semantic Segmentation of Remote Sensing Imagery using k-Means was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.