**Conformal Prediction for Machine Learning Classification —From the Ground Up**

#### Implementing conformal prediction for classification without need of bespoke packages

This blog post is inspired by Chris Molner’s book — Introduction to Conformal Prediction with Python. Chris is brilliant at making new machine learning techniques accessible to others. I’d especially also recommend his books on Explainable Machine Learning.

A GitHub repository with the full code may be found here: Conformal Prediction.

**What is Conformal Prediction?**

Conformal prediction is both a method of uncertainty quantification, and a method of classifying instances (which may be fine-tuned for classes or subgroups). Uncertainty is conveyed classification being in sets of potential classes rather than single predictions.

Conformal prediction specifies a coverage, which specifies the probability that the true outcome is covered by the prediction region. The interpretation of prediction regions in conformal prediction depends on the task. For classification we get prediction sets, while for regression we get prediction intervals.

Below is an example of the difference between ‘traditional’ classification (balance of likelihood) and conformal prediction (sets).

The difference between ‘normal’ classification based on most likely class, and conformal prediction which creates sets of possible classes.

The advantages of this method are:

**Guaranteed coverage**: Prediction sets generated by conformal prediction come with coverage guarantees of the true outcome — that is that they will detect whatever percentage of true values you set as a minimum target coverage. Conformal prediction does not depend on a well calibrated model — the only thing that matters is that, like all machine learning, the new samples being classified must come from similar data distributions to the training and calibration data. Coverage can also be guaranteed across classes or subgroups, though this takes an extra step in the method which we will cover.

**Easy to use**: Conformal prediction approaches can be implemented from scratch with just a few lines of code, as we will do here.**Model-agnostic**: Conformal prediction works with any machine learning model. It uses the normal outputs of whatever you preferred model is.**Distribution-free**: Conformal prediction makes no assumptions about underlying distributions of data; it is a non-parametric method.**No retraining required**: Conformal prediction can be used without retraining your model. It is another way of looking at, and using, model outputs.**Broad application**: conformal prediction works for tabular data classification, image or time-series classification, regression, and many other tasks, though we will demonstrate just classification here.

**Why should we care about uncertainty quantificiation?**

Uncertainty quantification is essential in many situations:

When we use model predictions to make decisions. How sure are we of those predictions? Is using just ‘most likely class’ good enough for the task we have?When we want to communicate the uncertainty associated with our predictions to stakeholders, without talking about probabilities or odds, or even log-odds!

**Alpha in conformal prediction — describes ***coverage*

*Coverage* is key to conformal prediction. In classification it is the normal region of data that a particular class inhabits. Coverage is equivalent to *sensitivity* or *recall*; it is the proportion of observed values that are identified in the classification sets. We can tighten or loosen the area of coverage by adjusting 𝛼 (*coverage = 1 — *𝛼).

### Let’s code!

#### Import packages

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

from sklearn.datasets import make_blobs

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

**Create synthetic data for classification**

Example data will be produced using SK-Learn’s `make_blobs` method.

n_classes = 3

# Make train and test data

X, y = make_blobs(n_samples=10000, n_features=2, centers=n_classes, cluster_std=3.75, random_state=42)

# Reduce the size of the first class to create an imbalanced dataset

# Set numpy random seed

np.random.seed(42)

# Get index of when y is class 0

class_0_idx = np.where(y == 0)[0]

# Get 30% of the class 0 indices

class_0_idx = np.random.choice(class_0_idx, int(len(class_0_idx) * 0.3), replace=False)

# Get the index for all other classes

rest_idx = np.where(y != 0)[0]

# Combine the indices

idx = np.concatenate([class_0_idx, rest_idx])

# Shuffle the indices

np.random.shuffle(idx)

# Split the data

X = X[idx]

y = y[idx]

# Split off model training set

X_train, X_rest, y_train, y_rest = train_test_split(X, y, test_size=0.5, random_state=42)

# Split rest into calibration and test

X_Cal, X_test, y_cal, y_test = train_test_split(X_rest, y_rest, test_size=0.5, random_state=42)

# Set class labels

class_labels = [‘blue’, ‘orange’, ‘green’]# Plot the data

fig = plt.subplots(figsize=(5, 5))

ax = plt.subplot(111)

for i in range(n_classes):

ax.scatter(X_test[y_test == i, 0], X_test[y_test == i, 1], label=class_labels[i], alpha=0.5, s=10)

legend = ax.legend()

legend.set_title(“Class”)

ax.set_xlabel(“Feature 1”)

ax.set_ylabel(“Feature 2”)

plt.show()Generated data (the data is created to be imbalanced — the blue class has only about 30% of the data points of either the green or orange classes).

**Build a classifier**

We will use a simple logistic regression model here, but the method can work with any model from a simple logistic regression model based on tabular data through to 3D ConvNets for image classification.

# Build and train the classifier

classifier = LogisticRegression(random_state=42)

classifier.fit(X_train, y_train)

# Test the classifier

y_pred = classifier.predict(X_test)

accuracy = np.mean(y_pred == y_test)

print(f”Accuracy: {accuracy:0.3f}”)

# Test recall for each class

for i in range(n_classes):

recall = np.mean(y_pred[y_test == i] == y_test[y_test == i])

print(f”Recall for class {class_labels[i]}: {recall:0.3f}”)Accuracy: 0.930

Recall for class blue: 0.772

Recall for class orange: 0.938

Recall for class green: 0.969

Note how recall for the minority class is lower than the other classes. Recall, otherwise known as sensitivity, is the number in a class that are correctly identified by the classifier.

#### S_i**, or the ***non-conformity score*** score**

In conformal prediction, the non-conformity score, often denoted as *s_i*, is a measure of how much a new instance deviates from the existing instances in the training set. It’s used to determine whether a new instance belongs to a particular class or not.

In the context of classification, the most common non-conformity measure is *1 — predicted class probability* for the given label. So, if the predicted probability of a new instance belonging to a certain class is high, the non-conformity score will be low, and vice versa.

For conformal prediction we obtain *s_i* scores for all classes (note: we only look at the model output for the true class of an instance, even if it is has a higher predicted probability of being another class). We then find a threshold of scores that contains (or *covers*) 95% of the data. The classification will then identify 95% of new instances (so long as our new data is similar to our training data).

**Calculate conformal prediction threshold**

We will now predict classification probabilities of the calibration set. This will be used to set a classification threshold for new data.

# Get predictions for calibration set

y_pred = classifier.predict(X_Cal)

y_pred_proba = classifier.predict_proba(X_Cal)

# Show first 5 instances

y_pred_proba[0:5]array([[4.65677826e-04, 1.29602253e-03, 9.98238300e-01],

[1.73428257e-03, 1.20718182e-02, 9.86193899e-01],

[2.51649788e-01, 7.48331668e-01, 1.85434981e-05],

[5.97545130e-04, 3.51642214e-04, 9.99050813e-01],

[4.54193815e-06, 9.99983628e-01, 1.18300819e-05]])

**Calculate non-conformality scores:**

We will calculate *s_i *scores only based on looking at probabilities associated with the observed class. For each instance we will get the predicted probability for the class of that instance. The *s_i* score (non-conformality) is *1-probability*. The higher the *s_i* score, the less that example conforms to that class in comparison to other classes.

si_scores = []

# Loop through all calibration instances

for i, true_class in enumerate(y_cal):

# Get predicted probability for observed/true class

predicted_prob = y_pred_proba[i][true_class]

si_scores.append(1 – predicted_prob)

# Convert to NumPy array

si_scores = np.array(si_scores)

# Show first 5 instances

si_scores[0:5]array([1.76170035e-03, 1.38061008e-02, 2.51668332e-01, 9.49187344e-04,

1.63720201e-05])

**Get 95th percentile threshold:**

The threshold determines what *coverage *our classification will have. Coverage refers to the proportion of predictions that actually contain the true outcome.

The threshold is the percentile corresponding to *1 — *𝛼. To get 95% coverage, we set an 𝛼 of 0.05.

When used in real life, the quantile level (based on 𝛼) requires a finite sample correction to calculate the corresponding quantile 𝑞. We multiple 0.95 by $(n+1)/n$, which means that 𝑞𝑙𝑒𝑣𝑒𝑙 would be 0.951 for n = 1000.

number_of_samples = len(X_Cal)

alpha = 0.05

qlevel = (1 – alpha) * ((number_of_samples + 1) / number_of_samples)

threshold = np.percentile(si_scores, qlevel*100)

print(f’Threshold: {threshold:0.3f}’)Threshold: 0.598

Show chart of s_i values, with cut-off threshold.

x = np.arange(len(si_scores)) + 1

sorted_si_scores = np.sort(si_scores)

index_of_95th_percentile = int(len(si_scores) * 0.95)

# Color by cut-off

conform = ‘g’ * index_of_95th_percentile

nonconform = ‘r’ * (len(si_scores) – index_of_95th_percentile)

color = list(conform + nonconform)

fig = plt.figure(figsize=((6,4)))

ax = fig.add_subplot()

# Add bars

ax.bar(x, sorted_si_scores, width=1.0, color = color)

# Add lines for 95th percentile

ax.plot([0, index_of_95th_percentile],[threshold, threshold],

c=’k’, linestyle=’–‘)

ax.plot([index_of_95th_percentile, index_of_95th_percentile], [threshold, 0],

c=’k’, linestyle=’–‘)

# Add text

txt = ’95th percentile conformality threshold’

ax.text(5, threshold + 0.04, txt)

# Add axis labels

ax.set_xlabel(‘Sample instance (sorted by $s_i$)’)

ax.set_ylabel(‘$S_i$ (non-conformality)’)

plt.show()s_i scores for all data. The threshold is the s_i level that contains 95% of all data (if 𝛼 is set at 0.05).

#### Get samples/classes from test set classified as positive

We can now find all those model outputs less than the threshold.

It is possible for an individual example to have no value, or more than one value, below the threshold.

prediction_sets = (1 – classifier.predict_proba(X_test) <= threshold)

# Show first ten instances

prediction_sets[0:10]array([[ True, False, False],

[False, False, True],

[ True, False, False],

[False, False, True],

[False, True, False],

[False, True, False],

[False, True, False],

[ True, True, False],

[False, True, False],

[False, True, False]])

**Get prediction set labels, and compare to standard classification.**

# Get standard predictions

y_pred = classifier.predict(X_test)

# Function to get set labels

def get_prediction_set_labels(prediction_set, class_labels):

# Get set of class labels for each instance in prediction sets

prediction_set_labels = [

set([class_labels[i] for i, x in enumerate(prediction_set) if x]) for prediction_set in

prediction_sets]

return prediction_set_labels

# Collate predictions

results_sets = pd.DataFrame()

results_sets[‘observed’] = [class_labels[i] for i in y_test]

results_sets[‘labels’] = get_prediction_set_labels(prediction_sets, class_labels)

results_sets[‘classifications’] = [class_labels[i] for i in y_pred]

results_sets.head(10)

observed labels classifications

0 blue {blue} blue

1 green {green} green

2 blue {blue} blue

3 green {green} green

4 orange {orange} orange

5 orange {orange} orange

6 orange {orange} orange

7 orange {blue, orange} blue

8 orange {orange} orange

9 orange {orange} orange

Note instance 7 is actually orange class, but has been classified by the simple classifier as blue. The conformal prediction classes it as a set of orange and blue.

**Plot data showing instance 7 which is predicted to possibly be in 2 classes:**

# Plot the data

fig = plt.subplots(figsize=(5, 5))

ax = plt.subplot(111)

for i in range(n_classes):

ax.scatter(X_test[y_test == i, 0], X_test[y_test == i, 1],

label=class_labels[i], alpha=0.5, s=10)

# Add instance 7

set_label = results_sets[‘labels’].iloc[7]

ax.scatter(X_test[7, 0], X_test[7, 1], color=’k’, s=100, marker=’*’, label=f’Instance 7′)

legend = ax.legend()

legend.set_title(“Class”)

ax.set_xlabel(“Feature 1”)

ax.set_ylabel(“Feature 2”)

txt = f”Prediction set for instance 7: {set_label}”

ax.text(-20, 18, txt)

plt.show()Scatter plot showing how test instance 7 was classified as belonging to two possible sets: {‘blue’, ‘orange’},

**Show coverage and average set size**

*Coverage* is the proportion of prediction sets that actually contain the true outcome.

*Average set size* is the average number of predicted classes per instance.

We will define some functions to calculate the results.

# Get class counts

def get_class_counts(y_test):

class_counts = []

for i in range(n_classes):

class_counts.append(np.sum(y_test == i))

return class_counts

# Get coverage for each class

def get_coverage_by_class(prediction_sets, y_test):

coverage = []

for i in range(n_classes):

coverage.append(np.mean(prediction_sets[y_test == i, i]))

return coverage

# Get average set size for each class

def get_average_set_size(prediction_sets, y_test):

average_set_size = []

for i in range(n_classes):

average_set_size.append(

np.mean(np.sum(prediction_sets[y_test == i], axis=1)))

return average_set_size

# Get weighted coverage (weighted by class size)

def get_weighted_coverage(coverage, class_counts):

total_counts = np.sum(class_counts)

weighted_coverage = np.sum((coverage * class_counts) / total_counts)

weighted_coverage = round(weighted_coverage, 3)

return weighted_coverage

# Get weighted set_size (weighted by class size)

def get_weighted_set_size(set_size, class_counts):

total_counts = np.sum(class_counts)

weighted_set_size = np.sum((set_size * class_counts) / total_counts)

weighted_set_size = round(weighted_set_size, 3)

return weighted_set_size

Show results for each class.

results = pd.DataFrame(index=class_labels)

results[‘Class counts’] = get_class_counts(y_test)

results[‘Coverage’] = get_coverage_by_class(prediction_sets, y_test)

results[‘Average set size’] = get_average_set_size(prediction_sets, y_test)

results Class counts Coverage Average set size

blue 241 0.817427 1.087137

orange 848 0.954009 1.037736

green 828 0.977053 1.016908

Show overall results.

weighted_coverage = get_weighted_coverage(

results[‘Coverage’], results[‘Class counts’])

weighted_set_size = get_weighted_set_size(

results[‘Average set size’], results[‘Class counts’])

print (f’Overall coverage: {weighted_coverage}’)

print (f’Average set size: {weighted_set_size}’)Overall coverage: 0.947

Average set size: 1.035

NOTE: Though our overall coverage is as desired, being very close to 95%, coverage of the different classes varies, and is lowest (83%) for our smallest class. If coverage of individual classes is important we can set out thresholds for classes independently, which is what we will now do.

### Conformal classification with equal coverage across classes

When we want to be sure of coverage across all classes, we can set thresholds for each class independently.

Note: we could also do this for subgroups of data, such as ensuring equal coverage for a diagnostic across racial groups, if we found coverage using a shared threshold led to problems.

#### Get thresholds for each class independently

# Set alpha (1 – coverage)

alpha = 0.05

thresholds = []

# Get predicted probabilities for calibration set

y_cal_prob = classifier.predict_proba(X_Cal)

# Get 95th percentile score for each class’s s-scores

for class_label in range(n_classes):

mask = y_cal == class_label

y_cal_prob_class = y_cal_prob[mask][:, class_label]

s_scores = 1 – y_cal_prob_class

q = (1 – alpha) * 100

class_size = mask.sum()

correction = (class_size + 1) / class_size

q *= correction

threshold = np.percentile(s_scores, q)

thresholds.append(threshold)

print(thresholds)[0.9030202125697161, 0.6317149025299887, 0.26033562285411]

**Apply class-specific threshold to each class classification**

# Get Si scores for test set

predicted_proba = classifier.predict_proba(X_test)

si_scores = 1 – predicted_proba

# For each class, check whether each instance is below the threshold

prediction_sets = []

for i in range(n_classes):

prediction_sets.append(si_scores[:, i] <= thresholds[i])

prediction_sets = np.array(prediction_sets).T

# Get prediction set labels and show first 10

prediction_set_labels = get_prediction_set_labels(prediction_sets, class_labels)

# Get standard predictions

y_pred = classifier.predict(X_test)

# Collate predictions

results_sets = pd.DataFrame()

results_sets[‘observed’] = [class_labels[i] for i in y_test]

results_sets[‘labels’] = get_prediction_set_labels(prediction_sets, class_labels)

results_sets[‘classifications’] = [class_labels[i] for i in y_pred]

# Show first 10 results

results_sets.head(10) observed labels classifications

0 blue {blue} blue

1 green {green} green

2 blue {blue} blue

3 green {green} green

4 orange {orange} orange

5 orange {orange} orange

6 orange {orange} orange

7 orange {blue, orange} blue

8 orange {orange} orange

9 orange {orange} orange

**Check coverage and set size across classes**

We now have about 95% coverage across all classes. The conformal prediction method gives us better coverage of the minority class than the standard method of classification.

results = pd.DataFrame(index=class_labels)

results[‘Class counts’] = get_class_counts(y_test)

results[‘Coverage’] = get_coverage_by_class(prediction_sets, y_test)

results[‘Average set size’] = get_average_set_size(prediction_sets, y_test)

results Class counts Coverage Average set size

blue 241 0.954357 1.228216

orange 848 0.956368 1.139151

green 828 0.942029 1.006039weighted_coverage = get_weighted_coverage(

results[‘Coverage’], results[‘Class counts’])

weighted_set_size = get_weighted_set_size(

results[‘Average set size’], results[‘Class counts’])

print (f’Overall coverage: {weighted_coverage}’)

print (f’Average set size: {weighted_set_size}’)Overall coverage: 0.95

Average set size: 1.093

### Summary

Conformal prediction was used to classify instances in sets rather than single predictions. Instances on the borders between two classes were labelled with both classes rather than picking the class with highest probability.

When it is important that all classes are detected with the same coverage, the threshold for classifying instances may be set separately (this method could also be used for subgroups of data, for example to guarantee the same coverage across different ethnic groups).

Conformal prediction does not change the model predictions. It just uses them in a different way to traditional classification. It may be used alongside more traditional methods.

(All images are by the author)

Conformal prediction for machine learning classification — from the ground up was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.