In the world of data science, high-dimensional data presents both a challenge and opportunity. Whilst it provides a treasure trove of relationships and patterns that can be moulded and transformed, without careful cleaning and selection it can become overwhelming to analyse and draw conclusions from: “The curse of dimensionality”. Whilst instinctively you may lean to Principal Component Analysis to embed the data to a smaller subspace, you might be making your data problem even more challenging and a nonlinear embedding technique might be the more appropriate option. However, care needs to be made when selecting the right nonlinear technique as a wrong turn might lead to embeddings that are overfit or simply not fit for use. In this article I will take the opportunity to discuss a novel approach to understanding the manifold within high dimensional data, so that we as data scientists can make informed quantitative decisions on the underlining structure of our complex data.

I will start by covering what manifold learning is and outline a high level but informative summary of four popular linear and non-linear embedding techniques. From this we will achieve a deeper understanding of the assumptions made in each case and the implications these have on effective embeddings. I will also cover some Python examples on how to apply my noise injection analytical approach to evaluating manifolds and the type of inferences that can be made. At the end of this article you will have a thorough understanding of different manifold learning techniques, and the steps you can take to truly understand the underlining structure within your data.

### Manifold learning

Before jumping into these manifold learning techniques, it is important to cover exactly what a manifold is? In our context the manifold is an approximate representation of the structure of our high-dimensional space which might have local and/or global relationships to other nearby data points. The caveat is that we really do not know up front the true structure within our N-dimensional space, and often are forced to make implicit assumptions of the relationship between the data points when embedding our data. Unlike manifold learning in mathematics (Riemannian geometry), where it is possible to find an explicit mapping from one space to another.

The success of a machine learning model, in terms of performance and data-driven insight, is fundamentally subject to the data we pass to it. Whilst passing more information can enable these algorithms to find more intricate relationships and patterns, it also leads to compounding problems that are often generalised under the phrase of the *curse of dimensionality*:

**Overfit models**: As the dimensionality of the data increases, subsequent machine learning models may fail to generalise the true relationship within the data as a result overfit to noise and outliers.**Interpoint relationship dilation**: In large complex feature spaces it is not uncommon for some areas to become so sparse they are hard to model or so concentrated that key information is obscured.**Increase computational complexity**: Most machine learning algorithms do not scale well as the number of features increase, leading to increased computational time or memory demand when training the model.

To overcome this we must either reduce the number of features we consider or map the data to a lower dimensional space whilst preserving as much key information as possible. In the following section we summarise and explore different techniques (linear and nonlinear).

### Principal Component Analysis

Principal component analysis (PCA) is arguably the most infamous method to embed or reduce the dimensionality of a dataset, likely grounded in the explainability inferred from its statistical approach. There are plenty of other articles that can be found online that go deeper into this algorithm, but for the purpose of this article I have outlined the primary steps below.

The key takeaway is that PCA attempts to preserve the relationships between all data points by assuming a linear manifold and mapping the data onto N orthogonal principal components that are a linear combination of the original features. It does this by first standardising the data, centring around the mean and scaling accordingly so that the variance is consistent across all variables:

where *Xⱼ *is the original feature space *X* for all features *j*, *μ* and *σ* are the mean and standard deviation of *Xⱼ *respectively. The algorithm then computes the covariance matrix, *S*, of the standardised data

expressing how each variable correlates with every other variable. PCA then performs eigen decomposition of the covariance matrix to determine the eigenvalues, *λᵢ*, and eigenvectors, *vᵢ*

A Matrix, *W*, is defined by these eigenvectors, ordered by decreasing eigenvalues. The final projection of the transformed data, *Y*, is simply the product of *Z* and *W*.

In summary, PCA provides a way to uncover the internal structure of the data in a way that best preserves and explains the variance (i.e. maximising the information across the lowest number of dimensions). Each eigenvalue is proportional to the portion of the variance and so our matrix *W* enforces that the first projected principal component contains the most variance and each subsequent orthogonal component a fraction less.

### Local Linear Embedding

Before jumping into more advanced nonlinear approaches, let’s start with what is probably the simplest to understand, Local Linear Embedding (LLE). In essence, LLE assumes that a given data point and its neighbours can be approximately represented and mapped to a linear tangential plane upon the manifold such that they are linear combinations of one another. Weights to map the neighbourhood clusters to the plane are tuned to minimise the error from the transformation (see below) and this process is repeated for each data point. Therefore, whilst there is local linearity in a neighbourhood of data points, nonlinearity will be captured globally.

As a data scientist you need to define the number of nearest neighbours, *k*, which will require careful tuning. With this value to hand, the first optimisation problem to solve is the weights array *Wᵢⱼ *that maps each data point *Xᵢ *to the tangential plane as a linear combination of its neighbours:

subject, for each *i*, to the constraint:

that ensures local geometry is preserved with weights summing to 1. From this our embedding, *Yᵢ*, is simply the product of these weights and the original space *Xᵢ* whilst ensuring that the following embedding cost function is minimised

The above ensures that the lower dimensional representation best preserves the local weights in the original space. Whilst this is quite an eloquent solution to capturing nonlinear relationships there is a risk of corrupt embeddings if our value for *k* is not tuned appropriately of if there are parts of our N-dimensional space that are sparse. Next we will explore other nonlinear embedding techniques that use a combination of local and global relationships to form the final embedding, which in most cases makese them more robust.

### Spectral Embedding

Spectral Embedding (SE), also referred to as Laplacian Eigenmap embedding, forms a similarity graph that connects all data points together, which are then weighted based upon the spectral similarity between points. By doing so, SE not only preserves the local relationship (as LLE does) but the connected graph insures that the global relationships are also factored in. The reliance on the spectral properties observed in the graph Laplacian allows this algorithm to uncover complex non-linear structures that other techniques might fail to identify.

The first step of this algorithm is to construct the graph:

where *Wᵢⱼ* is the width of the edge between nodes *i* and *j*, and *σ *is a customisable parameter to control the width of the neighbourhoods. From this the graph Laplacian matrix, *L*, is then defined as *L=D-W* where *W* is the adjacency matrix of the graph and *D* the diagonal degree matrix with the below entries:

By enforcing orthogonality and a centring constraint to the final embedding, the algorithm performs eigenvalue decomposition on the Laplacian matrix to identify the eigenvalues and eigenvectors to embed the data accordingly.

### Isometric Feature Mapping

The last nonlinear manifold technique that will be covered is Isometric Feature Mapping (ISOMAP), a powerful nonlinear embedding method that offers slightly more customisation than those mentioned above. The algorithmic approach assumes you can present the high dimensional space by a connected neighbourhood graph, whereby the distance between nodes are geodesic. The method then applies multidimensional scaling (MDS) to find a lower-dimensional representation of the data such that the pairwise distances between the nodes are preserved as much as possible.

When constructing the graph you can choose to impose limits on either/both the number of nearest neighbours to consider and the relative euclidean distance of a data point to its neighbours. These constraints need to be appropriately tuned, for example, not too large that shortcut edges are formed (missing key structural information) but also not too small that we fail to create a connected graph. If these neighbourhood conditions are satisfied then an edge is established between two nodes. With the graph, *G*, the algorithm then computes the geodesic distance, *Dᵢⱼ,* between all pairs of points using a shortest path algorithm, *f*, such as Dijkstra’s or Floyd’s:

The final step is to map the data to the subspace which involves applying MDS to *Dᵢⱼ*. If we recall earlier to our overview of PCA, we would evaluate the covariance matrix prior to eigenvalue decomposition. MDS differs slightly by calculating a Gram matrix, *B*, relative to centring matrix *H*:

where *e* is a vector of ones and *n* is the number of data points. The final embedding is derived from the *d *eigenvectors that correspond to the largest eigenvalues:

### Noise injection

Up to now we have covered a selection of linear and non-linear methods for embedding data into a lower dimensional space by making certain assumptions about the underlining manifold, but how do we know which method is capturing useful information, especially when working with high-dimensional data that we cannot visualise. One approach to judging the performance of any embedding technique from a quantitative angle is to use, what I refer to as, noise injection. With this approach we apply varying amounts of noise to the original space and monitor the impact it has one our embeddings. The underlining principle is that as the amount of noise (distortion) is increased in the original space there will come a point where any manifold learning algorithms will fail to capture the true underlining structure. From observing how embeddings respond to varying amounts of noise it is easy to identify how well each technique is modelling the underlining structure in the dataset. Below you can find a step by step summary on how to do this analysis, with two python examples to bring the idea to life:

Generate surrogate datasets from the original dataset with the addition of gaussian noise, the variance of which we will scale for each subsequent surrogate.Embed these datasets using a selection of manifold learning techniques.For each technique compare the noise injected embeddings to the embedding of the original space (no synthetic additive noise) using procrustes analysis, a popular statistical method for comparing two shapes. This evaluation method will rotate, scale and translate one space upon another with the objective to minimise the sum of squared differences between each data point. This difference is a measure of similarity, which will be analysed to observe how each embedding method performs when subject to noise.The final step is to plot the change in procrustes distance relative to the scale of synthetic additional noise, allowing us to draw conclusions on the performance of each technique.

Let’s apply the above steps to the classic S-curve dataset:

import matplotlib.pyplot as plt

from sklearn import manifold, datasets

import numpy as np

from scipy.spatial import procrustes

from sklearn.decomposition import PCA

def compute_procrustes_distances(data, embedding_technique, max_noise, noise_step=0.05):

“””

Compute Procrustes distances for a range of Gaussian noise levels.

Parameters:

data (np.array): The original dataset.

embedding_technique (object): An instance of a manifold learning technique.

max_noise (float): Maximum level of noise to be added.

noise_step (float): Incremental step for noise addition.

Returns:

list: A list of Procrustes distances for each noise level.

“””

base_embedding = embedding_technique.fit_transform(data)

noise_levels = np.arange(0, max_noise, noise_step)

distances = []

for noise in noise_levels:

noisy_data = data + np.random.normal(-noise, noise, data.shape)

noisy_embedding = embedding_technique.fit_transform(noisy_data)

_, _, disparity = procrustes(base_embedding, noisy_embedding)

distances.append(disparity)

return distances

def plot_data(X, colour):

“””

Plot the dataset in 3D.

Parameters:

X (np.array): Data points.

colour (np.array): Colour mapping for the data points.

“””

fig = plt.figure(figsize=(30, 10))

ax = fig.add_subplot(111, projection=’3d’)

ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=colour, cmap=plt.cm.Spectral)

ax.view_init(4, -72)

plt.show()

def plot_procrustes_distances(noise_range, distances, labels):

“””

Plot Procrustes distances for different embedding techniques.

Parameters:

noise_range (np.array): Range of noise levels.

distances (dict): Dictionary of distances for each embedding technique.

labels (list): List of labels for each embedding technique.

“””

plt.figure(figsize=(10, 6))

for label in labels:

plt.plot(noise_range, distances[label], label=label)

plt.xlabel(‘Range of noise’)

plt.ylabel(‘Procrustes distance’)

plt.title(‘Comparison of Embedding Techniques’)

plt.legend()

plt.show()

# Generate and plot the S-curve dataset

X, colour = datasets.make_s_curve(1000, random_state=0)

plot_data(X, colour)

# Compute Procrustes distances for different embedding techniques

max_noise = 2

noise_range = np.arange(0, max_noise, 0.05)

embedding_techniques = {

“PCA”: PCA(2),

“ISOMAP”: manifold.Isomap(n_components=2),

“LLE”: manifold.LocallyLinearEmbedding(n_neighbors=10, n_components=2),

“SE”: manifold.SpectralEmbedding(n_components=2, n_neighbors=10)

}

distances = {label: compute_procrustes_distances(X, technique, max_noise) for label, technique in embedding_techniques.items()}

# Plot the computed Procrustes distances

plot_procrustes_distances(noise_range, distances, embedding_techniques.keys())

Of course, in real world applications we will not be privy to the underlining true structure as is the case with this dummy example, but for now let’s assume we do not know the true structure. What can we derive from the above graph? Fundamentally a good trend would be one that resembles a sigmoid curve, where initially we see resilience to a small amount of noise with very similar embedding to the original space, however there will come a critical point when this is not the case as the noise fractures the underlining structure. At this point we would expect a steep increase in procrustes distance with subsequent embeddings capturing noise with little to no meaningful information. Considering this, and the graph above, we can summarise the following:

**PCA**: Whilst procrustes distance does increase with noise, the trend is quite linear and therefore likely to not be capturing sufficient information of the true structure. Without considering the other trends, this alone would be a strong indication that a nonlinear embedding is required.

**LLE**: We see very little resilience to even marginal amounts of noise, which is likely due to a violation of the key assumption of local linearity. Increasing k, the number of nearest neighbours, might reduce the fragility of this embedding but it comes at a potential cost of losing detail (information) in the resultant subspace.

**ISOMAP**: Initially the performance of this embedding technique looks okay, but as more noise is introduced it becomes evident that there is a lack of information being captured (with a linear trend post the noise level: 0.25).

**SE**: Out of all of the methods explored SE performs the best, although it requires tuning to obtain an optimum fit.

Overall you might conclude that evidently the structure of our data resides on a nonlinear manifold, and that nonlinearity appears to be captured reasonably well with SE and ISOMAP. Considering this, and the poor performance from LLE, we might infer significant curvature to the manifold in the original space. Let’s explore this with a different example:

# Generate and plot the S-curve dataset

X, color = datasets.make_swiss_roll(1000, random_state=0)

plot_data(X, color)

# Compute Procrustes distances for different embedding techniques

max_noise = 4

noise_range = np.arange(0, max_noise, 0.05)

embedding_techniques = {

“PCA”: PCA(2),

“ISOMAP”: manifold.Isomap(n_components=2),

“LLE”: manifold.LocallyLinearEmbedding(n_neighbors=10, n_components=2),

“SE”: manifold.SpectralEmbedding(n_components=2, n_neighbors=10)

}

distances = {label: compute_procrustes_distances(X, technique, max_noise) for label, technique in embedding_techniques.items()}

# Plot the computed Procrustes distances

plot_procrustes_distances(noise_range, distances, embedding_techniques.keys())

Using the same functions outlined earlier we obtain the above procrustes distance trends for each embedding technique. Clearly the structure within this data is significantly nonlinear as we observe the PCA struggling to capture the underlining structure. The constant nonlinearity across the original space is also further emphasised by how poorly LLE performs, likely due to inconstant mapping of neighbourhoods to tangential planes. Again, SE and ISOMAP perform well, the latter slightly outperforming the former by trending to a procrustes distance of 1 quicker after the structure becomes fractured from noise. It would not be unreasonable to infer that SE is capturing some noise in all of the embeddings, which might be rectified with some tuning of the parameters. Tuning these algorithms can improve the generalisation and fit of the embedded data, here is an example of doing that for the above ISOMAP technique:

import numpy as np

from scipy.spatial import procrustes

import matplotlib.pyplot as plt

from sklearn import manifold, datasets

def return_procrustes_distance(data, embedding_technique, max_noise, noise_step=0.05):

“””

Compute the Procrustes distance for different levels of noise.

Parameters:

data (array_like): The original data to be embedded.

embedding_technique (object): The embedding technique (e.g., PCA, SpectralEmbedding).

max_noise (float): The maximum level of noise to be added.

noise_step (float): The increment step for the noise level.

Returns:

list: A list of Procrustes distances for each noise level.

“””

embeddings = []

distances = []

noise_range = np.arange(0, max_noise, noise_step)

for noise_level in noise_range:

noisy_data = data + np.random.normal(0, noise_level, data.shape)

embedded_data = embedding_technique.fit_transform(noisy_data)

if not embeddings: # if embeddings list is empty

embeddings.append(embedded_data)

_, _, disparity = procrustes(embeddings[0], embedded_data)

distances.append(disparity)

return distances

# Generating the S-curve dataset

X, _ = datasets.make_swiss_roll(1000, random_state=0)

# Parameters

max_noise = 2

k_values = [5, 7, 9] # Different values of k for ISOMAP

# Computing and plotting Procrustes distances for each k value

noise_levels = np.arange(0, max_noise, 0.05)

plt.figure(figsize=(10, 6))

for k in k_values:

embedding = manifold.Isomap(n_components=2, n_neighbors=k)

procrustes_distances = return_procrustes_distance(X, embedding, max_noise)

plt.plot(noise_levels, procrustes_distances, label=f’ISOMAP (k={k})’)

plt.xlabel(‘Noise Level’)

plt.ylabel(‘Procrustes Distance’)

plt.title(‘Procrustes Distance by Noise Level for Various k in ISOMAP’)

plt.legend()

plt.show()

The above is a very generic example of tuning, and you will certainly want to explore other parameters, but the principle is that by simply adjusting k and observing the impact of noise we can start to see the algorithm generalise better.

### Conclusion

In this article we have explored a range of manifold learning techniques and shown how we can use noise injection to better understand the underlining structure within high dimensional data. We achieved this by leveraging our understanding of how each algorithm works, the assumptions that underpin each, and analysing the influence of noise on embeddings. With this insight to hand we can make more informed decisions on how to pre-process or handle the data before passing it onto subsequent ML pipelines. This approach also enriches our understanding of the generalisation of the resultant embedding, and how it may respond to noise or potential future data drift.

Whether you plan to deploy an embedding technique as part of your ML solution or want a way to expand upon your process of performing exploratory data analysis, the above approach will hold you in good stead to deepen your understanding of the hidden obscured structure in any high dimensional dataset.

*Unless otherwise noted, all images are by the author.*

Unravelling Complexity: A Novel Approach to Manifold Learning Using Noise Injection was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.