Why Independent Component Analysis fails its canonical thought experiment and what we can learn from this failure.
A Cocktail Party Metaphor (Image by author)
Independent Component Analysis (ICA) has — since significant development in the 1990’s¹ — become a commonly used data decomposition and preprocessing technique. ICA is a method for blind source separation (BSS): some number of independent sources are blindly mixed and the resulting mixture signals are received by some other number of observers. ICA approaches work to demix observed signals and find independent sources by seeking a change of basis that minimizes the mutual information between demixed components or that maximizes “non-Gaussianity” of data projected onto those components.
Plenty of tutorials are already out there on ICA and its application: this article is not another introduction to ICA. Instead this is a commentary on the motivating problem that almost always accompanies explanations of ICA.
Seemingly every introduction to ICA utilizes the Cocktail Party Problem as an illustration of the BSS problem ICA is meant to solve.² The Cocktail Party Problem is an evocative and motivating thought experiment for ICA. There’s only one teensy issue: ICA will fail spectacularly at a real-life cocktail party, and the reasons why it will fail should really inform how ICA is applied.
ICA and the Cocktail Party
A crowded room. Clusters of party-goers — cocktails in hand — are speaking over one another in conversation. How can listeners separate the mixed chatter into distinct voices and perhaps zero-in on a single speaker? This is the setup of the Cocktail Party Problem, the canonical example used to introduce ICA. Imagine that several microphones are positioned at different locations within the room. ICA, it is said, reveals to us how to demix recorded signals into independent components representative of distinct speakers at the party.
The BSS problem is formulated by a mixing problem where some independent sources y are mixed into observed signals x
https://medium.com/media/8b4a1006823e8c388c0c43912965fef8/href
over for N time samples. A is a mixing matrix and we index the sources and observations by j and i. For the few equations included in this article I am using Einstein summation notation.
Scikit-learn’s decomposition module includes a good out of the box implementation of FastICA, which we can use to illustrate how this works in a low dimensional example. We’ll set up a few independent sources that are just phase shifted sinusoids oscillating at different frequencies, randomly mix them, and then apply FastICA to attempt to demix them. What we see is that — up to scaling, sign, and permutation — FastICA does a good job of recovering the original source signals (and hence, in a physical problem where we know the position of the microphones, we could recover the direction/position of the speakers).
import numpy as np
from sklearn.decomposition import FastICA
import matplotlib.pyplot as plt
rng = np.random.default_rng(8675309)
t = np.linspace(0, 10, 10000)
x = np.array([
np.sin(2 * np.pi * t),
np.sin(2 * np.pi / 2 * t + 1),
np.sin(2 * np.pi / 2 * 2 * t + 2)])
mixing = rng.uniform(-1, 1, [3, 3])
# check that our randomly generated mixing is invertible
assert np.linalg.matrix_rank(mixing) == 3
demixing_true = np.linalg.inv(mixing)
y = np.matmul(mixing, x)
fica = FastICA(3)
z = fica.fit_transform(np.transpose(y))
z = np.transpose(z)
z /= np.reshape(np.max(z, 1), (3, 1))
fig, ax = plt.subplots(3, 1, figsize = (8, 5))
for ii in range(3):
ax[0].plot(t, x[ii])
ax[1].plot(t, y[ii])
ax[2].plot(t, z[ii])
ax[0].set_title(“Independent Sources”)
ax[1].set_title(“Randomly, Linearly, Instaneously Mixed Signals”)
ax[2].set_title(“ICA Components (Match up to Sign and Linear Scaling)”)
plt.tight_layout()FastICA for Instantaneously Mixed SIgnals
Let’s itemize a few of the cocktail party model assumptions:
In the room of the cocktail party, we suppose that there are more observers (e.g. microphones) than sources (e.g. speakers), which is a necessary condition for the problem to not be underdetermined.Sources are independent and not normally distributed.A is a constant matrix: mixing is instantaneous and unchanging.
The BSS problem is “blind,” so we note that the sources y and mixing A are unknown, and we seek a generalized inverse of A called a demixing matrix W. ICA algorithms are strategies to derive W.
Ready to Mix (Image by author)
A Real-Life Cocktail Party
What would happen if we actually set up an array of microphones at a party and tried to perform ICA on recorded audio? It just so happens that out-of-the-box ICA will almost certainly fail miserably at separating speakers!
Let’s revisit one of our model assumptions: specifically instantaneous mixing. Because of the finite speed of sound, audio originating at speaker locations will arrive at each microphone in the room with different time delays.
The speed of sound is around 343 m/s at the party, so a microphone 10 m away from a speaker will record at about a 0.03 second delay. While this may seem nearly instantaneous to a human at the party, for a recording on the order of 10 kHz this translates to a delay of hundreds of digital samples.
Try and feed this blind mixture of time-delayed speakers into vanilla ICA and the results won’t be pretty. But hold on, aren’t there examples of ICA being used to demix audio?³ Yes, but these toy problems are digitally and instantaneously mixed and hence are consistent with the ICA model assumptions. Real-world recordings are not only time delayed, but are also subject to more complicated time transformations (more on this below).
We can revisit our toy example above and introduce a random delay between sources and recording microphones to see how FastICA starts to break down as the model assumptions are violated.
rng = np.random.default_rng(8675309)
t = np.linspace(0, 11, 11000)
x = np.array([
np.sin(2 * np.pi * t),
np.sin(2 * np.pi / 2 * t + 1),
np.sin(2 * np.pi / 2 * 2 * t + 2)])
mixing = rng.uniform(-1, 1, [3, 3])
# check that our randomly generated mixing is invertible
assert np.linalg.matrix_rank(mixing) == 3
demixing_true = np.linalg.inv(mixing)
delays = rng.integers(100, 500, (3, 3))
y = np.zeros(x.shape)
for source_i in range(3):
for signal_j in range(3):
x_ = x[source_i, delays[source_i, signal_j]:]
y[signal_j, :len(x_)] += mixing[source_i, signal_j] * x_
t = t[:10000]
x = x[:, :10000]
y = y[:, :10000]
fica = FastICA(3)
z = fica.fit_transform(np.transpose(y))
z = np.transpose(z)
z /= np.reshape(np.max(z, 1), (3, 1))
fig, ax = plt.subplots(3, 1, figsize = (8, 5))
for ii in range(3):
ax[0].plot(t, x[ii])
ax[1].plot(t, y[ii])
ax[2].plot(t, z[ii])
ax[0].set_title(“Independent Sources”)
ax[1].set_title(“Randomly, Linearly, Time-Delayed Mixed Signals”)
ax[2].set_title(“ICA Components (Match up to Sign and Linear Scaling)”)
plt.tight_layout()FastICA for Delayed Mixed Signals: Notice that the demixed components diverge from the source signal shapes.
Timing Problems
It’s worth examining in a little more detail exactly why ICA can’t handle these time delays. After all, we’re already dealing with an unknown mixing, shouldn’t we be able to handle a little temporal perturbation? Going even further, vanilla ICA on structured data is permutation invariant! You can shuffle sampling or pixel order on time series or image data sets and get the same results out of ICA. So again, why wouldn’t ICA be robust to these time delays?
The problem in the real-world cocktail party is that there is a different time delay for each speaker and microphone pair. Think of each digital sample from a speaker as a drawing from a random variable. When there are no delays, each microphone is hearing the same draw/sample at the same time. However, in the real world each microphone records a different delayed draw/sample of the same speaker and just as the mixing matrix A is unknown, so also is the time delay. And, of course, the actual problem is even worse than just a single lag value: reverberations, echoes, and attenuation will further spread out the source signals before they arrive at microphones.
Let’s update our model formulation to represent this complicated time delay. Assuming that the underlying acoustics of the room don’t really change and the microphones and speakers remain in the same place, we can write:
https://medium.com/media/39ce66cb34de9dc19298586f1450e0dd/href
where k denotes a discrete time delay index, and the mixing matrix A is now a matrix function that varies from k = 0…T.³ In other words, the actual observation at the i-th microphone is a linear mixing of sources going back T samples. Also, we can notice that the somewhat simpler problem of a single time delay per source/microphone pair (without more complicated acoustic effects) is a subcase of the above model formulation, where the matrix A takes on a non-zero form at one value of k per (i, j) index pair.
The mathematically inclined, or those immersed in the arcane wizardry of signal processing, will notice that the real-world model⁴ of the cocktail party problem is starting to look a lot like convolution. In fact, this is a discrete analogue of a functional convolution, and via the Fourier transform we can arrive at a possibly more tractable frequency space version of our problem.
https://medium.com/media/c5e1261e1c54aee1c1018a7dac0db5ee/href
There’s a lot to unpack here. The convolutional representation of the cocktail party reveals concisely why ICA is doomed to fail on a naive treatment of the BSS problem. A real-world multi-sensor audio recording is almost certainly a deconvolution problem, not a linear demixing problem. It may still be tractable to approximate a solution to the problem (we’ll discuss some strategies below), but ICA shouldn’t be assumed to provide a spatially meaningful demixing in the time domain without a lot more work.
We can revisit our toy example one more time and emulate a rudimentary convolution by designing a nonlinear convolution of a random absolute delay and length. In this case we can really start to see the FastICA component solutions diverging significantly from the original source signals.
rng = np.random.default_rng(8675309)
t = np.linspace(0, 11, 11000)
x = np.array([
np.sin(2 * np.pi * t),
np.sin(2 * np.pi / 2 * t + 1),
np.sin(2 * np.pi / 2 * 2 * t + 2)])
mixing = rng.uniform(-1, 1, [3, 3])
# check that our randomly generated mixing is invertible
assert np.linalg.matrix_rank(mixing) == 3
demixing_true = np.linalg.inv(mixing)
delays = rng.integers(100, 500, (3, 3))
impulse_lengths = rng.integers(200, 400, (3, 3))
y = np.zeros(x.shape)
for source_i in range(3):
for signal_j in range(3):
impulse_length = impulse_lengths[source_i, signal_j]
impulse_shape = np.sqrt(np.arange(impulse_length).astype(float))
impulse_shape /= np.sum(impulse_shape)
delay = delays[source_i, signal_j]
for impulse_k in range(impulse_length):
x_ = x[source_i, (delay + impulse_k):]
y[signal_j, :len(x_)] += (
mixing[source_i, signal_j]
* x_ * impulse_shape[impulse_k]
)
t = t[:10000]
x = x[:, :10000]
y = y[:, :10000]
fica = FastICA(3)
z = fica.fit_transform(np.transpose(y))
z = np.transpose(z)
z /= np.reshape(np.max(z, 1), (3, 1))
fig, ax = plt.subplots(3, 1, figsize = (8, 5))
for ii in range(3):
ax[0].plot(t, x[ii])
ax[1].plot(t, y[ii])
ax[2].plot(t, z[ii])
ax[0].set_title(“Independent Sources”)
ax[1].set_title(“Randomly Convolved Signals”)
ax[2].set_title(“ICA Components (Match up to Sign and Linear Scaling)”)
plt.tight_layout()FastICA for Convolved Signals: Notice that the demixed components diverge significantly from the source signal shapes.
The frequency space version, however, is starting to look more like an ICA model problem, at least as a linear mixing problem. It’s not perfect: the Fourier transformed mixing matrix function is not stationary in frequency space. However, this is probably where we would want to sink our teeth into the problem, and is indeed the starting point of more generic deconvolution strategies.
The “Real World” and ICA
Whatever you do, don’t use ICA for audio source separation at a cocktail party. Are there, however, real-world situations where ICA is useful?
Let’s consider one of the most commonly encountered applications of ICA: electroencephalography (EEG) featurization and decomposition. EEG signals are time series recordings of electrical potential from electrodes at the scalp (and sometimes from electrodes in the brain). There is a cottage industry of applying ICA to preprocessed EEG data in order to identify independent sources of electrical potential signals in the brain and the body.
In the case of EEG recordings, the ICA model assumption of instantaneous mixing is certainly satisfied: electrical signals propagate nearly instantaneously relative to the length scale of a human head and the sampling frequency, which is typically on the order of tens to hundreds of Hz. That’s a good sign for ICA here, and in fact independent components generally do separate out some spatially meaningful features. Eye motions and muscle movements (where skin conductance propagates signal to the scalp) are often obviously distinct components. Other components result in seemingly meaningful patterns of electrode activations on the scalp, these activations are thought to result from collections of neurons acting as radiating dipole sources in the brain. The three dimensional location and orientation of such sources can be further inferred given an accurate coordinate mapping of electrode positions on the scalp.
We have established that the instantaneous mixing assumption is satisfied here, but what about the other model assumptions? If electrodes don’t move on the scalp, and the subject is otherwise stationary, then constant mixing may also be a reasonable assumption. Are we measuring more channels than sources? ICA won’t generate more independent components than recorded signal channels, but if there are many more true sources than it is feasible to discern, ascribing spatial meaning to components could be problematic.
Finally, are the sources independent? This is where things can get very tricky! Radiating dipole sources are, of course, not single neurons but are rather the collective spiking activity of many, many neurons. To what extent do we believe — at the sampling time scale of EEG — that these clusters of coherent neurons are independent of one another? A decade ago an extensive discussion and investigation on the topic was offered by Makeig and Onton.⁶ The gist is that sources are thought to be locally coherent cortical neuron patches: the strength of nearby connections relative to distant connections both induce “pond-ripple” like electrical potential (centered at localized sources) and presumably reduce dependence between spatially separated patches. That said, there has been intermittent interest in examining convolutive mixing in EEG via ICA in the complex domain.⁷ ⁸
Deconvolution and ICA
Can ICA still be used somehow to solve the deconvolution problem that the real-world cocktail party illustrates? Let’s return to the frequency space representation of the BSS deconvolution problem. Remember that it is very close to something ICA can handle… the mixing matrix is a linear transformation, the main issue is that it’s not stationary as a function of frequency. If we make a few assumptions about the (blind) convolution, we might be able to adapt ICA to the deconvolution problem.
Let’s assume that frequency space mixing varies continuously and somewhat “slowly” as a function of frequency. By slowly we mean that a small change in the frequency argument induces a small(er) change in the mixing. We’re being a bit vague with the terminology here, but the general idea is that given enough samples we could divide the BSS problem up over subsets of frequency space and run ICA in each subset assuming stationary mixing within frequency subsets. E.g. we know that globally the mixing varies with frequency, but maybe it varies slowly enough that we could assume it is stationary in spectral windows. So, between, say 10 and 15 kHz we are going to use a bunch of Fourier transformed samples to estimate a single static mixing in that frequency window.
In theory we can try and interpolate between static ICA solutions across the whole frequency spectrum. So, if we have our 10–15 kHz ICA demixing solution and another solution for 15–20 kHz, we could come up with some interpolation scheme centering our two solutions at 12.5 kHz and 17.5 kHz and then infer some mixing function of frequency between those two points.
However there are some ambiguities that need to be resolved. First demixing matrices aren’t just vectors but have some additional group structure we might want to pay attention to. Second, ICA solution components are invariant with respect to permutation and scaling… in other words, thinking again of ICA as a change of basis, any reordering or change in sign/magnitude of the basis directions will be an equally good solution. So strategies to do this sort of frequency space distributed ICA can boil down to how to solve a matching and consistency problem between ICA solutions in adjacent frequency sets.
Mixed Cocktails (Image by author)
Carefree and Careful Featurization
There is, hopefully, a more broadly applicable lesson in all of this. ICA can be a very powerful decomposition technique, even when there is some ambiguity as to whether its model assumptions are satisfied. In fact, as a researcher I would almost always turn to FastICA for dimension reduction instead of — or at least in comparison to — PCA. I especially love to experiment with FastICA for more abstract data without a formal BSS interpretation.
Why can ICA be used more generally? Because the algorithms themselves are only abstract approximations to BSS solutions. FastICA does what it says: it finds a change of basis against which data components are maximally non-Gaussian — as inferred (more or less) by kurtosis. If this transformation happens to coincide with physically meaningful independent sources, then that’s great! If not, it can still be a useful transform in the same sense in which PCA is used abstractly. PCA and FastICA are even very loosely related, if we think of each as a change of basis optimizing a second and fourth order statistic, respectively.
But it is necessary to be careful about reading more into ICA results than might be supported. We want to say that ICA components are maximally independent or non-Gaussian: sure, no problem! But can we say that ICA components are physically meaningfully distinct sources? Only if there is an underlying BSS model problem that satisfies the assumptions we have laid out here. ICA components in the abstract could well be indicative of useful relationships buried under layers of nonlinearities and convolutions. We just need to take care not to overload our interpretation of ICA without validating model assumptions.
References and Footnotes
[1] The two historically most prominent flavors of ICA — FastICA and Infomax ICA — trace back to:
A. Hyvärinen and E. Oja, A Fast Fixed-Point Algorithm for Independent Component Analysis (1997), Neural Computation
A. Bell and T. Sejnowski, An Information-Maximization Approach to Blind Separation and Blind Deconvolution (1995), Neural Computation
[2] C. Maklin, Independent Component Analysis (ICA) In Python (2019), Towards Data Science
[3] J. Dieckmann, Introduction to ICA: Independent Component Analysis (2023), Towards Data Science
[4] We are slightly abusing notation here by ignoring the boundaries of the audio recording, e.g. when t=0. Don’t worry! Everything is, after all, a notational abuse of mathematics.
[5] Is any model ever perfectly true in the “real world?” No, answers Dr. Box. Rob Thomas, tired of being hassled, concurs.
[6] S. Makeig and J. Onton, ERP features and EEG dynamics: An ICA perspective (2012), The Oxford Handbook of Event-Related Potential Components
[7] J. Anemüller, T. J. Sejnowski, and S. Makeig, Complex independent component analysis of frequency-domain electroencephalographic data (2003), Neural Networks
[8] A. Hyvärinen, P. Ramkumar, L. Parkkonen, and R. Hari, Independent component analysis of short-time Fourier transforms for spontaneous EEG/MEG analysis (2009), NeuroImage
ICA and the Real-Life Cocktail Party Problem was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.