How information about a target variable is distributed across its multiple features
Photo by Alina Grubnyak, via Unsplash.
When a target variable is influenced by multiple sources of information, it is crucial (and yet not trivial) to understand how each source contributes to the overall information provided.
In this article I’ll start with the basic concept of surprise, then I’ll proceed to explain how entropy consists of the average amount of surprise distributed over a random variable, and this gives us the conditions to define mutual information. After this, I talk about partial information decomposition for cases where we have multiple sources of information.
Surprise and Entropy
Maybe one of the most intuitive ways of defining Entropy from an Information standpoint is to first talk about surprise. The measure of surprise works just as how we expect: less probable events are more surprising, while more probable events are less surprising. The mathematical definition that encompasses these properties is the one shown below:
We can see by the graph in Figure 1 that this definition is pretty related to the properties we talked about. When some event has a high chance of happening (p closer to 1), then the surprise is close to zero. On the other hand, if an event has a very low probability of happening, its surprise gets arbitrarily large.
Figure 1: The graph of surprise. Image by author.
Now, what does entropy have to do with surprise? Well, entropy is the average surprise over all the values of a random variable. Therefore, if we have some random variable X, and the set of all possible outcomes of X is called A_X (we call it the “alphabet of X”), then entropy H is defined as:
Great. Now we tied up entropy with surprise, we can understand one useful interpretation of entropy:
Entropy is a measure of ignorance.
How can this be? I will explain it with a silly example. Imagine that you have to take a final physics exam. Within the language we have developed so far, we can consider the test as a random variable with some alphabet of possible questions. Suppose now two scenarios:
You studied hard for this exam and you know what kind of questions will be in the exam, so on average, you will not be so surprised by your exam.You didn’t really study and you don’t know which type of question will be in the exam, so your level of surprise will be pretty high throughout the exam.
So when your average surprise is bigger coincides perfectly with the scenario where you don’t have as much information.
Speaking from a technical standpoint, more peaked distributions (e.g. distributions where certain values are more likely to happen than others) have a lower entropy than more dispersed ones, where every event has about the same chance of happening. That is why we say that the distribution with the highest entropy is the uniform distribution, where any value can happen with the same chance.
Entropy and (Mutual) Information
Now that we have established a measure of average surprise on a system described by a random variable (this is the entropy), we can create the link of entropy with information.
Since entropy is a measure of ignorance over some system, the lack of it represents… information. In this sense, it is pretty natural to create a measure called mutual information: it measures the information you gain once you know some information about the system:
This definition says: take the average surprise of a random variable X, then take the average surprise of the random variable X, but now consider that we know the outcome of another random variable Y. Subtract the former by the latter, and you know how much ignorance you removed from your system X by knowing about Y.
Let’s come back to our silly example: suppose you don’t know what questions will be asked within your test, and this is X. Now suppose that a friend of yours has made a test from the same teacher, about the same subject, one week before your test. He tells you everything that his test covered (which happens to be another random variable Y). The most plausible to say is that your ignorance from your test has dropped, which means your test X and your friend’s test Y share information.
In Figure 2 there is a nice, comprehensible Venn Diagram showing the relation between the entropies and the information shared between a pair of variables X and Y.
Figure 2: Mutual Information scheme. Image by author, heavily inspired by many others.
But what if we have multiple sources of information?
So far we have only talked about cases where we have one feature X and one target variable Y, but it is quite obvious that this does not generalize well. Hence, now imagine we have a random variable Y (say, a target variable from a classification model) and we want to know the amount of information provided by each one of the n features of the model X_1, X_2, …, X_n. One could say that it suffices to calculate the mutual information shared by X_1 and Y, then by X_2 and Y, and so on. Well, in the real world, our features can interact among them and create non-trivial relations, and if we want to have a consistent framework we have to take these interactions into account.
Let’s take the case where we have two input signals X_1 and X_2, and we want to quantify the mutual information between these two features and a target feature Y. That is, we want to calculate I(Y; {X_1, X_2}). The Partial Information Decomposition framework states that this information can be divided into four non-negative components:
Syn(Y; {X_1, X_2}): the Synergy of the two features. This is an amount of information about Y provided solely by the two features together.Rdn(Y; {X_1, X_2}): the Redundancy of the two features. This quantity accounts for the information about Y that can be explained either by X_1 or X_2 alone.Unq(Y; X_1) and Unq(Y; X_2): the Unique Information, which measures the information about Y that only X_1 can explain for Unq(Y; X_1) or that only X_2 can explain for Unq(Y; X_2).
Notice that only Unq(Y; X_1) and Unq(Y; X_2) correspond to a scenario of no interaction between features. Hence, the mutual information I(Y; {X_1, X_2}) can be decomposed into its four components:
I(Y; {X_1, X_2}) = Syn(Y; {X_1, X_2}) + Rdn(Y; {X_1, X_2}) + Unq(Y; X_1) + Unq(Y; X_2)
Just as before, we can draw a nice Venn diagram that summarizes the dependency of these quantities.
Figure 3: Venn diagram for partial information decomposition. Image by author, heavily inspired by [1].
Each of these terms is called an atom of information. Any non-atomic information can be decomposed into atomic parts, that cannot be decomposed.
It was Williams and Beer [1] who first proposed this framework (and came up with a way of calculating partial information). It turns out that calculating these quantities is not trivial and deserves an article by itself. There is more than one measure of partial information decomposition, and all of them follow the same process: they imagine a measure that satisfies a series of nice-to-have characteristics and that is consistent with what we expect to happen with some quantity called “information”. All these measurements have strong and weak spots, and they are nicely implemented in dit library, which will be briefly introduced and used to give some examples in the following section.
Partial Information Decomposition examples and the dit library
To tie these concepts together, let’s see some examples. The dit library is a great tool for these experiments when it comes to information theory concepts. It is a lib that consists of creating customized probability distributions, and then performing measurements over them. There are several features within this library, and they can be found on their Github or at the official documentation page.
For all examples to come, we can think of two features X_1 and X_2, both of them binary, and the target variable Y is some boolean operation with the features. All forms of measuring partial information will be due to Williams and Beer [1], but other forms proposed by other authors are also implemented in dit .
Unique information
For this example, imagine that the target variable Y is the AND gate. Notice, by Fig. 4, that the output is always equal to the feature X_1, which makes the feature X_2 completely irrelevant.
Figure 4: Diagram of AND gate and a unique source of information.
For this reason, the information that X_1 and X_2 provide about Y is fully concentrated in X_1. In the formalism we have developed so far, we can say that the information about Y is unique to X_1.
In dit library, we can create this as:
import dit # importing dit library
from dit.pid import PID_WB # importing the PID measure we want to use
# creating a probability distribution of AND gate
dist_unique = dit.Distribution([“000”, “010”, “101”, “111”], [1/4, 1/4, 1/4, 1/4])
print(PID_WB(dist_unique))
“””
Out:
+——–+——–+——–+
| I_min | I_r | pi |
+——–+——–+——–+
| {0:1} | 1.0000 | 0.0000 |
| {0} | 1.0000 | 1.0000 |
| {1} | 0.0000 | 0.0000 |
| {0}{1} | 0.0000 | 0.0000 |
+——–+——–+——–+
“””
The dit library encodes the type of information as follows:
{0:1}: the synergistic information between X_1 and X_2{0}: unique information provided by X_1{1}: unique information provided by X_2{0}{1}: redundant information provided by X_1 and X_2
We can see by the output that the only partial information (the “pi” column) provided is from X_1.
Redundant Information
The next example serves to show the redundant information. Here, both X_1, X_2, and Y are equal as shown in Fig. 5, so the redundant information about Y provided by X_1 and X_2 is maximal.
Figure 5: Fully redundant information.
Using dit the code goes as:
import dit # importing dit library
from dit.pid import PID_WB # importing the PID measure we want to use
# creating a redundant probability distribution
dist_redundant = dit.Distribution([“000”, “111”], [1/2, 1/2])
print(PID_WB(d_redundant))
“””
Out:
+——–+——–+——–+
| I_min | I_r | pi |
+——–+——–+——–+
| {0:1} | 1.0000 | 0.0000 |
| {0} | 1.0000 | 0.0000 |
| {1} | 1.0000 | 0.0000 |
| {0}{1} | 1.0000 | 1.0000 |
+——–+——–+——–+
“””
As we can see, the only information about Y provided by X_1 and X_2 is redundant, in other words, provided by both of them.
Synergistic Information
A classic example of synergistic information is the XOR gate. The diagram for the XOR gate is shown in Fig. 6.
Figure 6: The XOR gate with fully synergistic information
Notice by this distribution that we can only know the target variable Y once we know both X_1 and X_2. It is not possible to know Y without X_1 and X_2, simply because for each value of X_1 we have both values for Y; and the same goes for X_2. The code indit goes:
import dit # importing dit library
from dit.pid import PID_WB # importing the PID measure we want to use
# creating a probability distribution of XOR gate
dist_syn = dit.Distribution([“000”, “011”, “101”, “110”], [1/4]*4)
print(dist_syn)
“””
Out:
+——–+——–+——–+
| I_min | I_r | pi |
+——–+——–+——–+
| {0:1} | 1.0000 | 1.0000 |
| {0} | 0.0000 | 0.0000 |
| {1} | 0.0000 | 0.0000 |
| {0}{1} | 0.0000 | 0.0000 |
+——–+——–+——–+
“””
As expected, the only information about Y that X_1 and X_2 convey is {0:1}, which is the synergistic information.
Final comments and takeaways
Finally, we can see that the interaction between variables can pose a difficult challenge when we have at our disposal only mutual information. There needs to be some tool to measure information coming from multiple sources (and possibly the interaction between these sources of information). This is a perfect ground for the Partial Information Decomposition (PID) framework.
Usually, the measurements in this field are convoluted and involve some formal logic: this can be left for another thorough article about this topic, but now it suffices to say that these tools are not only important, but their need arises naturally from the information approach.
References
[1] P. L. Williams and R. D. Beer, Nonnegative decomposition of multivariate information, arXiv preprint arXiv:1004.2515, 2010
[2] Shujian Yu, et al., Understanding Convolutional Neural Networks with Information Theory: An Initial Exploration, arXiv preprint arXiv:1804.06537v5, 2020
[3] A. J. Gutknecht, M. Wibral and A. Makkeh, Bits and pieces: understanding information decomposition from part-whole relationships and formal logic, arXiv preprint arXiv:2008.09535v2, 2022
[4] James, R. G., Ellison, C. J. and Crutchfield, J. P., dit: a Python package for discrete information theory, The Journal of Open Source Software, 2018
What Is Partial Information Decomposition and How Features Interact was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.