A while ago, I was making features from text documents and I wanted to check which n-grams correlated with the target classification. A common way to achieve this is by calculating the mutual information between the n-gram and the classification. In this blog post, I explain how you can calculate the mutual information between two variables in Python using SciKit-learn.
All quoted and copied definitions are taken from this great book on information theory. I highly recommend it and it’s freely available.
Let’s start with entropy, which is a “a measure of the uncertainty of a random variable”.
Manually calculating the entropy, can be done as follows.
import numpy as np def entropy(p): return -(p * np.log2(p) + (1-p) * np.log2((1-p))) entropy(0.95)
If we use the log with base 2, the entropy is expressed in bits. It is perfectly reasonable to use another base, such as e. The entropy calculated with a natural log is expressed in nats.
from scipy import stats stats.entropy([0.95,0.05], base = 2)
The relative entropy is a measure of the distance between two distributions: “The relative entropy D(p||q) is a measure of the inefficiency of assuming that the distribution is q when the true distribution is p.”
First, let’s calculate it manually:
def relative_entropy(p, q): return sum(p[i] * np.log2(p[i]/q[i]) for i in range(len(p))) relative_entropy([0.95,0.05], [0.2,0.8])
Of course, the same can be achieved using the same function from SciPy. By passing a second distribution to the entropy function, the function assumes you’re looking for relative entropy between both distributions.
stats.entropy(pk = [0.95,0.05], qk = [0.2,0.8], base=2)
Keep in mind that the relative entropy is not symmetric. Flipping p and q wil yield different results.
Mutual Information or Information Gain
The information gain, on the other hand, is “a measure of the amount of information that one random variable contains about another random variable. It is the reduction in the uncertainty of one random variable due to the knowledge of the other.”
As you can see in the following formula, the information gain is the relative entropy between the joint probability mass function and the marginal probability mass function. This is also known as mutual information.
We can rewrite the formula so it becomes mathematical notation of our definition of information gain we specified earlier.
In other words: mutual information and information gain are the same thing, whereas mutual information describes the dependency between two variables and information gain describes the reduction of entropy.
To demonstrate, I created two arrays. They are balanced (50/50) and they are exactly the same. First, I create the entropy, expressed in nats: it is 0.69. If you would calculate the information gain or the mutual information, you will see it is the same, because by knowing b, you also know a, reducing the entropy to 0.
from sklearn.feature_selection import mutual_info_classif from sklearn.metrics import mutual_info_score a = np.array([1, 1, 1, 0, 0, 1, 0, 0, 0, 1]) b = np.array([1, 1, 1, 0, 0, 1, 0, 0, 0, 1]) print(stats.entropy([0.5,0.5])) # entropy of 0.69, expressed in nats print(mutual_info_classif(a.reshape(-1,1), b, discrete_features = True)) # mutual information of 0.69, expressed in nats print(mutual_info_score(a,b)) # information gain of 0.69, expressed in nats
That’s why the mutual information is a great method of feature selection, it tells you how much you know about your target variable, by looking at another variable.
Great success! You know now how to calculate mutual information in Python.