Mutual Information Calculator
Calculate the mutual information between two discrete random variables using their joint probability distribution
Comprehensive Guide: How to Calculate Mutual Information (With Examples)
Mutual information (MI) is a fundamental concept in information theory that quantifies the amount of information obtained about one random variable through observing another random variable. It measures the dependence between two variables, providing insight into how much knowing one variable reduces uncertainty about the other.
Key Properties of Mutual Information
- Non-negativity: MI(X;Y) ≥ 0
- Symmetry: MI(X;Y) = MI(Y;X)
- Independence: MI(X;Y) = 0 if and only if X and Y are independent
- Relationship to entropy: MI(X;Y) = H(X) – H(X|Y)
Common Applications
- Feature selection in machine learning
- Image registration in computer vision
- Neuroscience (measuring neural coding)
- Bioinformatics (gene expression analysis)
- Natural language processing
Mathematical Definition
The mutual information between two discrete random variables X and Y is defined as:
I(X;Y) = ∑x∈X∑y∈Y p(x,y) logb(p(x,y) / (p(x)p(y)))
Where:
- p(x,y) is the joint probability distribution function of X and Y
- p(x) and p(y) are the marginal probability distribution functions of X and Y respectively
- b is the base of the logarithm (commonly 2, e, or 10)
Step-by-Step Calculation Process
-
Define the joint probability distribution:
Create a matrix representing p(x,y) for all possible combinations of X and Y values. The sum of all joint probabilities must equal 1.
-
Calculate marginal probabilities:
Compute p(x) by summing joint probabilities over Y for each X value, and p(y) by summing over X for each Y value.
-
Compute the ratio:
For each (x,y) pair, calculate p(x,y) / (p(x)p(y)). This ratio measures how much more (or less) likely the joint event is compared to what it would be if X and Y were independent.
-
Apply the logarithm:
Take the logarithm (with your chosen base) of each ratio from step 3.
-
Weight and sum:
Multiply each log value by p(x,y) and sum all these products to get the mutual information.
Practical Example Calculation
Let’s work through a concrete example to illustrate the calculation process.
Example Scenario
Consider two binary random variables:
- X ∈ {0,1} representing whether a patient has a certain genetic marker (0=no, 1=yes)
- Y ∈ {0,1} representing whether a patient develops a disease (0=no, 1=yes)
The joint probability distribution is given by:
| p(x,y) | Y=0 | Y=1 | p(x) |
|---|---|---|---|
| X=0 | 0.35 | 0.15 | 0.50 |
| X=1 | 0.20 | 0.30 | 0.50 |
| p(y) | 0.55 | 0.45 | 1.00 |
Let’s calculate the mutual information using base 2 (bits):
-
For X=0, Y=0:
p(0,0) = 0.35
p(0) = 0.50, p(0) = 0.55
Ratio = 0.35 / (0.50 × 0.55) ≈ 1.2727
log₂(1.2727) ≈ 0.3456
Contribution = 0.35 × 0.3456 ≈ 0.1210 -
For X=0, Y=1:
p(0,1) = 0.15
Ratio = 0.15 / (0.50 × 0.45) ≈ 0.6667
log₂(0.6667) ≈ -0.5850
Contribution = 0.15 × (-0.5850) ≈ -0.0878 -
For X=1, Y=0:
p(1,0) = 0.20
Ratio = 0.20 / (0.50 × 0.55) ≈ 0.7273
log₂(0.7273) ≈ -0.4606
Contribution = 0.20 × (-0.4606) ≈ -0.0921 -
For X=1, Y=1:
p(1,1) = 0.30
Ratio = 0.30 / (0.50 × 0.45) ≈ 1.3333
log₂(1.3333) ≈ 0.4150
Contribution = 0.30 × 0.4150 ≈ 0.1245
Total Mutual Information:
0.1210 + (-0.0878) + (-0.0921) + 0.1245 ≈ 0.0656 bits
Normalized Mutual Information
To make mutual information values more interpretable across different datasets, we often normalize it by the maximum possible mutual information (which is the minimum of H(X) and H(Y)):
NMI(X;Y) = I(X;Y) / max(H(X), H(Y))
Where H(X) and H(Y) are the marginal entropies:
H(X) = -∑ p(x) logb p(x)
H(Y) = -∑ p(y) logb p(y)
For our example:
- H(X) = -[0.5 log₂(0.5) + 0.5 log₂(0.5)] = 1 bit
- H(Y) = -[0.55 log₂(0.55) + 0.45 log₂(0.45)] ≈ 0.9928 bits
- max(H(X), H(Y)) = 1 bit
- NMI = 0.0656 / 1 ≈ 0.0656
Interpreting Mutual Information Values
| MI Value Range | Normalized MI | Interpretation |
|---|---|---|
| 0 | 0 | Variables are completely independent |
| 0 to ~0.1×max | 0 to ~0.1 | Very weak dependence |
| ~0.1×max to ~0.3×max | ~0.1 to ~0.3 | Weak dependence |
| ~0.3×max to ~0.7×max | ~0.3 to ~0.7 | Moderate dependence |
| ~0.7×max to max | ~0.7 to 1 | Strong dependence |
| max(H(X), H(Y)) | 1 | Variables are perfectly dependent (one is a function of the other) |
Common Pitfalls and How to Avoid Them
-
Incorrect probability distributions:
Ensure your joint probabilities sum to 1 and that marginal probabilities are correctly calculated by summing over the appropriate dimensions.
-
Using the wrong logarithm base:
Remember that different bases give different units:
- Base 2: bits (common in computer science)
- Base e: nats (common in mathematics)
- Base 10: dits or hartleys (less common)
-
Ignoring zero probabilities:
When p(x,y) = 0, the term contributes 0 to the sum (by convention, 0 × log(0/0) is treated as 0).
-
Confusing mutual information with correlation:
MI captures any statistical dependence, not just linear relationships like correlation.
-
Overinterpreting small values:
With finite samples, even independent variables may show small MI values due to sampling noise.
Advanced Topics
Conditional Mutual Information
Measures the mutual information between two variables given a third variable:
I(X;Y|Z) = H(X|Z) – H(X|Y,Z)
Useful for controlling for confounding variables in complex systems.
Multivariate Mutual Information
Extends MI to more than two variables:
I(X₁;X₂;…;Xₙ) = ∑ₖ H(Xₖ) – H(X₁,X₂,…,Xₙ)
Measures the total shared information among all variables.
Differential Entropy for Continuous Variables
For continuous variables, we use probability density functions:
I(X;Y) = ∫∫ p(x,y) log(p(x,y)/(p(x)p(y))) dx dy
Often estimated using binning or kernel density estimation.
Real-World Applications with Statistics
| Application Domain | Typical MI Values | Interpretation | Example Study |
|---|---|---|---|
| Gene expression analysis | 0.1-0.5 bits | Moderate regulatory relationships between genes | Butte & Kohane (2000) |
| Neural coding | 0.01-0.2 bits/spike | Information carried by individual neurons | Rieke et al. (1997) |
| Image registration | 0.5-2.0 bits | Alignment quality between medical images | Viola & Wells (1997) |
| Natural language processing | 0.05-0.3 bits | Word co-occurrence relationships | Church & Hanks (1990) |
| Financial markets | 0.001-0.05 bits | Weak dependencies between assets | Marschinski & Boutler (2002) |
Computational Considerations
When implementing mutual information calculations:
-
Numerical stability:
For very small probabilities, use log-sum-exp tricks to avoid underflow:
log(a + b) = max(log(a), log(b)) + log(1 + exp(-|log(a) – log(b)|))
-
Efficient computation:
For large datasets, use sparse representations of probability distributions to save memory.
-
Bias correction:
For empirical distributions from finite samples, apply corrections like:
Î(X;Y) = I(X;Y) – (|X|-1)(|Y|-1)/(2N ln 2)
Where N is the sample size and |X|, |Y| are the number of bins.
-
Parallel computation:
The double summation in the MI formula is embarrassingly parallel – each term can be computed independently.
Alternative Dependence Measures
While mutual information is powerful, other measures exist for specific scenarios:
| Measure | When to Use | Relationship to MI |
|---|---|---|
| Pearson correlation | Linear relationships between continuous variables | MI captures any dependence; correlation only linear |
| Spearman’s rank correlation | Monotonic relationships in ordinal data | MI is more general but less interpretable for rankings |
| Kendall’s tau | Ordinal associations with many tied ranks | MI can detect non-monotonic relationships |
| Chi-squared test | Testing independence in contingency tables | MI quantifies dependence strength; chi-squared tests significance |
| Kullback-Leibler divergence | Comparing two probability distributions | MI is a special case of KL divergence |
Learning Resources
For those interested in deeper study of information theory and mutual information:
-
Books:
- “Elements of Information Theory” by Cover & Thomas (the standard textbook)
- “Information Theory, Inference, and Learning Algorithms” by MacKay (more applied focus)
- “A Mathematical Theory of Communication” by Shannon (the original foundational work)
- Online Courses:
-
Software Tools:
- Python:
sklearn.metrics.mutual_info_score - R:
entropy::mutinformation - MATLAB:
informationTheory::mutualInfo
- Python:
Frequently Asked Questions
-
Can mutual information be negative?
No, mutual information is always non-negative. A value of 0 indicates independence, while positive values indicate dependence.
-
How does mutual information relate to entropy?
MI can be expressed as: I(X;Y) = H(X) – H(X|Y) = H(Y) – H(Y|X), where H is entropy and H(X|Y) is conditional entropy.
-
What’s the difference between mutual information and joint entropy?
Joint entropy H(X,Y) measures the total uncertainty of the pair (X,Y), while MI measures how much knowing one variable reduces uncertainty about the other.
-
Can MI be used for feature selection in machine learning?
Yes, MI is commonly used to select features that are most informative about the target variable, as it can capture non-linear relationships.
-
How do I handle continuous variables?
For continuous variables, you can:
- Discretize the variables into bins
- Use kernel density estimation to approximate the PDFs
- Use differential entropy formulations
Authoritative References
For the most rigorous treatments of mutual information:
-
NIST Special Publication on Information Theory
Official US government standards document covering information-theoretic concepts including mutual information.
-
Stanford EE378 Lecture Notes on Mutual Information
Comprehensive lecture notes from Stanford’s information theory course, including proofs and examples.
-
Thomas Cover’s Information Theory Resources
Collection of resources from one of the most influential information theorists, including problem sets and solutions.