How To Calculate Mutual Information Example

Mutual Information Calculator

Calculate the mutual information between two discrete random variables using their joint probability distribution

Enter probabilities in row-major order (first row of X×Y, then second row, etc.)

Comprehensive Guide: How to Calculate Mutual Information (With Examples)

Mutual information (MI) is a fundamental concept in information theory that quantifies the amount of information obtained about one random variable through observing another random variable. It measures the dependence between two variables, providing insight into how much knowing one variable reduces uncertainty about the other.

Key Properties of Mutual Information

  • Non-negativity: MI(X;Y) ≥ 0
  • Symmetry: MI(X;Y) = MI(Y;X)
  • Independence: MI(X;Y) = 0 if and only if X and Y are independent
  • Relationship to entropy: MI(X;Y) = H(X) – H(X|Y)

Common Applications

  • Feature selection in machine learning
  • Image registration in computer vision
  • Neuroscience (measuring neural coding)
  • Bioinformatics (gene expression analysis)
  • Natural language processing

Mathematical Definition

The mutual information between two discrete random variables X and Y is defined as:

I(X;Y) = ∑x∈Xy∈Y p(x,y) logb(p(x,y) / (p(x)p(y)))

Where:

  • p(x,y) is the joint probability distribution function of X and Y
  • p(x) and p(y) are the marginal probability distribution functions of X and Y respectively
  • b is the base of the logarithm (commonly 2, e, or 10)

Step-by-Step Calculation Process

  1. Define the joint probability distribution:

    Create a matrix representing p(x,y) for all possible combinations of X and Y values. The sum of all joint probabilities must equal 1.

  2. Calculate marginal probabilities:

    Compute p(x) by summing joint probabilities over Y for each X value, and p(y) by summing over X for each Y value.

  3. Compute the ratio:

    For each (x,y) pair, calculate p(x,y) / (p(x)p(y)). This ratio measures how much more (or less) likely the joint event is compared to what it would be if X and Y were independent.

  4. Apply the logarithm:

    Take the logarithm (with your chosen base) of each ratio from step 3.

  5. Weight and sum:

    Multiply each log value by p(x,y) and sum all these products to get the mutual information.

Practical Example Calculation

Let’s work through a concrete example to illustrate the calculation process.

Example Scenario

Consider two binary random variables:

  • X ∈ {0,1} representing whether a patient has a certain genetic marker (0=no, 1=yes)
  • Y ∈ {0,1} representing whether a patient develops a disease (0=no, 1=yes)

The joint probability distribution is given by:

p(x,y) Y=0 Y=1 p(x)
X=0 0.35 0.15 0.50
X=1 0.20 0.30 0.50
p(y) 0.55 0.45 1.00

Let’s calculate the mutual information using base 2 (bits):

  1. For X=0, Y=0:
    p(0,0) = 0.35
    p(0) = 0.50, p(0) = 0.55
    Ratio = 0.35 / (0.50 × 0.55) ≈ 1.2727
    log₂(1.2727) ≈ 0.3456
    Contribution = 0.35 × 0.3456 ≈ 0.1210

  2. For X=0, Y=1:
    p(0,1) = 0.15
    Ratio = 0.15 / (0.50 × 0.45) ≈ 0.6667
    log₂(0.6667) ≈ -0.5850
    Contribution = 0.15 × (-0.5850) ≈ -0.0878

  3. For X=1, Y=0:
    p(1,0) = 0.20
    Ratio = 0.20 / (0.50 × 0.55) ≈ 0.7273
    log₂(0.7273) ≈ -0.4606
    Contribution = 0.20 × (-0.4606) ≈ -0.0921

  4. For X=1, Y=1:
    p(1,1) = 0.30
    Ratio = 0.30 / (0.50 × 0.45) ≈ 1.3333
    log₂(1.3333) ≈ 0.4150
    Contribution = 0.30 × 0.4150 ≈ 0.1245

Total Mutual Information:
0.1210 + (-0.0878) + (-0.0921) + 0.1245 ≈ 0.0656 bits

Note: The small positive value indicates a slight dependence between the genetic marker and disease development.

Normalized Mutual Information

To make mutual information values more interpretable across different datasets, we often normalize it by the maximum possible mutual information (which is the minimum of H(X) and H(Y)):

NMI(X;Y) = I(X;Y) / max(H(X), H(Y))

Where H(X) and H(Y) are the marginal entropies:

H(X) = -∑ p(x) logb p(x)
H(Y) = -∑ p(y) logb p(y)

For our example:

  • H(X) = -[0.5 log₂(0.5) + 0.5 log₂(0.5)] = 1 bit
  • H(Y) = -[0.55 log₂(0.55) + 0.45 log₂(0.45)] ≈ 0.9928 bits
  • max(H(X), H(Y)) = 1 bit
  • NMI = 0.0656 / 1 ≈ 0.0656

Interpreting Mutual Information Values

MI Value Range Normalized MI Interpretation
0 0 Variables are completely independent
0 to ~0.1×max 0 to ~0.1 Very weak dependence
~0.1×max to ~0.3×max ~0.1 to ~0.3 Weak dependence
~0.3×max to ~0.7×max ~0.3 to ~0.7 Moderate dependence
~0.7×max to max ~0.7 to 1 Strong dependence
max(H(X), H(Y)) 1 Variables are perfectly dependent (one is a function of the other)

Common Pitfalls and How to Avoid Them

  1. Incorrect probability distributions:

    Ensure your joint probabilities sum to 1 and that marginal probabilities are correctly calculated by summing over the appropriate dimensions.

  2. Using the wrong logarithm base:

    Remember that different bases give different units:

    • Base 2: bits (common in computer science)
    • Base e: nats (common in mathematics)
    • Base 10: dits or hartleys (less common)

  3. Ignoring zero probabilities:

    When p(x,y) = 0, the term contributes 0 to the sum (by convention, 0 × log(0/0) is treated as 0).

  4. Confusing mutual information with correlation:

    MI captures any statistical dependence, not just linear relationships like correlation.

  5. Overinterpreting small values:

    With finite samples, even independent variables may show small MI values due to sampling noise.

Advanced Topics

Conditional Mutual Information

Measures the mutual information between two variables given a third variable:

I(X;Y|Z) = H(X|Z) – H(X|Y,Z)

Useful for controlling for confounding variables in complex systems.

Multivariate Mutual Information

Extends MI to more than two variables:

I(X₁;X₂;…;Xₙ) = ∑ₖ H(Xₖ) – H(X₁,X₂,…,Xₙ)

Measures the total shared information among all variables.

Differential Entropy for Continuous Variables

For continuous variables, we use probability density functions:

I(X;Y) = ∫∫ p(x,y) log(p(x,y)/(p(x)p(y))) dx dy

Often estimated using binning or kernel density estimation.

Real-World Applications with Statistics

Application Domain Typical MI Values Interpretation Example Study
Gene expression analysis 0.1-0.5 bits Moderate regulatory relationships between genes Butte & Kohane (2000)
Neural coding 0.01-0.2 bits/spike Information carried by individual neurons Rieke et al. (1997)
Image registration 0.5-2.0 bits Alignment quality between medical images Viola & Wells (1997)
Natural language processing 0.05-0.3 bits Word co-occurrence relationships Church & Hanks (1990)
Financial markets 0.001-0.05 bits Weak dependencies between assets Marschinski & Boutler (2002)

Computational Considerations

When implementing mutual information calculations:

  1. Numerical stability:

    For very small probabilities, use log-sum-exp tricks to avoid underflow:

    log(a + b) = max(log(a), log(b)) + log(1 + exp(-|log(a) – log(b)|))

  2. Efficient computation:

    For large datasets, use sparse representations of probability distributions to save memory.

  3. Bias correction:

    For empirical distributions from finite samples, apply corrections like:

    Î(X;Y) = I(X;Y) – (|X|-1)(|Y|-1)/(2N ln 2)

    Where N is the sample size and |X|, |Y| are the number of bins.

  4. Parallel computation:

    The double summation in the MI formula is embarrassingly parallel – each term can be computed independently.

Alternative Dependence Measures

While mutual information is powerful, other measures exist for specific scenarios:

Measure When to Use Relationship to MI
Pearson correlation Linear relationships between continuous variables MI captures any dependence; correlation only linear
Spearman’s rank correlation Monotonic relationships in ordinal data MI is more general but less interpretable for rankings
Kendall’s tau Ordinal associations with many tied ranks MI can detect non-monotonic relationships
Chi-squared test Testing independence in contingency tables MI quantifies dependence strength; chi-squared tests significance
Kullback-Leibler divergence Comparing two probability distributions MI is a special case of KL divergence

Learning Resources

For those interested in deeper study of information theory and mutual information:

  • Books:
    • “Elements of Information Theory” by Cover & Thomas (the standard textbook)
    • “Information Theory, Inference, and Learning Algorithms” by MacKay (more applied focus)
    • “A Mathematical Theory of Communication” by Shannon (the original foundational work)
  • Online Courses:
  • Software Tools:
    • Python: sklearn.metrics.mutual_info_score
    • R: entropy::mutinformation
    • MATLAB: informationTheory::mutualInfo

Frequently Asked Questions

  1. Can mutual information be negative?

    No, mutual information is always non-negative. A value of 0 indicates independence, while positive values indicate dependence.

  2. How does mutual information relate to entropy?

    MI can be expressed as: I(X;Y) = H(X) – H(X|Y) = H(Y) – H(Y|X), where H is entropy and H(X|Y) is conditional entropy.

  3. What’s the difference between mutual information and joint entropy?

    Joint entropy H(X,Y) measures the total uncertainty of the pair (X,Y), while MI measures how much knowing one variable reduces uncertainty about the other.

  4. Can MI be used for feature selection in machine learning?

    Yes, MI is commonly used to select features that are most informative about the target variable, as it can capture non-linear relationships.

  5. How do I handle continuous variables?

    For continuous variables, you can:

    • Discretize the variables into bins
    • Use kernel density estimation to approximate the PDFs
    • Use differential entropy formulations

Authoritative References

For the most rigorous treatments of mutual information:

Leave a Reply

Your email address will not be published. Required fields are marked *