Estimating The Difference Between Probability Distributions Using Probabilistic Classifiers
Quantifying the difference between two probability distributions is a fundamental problem in machine learning, and there a couple common ways to quantify that difference. Integral Probability Metrics (IPMs), such as the Wasserstein distance, are based on the idea that if two distributions are identical, any function should have the same expectation under both distributions. f-divergences, such as the KL divergence and the Jensen Shannon divergence, are based on the idea that if two distributions are identical, they assign the same likelihood to every point. One can then define a discrepancy based on how different the likelihood ratio is from one.
Let’s take a closer look at f-divergences. Given any convex continuous function \( f : R_+ \rightarrow R\) such that \( f (1) = 0\), the f -Divergence is defined as (assuming densities exist) \( Df (p‖q) = Eq[f (p(X)/q(X))]\). Examples include the KL divergence, where \( f : t \rightarrow t log (t) \) and the Jensen Shannon divergence, where \( f : t \rightarrow (t+1) log( \frac{2}{t+1}) + t log (t) \).
Let’s use KL divergence as an example. The Kullback-Leibler (KL) divergence between two distributions is defined as \[ \mathcal{D}_{\mathrm{KL}}[p(x) || q(x)] = {\mathbb{E}}_p \left [ \log \left ( \frac{p(x)}{q(x)} \right ) \right ]. \]
If we let \( r^{*}(x) \) be the ratio between the densities \( p(x)\) and \( q(x)\), \( r^{*}(x) = \frac{p(x)}{q(x)}\) then we can write the KL divergence formula more simply as: \( \mathcal{D}_{\mathrm{KL}}[p(x) || q(x)] = \mathbb{E}_p [ \log r^{*}(x) ],\).
This density ratio is not only crucial for computing the KL-dvergence but for all f-divergences (see above).
\[ \mathcal{D}_f[p(x) || q(x)] := \mathbb{E}_q \left [ f \left ( \frac{p(x)}{q(x)} \right ) \right ].\]
Rarely can this integral be calculated analytically and in most cases it must be approximated (e.g. Monte Carlo approximation). Moreover, in the case that \( p(x)\) and/or \( q(x)\) are not calculable we need to resort to estimating them. In fact, we can estimate their density ratios through their connection with probabilistic classification.
In the special case that the true distributions are both gaussians it is possible to analytically compute their KL divergence. In general, for \( p(x)\) and \( q(x)\) where:
\( p(x) = \mathcal{N}(x \mid \mu_p, \sigma_p^2), \qquad \text{and} \qquad q(x) = \mathcal{N}(x \mid \mu_q, \sigma_q^2),\)
\( \mathrm{KL}[ p(x) || q(x) ] = \log \sigma_q - \log \sigma_p - \frac{1}{2} \left [ 1 - \left ( \frac{\sigma_p^2 + (\mu_p - \mu_q)^2}{\sigma_q^2} \right ) \right ].\) You can verify this for yourselves. However for the case where we cannot analytically compute the KL divergence, we can resort to estimating them.
\[\mathcal{D}_{\mathrm{KL}}[p(x) || q(x)] \]
\[= \mathbb{E}[ \log r^{*}(x) ] \approx \frac{1}{M} \sum_{i=1}^M \log r^{*}(x_p^{(i)}), \quad x_p^{(i)} \sim p(x). \]
We still need a density ratio though! For a prescribed distribution of gaussians we can calculated this. In many real world cases we first need to model the density ratios of our implicit distributions. \( p(x)\) and \( q(x)\) could be anything; images, documents, spam and normal emails! While it’s possible to model the ratio by first modeling each distribution individually, we often don’t for practical reasons - any error in estimating the denominator is scaled significantly. This brings us to density ratio estimation.
The cool part is that this problem can be reduced to probabilistic classification. Suppose we have \( N_p\) and \( N_1\) samples drawn from \( p(x)\) and \( q(x)\), respectively.
\( x_p^{(1)}, \dotsc, x_p^{(N_p)} \sim p(x), \qquad \text{and} \qquad x_q^{(1)}, \dotsc, x_q^{(N_q)} \sim q(x).\)
And we construct our dataset of \( N = N_p + N_q\) samples as:
Then, \( p(x) = \mathcal{P}(x \mid y = 1), \qquad \text{and} \qquad q(x) = \mathcal{P}(x \mid y = 0).\) and by using Bayes’ rule we can write the density ratio as:
\( r^{*}(x) = \frac{\mathcal{P}(y = 0)}{\mathcal{P}(y = 1)} \frac{\mathcal{P}(y = 1 \mid x)} {\mathcal{P}(y = 0 \mid x)}. \)
For the sake of simplicity let’s assume the ratio of sample sizes is 1. Then:
\begin{align} r^{*}(x) = \frac{\mathcal{P}(y = 1 \mid x)} {\mathcal{P}(y = 0 \mid x)} & = \frac{\mathcal{P}(y = 1 \mid x)} {1 - \mathcal{P}(y = 1 \mid x)} \newline & = \exp \left [ \log \frac{\mathcal{P}(y = 1 \mid x)} {1 - \mathcal{P}(y = 1 \mid x)} \right ] \newline & = \exp[ \sigma^{-1}(\mathcal{P}(y = 1 \mid x)) ], \end{align}
where \( \sigma^{-1}\) is the inverse sigmoid function (logit) \( \sigma^{-1}(\rho) = \log \left ( \frac{\rho}{1-\rho} \right )\)
the class-posterior probability can be approximated using a parameterized function \( D_\theta (x)\) outputting a probability that it was drawn from \( p(x)\).
\begin{align} r_{\theta}(x) & = \exp[ \sigma^{-1}(D_{\theta}(x)) ] \newline & \approx \exp[ \sigma^{-1}(\mathcal{P}(y = 1 \mid x)) ] = r^{*}(x), \end{align}
Having an obtained an estimate of the log density ratio, once can now compute an estimation of the KL divergence. This in itself can be useful for a lot of things, including estimating the lower bound on your loss functions (such as binary-cross entropy bounded by the negative Jensen-Shannon divergence (up to some constants))! In fact, the lower bound can be found for any f-divergence since it’s a convex function.
There are tons of interesting papers and books to read on the topic. Density Ratio Estimation in Machine Learning. The Gaussian example is originally from this book. A full blog post on the topic can be found here. And if you’re a visual learner, check out this Jupyter Notebook.