Ji Won Park

Shapes of conformal sets in high dimensions

2025-04-05T00:00:00+00:00

Shapes of conformal sets in high dimensions

I became curious about the shapes of multivariate conformal regions when working on Semiparametric conformal prediction (AISTATS 2025) [1]. Included as a baseline in this paper is a simple way to extend conformal prediction to multiple response variables: to define a scalar non-conformity score.

If you are not familiar with conformal prediction, let’s briefly review the conformal calibration procedure for a single response variable (\(d=1\)). Suppose we have a point (uncalibrated) predictor \(\hat f: \mathcal{X} \to \mathbb{R}\) and an exchangeable calibration set \(\{(x^{(i)}, y^{(i)})\}_{i=1}^n\). We define a non-conformity score \(V(x, y, \hat f)\) which says how “strange” the prediction \(\hat f(x)\) is relative to \(y\). One simple example is the absolute residual \(V(x, y, \hat f) = |y - \hat f(x)|\). We evaluate this score on the calibration set to obtain a set of \(n\) scores. Then, given a user-specified miscoverage rate of \(\alpha\), we compute the empirical \(1-\alpha\) quantile of the scores. This quantile, which we denote \(q_\alpha\), sets the width of the conformal prediction interval for a test point \(x^*\), defined by

\[ C_{1-\alpha}(x^*) \equiv \{y\in \mathbb{R}: S(x^*, y, \hat f) \leq q \}. \tag{1} \]

For the absolute residual, this is simply the interval \([\hat f(x) - q, \hat f(x) + q]\). Provided that the test instances are exchangeable with the calibration ones, this set is guaranteed to cover the truth \(y^*\) with probability at least \(1-\alpha\).

Now, when the response variable takes values in \(\mathbb{R}^d\) for \(d>1\), we can define a scalar non-conformity score by taking the norm of the signed error vector \(y - \hat f(x) \in \mathbb{R}^d\). That is, we generalize the score to

\[ S(x, y, \hat f) = ||y - \hat f(x)||_p, \tag{2} \]

where \(||\cdot||_p\) indicates the \(p\)-norm. The rest of the procedure proceeds the same way, by computing the empirical \(1-\alpha\)-quantile \(q_p\) of the scalarized scores and constructing the prediction region as in Equation \(\text{(1)}\) with \(q=q_p\). There are “smarter” ways to do the scalarization, by first transforming the score space, for instance [2], but we focus on simple \(L_p\)-norm scalarizations here.

The \(\geq 1-\alpha\) coverage requirement is trivially satisfied by just returning the entire space \(\mathbb{R}^d\) as the prediction set, which would not be useful. The smaller the size (or the hypervolume in \(d\) dimensions) of the prediction set, the better in the sense that it’s more precise. This post is about how to choose \(p\) such that precision is maximized (i.e., the hypervolume of the prediction set is minimized).

As Equation \(\text{(1)}\) suggests, shapes of prediction regions in \(d\) dimensions using the \(L_p\) norm of \(y - \hat f(x)\) as the non-conformity score are \(p\)-norm balls,

\[ B_p(a) \equiv \{y\in \mathbb{R}^d: ||y||_p \leq a \}, \tag{3} \]

with radius \(a=q_p\) and centered at the original prediction \(\hat f(x)\). For \(d=2\), the \(L_1\) norm for the scalar score in Equation \(\text{(2)}\) yields a diamond-shaped prediction region with distance \(q_1\) from center to the corner, with area \(2 q_1^2\). The \(L_2\) norm yields a circle with radius \(q_2\), which has area \(\pi q_2^2\). The \(L_\infty\) norm yields a square with side length \(2 q_\infty\), which has area \(4 q_\infty^2\). The shapes for \(d=2\) and \(p=1, 2, 4, \infty\) look like the below, assuming \(y - \hat f(x) \sim \mathcal{N}(0, I_2)\).

Similarly, for \(d=3\), the \(L_1\) norm for the scalar score yields a cross polytope with distance \(q_1\) from center to the corner, which has volume \(\frac{2^d}{d!}\). The \(L_2\) norm yields a ball with radius \(q_2\), which has volume \(\frac{4 }{3} \pi q_2^3\). The \(L_\infty\) norm yields a hypercube with side length \(2 q_\infty\), which has volume \(8 q_\infty^3\). The shapes for \(d=3\) and \(p=1, 2, 4, \infty\) look like the below, assuming \(y - \hat f(x) \sim \mathcal{N}(0, I_3)\).

The general formula for the (hyper)volume of a \(d\)-dimensional \(p\)-ball of radius \(a\) in Equation \(\text{(3)}\) is

\[ {\rm Vol} \left(B_p(a) \right) = \frac{ \left( 2 \Gamma\left(1 + \frac{1}{p} \right) \right)^d}{\Gamma\left(1 + \frac{d}{p} \right)} a^d. \tag{4} \]

The radius \(a\) to use is the quantile, \(q_p\), of the \(L_p\)-scalarized scores. This leads one to wonder if, given a distribution of signed error vectors \(y - \hat f(x)\) in \(\mathbb{R}^d\), it’s possible to “win” some precision by choosing \(p\) that gives the smallest \({\rm Vol} \left(B_p(q_p) \right)\). One observation is that, while \({\rm Vol} \left(B_p(a) \right)\) increases with \(p\) for a fixed radius \(a\), the quantile value \(q_p\) decreases monotonically with \(p\), since \(||y||_{p_1} \geq ||y||_{p_2}\) for \(p_1 \leq p_2\).

Recalling that \(q_p\) is the \(1-\alpha\) quantile of the \(p\)-norm of signed error vectors \(y - \hat f(x)\), we know that its value must depend on the distribution of the errors. For simplicitly, let us assume the errors are independent across the \(d\) response variables, and consider ones that are distributed according to the radially symmetric density

\[ f(z) \propto e^{-||z||_{p^*}^{p^*}}, \]

which is Gaussian when \(p^*=2\).

Given this error distribution, what is the optimal \(p\) to choose? In other words, what is the \(p\) that would yield the minimum \({\rm Vol} \left(B_p(q_p) \right)\), where \(q_p\) is the \(1-\alpha\) quantile of the \(p\)-norm of the errors?

The answer is \(p^*\) – proof to follow. This means that you can inspect the distribution of errors, specifically the slope of its density, and match the order \(p\) of the scalarizing norm. If the errors are approximately Gaussian, use their \(L_2\) norm as the non-conformity score. If they are Laplace, use the \(L_1\) norm. The radially symmetric density assumes that the scales of errors are the same across the \(d\) response variables, but the logic of the proof extends to settings with differing scales, such as ones where a response variable is particularly difficult to predict. In fact, up to monotone transformations, it applies to general convex \(r(z)\) associated with a log-concave density \(f(z) \propto e^{-r(z)}\). The result can be understood in terms of the isoperimetric inequality applied to log-concave densities [3]; for any other \(p \neq p^*\), the geometry of the \(p\)-ball is misaligned with the level sets of the density \(f\), so we pay a penalty in volume. Basically, the best scalarizing scheme is one that matches the shape of the error distribution.

We want to show that, for a random variable \(Z\) distributed according to the density \(f(z) = C e^{-||z||_{p^*}^{p^*}}\) for some constant \(C > 0\), the volume \({\rm Vol} \left(B_p(q_p) \right)\) in Equation \(\text{(4)}\) with radius \(q_p\) chosen to satisfy

\[ \int_{B_p} f(z) dz = \mathbb{P}(|| Z ||_p \leq q_p) = 1-\alpha, \]

is minimized if and only if \(p = p^*\).

The proof essentially depends on a well-known result that states that, for any density \(f(z)\), among all measurable sets \(A\) satisfying \(\int_A f(z) dz = 1-\alpha,\) the set that minimizes the hypervolume (the Lebesgue measure) is the highest-density region, or the superlevel set \(A^* = \{z \in \mathbb{R}^d: f(z) \geq k \}\) where \(k\) is chosen so that \(A^*\) contains mass \(1-\alpha\).

First, consider the set,

\[ B_{p^*}(r_{p^*}) \equiv \{z \in \mathbb{R}^d: ||z||_{p^*} \leq r_{p^*} \}, \]

where \(r_{p^*}\) is chosen to satisfy \(\int_{B_{p^*}(r_{p^*})} f(z) dz = 1-\alpha\). This is a highest-density region, because \(||z||_{p^*} \leq r_{p^*} \iff f(z) \geq C e^{-r_{p^*}}\). This means that, for any other measurable set \(A\) satisfying \(\int_A f(z) dz = 1-\alpha\), we have \({\rm Vol} \left( B_{p^*}(r_{p^*}) \right) \leq {\rm Vol}(A)\). This includes sets of the form \(B_p(q_p)\) with \(q_p\) chosen to satisfy \(\int_{B_p(q_p)} f(z) dz = 1-\alpha\). The minimum volume for \(B_p(q_p)\) occurs at \(p = p^*\).

References

[1] Park, Ji Won, Robert Tibshirani, and Kyunghyun Cho. “Semiparametric conformal prediction.” AISTATS (2025).

[2] Feldman, Shai, Stephen Bates, and Yaniv Romano. “Calibrated multiple-output quantile regression with representation learning.” JMLR (2023).

[3] Bobkov, S.G. “Isoperimetric and Analytic Inequalities for Log-Concave Probability Measures.” Annals of Probability (1999).

]]>

What we talk about when we talk about influence functions: Part 2

2025-03-03T00:00:00+00:00

What we talk about when we talk about influence functions: Part 2

In Part 1 of this series, we showed that the efficient influence function (EIF) coincides with the leave-one-out empirical influence function (EmpIF) in finite samples if the underlying distribution is the empirical distribution and the estimand is the mean estimand.

Being a linear estimator, the mean \(\Psi(P) = \mathbb{E}_P[X]\) is a “nice” estimand. This means that it’s a linear functional of the distribution. That is, for any pair of distributions \(P, Q\) and any \(\alpha \in [0, 1]\), a linear functional \(\Psi\) satisfies \[\Psi\left(\alpha P + (1-\alpha) Q \right) = \alpha \Psi(P) + (1-\alpha)\Psi(Q).\]

Quantile functionals: the EIF

Now let’s consider a non-linear functional, the quantile. For this post, we will be identifying measures \(P\) with their distribution functions \(F\), for notational simplicity. The functional \(\Psi\) will thus take as its argument the distribution function \(F\) associated with \(P\), rather than \(P\) itself. The \(\tau\)-quantile estimand (quantile at level \(\tau\)) can be implicitly defined as \(\Psi_\tau(F)\) satisfying \[F\left( \Psi_\tau(F) \right) = \tau.\]Equivalently, we may write it in terms of the quantile function \(F^{-1}\): \[\Psi_\tau(F) = F^{-1}(\tau).\] As an example, let’s take the Cauchy distribution. The true distribution function \(F\) is the solid black curve, and the true quantile \(\Psi(F)\) in dashed black is where the distribution function equals \(\tau=0.8\).

We will derive EIF like we did in Part 1. First, note that the quantile can be implicitly written in terms of the probability density function (PDF) \(f\): \[\int_{-\infty}^{\Psi_\tau(F)} f(s) ds = \tau.\] The \(\delta\)-contaminated density that we associate with the \(\delta\)-contaminated distribution \(F_\epsilon\) is then \(f_\epsilon(s) = (1-\epsilon)f(s) + \epsilon \delta_x(s)\), where we have contaminated the original density \(f(s)\) with a Dirac measure located at \(x\), a measure assigning probability 1 to \(\{x\}\) and 0 to all other elements in \(\mathbb{R}\). For this density, \[\int_{-\infty}^{\Psi_\tau(F)} f_\epsilon(s) ds = \tau. \tag{1}\] Recall that EIF can then be derived as the following generalized derivative:\[{\rm EIF} \equiv \psi(x; F, \Psi) = \lim_{\epsilon \to 0} \frac{\Psi_\tau(F_\epsilon) - \Psi_\tau(F)}{\epsilon}.\] Now differentiating both sides of \((1)\) with respect to \(\epsilon\) using the Leibniz integral rule,

\[f_\epsilon \left( \Psi_\tau(F_\epsilon) \right) \frac{d \Psi_\tau(F_\epsilon)}{d \epsilon} + \int^{\Psi_\tau(F_\epsilon)}_{-\infty} \frac{d f_\epsilon(s)}{d\epsilon} ds = 0.\]

Rearranging,

\[\frac{d \Psi_\tau(F_\epsilon)}{d \epsilon} =\frac{-1}{f_\epsilon \left( \Psi_\tau(F_\epsilon) \right) } \int^{\Psi_\tau(F_\epsilon)}_{-\infty} \frac{d f_\epsilon(s)}{d\epsilon} ds.\]

The integrand is \[\frac{d f_\epsilon(s)}{d \epsilon} = -f(s) + \delta_x(s),\] and note that \(\int_{-\infty}^{c} f(s) ds = F(c)\) and \(\int_{-\infty}^{c'} \delta_x(s) ds = 1[x \leq c']\), where \(1[\cdot]\) is the indicator function, for \(c, c' \in \mathbb{R}\). Finally, then, we have

\[\frac{d \Psi_\tau(F_\epsilon)}{d \epsilon} = \frac{F\left(\Psi_\tau (F_\epsilon)\right) - 1\left[x \leq \Psi_\tau (F)\right]}{f_\epsilon \left( \Psi_\tau(F_\epsilon) \right) } = \frac{F\left(F^{-1}_\epsilon(\tau) \right) - 1\left[x \leq \Psi_\tau (F)\right]}{f_\epsilon \left( \Psi_\tau(F_\epsilon) \right) }.\]

Evaluating at \(\epsilon=0\), we have derived the EIF:

\[\psi(x; F, \Psi_\tau) = \frac{\tau - 1[x \leq \Psi_\tau (F)]}{f(\Psi_\tau(F))} = \frac{1[\Psi_\tau (F) \leq x] - (1-\tau)}{f(\Psi_\tau(F))} \tag{1}\]

In the numerator, you may recognize the derivative of the pinball loss used for quantile regression. The pinball loss is \(l_\tau(x) = x\left(1[x \geq 0] - (1-\tau)\right)\), and its derivative \(l'_\tau(x) = 1[x \geq 0] - (1-\tau).\)

Quantile functionals: the EmpIF

Similarly as for the mean estimand, the leave-one-out EmpIF can be written as the (scaled) difference: \[{\rm EmpIF} = \psi_{\rm emp}(x_j; \Psi_\tau) = (n-1)\left(\Psi_\tau(F_n) - \Psi_\tau\left(F_n^{-j}\right)\right), \tag{2}\]where \(F_n\) is the distribution associated with the empirical measure \(P_n\) and \(F_n^{-j}\) is the distribution with the \(j\)-th data point \(x_j\) removed.

Going back to the Cauchy example, suppose we draw 20 samples from this distribution, yielding the empirical distribution \(F_n\) in orange and its empirical quantile, \(\Psi_\tau(F_n) = 0.89\). We remove one data point, \(x_j\), and obtain the green distribution \(F_n^{-j}\), in green and its empirical quantile, \(\Psi_\tau(F_n^{-j}) = 2.24\). Because \(x_j\) is relatively small, the quantile estimate goes up when we remove it.

Recall that, in Part 1, where we discussed the mean functional, the leave-one-out EmpIF \(\psi_\textrm{emp}(x_j; \Psi_\tau)\) was equal to the EIF evaluated at the empirical measure, \(\psi(x; F_n, \Psi_\tau)\). We will now show that the equality does not hold with the quantile functional \(\Psi_\tau\). In fact, \(\psi_\textrm{emp}(x_j; \Psi_\tau)\) and \(\psi(x; F_n, \Psi_\tau)\) only coincide in the large \(n\) limit.

Let’s first relate the two terms that enter the difference in Equation \((2)\). The full-sample distribution is \[F_n(t) = \frac{1}{n} \sum_{i=1}^n 1\left[ x_j \leq t \right],\] whereas the distribution with \(x_j\) left out is \[F_n^{-j}(t) = \frac{1}{n-1} \sum_{i \neq j} 1\left[ x_i \leq t \right].\] These are related by \[F_n^{-j}(t) = \frac{1}{n-1} \left( n F_n(t) - 1 \left[x_j \leq t \right] \right).\] When \(t=\Psi_\tau (F_n)\), \[F_n^{-j}\left( \Psi_\tau (F_n) \right) = \frac{n}{n-1} \tau - \frac{1}{n-1} 1 \left[x_j \leq \Psi_\tau (F_n) \right], \tag{3}\] using the fact that \(F_n(\Psi_\tau(F_n)) = \tau\).

Next, let us consider the true distribution \(F\) and for notational simplicity define \(q^{-j} \equiv \Psi_\tau(F_n^{-j})\) and \(q = \Psi_\tau(F_n)\). Since \(q^{-j}\) and \(q\) are close together, we can Taylor-expand \(F(q^{-j})\) around \(q\). Setting \(\Delta \equiv q^{-j} - q\), we have \[F(q^{-j}) = F(q) + f(q) \Delta + \frac{1}{2} f’(q) \Delta^2 + \mathcal{O}\left(\Delta^3 \right),\] where \(f\) is the density function associated with \(F\) and \(f'\) is its derivative.

The leave-one-out EmpIF in Equation \((2)\) is \(-(n-1)\Delta\), so let’s solve for \(\Delta\). Rearranging the above, \[\Delta + \frac{1}{2} \frac{f’(q)}{f(q)} \Delta^2 + \mathcal{O}\left(\Delta^3 \right) = \frac{F(q^{-j}) - F(q)}{f(q)}.\] Here, we will wave hands a little bit and now substitute \(F^{-j}_n\) for \(F\) on the RHS. We are justified in doing so for asymptotic analysis, because \(F_n^{-j}\) approaches \(F\) in the large-\(n\) limit. We now have

\[\Delta + \frac{1}{2} \frac{f'(q)}{f(q)} \Delta^2 + \mathcal{O}\left(\Delta^3 \right) = \frac{F^{-j}_n(q^{-j}) - F^{-j}_n(q)}{f(q)} \equiv \Delta_1. \tag{4}\]

The leading-order term is \(\Delta_1\), so let us express \(\Delta\) as this plus a second-order correction — that is, \(\Delta \equiv \Delta_1 + \Delta_2\). We already know what this leading term \(\Delta_1\) looks like. In the numerator, \(F^{-j}_n(q^{-j})\) is simply \(\tau\) and we already worked out the second term in Equation \((3)\). \[\Delta_1 = \frac{1}{f(q)} \left( \tau - \frac{n}{n-1} \tau + \frac{1}{n-1} 1 \left[x_j \leq \Psi_\tau (F_n) \right] \right) = \frac{1}{n-1} \times \frac{1\left[x_j \leq \Psi_\tau(F_n) \right] - \tau}{f(q)}\] Almost there. Canceling terms in Equation \((4)\), the second-order term is then \[\Delta_2 = -\frac{1}{2} \frac{f’(q)}{f(q)} \Delta_1^2 + \mathcal{O}\left(n^{-3} \right).\] Finally, the leave-one-out EmpIF is

\[ \psi_{\rm emp}(x; \Psi_\tau) = -(n-1)\Delta = -(n-1)(\Delta_1 + \Delta_2) = \underbrace{\frac{\tau - 1[x \leq \Psi_\tau (F_n)]}{f(\Psi_\tau(F_n))}}_{\psi(x; F_n, \Psi_\tau)} -(n-1)\Delta_2.\]

So we have recovered the EIF, evaluated at the empirical measure, as the first term. There are some remainder terms that arise because the quantile functional is not linear in the underlying measure.

Numerical verification

Let’s evaluate the EIF with the true population quantile, EIF with the empirical quantile, and the leave-one-out EmpIF and see how they compare with increasing \(n\).

np.random.seed(42)

# We will use the Cauchy distribution as an example,
# since it has an analytical quantile function

dist_obj = stats.cauchy(loc=0, scale=1)
tau = 0.8 # target quantile level
sample_sizes = np.logspace(1, 3, 10).astype(int)
num_trials = 100

# Init arrays to store results
eif = np.empty((num_trials, len(sample_sizes)))
eif_true = np.empty((num_trials, len(sample_sizes)))
empif = np.empty((num_trials, len(sample_sizes)))

def get_eif_quantile_cauchy(x, tau, dist_obj, plug_in_quantile):
"""Evaluate efficient influence function with the empirical quantile"""
    return (tau - (x <= plug_in_quantile).astype(float))/dist_obj.pdf(plug_in_quantile)

  
def get_true_eif_quantile_cauchy(x, tau, dist_obj):
"""Evaluate efficient influence function with the true quantile"""
    true_quantile = dist_obj.ppf(tau)
    return (tau - (x <= true_quantile).astype(float))/dist_obj.pdf(true_quantile)


for i in range(num_trials):
    for j, num_samples in enumerate(sample_sizes):
		samples = dist_obj.rvs(size=num_samples)
		# Compute Psi(F_n), Psi(F_n^{-j})
		query_j = np.random.choice(num_samples) # chose leave-one-out index
		samples_minus_j = np.delete(samples, query_j)
		empirical_tau_quantile = np.quantile(samples, tau, method="inverted_cdf")
		empirical_tau_quantile_minus_j = np.quantile(samples_minus_j, tau, method="inverted_cdf")
		# Compute EIF with empirical quantile and true quantile
		eif[i, j] = get_eif_quantile_cauchy(samples[query_j], tau, dist_obj, empirical_tau_quantile)
		eif_true[i, j] = get_true_eif_quantile_cauchy(samples[query_j], tau, dist_obj)
		# Compute leave-one-out EmpIF
		empif_val = (empirical_tau_quantile - empirical_tau_quantile_minus_j)*(num_samples-1)
		empif[i, j] = empif_val

As expected, the population EIF and EIF evaluated using the empirical quantile coincide as \(n\) increases, as the empirical quantile approaches the true quantile.

diff = eif - eif_true
plt.errorbar(sample_sizes, diff.mean(0), yerr=diff.std(0), marker="o", color="tab:orange")
plt.axhline(0, linestyle="--", color="k")
plt.xlabel("Sample sizes")
plt.ylabel("Difference: EIF with empirical dist - population EIF")
plt.title("Efficient Influence Function for the Quantile")
plt.grid(True)
plt.xscale("log")

The leave-one-out EmpIF, on the other hand, carries more finite-sample variance.

diff = empif - eif
plt.errorbar(sample_sizes, diff.mean(0), yerr=diff.std(axis=0), fmt='o-')
plt.axhline(0, linestyle="--", color="k")
plt.xlabel("Sample sizes")
plt.ylabel("Difference: EmpIF - EIF")
plt.title("Efficient vs. Empirical Influence Function for the Quantile")
plt.grid(True)
plt.xscale("log")

Today, we focused on comparing the EIF (evaluated at the empirical measure) with the leave-one-out EmpIF. The quantile was a simple example of a non-linear functional that didn’t involve fitting a model. One model-specific example of key interest to robust statistics and ML interpretability is the M-estimator: \[\Psi(P) \equiv \arg \min_\theta \ \mathbb{E}_{X \sim P} [ G(X, \theta)],\] where \(G(X, \theta)\) is some per-instance loss function. It can be, for instance, the loss function of a model parameterized by \(\theta\) with respect to a training instance \(X\). Because leave-one-out retraining for every single training instance is unfeasible, the EIF provides a cheap approximation under some assumptions, including a strongly convex loss landscape. When these assumptions are violated, as they are with nonlinear neural networks, it can be very fragile. See Basu et al. (2020) and Bae et al. (2022).

Part 3 will be dedicated to the EIF and its relevance to semiparametric efficiency theory.

References

[1] Basu, S., Pope, P., & Feizi, S. (2020). Influence functions in deep learning are fragile. ICLR.

[2] Bae, J., Ng, N., Lo, A., Ghassemi, M., & Grosse, R. B. (2022). If influence functions are the answer, then what is the question?. NeurIPS.

]]>

Estimation as betting

2025-02-19T00:00:00+00:00

Estimation as betting

I’ve been really enjoying the lecture series “A Martingale Theory of Evidence” by Aaditya Ramdas. Part 2 of the series was particularly eye-opening to me. He estimates the mean of a bounded random variable under a betting framework and demonstrates that estimation is essentially testing, which can be viewed as betting.

In this blog post, I will walk through his example of estimating the mean of a bounded random variable with betting. We can construct anytime-valid confidence sets, or confidence sequences, for the mean by testing the null hypothesis associated with every candidate mean. We will then derive the betting strategy for a different parameter, the M-estimator. I hope you can take away some intuition for designing betting strategies given a target parameter of interest.

Expressing probabilities in the language of betting and gambling has a long history. Ville’s 1939 thesis first connected measure-theoretic probability with betting. In fact, he put martingales on the map of probability theory in terms of betting strategies. See Appendix Section F of Waudby-Smith and Ramdas 2024 for a historical overview of betting and its applications.

Estimating the mean of a bounded random variable with betting

\(K_0^{(m)} = 1\) is the initial capital. For each \(t\), we place the bet

\[ \lambda_t^{(m)} \in [-1/(1-m), 1/m], \]

and reveal incoming data \(x_t\).

The capital evolves as \(K_t^{(m)} = K_{t-1}^{(m)} \left( 1 + \lambda_t^{(m)} (x_t - m) \right)\).

Rescale (allowed b/c RV is bounded) so that \(m\) lies in \([0, 1]\). The confidence sequence for the true mean \(\mu\) is then

\[ C_t \equiv \{m \in [0, 1]: \prod_{i=1}^t (1 + \lambda_i (X_i - m)) < 1/\alpha \} = \{m \in [0, 1]: K_t^{(m)} < 1/\alpha \}. \tag{1} \]

This can be evaluated on a grid of \(m\).

Where did this confidence sequence come from?

Intuitively, for each candidate mean \(m \in [0, 1]\), we are testing the null \(H_0^{(m)}: \mathbb{E}_P[X_i\mid X_{1:i-1}] = m\).

We claim that \(C_t\) is a confidence sequence for \(\mu\). That is,

\[ \sup_{P \in \mathcal{P}^\mu} P\left(\exists t \in \mathbb{N}: \mu \notin C_t \right) \leq \alpha, \]

which conversely states that \(C_t\) contains \(\mu\) with high probability.

To prove this, let’s think about a special game that’s indexed by the true mean \(\mu\), which is \(K_t^{(\mu)}\). We can see that \(K_t^{(\mu)}\) is a non-negative martingale with initial value 1, the so-called “test martingale.”

Non-negative because the bets were constrained to leave it non-negative.
Martingale because \(\mathbb{E}_P [K_t^{(\mu)}\mid K_{1:t-1}^{(\mu)}] = K_{t-1}^{(\mu)} \mathbb{E}_P [1 + \lambda_t^{(\mu)}(X_t - \mu)] = K_{t-1}^{(\mu)} \left(1 + \lambda_t^{(\mu)} (\underbrace{\mathbb{E}_P(X_t)}_{=\mu} - \mu) \right) = K_{t-1}^{(\mu)}\).
Initial value \(1\) because we started with an initial capital of \(1\), or \(K_0^{(\mu)} = 1\).

Since \(K_t^{(\mu)}\) is a non-negative martingale, we can apply Ville’s inequality:

\[ \sup_{P \in \mathcal{P}^\mu} P\left(\exists t \in \mathbb{N}: K_t^{(\mu)} \geq 1/\alpha \right) \leq \alpha. \tag{2} \]

Note that \(C_t\) defined in \((1)\) is incorrect only if it excludes \(\mu\), which happens when \(K_t^{(\mu)} \geq 1/\alpha\). But this happens with probability \(\leq \alpha\) according to \((2)\). In other words, with high probability, \(C_t\) contains \(\mu\).

Figure 1: The payout rarely goes above \(1/\alpha\).

So how should we bet?

So far we haven’t discussed how we should actually bet (i.e., choose \(\lambda_t^{(m)}\)). One method is called growth rate adaptive to the particular alternative (GRAPA; Waudby-Smith and Ramdas 2024). We choose

\[ \lambda_t^{(m)}(P) \equiv \arg \max_{\lambda \in [-1, 1]} \mathbb{E}_P \left[ \log \left(1 + \lambda(X_t - m) \right) \mid \mathcal{F}_{t-1} \right]. \tag{3} \]

The main issue with evaluating the above \(\lambda_t^{(m)}\) is that we don’t know \(P\). But let’s just blindly differentiate through the expectation:

\[ \mathbb{E}_P[(X_t - m)/(1 + \lambda^*(X_t - m))] = 0. \]

The denominator begs for a Taylor expansion:

\[ \mathbb{E}_P [(X_t - m)\left(1 - \lambda^*(X_t - m) \right)] = 0. \]

Rearranging to solve for \(\lambda^*\), we have

\[ \lambda_t^{(m)} = \frac{\mathbb{E}_P[X_t - m]}{\mathbb{E}_P[(X_t-m)^2]} \approx \frac{\hat{\mu}_t - m}{\hat{\sigma}_t^2 + (\hat{\mu}_t - m)^2}, \]

where we use the plug-in empirical estimates for \(\hat{\mu}_t\) and \(\hat{\sigma}_t^2\) computed from the first \(t-1\) samples.

Figure 2 plots the GRAPA-chosen \(\lambda_t^{(m)}\) for five different values of \(m\), for the Bernoulli(1/2) and the Beta(1, 1) distributions. The dotted lines are “oracle” bets, where the plug-in empirical estimates \(\hat{\mu}, \hat{\sigma}^2\) have been replaced with their true values. Over time, bets converge to their oracle values.

Let’s focus on the left panel. The truth is \(\mu=0.5\), so when \(m=0.5\) (green), we see that the bet quickly converges to zero, because there is no advantage to betting. When \(m>0.5\) (\(m<0.5\)), on the other hand, the bet converges to a negative (positive) value. The absolute values of the bets become more aggressive as you are betting against values that are farther from the truth.

Beta(1, 1) on the right panel has the same mean of 0.5 but lower variance than Bernoulli(1/2). Accordingly, the bets are more aggressive on the whole.

Figure 2: Figure 10 from Waudby-Smith and Ramdas 2024

What about other estimators?

The setup presented above seems very contrived for the mean estimator. Let’s consider another estimator, the univariate M-estimator,

\[ \theta^* \equiv \arg \min_{\theta' \in \Theta} \mathbb{E}_P [ L(X, \theta') ], \]

for some “nice” loss function \(L\).* Our payout will look like

\[ K_t^{(\theta)} = K_{t-1}^{(\theta)} \left(1 + \lambda_t^{(\theta)} \nabla_\theta L(X_t, \theta) \right). \]

Now \(K_t^{(\theta^*)}\) is a non-negative martingale with an initial value of 1. The arguments for non-negative and initial value 1 are the same as above, and it’s a martingale, because

\[ \mathbb{E}_P [K_t^{(\theta^*)} \mid K_{1:t-1}^{(\theta^*)}] = K_{t-1}^{(\theta^*)} \left(1 + \lambda_t^{(\theta^*)} \underbrace{\mathbb{E}_P [\nabla L_\theta(X_i, \theta^*)]}_{=0} \right) = K_{t-1}^{(\theta^*)}, \]

where the underbrace equality holds because \(\theta^*\) is the minimizer of the expected loss, so \(\nabla_\theta \mathbb{E}_P [L(X_i, \theta^*)] = 0 \implies \mathbb{E}_P [\nabla_\theta L(X_i, \theta^*)] = 0.\) We can then write down the betting strategy similarly as \((3)\):

\[ \lambda_t^{(\theta)}(P) \equiv \arg \max_{\lambda \in [-1, 1]} \mathbb{E}_P [\log \left(1 + \lambda \nabla_\theta L(X_t, \theta) \right) \mid \mathcal{F}_{t-1}]. \tag{4} \]

Following the GRAPA derivation of differentiating through the expectation and taking the Taylor approximation, we obtain

\[ \lambda_t^{(\theta)} = \frac{1}{\mathbb{E}_P[\nabla_\theta L(X, \theta)]} \approx \frac{t-1}{\sum_{i=1}^{t-1} \nabla_\theta L(X_i, \theta)}, \]

where we again use the plug-in empirical estimate using the first \(t-1\) samples, this time for the expected gradient of the loss. Based on the intuition we took away from the GRAPA figure, this result makes sense. If \(P\) is lower-variance, like the Beta(1, 1) distribution, the numerator will be smaller and the bets \(\lambda_t^{(\theta)}\) will become more aggressive.

Acknowledgments

A huge thank you to Aaditya Ramdas for the great lecture series and textbook on e-values. I also thank Clara Wong-Fannjiang for foraying into e-values with me in our aptly named (in our opinion) reading group, “Great Expectations.”

\(*\) A sufficient condition for “niceness” being that \(X\) is regular and \(\int \mathbb{E}_P[\mid \nabla_\theta L(X, \theta) \mid]\) is finite, by Fubini

References

[1] Waudby-Smith, I. & Ramdas, A. (2024). Estimating means of bounded random variables by betting. JRSS Series B: Statistical Methodology, 86(1), 1-27.

]]>

What we talk about when we talk about influence functions: Part 1

2025-02-16T00:00:00+00:00

What we talk about when we talk about influence functions: Part 1

Efficient influence function vs. empirical influence function: a visual disambiguation

I spent a few weeks last year confused about influence functions. The main source of the confusion was that influence functions manifest as distinct objects in multiple fields: the efficient influence function (EIF) in semiparametric statistics and the empirical influence function (EmpIF) in robust statistics and, more recently, ML interpretability.

While these objects are conceptually related, they are used for different purposes and often require specialized implementations.

Both objects measure the sensitivity of an estimand to changes in some underlying distribution. They are concepts defined in relation to an estimand, a finite-dimensional functional \(\Psi(P)\) of a distribution \(P\).

In this Part 1 of a three-part series on influence functions, we will work with the mean estimand, i.e., \(\Psi(P) \equiv \mathbb{E}_P[X]\). We will derive the EIF of this estimand and show that, when \(P\) equals the empirical distribution \(P_n\), the EIF and EmpIF coincide.

Deriving the efficient influence function for the mean

Let us start by deriving the EIF for the mean estimand, \(\Psi(P) = \mathbb{E}_P[X]\). We will use a simple method that works in most cases, formalized by Ichimura and Newey (2015). See Hines et al. (2022) for a very friendly tutorial. In brief, the method is based on the result that the EIF is what’s called a Gateaux derivative (a type of generalized derivative for probability distributions) with respect to a smooth perturbation evaluated at a point mass. It works by first defining what’s called a \(\delta\)-contaminated distribution, which is the result of perturbing a distribution with a point mass at \(x\) with weight \(\epsilon\): \[P_\epsilon = (1-\epsilon)P + \epsilon \delta_x.\] The EIF can then be derived as the following generalized derivative: \[{\rm EIF} \equiv \psi(x; P, \Psi) = \lim_{\epsilon \to 0} \frac{\Psi(P_\epsilon) - \Psi(P)}{\epsilon}. \tag{1}\]

Applying this to \(\Psi(P) = \mathbb{E}_P[X]\), we have \[\Psi(P_\epsilon) = \Psi(P) - \epsilon \Psi(P) + \epsilon x,\] so the expression in the limit of \((1)\) yields \[\frac{\Psi(P) - \epsilon \Psi(P) + \epsilon x - \Psi(P)}{\epsilon} = -\Psi(P) + x.\]

So we simply have \[\psi(x; P, \Psi) = -\mathbb{E}_P[X] + x.\]

The empirical counterpart

The EmpIF, on the other hand, is commonly defined via a leave-one-out (LOO) formulation with respect to the empirical measure, \[P_n \equiv \frac{1}{n} \sum_{i=1}^n \delta_{x_i}.\] The estimand of interest, then, is simply the sample mean: \[\Psi(P_n) = \frac{1}{n} \sum_{i=1}^n x_i.\] The EmpIF asks what the effect on \(\Psi\) would be if a sample \(x_j\) were to be removed. Denoting by \(P_n^{-j}\) the distribution without \(x_j\), the sample mean under perturbation would be: \[\Psi(P_n^{-j}) = \frac{1}{n-1} \sum_{i\neq j} x_i.\] Then the EmpIF with appropriate scaling is \[{\rm EmpIF} = \psi_{\rm emp}(x_j; \Psi) = (n-1)\left(\Psi(P_n) - \Psi(P_n^{-j})\right),\] where the notation no longer explicitly includes the dependence on the distribution, as the distribution is always assumed to be \(P_n\).

Notice that, putting \(\bar X \equiv \frac{1}{n}\sum_{i=1}^n x_i\), this can be written \[\psi_{\rm emp}(x_j; \Psi) = (n-1) \left( {\bar X} - \frac{1}{n-1} \sum_{i \neq j} x_j \right) = (n-1){\bar X} - \left( n{\bar X} - x_j \right) = -{\bar X} + x_j,\] or, equivalently, \(\psi_{\rm emp}(x; \Psi) = -\mathbb{E}_{P_n}[X] + x.\) We have shown that the EIF with \(P=P_n\) equals the EmpIF when the estimand \(\Psi\) is the mean, i.e., \(\psi(x; P_n, \Psi) = -\mathbb{E}_{P_n}[X] + x = \psi_{\rm emp}(x; \Psi).\)

Numerical verification

The EmpIF approaches the EIF as \(n\) increases. Let’s verify this visually, though it should be obvious as we are basically just checking that the sample mean of a standard normal approaches 0.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# Define true distribution as a standard normal
def get_eif(x):
    return 0.0 - x  # true mean is 0

def get_empif(x, all_x):
    return all_x.mean() - x

diff = []
diff_sigma = []
sample_sizes = np.logspace(1, 4, 20).astype(int)
for n_emp in sample_sizes:
    diff_n = []
    for _ in range(100):  # repeat 100 times
	    # Define some empirical distribution
	    X = np.random.normal(size=n_emp)

	    # Choose x_j
	    x_j = np.random.choice(X)

	    # Evaluate EIF, EmpIF
	    eif = get_eif(x_j)
	    empif = get_empif(x_j, X)

	    # Get difference
	    diff_n.append(empif - eif)
	# Get mean, std in diff across runs
	diff.append(np.mean(diff_n))
	diff_sigma.append(np.std(diff_n))

# Verify that the two influence functions coincide
plt.figure(figsize=(10, 6))
plt.errorbar(sample_sizes, diff, yerr=diff_sigma, marker="o", label="mean +/- std across 100 runs")
plt.axhline(0, linestyle="--", color="k")
plt.xlabel("Sample sizes")
plt.ylabel("Difference: EmpIF - EIF")
plt.title("Efficient vs. Empirical Influence Function for the Mean")
plt.grid(True)
plt.legend()
plt.xscale("log")

This snippet produces

Summary

We learned how to derive the EIF using the example of the mean estimand.
We showed that, for the mean estimand, the EIF with \(P=P_n\) reduces to the EmpIF.
We numerically confirmed this for varying \(n\).

While the mean example may appear trivial, the story is not as neat for other estimators, for which the use cases of EIF and EmpIF diverge. A sneak peek into the rest of the series:

Part 2. For more complex estimands, the EIF and EmpIF do not coincide in finite samples as they do for the mean estimand. We take a close look at the quantile estimand as an example of a non-linear functional. Because evaluating the EmpIF for every leave-one-out training sample can be unfeasible for estimators that involve model fitting, like the weights of a neural network, some approximations are commonly employed for ML interpretability.

Part 3. Fundamentally, the EmpIF is defined explicitly for \(P_n\) and estimates the finite-sample sensitivity of the estimand to particular data points. The estimand considered for the EmpIF is usually some (finite-dimensional) parameter of a parametric model. The EIF, on the other hand, is an idealized object, defined for an arbitrary distribution \(P\) without explicit reference to a parametric model, and representing the theoretical sensitivity of any estimand to infinitesimal perturbations on \(P\). Techniques like one-step estimation and targeted learning use the EIF to equip a potentially biased estimator with asymptotic efficiency.

References

[1] Ichimura, H., & Newey, W. K. (2015). The influence function of semiparametric estimators. Quantitative Economics, 13(1), 29-61. [2] Hines, O., Dukes, O., Diaz-Ordaz, K., & Vansteelandt, S. (2022). Demystifying statistical learning based on efficient influence functions. The American Statistician, 76(3), 292-304.

]]>