COLT 2025 Book of Abstracts

1A - Sampling#

Time: Tuesday 01 July 09:00–10:00

Mixing Time of the Proximal Sampler in Relative Fisher Information via Strong Data Processing Inequality#

Session: 1A - Sampling (Tuesday 01 July 09:00–10:00)

Authors: Wibisono, Andre

Abstract:
We study the mixing time guarantee for sampling in relative Fisher information via the Proximal Sampler algorithm, which is an approximate proximal discretization of the Langevin dynamics. We show that when the target probability distribution is strongly log-concave, the relative Fisher information converges exponentially fast along the Proximal Sampler; this matches the exponential convergence rate of the relative Fisher information along the continuous-time Langevin dynamics for strongly log-concave target. When combined with a standard implementation of the Proximal Sampler via rejection sampling, this exponential convergence rate provides a high-accuracy iteration complexity guarantee for the Proximal Sampler in relative Fisher information when the target distribution is strongly log-concave and log-smooth. Our proof proceeds by establishing a strong data processing inequality for relative Fisher information along the Gaussian channel under strong log-concavity, and a data processing inequality along the reverse Gaussian channel for a special distribution. The forward and reverse Gaussian channels compose to form the Proximal Sampler, and these data processing inequalities imply the exponential convergence of the relative Fisher information along the Proximal Sampler.

On the Convergence of Min-Max Langevin Dynamics and Algorithm#

Session: 1A - Sampling (Tuesday 01 July 09:00–10:00)

Authors: Cai, Yang; Mitra, Siddharth; Wang, Xiuyuan; Wibisono, Andre

Abstract:
We study zero-sum games in the space of probability distributions over the Euclidean space $\mathbb{R}^d$ with entropy regularization, in the setting when the interaction function between the players is smooth and strongly convex-strongly concave. We prove an exponential convergence guarantee for the mean-field min-max Langevin dynamics to compute the equilibrium distribution of the zero-sum game. We also study the finite-particle approximation of the mean-field min-max Langevin dynamics, both in continuous and discrete times. We prove biased convergence guarantees for the continuous-time finite-particle min-max Langevin dynamics to the stationary mean-field equilibrium distribution with an explicit bias term which does not scale with the number of particles. We also prove biased convergence guarantees for the discrete-time finite-particle min-max Langevin algorithm to the stationary mean-field equilibrium distribution with an additional bias term which scales with the step size and the number of particles. This provides an explicit iteration complexity for the average particle along the finite-particle algorithm to approximately compute the equilibrium distribution of the zero-sum game.

On the query complexity of sampling from non-log-concave distributions#

Session: 1A - Sampling (Tuesday 01 July 09:00–10:00)

Authors: He, Yuchen; Zhang, Chihao

Abstract:
We study the problem of sampling from a $d$-dimensional distribution with density $p(x)\propto e^{-f(x)}$, which does not necessarily satisfy good isoperimetric conditions. Specifically, we show that for any $L,M$ satisfying $LM\ge d\ge 5$, $\eps\in \left\{0,\frac{1}{32}\right\}$, and any algorithm with query accesses to the value of $f(x)$ and $\grad f(x)$, there exists an $L$-log-smooth distribution with second moment at most $M$ such that the algorithm requires $\left\{\frac{LM}{d\eps}\right\}^{\Omega(d)}$ queries to compute a sample whose distribution is within $\eps$ in total variation distance to the target distribution. We complement the lower bound with an algorithm requiring $\left\{\frac{LM}{d\eps}\right\}^{\mathcal O(d)}$ queries, thereby characterizing the tight (up to the constant in the exponent) query complexity for sampling from non-log-concave distributions. Our results are in sharp contrast with the recent work of Huang et al. (COLT'24), where an algorithm with quasi-polynomial query complexity was proposed for sampling from a non-log-concave distribution when $M=\mathtt{poly}(d)$. Their algorithm works under the stronger condition that all distributions along the trajectory of the Ornstein-Uhlenbeck process, starting from the target distribution, are $\mathcal O(1)$-log-smooth. We investigate this condition and prove that it is strictly stronger than requiring the target distribution to be $\mathcal O(1)$-log-smooth. Additionally, we study this condition in the context of mixtures of Gaussians. Finally, we place our results within the broader theme of ``sampling versus optimization'', as studied in Ma et al. (PNAS'19). We show that for a wide range of parameters, sampling is strictly easier than optimization by a super-exponential factor in the dimension $d$.

Efficiently learning and sampling multimodal distributions with data-based initialization#

Session: 1A - Sampling (Tuesday 01 July 09:00–10:00)

Authors: Koehler, Frederic; Lee, Holden; Vuong, Thuy-Duong

Abstract:
We consider the problem of sampling a multimodal distribution with a Markov chain given a small number of samples from the stationary measure. Although mixing can be arbitrarily slow, we show that if the Markov chain has a $k$th order spectral gap, initialization from a set of $\tilde O(k/\varepsilon^2)$ samples from the stationary distribution will, with high probability over the samples, efficiently generate a sample whose conditional law is $\varepsilon$-close in TV distance to the stationary measure. In particular, this applies to mixtures of $k$ distributions satisfying a Poincar\'e inequality, with faster convergence when they satisfy a log-Sobolev inequality. Our bounds are stable to perturbations to the Markov chain, and in particular work for Langevin diffusion over $\mathbb R^d$ with score estimation error, as well as Glauber dynamics combined with approximation error from pseudolikelihood estimation. This justifies the success of data-based initialization for score matching methods despite slow mixing for the data distribution, and improves and generalizes the results of Koehler and Vuong '23 to have linear, rather than exponential, dependence on $k$ and apply to arbitrary semigroups. As a consequence of our results, we show for the first time that a natural class of low-complexity Ising measures can be efficiently learned from samples.

Characterizing Dependence of Samples along the Langevin Dynamics and Algorithms via Contraction of Φ-Mutual Information#

Session: 1A - Sampling (Tuesday 01 July 09:00–10:00)

Authors: Mitra, Siddharth; Wibisono, Andre; Liang, Jiaming

Abstract:
The mixing time of a Markov chain determines how fast the iterates of the Markov chain converge to the stationary distribution; however, it does not control the dependencies between samples along the Markov chain. In this paper, we study the question of how fast the samples become approximately independent along popular Markov chains for continuous-space sampling: the Langevin dynamics in continuous time, and the Unadjusted Langevin Algorithm and the Proximal Sampler in discrete time. We measure the dependence between samples via Φ-mutual information, which is a broad generalization of the standard mutual information, and which is equal to 0 if and only if the the samples are independent. We show that along these Markov chains, the Φ-mutual information between the first and the k-th iterate decreases to 0 exponentially fast in k when the target distribution is strongly log-concave. Our proof technique is based on showing the Strong Data Processing Inequalities (SDPIs) hold along the Markov chains. To prove fast mixing of the Markov chains, we only need to show the SDPIs hold for the stationary distribution. In contrast, to prove the contraction of Φ-mutual information, we need to show the SDPIs hold along the entire trajectories of the Markov chains; we prove this when the iterates along the Markov chains satisfy the corresponding Φ-Sobolev inequality, which is implied by the strong log-concavity of the target distribution.

1B - Bandits#

Time: Tuesday 01 July 09:00–10:00

Data-dependent Bounds with $T$-Optimal Best-of-Both-Worlds Guarantees in Multi-Armed Bandits using Stability-Penalty Matching#

Session: 1B - Bandits (Tuesday 01 July 09:00–10:00)

Authors: Nguyen, Quan; Ito, Shinji; Komiyama, Junpei; Mehta, Nishant

Abstract:
Existing data-dependent and best-of-both-worlds regret bounds for multi-armed bandits problems have limited adaptivity as they are either data-dependent but not best-of-both-worlds (BOBW), BOBW but not data-dependent or have sub-optimal $O(\sqrt{T\ln{T}})$ worst-case guarantee in the adversarial regime. To overcome these limitations, we propose real-time stability-penalty matching (SPM), a new method for obtaining regret bounds that are simultaneously data-dependent, best-of-both-worlds and $T$-optimal for multi-armed bandits problems. In particular, we show that real-time SPM obtains bounds with worst-case guarantees of order $O(\sqrt{T})$ in the adversarial regime and $O(\ln{T})$ in the stochastic regime while simultaneously being adaptive to data-dependent quantities such as sparsity, variations, and small losses. Our results are obtained by extending the SPM technique for tuning the learning rates in the follow-the-regularized-leader (FTRL) framework, which further indicates that the combination of SPM and FTRL is a promising approach for proving new adaptive bounds in online learning problems.

Model predictive control is almost optimal for restless bandits#

Session: 1B - Bandits (Tuesday 01 July 09:00–10:00)

Authors: Narasimha, Dheeraj; Gast, Nicolas

Abstract:
We consider the discrete time infinite horizon average reward Restless Markovian bandit (RMAB) problem. We propose a model predictive control based non-stationary policy with a rolling computational horizon τ . At each time-slot, this policy solves a τ horizon linear program whose first control value is kept as a control for the RMAB. Our solution requires minimal assumptions and quantifies the loss in optimality in terms of τ and the number of arms, N. We show that its suboptimality gap is O(1/√N) in general, and exp(−Ω(N)) under a local-stability condition. Our proof is based on a framework from dynamic control known as dissipativity. Our solution is easy to implement and performs very well in practice when compared to the state of the art. Further, both our solution and our proof methodology can easily be generalized to more general constrained MDP settings and should thus be of great interest to the burgeoning RMAB community.

Improved Offline Contextual Bandits with Second-Order Bounds and Beyond with Betting and Freezing#

Session: 1B - Bandits (Tuesday 01 July 09:00–10:00)

Authors: Ryu, Jongha; Kwon, Jeongyeol; Koppe, Benjamin; Jun, Kwang-Sung

Abstract:
We consider the off-policy selection and learning in contextual bandits where the learner aims to select or train a reward-maximizing policy using data collected by a fixed behavior policy. Our contribution is two-fold. First, we propose a novel off-policy selection method that leverages a new betting-based confidence bound applied to an inverse propensity weight sequence. Our theoretical analysis reveals that our method achieves a significantly better, variance-adaptive guarantee upon prior art. Second, we propose a novel and generic condition on the optimization objective for off-policy learning that strikes a difference balance in bias and variance. One special case that we call freezing tends to induce small variance, which is preferred in small-data regimes. Our analysis shows that they match the best existing guarantee. In our empirical study, our selection method outperforms existing methods, and freezing exhibits improved performance in small-sample regimes.

Linear Bandits on Ellipsoids: Minimax Optimal Algorithms#

Session: 1B - Bandits (Tuesday 01 July 09:00–10:00)

Authors: Zhang, Raymond; Hadiji, Hedi; Combes, Richard

Abstract:
We consider linear stochastic bandits where the set of actions is an ellipsoid. We provide the first known minimax optimal algorithm for this problem. We first derive a novel information-theoretic lower bound on the regret of any algorithm, which must be at least $\Omega(\min(d \sigma \sqrt{T} + d \|\theta\|_{A}, \|\theta\|_{A} T))$ where $d$ is the dimension, $T$ the time horizon, $\sigma^2$ the noise variance, $A$ a matrix defining the set of actions and $\theta$ the vector of unknown parameters. We then provide an algorithm whose regret matches this bound to a multiplicative universal constant. The algorithm is non-classical in the sense that it is not optimistic, and it is not a sampling algorithm. The main idea is to combine a novel sequential procedure to estimate $\|\theta\|$, followed by an explore-and-commit strategy informed by this estimate. The algorithm is highly computationally efficient, and a run requires only time $O(dT + d^2 \log(T/d) + d^3)$ and memory $O(d^2)$, in contrast with known optimistic algorithms, which are not implementable in polynomial time. We go beyond minimax optimality and show that our algorithm is locally asymptotically minimax optimal, a much stronger notion of optimality. We further provide numerical experiments to illustrate our theoretical findings.

Thompson Sampling for Bandit Convex Optimisation#

Session: 1B - Bandits (Tuesday 01 July 09:00–10:00)

Authors: Bakhtiari, Alireza; Lattimore, Tor; Szepesvari, Csaba

Abstract:
We show that Thompson sampling has a Bayesian regret of at most ~O(sqrt(n)) for 1-dimensional bandit convex optimisation where n is the time horizon and no assumptions are made on the loss function beyond convexity, boundedness and a mild Lipschitz assumption. For general high-dimensional problems we show that Thompson sampling can fail catastrophically. More positively, we show that Thompson sampling has Bayesian regret of ~O (d^{2.5}sqrt(n)) for generalised linear bandits with an unknown convex monotone link function. Lastly, we prove that the standard information-theoretic machinery can never give a bound on the regret in the general case that improves on the best known bound of ~O(d^{1.5} sqrt(n)).

2A - Concentration Inequalities#

Time: Tuesday 01 July 10:30–11:20

Of Dice and Games: A Theory of Generalized Boosting#

Session: 2A - Concentration Inequalities (Tuesday 01 July 10:30–11:20)

Authors: Bressan, Marco; Brukhim, Nataly; Cesa-Bianchi, Nicolo; Esposito, Emmanuel; Mansour, Yishay; Moran, Shay; Thiessen, Maximilian

Abstract:
Cost-sensitive loss functions are crucial in many real-world prediction problems, where different types of errors are penalized differently; for example, in medical diagnosis, a false negative prediction can lead to worse consequences than a false positive prediction. However, traditional PAC learning theory has mostly focused on the symmetric 0-1 loss, leaving cost-sensitive losses largely unaddressed. In this work, we extend the celebrated theory of boosting to incorporate both cost-sensitive and multi-objective losses. Cost-sensitive losses assign costs to the entries of a confusion matrix, and are used to control the sum of prediction errors accounting for the cost of each error type. Multi-objective losses, on the other hand, simultaneously track multiple cost-sensitive losses, and are useful when the goal is to satisfy several criteria at once (e.g., minimizing false positives while keeping false negatives below a critical threshold). We develop a comprehensive theory of cost-sensitive and multi-objective boosting, providing a taxonomy of weak learning guarantees that distinguishes which guarantees are trivial (i.e., can always be achieved), which ones are boostable (i.e., imply strong learning), and which ones are intermediate, implying non-trivial yet not arbitrarily accurate learning. For binary classification, we establish a dichotomy: a weak learning guarantee is either trivial or boostable. In the multiclass setting, we describe a more intricate landscape of intermediate weak learning guarantees. Our characterization relies on a geometric interpretation of boosting, revealing a surprising equivalence between cost-sensitive and multi-objective losses.

Improved Margin Generalization Bounds for Voting Classifiers#

Session: 2A - Concentration Inequalities (Tuesday 01 July 10:30–11:20)

Authors: Hogsgaard, Mikael Moller; Larsen, Kasper Green

Abstract:
In this paper we establish a new margin-based generalization bound for voting classifiers, refining existing results and yielding tighter generalization guarantees for widely used boosting algorithms such as AdaBoost Freund and Schapire (1997). Furthermore, the new margin-based generalization bound enables the derivation of an optimal weak-to-strong learner: a Majority-of-3 large-margin classifiers with an expected error matching the theoretical lower bound. This result provides a more natural alternative to the Majority-of-5 algorithm by Høgsgaard et al. (2024), and matches the Majority-of-3 result by Aden-Ali et al. (2024) for the realizable prediction model.

Are all models wrong? Fundamental limits in distribution-free empirical model falsification#

Session: 2A - Concentration Inequalities (Tuesday 01 July 10:30–11:20)

Authors: Mueller, Manuel; Luo, Yuetian; Foygel Barber, Rina

Abstract:
In statistics and machine learning, when we train a fitted model on available data, we typically want to ensure that we are searching within a model class that contains at least one accurate model---that is, we would like to ensure an upper bound on the model class risk (the lowest possible risk that can be attained by any model in the class). However, it is also of interest to establish lower bounds on the model class risk, for instance so that we can determine whether our fitted model is at least approximately optimal within the class, or, so that we can decide whether the model class is unsuitable for the particular task at hand. Particularly in the setting of interpolation learning where machine learning models are trained to reach zero error on the training data, we might ask if, at the very least, a positive lower bound on the model class risk is possible---or are we unable to detect that "all models are wrong"? In this work, we answer these questions in a distribution-free setting by establishing a model-agnostic, fundamental hardness result for the problem of constructing a lower bound on the best test error achievable over a model class, and examine its implications on specific model classes such as tree-based methods and linear regression.

Quantifying Overfitting along the Regularization Path for Two-Part-Code MDL in Supervised Classification#

Session: 2A - Concentration Inequalities (Tuesday 01 July 10:30–11:20)

Authors: Zhu, Xiaohan; Srebro, Nathan

Abstract:
We provide a complete characterization of the entire regularization curve of a modified two-part-code Minimum Description Length (MDL) learning rule for binary classification, based on an arbitrary prior or description language. \citet{GL} previously established the lack of asymptotic consistency, from an agnostic PAC (frequentist worst case) perspective, of the MDL rule with a penalty parameter of $\lambda=1$, suggesting that it underegularizes. Driven by interest in understanding how benign or catastrophic under-regularization and overfitting might be, we obtain a precise quantitative description of the worst case limiting error as a function of the regularization parameter $\lambda$ and noise level (or approximation error), significantly tightening the analysis of \citeauthor{GL} for $\lambda=1$ and extending it to all other choices of $\lambda$.

2B - Learning Theory I#

Time: Tuesday 01 July 10:30–11:20

A Distributional-Lifting Theorem for PAC Learning#

Session: 2B - Learning Theory I (Tuesday 01 July 10:30–11:20)

Authors: Blanc, Guy; Lange, Jane; Strassle, Carmen; Tan, Li-Yang

Abstract:
The apparent difficulty of efficient distribution-free PAC learning has led to a large body of work on distribution-specific learning. Distributional assumptions greatly facilitate the design of efficient algorithms but also limit their reach and relevance. Towards addressing this, we prove a {\sl distributional-lifting theorem} that shows how a learner that succeeds with respect to a distribution family $\mathcal{D}$ can be lifted to one that succeeds with respect to {\sl any} distribution $D^\star$, with an efficiency overhead that scales with the complexity of expressing $D^\star$ as a mixture of distributions in $\mathcal{D}$. Recent work of Blanc, Lange, Malik, and Tan considered the special case of lifting {\sl uniform-distribution} learners and designed a lifter that uses a {\sl conditional sample oracle} for $D^\star$, a strong form of access not afforded by the standard PAC model. Their approach, which draws on ideas from semi-supervised learning, first learns $D^\star$ and then exploits this information to lift. We show that their approach, while natural, is information-theoretically intractable with access only to random examples, thereby giving formal justification for their use of the conditional sample oracle. We then give a different approach that sidesteps the need to learn $D^\star$, yielding a lifter that works in the standard PAC model and enjoys additional advantages: it works for all base distribution families, preserves the noise tolerance of learners, has better sample complexity, and is simpler.

An Uncertainty Principle for Linear Recurrent Neural Networks#

Session: 2B - Learning Theory I (Tuesday 01 July 10:30–11:20)

Authors: Francois, Alexandre; Bach, Francis; Orvieto, Antonio

Abstract:
We consider linear recurrent neural networks, which have become a key building block of sequence modeling due to their ability for stable and effective long-range modeling. In this paper, we aim at characterizing this ability on the simple but core copy task, whose goal is to build a linear filter of order $S$ that approximates the filter that looks $K$ time steps in the past (which we refer to as the shift-$K$ filter), where $K$ is larger than $S$. Using classical signal models and quadratic cost, we fully characterize the problem by providing lower bounds of approximation, as well as explicit filters that achieve this lower bound up to constants. The optimal performance highlights an uncertainty principle for this task: the optimal filter has to average values around the $K$-th time step in the past with a range~(width) that is proportional to $K/S$.

Recovering Labels from Crowdsourced Data: an Optimal and Polynomial-Time Method#

Session: 2B - Learning Theory I (Tuesday 01 July 10:30–11:20)

Authors: Pilliat, Emmanuel

Abstract:
Crowdsourcing involves aggregating meaningful information from partial and noisy data provided by a pool of $n$ workers across $d$ tasks. Traditional models, such as the Dawid-Skene model, assume that workers' abilities are independent of tasks, limiting their applicability in real-world scenarios where worker ability often varies significantly across tasks. Recent advances have proposed permutation-based models, which relax these assumptions by imposing only isotonicity constraints on worker abilities. In this work, we study a permutation-based model where each worker $i$ has an ability $M_{ik}$ to recover a binary label $x_k^*\in\{-1,1\}$ for task $k$. The ability matrix $M$ is assumed to be isotonic up to a permutation of its rows, and only a fraction $\lambda$ of the worker-task pairs is observed. We focus on three primary objectives: recovering the true labels, ranking the workers, and estimating the ability matrix $M$. We introduce a polynomial-time and minimax optimal procedure to recover the labels, contradicting a conjecture in the literature regarding the existence of a statistical-computational gap for this problem. Additionally, building on the literature on ranking, we further introduce a polynomial-time procedure to rank the workers and to estimate their abilities. Notably, we show that ranking the workers or estimating their abilities is no harder when the true labels are unknown than when they are known, within the main regimes of interest in the isotonic model.

Learning Intersections of Two Margin Halfspaces under Factorizable Distributions#

Session: 2B - Learning Theory I (Tuesday 01 July 10:30–11:20)

Authors: Diakonikolas, Ilias; Ma, Mingchen; Ren, Lisheng; Tzamos, Christos

Abstract:
Learning intersections of halfspaces is a central problem in Computational Learning Theory. Even for just two halfspaces, it remains a major open question whether learning is possible in polynomial time with respect to the margin \gamma of the data points and their dimensionality d. The best-known algorithms run in quasi-polynomial time d^{O( \log{1/\gamma} )}, and it has been shown that this complexity is unavoidable for any algorithm relying solely on correlational statistical queries (CSQ). In this work, we introduce a novel algorithm that provably circumvents the CSQ hardness barrier. Our approach applies to a broad class of distributions satisfying a natural, previously studied, factorizability assumption. Under these distributions, we show that CSQ-based methods still require quasipolynomial time, whereas our algorithm achieves poly(d,1/\gamma) time by leveraging more general statistical queries (SQ). Factorizable distributions lie between distribution-specific and distribution-free settings, and significantly extend previously known tractable cases. Our result is grounded in a rigorous analysis utilizing a novel duality framework that characterizes the moment tensor structure induced by the marginal distributions. Building on these structural insights, we propose new, efficient learning algorithms. These algorithms combine a refined variant of Jennrich’s Algorithm with PCA over random projections of the moment tensor, along with a gradient-descent-based non-convex optimization framework.

3A - Reinforcement Learning#

Time: Tuesday 01 July 14:00–15:36

The Pitfalls of Imitation Learning When Actions Are Continuous#

Session: 3A - Reinforcement Learning (Tuesday 01 July 14:00–15:36)

Authors: Pfrommer, Daniel; Simchowitz, Max; Jadbabaie, Ali

Abstract:
We study the problem of imitating an expert demonstrator in a discrete-time, continuous state-and-action space control system. We show that there exist stable dynamics (i.e. contracting exponentially quickly) and smooth, deterministic experts such that any smooth, deterministic imitator policy necessarily suffers error on execution that is exponentially larger, as a function of problem horizon, than the error under the distribution of expert training data. Our negative result applies to both behavior cloning and offline-RL algorithms, unless they produce highly improper imitator policies --- those which are non-smooth, non-Markovian, or which exhibit highly state-dependent stochasticity --- or unless the expert trajectory distribution is sufficiently spread. We provide preliminary evidence of the benefits of these more complex policy parameterizations, explicating the benefits of today's popular policy parameterizations in robot learning (e.g. action-chunking and Diffusion-policies). We also establish a host of complementary negative and positive results for imitation in control systems.

Span-Agnostic Optimal Sample Complexity for Average-Reward RL#

Session: 3A - Reinforcement Learning (Tuesday 01 July 14:00–15:36)

Authors: Zurek, Matthew; Chen, Yudong

Abstract:
We study the sample complexity of finding an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDPs) with a generative model. The minimax optimal span-based complexity of $\widetilde{O}(SAH/\varepsilon^2)$, where $H$ is the span of the optimal bias function, has only been achievable with prior knowledge of the value of $H$. Prior-knowledge-free algorithms have been the objective of intensive research, but several natural approaches provably fail to achieve this goal. We resolve this problem, developing the first algorithms matching the optimal span-based complexity without $H$ knowledge, both when the dataset size is fixed and when the suboptimality level $\varepsilon$ is fixed. Our main technique combines the discounted reduction approach with a method for automatically tuning the effective horizon based on empirical confidence intervals or lower bounds on performance, which we term \textit{horizon calibration}. We also develop an \textit{empirical span regularization} approach, inspired by sample variance penalization, which can outperform the minimax complexity in benign settings such as when there exist near-optimal policies with span much smaller than $H$.

The Role of Environment Access in Agnostic Reinforcement Learning#

Session: 3A - Reinforcement Learning (Tuesday 01 July 14:00–15:36)

Authors: Krishnamurthy, Akshay; Li, Gene; Sekhari, Ayush

Abstract:
We study Reinforcement Learning (RL) in environments with large state spaces, where function approximation is required for sample-efficient learning. Departing from a long history of prior work, we consider the weakest possible form of function approximation, called agnostic policy learning, where the learner seeks to find the best policy in a given class $\Pi$, with no guarantee that $\Pi$ contains an optimal policy for the underlying task. Although it is known that sample-efficient agnostic policy learning is not possible in the standard online RL setting without further assumptions, we investigate the extent to which this can be overcome with stronger forms of access to the environment. Specifically, we show that: -Agnostic policy learning remains statistically intractable when given access to a local simulator, from which one can reset to any previously seen state. This result holds even when the policy class is realizable, and stands in contrast to a positive result of [MFR24] showing that value-based learning under realizability is tractable with local simulator access. -Agnostic policy learning remains statistically intractable when given online access to a reset distribution with good coverage properties over the state space (the so-called $\mu$-reset setting). We also study stronger forms of function approximation for policy learning, showing that PSDP [BKSN03] and CPI [KL02] provably fail in the absence of policy completeness. -On a positive note, agnostic policy learning is statistically tractable for Block MDPs with access to both of the above reset models. We establish this via a new algorithm that carefully constructs a policy emulator: a tabular MDP with a small state space that approximates the value functions of all $\pi \in \Pi$. These values are approximated without any explicit value function class. Taken together, our results contribute to a deeper understanding of the interplay between function approximation and environment access in RL.

Necessary and Sufficient Oracles: Toward a Computational Taxonomy For Reinforcement Learning#

Session: 3A - Reinforcement Learning (Tuesday 01 July 14:00–15:36)

Authors: Rohatgi, Dhruv; Foster, Dylan

Abstract:
Algorithms for reinforcement learning (RL) in large state spaces crucially rely on supervised learning subroutines to estimate objects such as value functions or transition probabilities. Since only the simplest supervised learning problems can be solved provably and efficiently, practical performance of an RL algorithm depends on which of these supervised learning ``oracles'' it assumes access to (and how they are implemented). But which oracles are better or worse? Is there a minimal oracle? In this work, we clarify the impact of the choice of supervised learning oracle on the computational complexity of RL, as quantified by the oracle strength. First, for the task of reward-free exploration in Block MDPs in the standard episodic access model---a ubiquitous setting for RL with function approximation---we identify two-context regression as a minimal oracle, i.e. an oracle that is both necessary and sufficient (under a mild regularity assumption). Second, we identify one-context regression as a near-minimal oracle in the stronger reset access model, establishing a provable computational benefit of resets in the process. Third, we broaden our focus to Low-Rank MDPs, where we give cryptographic evidence that the analogous oracle from the Block MDP setting is insufficient.

Optimistically Optimistic Exploration for Provably Efficient Infinite-Horizon Reinforcement and Imitation Learning#

Session: 3A - Reinforcement Learning (Tuesday 01 July 14:00–15:36)

Authors: Moulin, Antoine; Neu, Gergely; Viano, Luca

Abstract:
We study the problem of reinforcement learning in infinite-horizon discounted linear Markov decision processes (MDPs), and propose the first computationally efficient algorithm achieving near-optimal regret guarantees in this setting. Our main idea is to combine two classic techniques for optimistic exploration: additive exploration bonuses applied to the reward function, and artificial transitions made to an absorbing state with maximal return. We show that, combined with a regularized approximate dynamic-programming scheme, the resulting algorithm achieves a regret of order $\tilde{\mathcal{O}} (\sqrt{d^3 (1 - \gamma)^{- 7 / 2} T})$, where $T$ is the total number of sample transitions, $\gamma \in (0,1)$ is the discount factor, and $d$ is the feature dimensionality. The results continue to hold against adversarial reward sequences, which also enables us to apply our method to the problem of imitation learning in linear MDPs, and achieve state-of-the-art results in this setting.

Optimistic Q-learning for average reward and episodic reinforcement learning#

Session: 3A - Reinforcement Learning (Tuesday 01 July 14:00–15:36)

Authors: Agrawal, Priyank; Agrawal, Shipra

Abstract:
We present an optimistic Q-learning algorithm for regret minimization in average reward reinforcement learning under an additional assumption on the underlying MDP that for all policies, the time to visit some frequent state $s_0$ is finite and upper bounded by $H$, either in expectation or with constant probability. Our setting strictly generalizes the episodic setting and is significantly less restrictive than the assumption of bounded hitting time \textit{for all states} made by most previous literature on model-free algorithms in average reward settings. We demonstrate a regret bound of $\tilde{O}(H^5 S\sqrt{AT})$, where $S$ and $A$ are the numbers of states and actions, and $T$ is the horizon. A key technical novelty of our work is the introduction of an $\overline{L}$ operator defined as $\overline{L} v = \frac{1}{H} \sum_{h=1}^H L^h v$ where $L$ denotes the Bellman operator. Under the given assumption, we show that the $\overline{L}$ operator has a strict contraction (in span) even in the average-reward setting where the discount factor is $1$. Our algorithm design uses ideas from episodic Q-learning to estimate and apply this operator iteratively. Thus, we provide a unified view of regret minimization in episodic and non-episodic settings, which may be of independent interest.

Logarithmic regret of exploration in average reward Markov decision processes#

Session: 3A - Reinforcement Learning (Tuesday 01 July 14:00–15:36)

Authors: Boone, Victor; Gaujal, Bruno

Abstract:
In average reward Markov decision processes, state–of–the–art algorithms for regret minimization follow a well–established framework: They are model–based, optimistic and episodic. First, they maintain a confidence region from which optimistic policies are computed using a well–known subroutine called Extended Value Iteration (EVI). Second, these policies are used over time windows called episodes, each ended by the Doubling Trick (DT) rule or a variant thereof. In this work, without modifying EVI, we show that there is a significant advantage in replacing (DT) by another simple rule, that we call the Vanishing Multiplicative (VM) rule. When managing episodes with (VM), the algorithm’s regret is, both in theory and in practice, as good if not better than with (DT), while the one–shot behavior is greatly improved. More specifically, the management of bad episodes (when sub–optimal policies are being used) is much better under (VM) than (DT) by making the regret of exploration logarithmic rather than linear. These results are made possible by a new in-depth understanding of the contrasting behaviors of confidence regions during good and bad episodes.

Sample and Oracle Efficient Reinforcement Learning for MDPs with Linearly-Realizable Value Functions#

Session: 3A - Reinforcement Learning (Tuesday 01 July 14:00–15:36)

Authors: Mhammedi, Zakaria

Abstract:
Designing sample-efficient and computationally feasible reinforcement learning (RL) algorithms is particularly challenging in environments with large or infinite state and action spaces. In this paper, we advance this effort by presenting an efficient algorithm for Markov Decision Processes (MDPs) where the state-action value function of any policy is linear in a given feature map. This challenging setting can model environments with infinite states and actions, strictly generalizes classic linear MDPs, and currently lacks a computationally efficient algorithm under online access to the MDP. Specifically, we introduce a new RL algorithm that efficiently finds a near-optimal policy in this setting, using a number of episodes and calls to a cost-sensitive classification (CSC) oracle that are both polynomial in the problem parameters. Notably, our CSC oracle can be efficiently implemented when the feature dimension is constant, representing a clear improvement over state-of-the-art methods, which require solving non-convex problems with horizon-many variables and can incur computational costs that are exponential in the horizon.

3B - Robust Learning#

Time: Tuesday 01 July 14:00–15:36

Logarithmic Width Suffices for Robust Memorization#

Session: 3B - Robust Learning (Tuesday 01 July 14:00–15:36)

Authors: Egosi, Amitsour; Yehudai, Gilad; Shamir, Ohad

Abstract:
The memorization capacity of neural networks with a given architecture has been thoroughly studied in many works. Specifically, it is well-known that memorizing $N$ samples can be done using a network of constant width, independent of $N$. However, the required constructions are often quite delicate. In this paper, we consider the natural question of how well feedforward ReLU neural networks can memorize \emph{robustly}, namely while being able to withstand adversarial perturbations of a given radius. We establish both upper and lower bounds on the possible radius for general $l_p$ norms, implying (among other things) that width \emph{logarithmic} in the number of input samples is necessary and sufficient to achieve robust memorization (with robustness radius independent of $N$).

Agnostic Learning of Arbitrary ReLU Activation under Gaussian Marginals#

Session: 3B - Robust Learning (Tuesday 01 July 14:00–15:36)

Authors: Guo, Anxin; Vijayaraghavan, Aravindan

Abstract:
We consider the problem of learning an arbitrarily-biased ReLU activation (or neuron) over Gaussian marginals with the squared loss objective. Despite the ReLU neuron being the basic building block of modern neural networks, we still do not understand the basic algorithmic question of whether one arbitrary ReLU neuron is learnable in the non-realizable setting. In particular, all existing polynomial time algorithms only provide approximation guarantees for the better-behaved unbiased setting or restricted bias setting. Our main result is a polynomial time statistical query (SQ) algorithm that gives the first constant factor approximation for arbitrary bias. It outputs a ReLU activation that achieves a loss of $O(\mathrm{OPT}) + \varepsilon$ in time $\mathrm{poly}(d,1/\varepsilon)$, where $\mathrm{OPT}$ is the loss obtained by the optimal ReLU activation. Our algorithm presents an interesting departure from existing algorithms, which are all based on gradient descent and thus fall within the class of correlational statistical query (CSQ) algorithms. We complement our algorithmic result by showing that no polynomial time CSQ algorithm can achieve a constant factor approximation. Together, these results shed light on the intrinsic limitation of gradient descent, while identifying arguably the simplest setting (a single neuron) where there is a separation between SQ and CSQ algorithms.

Simplifying Adversarially Robust PAC Learning with Tolerance#

Session: 3B - Robust Learning (Tuesday 01 July 14:00–15:36)

Authors: Ashtiani, Hassan; Pathak, Vinayak; Urner, Ruth

Abstract:
Adversarially robust PAC learning has proved to be challenging, with the currently best known learners (Montasser et al., 2021a) relying on improper methods based on intricate compression schemes, resulting in sample complexity exponential in the VC-dimension. A series of follow up work considered a slightly relaxed version of the problem called adversarially robust learning with tolerance (Ashtiani et al., 2023; Bhattacharjee et al., 2023; Raman et al., 2024) and achieved better sample complexity in terms of the VC-dimension. However, those algorithms were either improper and complex, or required additional assumptions on the hypothesis class H. We prove, for the first time, the existence of a simpler learner that achieves a sample complexity linear in the VC-dimension without requiring additional assumptions on H. Even though our learner is improper, it is “almost proper” in the sense that it outputs a hypothesis that is “similar” to a hypothesis in H. We also use the ideas from our algorithm to construct a semi-supervised learner in the tolerant setting. This simple algorithm achieves comparable bounds to the previous (non-tolerant) semi- supervised algorithm of Attias et al. (2022a), but avoids the use of intricate subroutines from previous works, and is “almost proper.”

Existence of Adversarial Examples for Random Convolutional Networks via Isoperimetric Inequalities on $\so(d)$#

Session: 3B - Robust Learning (Tuesday 01 July 14:00–15:36)

Authors: Daniely, Amit

Abstract:
We show that adversarial examples exists for various random convolutional networks, and furthermore, that this is a relatively simple consequence of the isoperimetric inequality on the special orthogonal group $\so(d)$. This extends and simplifies a recent line of work which shows similar results for random fully connected networks.

Heavy-tailed Estimation is Easier than Adversarial Contamination#

Session: 3B - Robust Learning (Tuesday 01 July 14:00–15:36)

Authors: Cherapanamjeri, Yeshwanth; Lee, Daniel

Abstract:
A large body of work in the statistics and computer science communities dating back to Huber (Huber, 1960) has led to statistically and computationally efficient outlier-robust estimators. Two particular outlier models have received significant attention: the adversarial and heavy-tailed models. While the former models outliers as the result of a malicious adversary manipulating the data, the latter relaxes distributional assumptions on the data allowing outliers to naturally occur as part of the data generating process. In the first setting, the goal is to develop estimators robust to the largest fraction of outliers while in the second, one seeks estimators to combat the loss of statistical efficiency, where the dependence on the failure probability is paramount. Despite these distinct motivations, the algorithmic approaches to both these settings have converged, prompting questions on the relationship between the models. In this paper, we investigate and provide a principled explanation for this phenomenon. First, we prove that any adversarially robust estimator is also resilient to heavy-tailed outliers for any statistical estimation problem with i.i.d data. As a corollary, optimal adversarially robust estimators for mean estimation, linear regression, and covariance estimation are also optimal heavy-tailed estimators. Conversely, for arguably the simplest high-dimensional estimation task of mean estimation, we construct heavy-tailed estimators whose application to the adversarial setting requires any black-box reduction to remove almost all the outliers in the data. Taken together, our results imply that heavy-tailed estimation is likely easier than adversarially robust estimation opening the door to novel algorithmic approaches for the heavy-tailed setting. Additionally, confidence intervals obtained for adversarially robust estimation also hold with high-probability.

Faster Algorithms for Agnostically Learning Disjunctions and their Implications#

Session: 3B - Robust Learning (Tuesday 01 July 14:00–15:36)

Authors: Diakonikolas, Ilias; M. Kane, Daniel; Ren, Lisheng

Abstract:
We study the algorithmic task of learning Boolean disjunctions in the distribution-free agnostic PAC model. The best known agnostic learner for the class of disjunctions over $\{0, 1\}^n$ is the $L_1$-polynomial regression algorithm, achieving complexity $2^{\tilde{O}(n^{1/2})}$. This complexity bound is known to be nearly best possible within the class of Correlational Statistical Query (CSQ) algorithms. In this work, we develop an agnostic learner for this concept class with complexity $2^{\tilde{O}(n^{1/3})}$. Our algorithm can be implemented in the Statistical Query (SQ) model, providing the first separation between the SQ and CSQ models in distribution-free agnostic learning.

Decision Making in Hybrid Environments: A Model Aggregation Approach#

Session: 3B - Robust Learning (Tuesday 01 July 14:00–15:36)

Authors: Liu, Haolin; Wei, Chen-Yu; Zimmert, Julian

Abstract:
Recent work by Foster et al. (2021), Foster et al. (2022), Foster et al. (2023), and Xu and Zeevi (2023) developed the framework of decision estimation coefficient (DEC) that characterizes the complexity of general online decision making problems and provides a general algorithm design principle. These works, however, either focus on the pure stochastic regime where the world remains fixed over time, or the pure adversarial regime where the world arbitrarily changes over time. For the hybrid regime where the dynamics of the world is fixed while the reward arbitrarily changes, they only give pessimistic bounds on the decision complexity. In this work, we propose a general extension of DEC that more precisely characterizes this case. Besides applications in special cases, our framework leads to a flexible algorithm design where the learner learns over partitions of the hypothesis set, trading estimation complexity with decision complexity, which could be of independent interest. Our work covers model-based learning and model-free learning in the hybrid regime, with a newly proposed extension of the bilinear classes (Du et al., 2021) to the adversarial-reward case. We also recover some existing model-free learning results in the pure stochastic regime.

4A - Diffusion and Sampling#

Time: Tuesday 01 July 16:12–18:00

Regularized Dikin Walks for Sampling Truncated Logconcave Measures, Mixed Isoperimetry and Beyond Worst-Case Analysis#

Session: 4A - Diffusion and Sampling (Tuesday 01 July 16:12–18:00)

Authors: Jiang, Minhui; Chen, Yuansi

Abstract:
We study sampling from logconcave distributions truncated on polytopes, motivated by Bayesian models with indicator variables. Built on interior point methods and the Dikin walk, we analyze the mixing time of regularized Dikin walks. Our contributions include: (1) proving that the soft-threshold Dikin walk mixes in $O(mn+\kappa n)$ iterations for logconcave distributions with condition number $\kappa$, dimension $n$ and $m$ linear constraints, without requiring bounded polytopes. Moreover, we introduce the regularized Dikin walk using Lewis weights and show it mixes in $O(n^{2.5}+\kappa n)$; (2) extending the above mixing time guarantees to weakly log-concave truncated distributions with finite covariance matrices; and (3) going beyond worst-case mixing time analysis, we show that soft-threshold Dikin walk mix significantly faster when $O(1)$ number of constraints intersect the high-probability mass of the distribution, improving the $O(mn+\kappa n)$ upper bound to $O(m + \kappa n)$. Additionally, we also provide practical implementation to generate a warm initialization.

Taking a Big Step: Large Learning Rates in Denoising Score Matching Prevent Memorization#

Session: 4A - Diffusion and Sampling (Tuesday 01 July 16:12–18:00)

Authors: WU, Yu-Han; Marion, Pierre; Biau, Gerard; Boyer, Claire

Abstract:
Denoising score matching plays a pivotal role in the performance of diffusion-based generative models. However, the empirical optimal score–the exact solution to the denoising score matching–leads to memorization, where generated samples replicate the training data. Yet, in practice, only a moderate degree of memorization is observed, even without explicit regularization. In this paper, we investigate this phenomenon by uncovering an implicit regularization mechanism driven by large learning rates. Specifically, we show that in the small-noise regime, the empirical optimal score exhibits high irregularity. We then prove that, when trained by stochastic gradient descent with a large enough learning rate, neural networks cannot stably converge to a local minimum with arbitrarily small excess risk. Consequently, the learned score cannot be arbitrarily close to the empirical optimal score, thereby mitigating memorization. To make the analysis tractable, we consider one-dimensional data and two-layer neural networks. Experiments validate the crucial role of the learning rate in preventing memorization, even beyond the one-dimensional setting.

Linear Convergence of Diffusion Models Under the Manifold Hypothesis#

Session: 4A - Diffusion and Sampling (Tuesday 01 July 16:12–18:00)

Authors: Potaptchik, Peter; Azangulov, Iskander; Deligiannidis, George

Abstract:
Score-matching generative models have proven successful at sampling from complex high-dimensional data distributions. In many applications, this distribution is believed to concentrate on a much lower $d$-dimensional manifold embedded into $D$-dimensional space; this is known as the manifold hypothesis. The current best-known convergence guarantees are either linear in $D$ or polynomial (superlinear) in $d$. The latter exploits a novel integration scheme for the backward SDE. We take the best of both worlds and show that the number of steps diffusion models require in order to converge in Kullback-Leibler~(KL) divergence is linear (up to logarithmic terms) in the intrinsic dimension $d$. Moreover, we show that this linear dependency is sharp.

Learning general Gaussian mixtures with efficient score matching#

Session: 4A - Diffusion and Sampling (Tuesday 01 July 16:12–18:00)

Authors: Chen, Sitan; Kontonis, Vasilis; Shah, Kulin

Abstract:
We study the problem of learning mixtures of $k$ Gaussians in $d$ dimensions. We make no separation assumptions on the underlying mixture components: we only require that the covariance matrices have bounded condition number and that the means and covariances lie in a ball of bounded radius. We give an algorithm that draws $d^{\poly(k/\eps)}$ samples from the target mixture, runs in sample-polynomial time, and constructs a sampler whose output distribution is $\eps$-close from the unknown mixture in total variation. Prior works for this problem either (i) required exponential runtime in the dimension $d$, (ii) placed strong assumptions on the instance (e.g., spherical covariances or clusterability), or (iii) had doubly exponential dependence on the number of components $k$. Our approach departs from commonly used techniques for this problem like the method of moments. Instead, we leverage a recently developed reduction, based on diffusion models, from distribution learning to a supervised learning task called score matching. We give an algorithm for the latter by proving a structural result showing that the score function of a Gaussian mixture can be approximated by a piecewise-polynomial function, and there is an efficient algorithm for finding it. To our knowledge, this is the first example of diffusion models achieving a state-of-the-art theoretical guarantee for an unsupervised learning task.

Spike-and-Slab Posterior Sampling in High Dimensions#

Session: 4A - Diffusion and Sampling (Tuesday 01 July 16:12–18:00)

Authors: Zhu, Yusong; Kumar, Syamantak; Sarkar, Purnamrita; Tian, Kevin

Abstract:
Posterior sampling with the spike-and-slab prior (Mitchell and Beauchamp (1988)), a popular multi-modal distribution used to model uncertainty in variable selection, is considered the theoretical gold standard method for Bayesian sparse linear regression (Carvalho et al. (2009); Rockova et al. (2018)). However, designing provable algorithms for performing this sampling task is notoriously challenging. Existing posterior samplers for Bayesian sparse variable selection tasks either require strong assumptions about the signal-to-noise ratio (SNR) (Yang et al. (2016)), only work in moderate-dimensional regimes with a full-rank measurement matrix (Montanari and Wu (2024)), or rely on heuristic approximations to the posterior. We give the first provable algorithms for spike-and-slab posterior sampling that apply for any SNR, and use a measurement count sublinear in the problem dimension. Concretely, assume we are given a measurement matrix X in R^(n × d) and noisy observations y = Xθ* + ξ of a signal θ* drawn from a spike-and-slab prior π with a Gaussian diffuse density and expected sparsity k, where the noise ξ is distributed as a normal distribution with mean 0 (an n-dimensional zero vector) and covariance σ² times the n × n identity matrix. We give a polynomial-time, high-accuracy sampler for the posterior π(· | X, y), for any SNR satisfying σ⁻¹ > 0, as long as n is at least k³ multiplied by polylog(d) and X is drawn from a matrix ensemble satisfying the restricted isometry property. We further give a sampler that runs in near-linear time (approximately n·d) in the same setting, as long as n is at least k⁵ multiplied by polylog(d). To demonstrate the flexibility of our framework, we extend our result to spike-and-slab posterior sampling with Laplace diffuse densities, achieving similar guarantees when σ is on the order of 1/k.

Low-dimensional adaptation of diffusion models: Convergence in total variation#

Session: 4A - Diffusion and Sampling (Tuesday 01 July 16:12–18:00)

Authors: Liang, Jiadong; Huang, Zhihan; Chen, Yuxin

Abstract:
This paper investigates how diffusion generative models leverage (unknown) low-dimensional structure to accelerate sampling. Focusing on two mainstream samplers -- the denoising diffusion implicit model (DDIM) and the denoising diffusion probabilistic model (DDPM) -- and assuming accurate score estimates, we prove that their iteration complexities are no greater than the order of k/ε (up to some log factor), where ε is the precision in total variation distance and k is some intrinsic dimension of the target distribution. Our results are applicable to a broad family of target distributions without requiring smoothness or log-concavity assumptions. Further, we develop a lower bound that suggests the (near) necessity of the coefficients introduced by Ho et al. (2020) and Song et al. (2020) in facilitating low-dimensional adaptation. Our findings provide the first rigorous evidence for the adaptivity of the DDIM-type samplers to unknown low-dimensional structure, and improve over the state-of-the-art DDPM theory regarding total variation convergence.

Learning Mixtures of Gaussians Using Diffusion Models#

Session: 4A - Diffusion and Sampling (Tuesday 01 July 16:12–18:00)

Authors: Gatmiry, Khashayar; Lee, Holden; Kelner, Jonathan A.

Abstract:
We give a new algorithm for learning mixtures of $k$ Gaussians (with identity covariance in $\mathbb{R}^n$) to TV error $\varepsilon$, with quasi-polynomial ($O(n^{\text{poly\,log}\left(\frac{n+k}{\varepsilon}\right)})$) time and sample complexity, under a minimum weight assumption. Our results extend to continuous mixtures of Gaussians where the mixing distribution is supported on a union of $k$ balls of constant radius. In particular, this applies to the case of Gaussian convolutions of distributions on low-dimensional manifolds, or more generally sets with small covering number, for which no sub-exponential algorithm was previously known. Unlike previous approaches, most of which are algebraic in nature, our approach is analytic and relies on the framework of diffusion models. Diffusion models are a modern paradigm for generative modeling, which typically rely on learning the score function (gradient log-pdf) along a process transforming a pure noise distribution, in our case a Gaussian, to the data distribution. Despite their dazzling performance in tasks such as image generation, there are few end-to-end theoretical guarantees that they can efficiently learn nontrivial families of distributions; we give some of the first such guarantees. We proceed by deriving higher-order Gaussian noise sensitivity bounds for the score functions for a Gaussian mixture to show that that they can be inductively learned using piecewise polynomial regression (up to poly-logarithmic degree), and combine this with known convergence results for diffusion models.

Optimization, Isoperimetric Inequalities, and Sampling via Lyapunov Potentials#

Session: 4A - Diffusion and Sampling (Tuesday 01 July 16:12–18:00)

Authors: Chen, August; Sridharan, Karthik

Abstract:
In this paper, we prove that optimizability of any F using Gradient Flow from all initializations implies a Poincaré Inequality for Gibbs measures mu_{beta} = e^{-beta F}/Z at low temperature. In particular, under mild regularity assumptions on the convergence rate of Gradient Flow, we establish that mu_{beta} satisfies a Poincaré Inequality with constant O(C') for beta >= Omega(d), where C' is the Poincaré constant of mu_{beta} restricted to a neighborhood of the global minimizers of F. Under an additional mild condition on F, we show that mu_{beta} satisfies a Log-Sobolev Inequality with constant O(S beta C') where S denotes the second moment of mu_{beta}. Here asymptotic notation hides F-dependent parameters. At a high level, this establishes that optimizability via Gradient Flow from every initialization implies a Poincaré and Log-Sobolev Inequality for the low-temperature Gibbs measure, which in turn imply sampling from all initializations. Analogously, we establish that under the same assumptions, if F can be initialized from everywhere except some set S, then mu_{beta} satisfies a Weak Poincaré Inequality with parameters (O(C'), O(mu_{beta}(S)) ) for beta = Omega(d). At a high level, this shows while optimizability from `most' initializations implies a Weak Poincaré Inequality, which in turn implies sampling from suitable warm starts. Our regularity assumptions are mild and as a consequence, we show we can efficiently sample from several new natural and interesting classes of non-log-concave densities, an important setting with relatively few examples. As another corollary, we obtain efficient discrete-time sampling results for log-concave measures satisfying milder regularity conditions than smoothness, similar to Lehec (2023).

Generalization error bound for denoising score matching under relaxed manifold assumption#

Session: 4A - Diffusion and Sampling (Tuesday 01 July 16:12–18:00)

Authors: Yakovlev, Konstantin; Puchkin, Nikita

Abstract:
We examine theoretical properties of the denoising score matching estimate. We model the density of observations with a nonparametric Gaussian mixture. We significantly relax the standard manifold assumption allowing the samples step away from the manifold. At the same time, we are still able to leverage a nice distribution structure. We derive non-asymptotic bounds on the approximation and generalization errors of the denoising score matching estimate. The rates of convergence are determined by the intrinsic dimension. Furthermore, our bounds remain valid even if we allow the ambient dimension grow polynomially with the sample size.

4B - Privacy and Fairness#

Time: Tuesday 01 July 16:12–18:00

Lower Bounds for Private Estimation of Gaussian Covariance Matrices under All Reasonable Parameter Regimes#

Session: 4B - Privacy and Fairness (Tuesday 01 July 16:12–18:00)

Authors: S. Portella, Victor; Harvey, Nick

Abstract:
One of the most basic problems in statistics is estimating the covariance matrix of a Gaussian distribution. Over the past decade, researchers have studied the efficiency of covariance estimation in the setting of differential privacy. The goal is to minimize the number of samples needed to achieve particular accuracy and privacy guarantees. We prove lower bounds on the number of samples needed to privately estimate the covariance matrix of a Gaussian distribution. Our bounds match existing upper bounds in the widest known setting of parameters. Our analysis can be seen as a fingerprinting argument, one of the main techniques used to prove lower bounds in differential privacy. Most fingerprinting arguments rely on results analogous to the celebrated Stein's identity from probability theory. Our argument uses a matrix extension of this identity known as the Stein-Haff identity.

PREM: Privately Answering Statistical Queries with Relative Error#

Session: 4B - Privacy and Fairness (Tuesday 01 July 16:12–18:00)

Authors: Ghazi, Badih; Guzman, Cristóbal; Kamath, Pritish; Knop, Alexander; Kumar, Ravi; Manurangsi, Pasin; Sachdeva, Sushant

Abstract:
We introduce $\mathsf{PREM}$ (Private Relative Error Multiplicative weight update), a new framework for generating synthetic data that achieves a {\em relative} error guarantee for statistical queries under $(\varepsilon, \delta)$ differential privacy (DP). Namely, for a domain ${\cal X}$, a family ${\cal F}$ of queries $f : {\cal X} \to \{0, 1\}$, and $\zeta > 0$, our framework yields a mechanism that on input dataset $D \in {\cal X}^n$ outputs a synthetic dataset $\widehat{D} \in {\cal X}^n$ such that all statistical queries in ${\cal F}$ on $D$, namely $\sum_{x \in D} f(x)$ for $f \in {\cal F}$, are within a $1 \pm \zeta$ {\em multiplicative} factor of the corresponding value on $\widehat{D}$ up to an {\em additive error} that is polynomial in $\log |{\cal F}|$, $\log |{\cal X}|$, $\log n$, $\log(1/\delta)$, $1/\varepsilon$, and $1/\zeta$. In contrast, any $(\varepsilon, \delta)$-DP mechanism is known to require worst-case additive error that is polynomial in at least one of $n, |{\cal F}|$, or $|{\cal X}|$. We complement our algorithm with nearly matching lower bounds.

Better private distribution testing by leveraging unverified auxiliary data#

Session: 4B - Privacy and Fairness (Tuesday 01 July 16:12–18:00)

Authors: Aliakbarpour, Maryam; Burudgunte, Arnav; Canonne, Clement; Rubinfeld, Ronitt

Abstract:
We extend the framework of augmented distribution testing (Aliakbarpour, Indyk, Rubinfeld, and Silwal, NeurIPS 2024) to the differentially private setting. This captures scenarios where a data analyst must perform hypothesis testing tasks on sensitive data, but is able to leverage prior knowledge (public, but possibly erroneous or untrusted) about the data distribution. We design private algorithms in this augmented setting for three flagship distribution testing tasks, uniformity, identity, and closeness testing, whose sample complexity smoothly scales with the claimed quality of the auxiliary information. We complement our algorithms with information-theoretic lower bounds, showing that their sample complexity is optimal (up to logarithmic factors).

Optimal Differentially Private Sampling of Unbounded Gaussians#

Session: 4B - Privacy and Fairness (Tuesday 01 July 16:12–18:00)

Authors: Iverson, Valentio; Kamath, Gautam; Mouzakis, Argyris

Abstract:
We provide the first $\tilde O(d)$-sample algorithm for sampling from unbounded Gaussian distributions under the constraint of $(\varepsilon, \delta)$-differential privacy. This is a quadratic improvement over previous results for the same problem.

Private Realizable-to-Agnostic Transformation with Near-Optimal Sample Complexity#

Session: 4B - Privacy and Fairness (Tuesday 01 July 16:12–18:00)

Authors: Li, Bo; Wang, Wei; Ye, Peng

Abstract:
The realizable-to-agnostic transformation (Beimel et al., 2015; Alon et al., 2020) provides a general mechanism to convert a private learner in the realizable setting (where the examples are labeled by some function in the concept class) to a private learner in the agnostic setting (where no assumptions are imposed on the data). Specifically, for any concept class $\mathcal{C}$ and error parameter $\alpha$, a private realizable learner for $\mathcal{C}$ can be transformed into a private agnostic learner while only increasing the sample complexity by $\widetilde{O}(\mathrm{VC}(\mathcal{C})/\alpha^2)$, which is essentially tight assuming a constant privacy parameter $\varepsilon = \Theta(1)$. However, when $\varepsilon$ can be arbitrary, one has to apply the standard privacy-amplification-by-subsampling technique (Kasiviswanathan et al., 2011), resulting in a suboptimal extra sample complexity of $\widetilde{O}(\mathrm{VC}(\mathcal{C})/\alpha^2\varepsilon)$ that involves a $1/\varepsilon$ factor. In this work, we give an improved construction that eliminates the $1/\varepsilon$ dependency, thereby achieving a near-optimal extra sample complexity of $\widetilde{O}(\mathrm{VC}(\mathcal{C})/\alpha^2)$ for any $\varepsilon\le 1$. Moreover, our result reveals that in private agnostic learning, the privacy cost is only significant for the realizable part. We also leverage our technique to obtain a nearly tight sample complexity bound for the private prediction problem, resolving an open question posed by Dwork and Feldman (2018) and Dagan and Feldman (2020).

From Fairness to Infinity: Outcome-Indistinguishable (Omni)Prediction in Evolving Graphs#

Session: 4B - Privacy and Fairness (Tuesday 01 July 16:12–18:00)

Authors: Dwork, Cynthia; Hays, Chris; Immorlica, Nicole; Perdomo, Juan C.; Tankala, Pranay

Abstract:
Professional networks provide invaluable entree to opportunity through referrals and introductions. A rich literature shows they also serve to entrench and even exacerbate a status quo of privilege and disadvantage. Hiring platforms, equipped with the ability to nudge link formation, provide a tantalizing opening for beneficial structural change. We anticipate that key to this prospect will be the ability to estimate the likelihood of edge formation in an evolving graph. Outcome-indistinguishable prediction algorithms ensure that the modeled world is indistinguishable from the real world by a family of statistical tests. Omnipredictors ensure that predictions can be post-processed to yield loss minimization competitive with respect to a benchmark class of predictors for many losses simultaneously, with appropriate post-processing. We begin by observing that, by combining a slightly modified form of the online K29* algorithm of Vovk (2007) with basic facts from the theory of reproducing kernel Hilbert spaces, one can derive simple and efficient online algorithms satisfying outcome indistinguishability and omniprediction, with guarantees that improve upon, or are complementary to, those currently known. This is of independent interest; for example, we obtain efficient outcome indistinguishability for some interesting infinite collections of tests, as well as for any bounded function --- including those computable by deep (graph) neural networks. We apply these techniques to evolving graphs by designing efficient kernel functions that capture socially meaningful features of nodes and their neighborhoods. We obtain online outcome-indistinguishable omnipredictors for rich --- possibly infinite --- sets of distinguishers yielding, inter alia, multicalibrated predictions of edge formation with respect to pairs of demographic groups, and the ability to simultaneously optimize loss as measured by a variety of social welfare functions.

Differentially Private Synthetic Graphs Preserving Triangle-Motif Cuts#

Session: 4B - Privacy and Fairness (Tuesday 01 July 16:12–18:00)

Authors: Peng, Pan; Xu, Hangyu

Abstract:
We study the problem of releasing a differentially private (DP) synthetic graph $G'$ that well approximates the triangle-motif sizes of all cuts of any given graph $G$, where a motif in general refers to a frequently occurring subgraph within complex networks. Non-private versions of such graphs have found applications in diverse fields such as graph clustering, graph sparsification, and social network analysis. Specifically, we present the first $(\varepsilon,\delta)$-DP mechanism that, given an input graph $G$ with $n$ vertices, $m$ edges and local sensitivity of triangles $\LSKt$, generates a synthetic graph $G'$ in polynomial time, approximating the triangle-motif sizes of all cuts $(S,V\setminus S)$ of the input graph $G$ up to an additive error of $\tilde{O}(\sqrt{m\LSKt}n/\varepsilon^{3/2})$. Additionally, we provide a lower bound of $\Omega(\sqrt{mn}\LSKt/\varepsilon)$ on the additive error for any DP algorithm that answers the triangle motif size queries of all $(S,T)$-cut of $G$. Finally, our algorithm generalizes to weighted graphs, and our lower bound extends to any $K_h$-motif cut for any constant $h\geq 2$.

The Space Complexity of Learning-Unlearning Algorithms#

Session: 4B - Privacy and Fairness (Tuesday 01 July 16:12–18:00)

Authors: Cherapanamjeri, Yeshwanth; Garg, Sumegha; Sekhari, Ayush; Shetty, Abhishek; Rajaraman, Nived

Abstract:
We study the memory complexity of machine unlearning algorithms that provide strong data deletion guarantees to the users. Formally, consider an algorithm for a particular learning task that initially receives a training dataset. Then, after learning, it receives data deletion requests from a subset of users (of arbitrary size), and the goal of unlearning is to perform the task as if the learner never received the data of deleted users. In this paper, we ask how many bits of storage are needed to be able to delete certain training samples, at a later time. We focus on the task of realizability testing, where the goal is to check whether the remaining training samples are realizable within a given hypothesis class $\mcH$. Toward that end, we first provide a negative result showing that the VC dimension---a well-known combinatorial property of $\cH$ that characterizes the amount of information needed for learning and representing the ERM hypothesis in the standard PAC learning task---is not a characterization of the space complexity of unlearning. In particular, we provide a hypothesis class with constant VC dimension (and Littlestone dimension), but for which, any unlearning algorithm for realizability testing needs to store $\Omega(n)$-bits, where $n$ denotes the size of the initial training dataset. In fact, we provide a stronger separation by showing that for any hypothesis class $\cH$, the amount of information that the learner needs to store, so as to perform unlearning later, is lower bounded by the eluder dimension of $\cH$, a combinatorial notion always larger than the VC dimension. We complement the lower bound with an upper bound, albeit in a stronger ticketed-memory model proposed by [Ghazi et al., 2023]. We show that for any class $\mcH$ with bounded eluder dimension, there exists a ticketed scheme that uses only $\widetilde{O}(\text{Eluder}(\cH))$ many bits of storage and these many sized tickets.

Private List Learnability vs. Online List Learnability#

Session: 4B - Privacy and Fairness (Tuesday 01 July 16:12–18:00)

Authors: Hanneke, Steve; Moran, Shay; Schefler, Hilla; Tsubari, Iska

Abstract:
This work explores the connection between differential privacy (DP) and online learning in the context of PAC list learning. In this setting, a $k$-list learner outputs a list of $k$ potential predictions for an instance $x$ and incurs a loss if the true label of $x$ is not included in the list. A basic result in the multiclass PAC framework with a finite number of labels states that private learnability is equivalent to online learnability [Alon, Livni, Malliaris, and Moran (2019); Bun, Livni, and Moran (2020); Jung, Kim, and Tewari (2020)]. Perhaps surprisingly, we show that this equivalence does not hold in the context of list learning. Specifically, we prove that, unlike in the multiclass setting, a finite $k$-Littlestone dimension—a variant of the classical Littlestone dimension that characterizes online $k$-list learnability—is not a sufficient condition for DP $k$-list learnability. However, similar to the multiclass case, we prove that it remains a necessary condition. To demonstrate where the equivalence breaks down, we provide an example showing that the class of monotone functions with $k+1$ labels over $\mathbb{N}$ is online $k$-list learnable, but not DP $k$-list learnable. This leads us to introduce a new combinatorial dimension, the \emph{$k$-monotone dimension}, which serves as a generalization of the threshold dimension. Unlike the multiclass setting, where the Littlestone and threshold dimensions are finite together, for $k>1$, the $k$-Littlestone and $k$-monotone dimensions do not exhibit this relationship. We prove that a finite $k$-monotone dimension is another necessary condition for DP $k$-list learnability, alongside finite $k$-Littlestone dimension. Whether the finiteness of both dimensions implies private $k$-list learnability remains an open question.

5A - Online Learning I#

Time: Wednesday 02 July 09:00–10:36

Stability and List-Replicability for Agnostic Learners#

Session: 5A - Online Learning I (Wednesday 02 July 09:00–10:36)

Authors: Blondal, Ari; Gao, Shan; Hatami, Hamed; Hatami, Pooya

Abstract:
Two seminal papers--Alon, Livni, Malliaris, Moran~(STOC 2019) and Bun, Livni, and Moran~(FOCS 2020)--established the equivalence between online learnability and globally stable PAC learnability in binary classification. However, Chase, Chornomaz, Moran, and Yehudayoff (STOC 2024) recently showed that this equivalence does not hold in the agnostic setting. Specifically, they proved that in the agnostic setting, only finite hypothesis classes are globally stable learnable. Therefore, agnostic global stability is too restrictive to capture interesting hypothesis classes. To address this limitation, Chase et al. introduced two relaxations of agnostic global stability. In this paper, we characterize the classes that are learnable under their proposed relaxed conditions, resolving the two open problems raised in their work. First, we prove that in the setting where the stability parameter can depend on the excess error (the gap between the learner's error and the best achievable error by the hypothesis class), agnostic stability is fully characterized by the Littlestone dimension. Consequently, as in the realizable case, this form of learnability is equivalent to online learnability. As part of the proof of this theorem, we strengthen the celebrated result of Bun et al. by showing that classes with infinite Littlestone dimension are not stably PAC learnable, even if we allow the stability parameter to depend on the excess error. For the second relaxation proposed by Chase et al., we prove that only finite hypothesis classes are globally stable learnable even if we restrict the agnostic setting to distributions with small population loss.

Beyond Worst-Case Online Classification: VC-Based Regret Bounds for Relaxed Benchmarks#

Session: 5A - Online Learning I (Wednesday 02 July 09:00–10:36)

Authors: Montasser, Omar; Shetty, Abhishek; Zhivotovskiy, Nikita

Abstract:
We revisit online binary classification by shifting the focus from competing with the best-in-class binary loss to competing against relaxed benchmarks that capture smoothed notions of optimality. Instead of measuring regret relative to the exact minimal binary error—a standard approach that leads to worst-case bounds tied to the Littlestone dimension—we consider comparing with predictors that are robust to small input perturbations, perform well under Gaussian smoothing, or maintain a prescribed output margin. Previous examples of this were primarily limited to the hinge loss. Our algorithms achieve regret guarantees that depend only on the VC dimension and the complexity of the instance space (e.g., metric entropy), and notably, they incur only an $O(log(1/\gamma))$ dependence on the generalized margin . This stands in contrast to most existing regret bounds, which typically exhibit a polynomial dependence on $1/\gamma$. We complement this with matching lower bounds. Our analysis connects recent ideas from adversarial robustness and smoothed online learning.

Truthfulness of Decision-Theoretic Calibration Measures#

Session: 5A - Online Learning I (Wednesday 02 July 09:00–10:36)

Authors: Qiao, Mingda; Zhao, Eric

Abstract:
Calibration measures quantify how much a forecaster's predictions violates calibration, which requires that forecasts are unbiased conditioning on the forecasted probabilities. Two important desiderata for a calibration measure are its decision-theoretic implications [KPLST23] (i.e., downstream decision-makers that best-respond to the forecasts are always no-regret) and its truthfulness [HQYZ24] (i.e., a forecaster approximately minimizes the error by always reporting the true probabilities). Existing measures satisfy at most one of the two properties, but not both. We introduce a new calibration measure termed step calibration that strengthens the U-Calibration error (UCal) of [KPLST23]. Our main result shows that a subsampled variant of step calibration, stepCE^sub, is both decision-theoretic and truthful. In particular, on any product distribution, stepCE^sub is truthful up to an $O(1)$ factor whereas UCal suffers from an $e^{-\Omega(T)}$-$\Omega(\sqrt{T})$ truthfulness gap. Moreover, in any smoothed setting where the conditional probability of each event is perturbed by a noise of magnitude $c > 0$, stepCE^sub is truthful up to an $O(\sqrt{\log(1/c)})$ factor, while UCal has an $e^{-\Omega(T)}$-$\Omega(T^{1/3})$ truthfulness gap. We also prove more generally an impossibility result for truthful decision-theoretic forecasting: any complete and decision-theoretic calibration measure must be discontinuous and non-truthful in the non-smoothed setting.

Sample Efficient Downstream Swap Regret and Omniprediction for Non-Linear Losses#

Session: 5A - Online Learning I (Wednesday 02 July 09:00–10:36)

Authors: Lu, Jiuyao; Roth, Aaron; Shi, Mirah

Abstract:
We define ``decision swap regret'' which generalizes both prediction for downstream swap regret and omniprediction, and give algorithms for obtaining it for arbitrary multi-dimensional Lipschitz loss functions in online adversarial settings. We also give sample complexity bounds in the batch setting via an online-to-batch reduction. When applied to omniprediction, our algorithm gives the first polynomial sample-complexity bounds for Lipschitz loss functions---prior bounds either applied only to linear loss (or binary outcomes) or scaled exponentially with the error parameter even under the assumption that the loss functions were convex. When applied to prediction for downstream regret, we give the first algorithm capable of guaranteeing swap regret bounds for all downstream agents with non-linear loss functions over a multi-dimensional outcome space: prior work applied only to linear loss functions, modeling risk neutral agents. Our general bounds scale exponentially with the dimension of the outcome space, but we give improved regret and sample complexity bounds for specific families of multidimensional functions of economic interest: constant elasticity of substitution (CES), Cobb-Douglas, and Leontief utility functions.

Computing Optimal Regularizers for Online Linear Optimization#

Session: 5A - Online Learning I (Wednesday 02 July 09:00–10:36)

Authors: Gatmiry, Khashayar; Schneider, Jon; Jegelka, Stefanie

Abstract:
Follow-the-Regularized-Leader (FTRL) algorithms are a popular class of learning algorithms for online linear optimization (OLO) that guarantee sub-linear regret. However, the choice of regularizer can significantly impact dimension-dependent factors in the regret bound. We present an algorithm that takes as input convex and symmetric action sets and loss sets for a specific OLO instance, and outputs a regularizer such that running FTRL with this regularizer guarantees regret within a universal constant factor of the best possible regret bound. In particular, for any choice of (convex, symmetric) action set and loss set we prove that there exists an instantiation of FTRL that achieves regret within a constant factor of the best possible learning algorithm, strengthening the universality result of Srebro et al., 2011. Our algorithm requires preprocessing time and space exponential in the dimension $d$ of the OLO instance, but can be run efficiently online assuming a membership and linear optimization oracle for the action and loss sets, respectively (and is fully polynomial time for the case of constant dimension $d$). We complement this with a lower bound showing that even deciding whether a given regularizer is $\alpha$-strongly-convex with respect to a given norm is NP-hard.

Online Convex Optimization with a Separation Oracle#

Session: 5A - Online Learning I (Wednesday 02 July 09:00–10:36)

Authors: Mhammedi, Zakaria

Abstract:
In this paper, we introduce a new projection-free algorithm for Online Convex Optimization (OCO) with a state-of-the-art regret guarantee among separation-based algorithms. Existing projection-free methods based on the classical Frank-Wolfe algorithm achieve a suboptimal regret bound of $O(T^{3/4})$, while more recent separation-based approaches guarantee a regret bound of $O(\kappa \sqrt{T})$, where $\kappa$ denotes the asphericity of the feasible set, defined as the ratio of the radii of the containing and contained balls. However, for ill-conditioned sets, $\kappa$ can be arbitrarily large, potentially leading to poor performance. Our algorithm achieves a regret bound of $\widetilde{O}(\sqrt{dT} + \kappa d)$, while requiring only $\widetilde{O}(1)$ calls to a separation oracle per round. Crucially, the main term in the bound, $\widetilde{O}(\sqrt{d T})$, is independent of $\kappa$, addressing the limitations of previous methods. Additionally, as a by-product of our analysis, we recover the $O(\kappa \sqrt{T})$ regret bound of existing OCO algorithms with a more straightforward analysis and improve the regret bound for projection-free online exp-concave optimization. Finally, for constrained stochastic convex optimization, we achieve a state-of-the-art convergence rate of $\widetilde{O}(\sigma/\sqrt{T} + \kappa d/T)$, where $\sigma$ represents the noise in the stochastic gradients, while requiring only $\widetilde{O}(1)$ calls to a separation oracle per iteration.

Alternating Regret for Online Convex Optimization#

Session: 5A - Online Learning I (Wednesday 02 July 09:00–10:36)

Authors: Hait, Soumita; Li, Ping; Luo, Haipeng; Zhang, Mengxiao

Abstract:
Motivated by alternating learning dynamics in two-player games, a recent work by \citet{cevher2024alternation} shows that $o(\sqrt{T})$ alternating regret is possible for any $T$-round adversarial Online Linear Optimization (OLO) problem, and left as an open question whether the same is true for general Online Convex Optimization (OCO). We answer this question in the affirmative by showing that the continuous Hedge algorithm achieves $\tilde{\mathcal{O}}(d^{\frac{2}{3}}T^{\frac{1}{3}})$ alternating regret for any adversarial $d$-dimensional OCO problems. We show that this implies an alternating learning dynamic that finds a Nash equilibrium for any convex-concave zero-sum games or a coarse correlated equilibrium for any convex two-player general-sum games at a rate of $\tilde{\mathcal{O}}(d^{\frac{2}{3}}/T^{\frac{2}{3}})$. To further improve the time complexity and/or the dimension dependence, we propose another simple algorithm, Follow-the-Regularized-Leader with a regularizer whose convex conjugate is 3rd-order smooth, for OCO with smooth and self-concordant loss functions (such as linear or quadratic losses). We instantiate our algorithm with different regularizers and show that, for example, when the decision set is the $\ell_2$ ball, our algorithm achieves $\tilde{\mathcal{O}}(T^{\frac{2}{5}})$ alternating regret with no dimension dependence (and a better $\tilde{\mathcal{O}}(T^{\frac{1}{3}})$ bound for quadratic losses). We complement our results by showing some algorithm-specific alternating regret lower bounds, including a somewhat surprising $\Omega(\sqrt{T})$ lower bound for a Regret Matching variant that is widely used in alternating learning dynamics.

A Polynomial-time Algorithm for Online Sparse Linear Regression with Improved Regret Bound under Weaker Conditions#

Session: 5A - Online Learning I (Wednesday 02 July 09:00–10:36)

Authors: Li, Junfan; Liao, Shizhong; Xu, Zenglin; Nie, Liqiang

Abstract:
In this paper, we study the problem of online sparse linear regression (OSLR) where the algorithms are restricted to accessing only $k$ out of $d$ attributes per instance for prediction, which was proved to be NP-hard. Previous work gave polynomial-time algorithms assuming the data matrix satisfies the linear independence of features, the compatibility condition, or the restricted isometry property. We introduce a new polynomial-time algorithm, which significantly improves previous regret bounds \citep{Ito2017Efficient} under the compatibility condition that is weaker than the other two assumptions. The improvements benefit from a tighter convergence rate of the $\ell_1$-norm error of our estimators. Our algorithm leverages the well-studied Dantzig Selector, but importantly with several novel techniques, including an algorithm-dependent sampling scheme for estimating the covariance matrix, an adaptive parameter tuning scheme, and a batching online Newton step with careful initializations. We also give novel and non-trivial analyses, including an induction method for analyzing the $\ell_1$-norm error, careful analyses on the covariance of non-independent random variables, and a decomposition on the regret. We further extend our algorithm to OSLR with additional observations where the algorithms can observe additional $k_0$ attributes after each prediction, and improve previous regret bounds \citep{Kale2017Adaptive,Ito2017Efficient}.

5B - Deep Learning#

Time: Wednesday 02 July 09:00–10:36

Depth Separations in Neural Networks: Separating the Dimension from the Accuracy#

Session: 5B - Deep Learning (Wednesday 02 July 09:00–10:36)

Authors: Safran, Itay; Reichman, Daniel; Valiant, Paul

Abstract:
We prove an exponential separation between depth 2 and depth 3 neural networks, when approximating a $\mathcal{O}(1)$-Lipschitz target function to constant accuracy, with respect to a distribution with support in the unit ball, under the mild assumption that the weights of the depth 2 network are exponentially bounded. This resolves an open problem posed in \citet{safran2019depth}, and proves that the curse of dimensionality manifests itself in depth 2 approximation, even in cases where the target function can be represented efficiently using a depth 3 network. Previously, lower bounds that were used to separate depth 2 from depth 3 networks required that at least one of the Lipschitz constant, target accuracy or (some measure of) the size of the domain of approximation scale \emph{polynomially} with the input dimension, whereas in our result these parameters are fixed to be \emph{constants} independent of the input dimension: our parameters are simultaneously optimal. Our lower bound holds for a wide variety of activation functions, and is based on a novel application of a worst- to average-case random self-reducibility argument, allowing us to leverage depth 2 threshold circuits lower bounds in a new domain.

Complexity of Injectivity and Verification of ReLU Neural Networks#

Session: 5B - Deep Learning (Wednesday 02 July 09:00–10:36)

Authors: Grillo, Moritz; Froese, Vincent; Skutella, Martin

Abstract:
Neural networks with ReLU activation play a key role in modern machine learning. Understanding the functions represented by ReLU networks is a major topic in current research as this enables a better interpretability of learning processes. Injectivity plays a crucial role whenever invertibility of a neural network is necessary, such as, e.g., for inverse problems or generative models. The exact computational complexity of deciding injectivity was recently posed as an open problem (Puthawala et al. [JMLR 2022]). We answer this question by proving coNP-completeness. On the positive side, we show that the problem for a single ReLU-layer is still tractable for small input dimension; more precisely, we present a parameterized algorithm which yields fixed-parameter tractability with respect to the input dimension. In addition, we study the network verification problem which is of great importance since neural networks are increasingly used in safety-critical systems. We prove that network verification is coNP-hard for a general class of input domains. Our results also exclude constant-factor polynomial-time approximations for the maximum of a function computed by a ReLU network. In this context, we also characterize surjectivity for ReLU networks with one-dimensional output which turns out to be the complement of a basic network verification task. We reveal interesting connections to computational convexity by formulating the surjectivity problem as a zonotope containment problem.

DiscQuant: A Quantization Method for Neural Networks Inspired by Discrepancy Theory#

Session: 5B - Deep Learning (Wednesday 02 July 09:00–10:36)

Authors: Chee, Jerry; Backurs, Arturs; Heck, Rainie; Zhang, Li; Kulkarni, Janardhan; Rothvoss, Thomas; Gopi, Sivakanth

Abstract:
Quantizing the weights of a neural network has two steps: (1) Finding a good low bit-complexity representation for weights (which we call the quantization grid) and (2) Rounding the original weights to values in the quantization grid. In this paper, we study the problem of rounding optimally given any quantization grid. The simplest and most commonly used way to round is Round-to-Nearest (RTN). By rounding in a data-dependent way instead, one can improve the quality of the quantized model significantly. We study the rounding problem from the lens of \emph{discrepancy theory}, which studies how well we can round a continuous solution to a discrete solution without affecting solution quality too much. We prove that given $m=\poly\left(\frac{\log n}{\epsilon}\right)$ samples from the data distribution, we can round nearly all $n$ model parameters such that the expected approximation error of the quantized model on the true data distribution is $\le \epsilon$ as long as the space of gradients of the original model is approximately low rank (which we empirically validate). Our algorithm is based on the famous Lovett-Meka algorithm from discrepancy theory and uses sticky Brownian motion to find a good rounding. We also give a simple and practical rounding algorithm called \emph{DiscQuant}, which is inspired by our theoretical insights. In our experiments, we demonstrate that DiscQuant significantly improves over the prior state-of-the-art rounding method called GPTQ and the baseline RTN over a range of benchmarks on Phi3mini-3.8B and Llama3.1-8B. For example, rounding Phi3mini-3.8B to a fixed quantization grid with 3.25 bits per parameter using DiscQuant gets 64\% accuracy on the GSM8k dataset, whereas GPTQ achieves 54\% and RTN achieves 31\% (the original model achieves 84\%).

Information-theoretic reduction of deep neural networks to linear models in the overparametrized proportional regime#

Session: 5B - Deep Learning (Wednesday 02 July 09:00–10:36)

Authors: Camilli, Francesco; Tieplova, Daria; Barbier, Jean; Bergamin, Eleonora

Abstract:
We rigorously analyse fully-trained neural networks of arbitrary depth in the Bayesian optimal setting in the so-called \emph{proportional scaling regime} where the number of training samples and width of the input and all inner layers diverge proportionally. We prove an information-theoretic equivalence between the Bayesian deep neural network model trained from data generated by a teacher with matching architecture, and a simpler model of optimal inference in a generalized linear model. This equivalence enables us to compute the optimal generalization error for deep neural networks in this regime. We thus prove the ``deep Gaussian equivalence principle'' conjectured in \cite{cui2023optimal}. Our result highlights that in order to escape this ``trivialisation'' of deep neural networks (in the sense of reduction to a linear model) happening in the strongly overparametrized proportional regime, models trained from much more data have to be considered.

The late-stage training dynamics of (stochastic) subgradient descent on homogeneous neural networks#

Session: 5B - Deep Learning (Wednesday 02 July 09:00–10:36)

Authors: Schechtman, Sholom; Schreuder, Nicolas

Abstract:
We analyze the implicit bias of constant step stochastic subgradient descent (SGD). We consider the setting of binary classification with homogeneous neural networks -- a large class of deep neural networks with $\ReLU$-type activation functions such as MLPs and CNNs without biases. We interpret the dynamics of normalized SGD iterates as an Euler-like discretization of a conservative field flow that is naturally associated to the normalized classification margin. Owing to this interpretation, we show that normalized SGD iterates converge to the set of critical points of the normalized margin at late-stage training (i.e., assuming that the data is correctly classified with positive normalized margin). Up to our knowledge, this is the first extension of the analysis of Lyu and Li (2020) on the discrete dynamics of gradient descent to the nonsmooth and stochastic setting. Our main result applies to binary classification with exponential or logistic losses. We additionally discuss extensions to more general settings.

Learning Compositional Functions with Transformers from Easy-to-Hard Data#

Session: 5B - Deep Learning (Wednesday 02 July 09:00–10:36)

Authors: Wang, Zixuan; Nichani, Eshaan; Bietti, Alberto; Damian, Alex; Hsu, Daniel; Lee, Jason; Wu, Denny

Abstract:
Transformer-based language models have demonstrated impressive capabilities across a range of complex reasoning tasks. Prior theoretical work exploring the expressive power of transformers has shown that they can efficiently perform multi-step reasoning tasks involving parallelizable computations. However, the learnability of such constructions, particularly the conditions on the data distribution that enable efficient learning via SGD, remains an open question. Towards answering this question, we study the learnability of a task called the \emph{$k$-fold composition}, which requires computing an interleaved composition of $k$ input permutations and $k$ hidden permutations, and can be expressed by a transformer with $O(\log k)$ layers. On the negative front, we provide a Statistical Query lower bound showing that any learner which is trained on samples from the $k$-fold composition task and makes polynomially many queries must have sample size exponential in $k$, thus establishing a statistical-computational gap. On the other hand, we show that this function class can be efficiently learned, with runtime and sample complexity polynomial in $k$, by gradient descent on an $O(\log k)$-depth transformer via two different curriculum learning strategies: one in which data consists of $k'$-fold composition functions with $k' \le k$ presented in increasing order of difficulty, and another in which all data is presented simultaneously. Our work sheds light on the necessity and sufficiency of having both easy and hard examples in the data distribution for transformers to learn complex compositional tasks.

Low-rank fine-tuning lies between lazy training and feature learning#

Session: 5B - Deep Learning (Wednesday 02 July 09:00–10:36)

Authors: Dayi, Arif Kerem; Chen, Sitan

Abstract:
LoRA has emerged as one of the \emph{de facto} methods for fine-tuning foundation models with low computational cost and memory footprint. The idea is to only train a low-rank perturbation to the weights of a pre-trained model, given supervised data for a downstream task. Despite its empirical success, mathematically it remains poorly understood what learning mechanisms ensure that gradient descent converges to useful low-rank perturbations. In this work we study low-rank fine-tuning in a student-teacher setting. We are given the weights of a two-layer base model $f$, as well as i.i.d. samples $(x,f^*(x))$ where $x$ is Gaussian and $f^*$ is the teacher model given by perturbing the weights of $f$ by a rank-1 matrix. This generalizes the setting of generalized linear model (GLM) regression where the weights of $f$ are zero. When the rank-1 perturbation is comparable in norm to the weight matrix of $f$, we show that the training dynamics are genuinely distinct from both the lazy linearized dynamics of the kernel regime, and the rich feature learning dynamics captured by GLM regression. We prove under mild assumptions that a student model which is initialized at the base model and trained with online SGD will converge to the teacher in $dk^{O(1)}$ iterations, where $k$ is the number of neurons in $f$. Importantly, unlike in the GLM setting, the complexity does not depend on fine-grained properties of the activation's Hermite expansion. We also prove that in our setting, learning the teacher model ``from scratch'' can require significantly more iterations.

Mean-field neural network beyond finite time horizon#

Session: 5B - Deep Learning (Wednesday 02 July 09:00–10:36)

Authors: Glasgow, Margalit; Bruna, Joan; Wu, Denny

Abstract:
We study the approximation gap between the dynamics of a polynomial-width neural network and its infinite-width counterpart, both trained using projected gradient descent in the mean-field scaling regime. We demonstrate how to tightly bound this approximation gap through a differential equation governed by the mean-field dynamics. A key factor influencing the growth of this ODE is the \textit{local Hessian} of each particle, defined as the derivative of the particle's velocity in the mean-field dynamics with respect to its position. We apply our results to the canonical feature learning problem of estimating a well-specified single-index model; we permit the information exponent to be arbitrarily large, leading to convergence times that grow polynomially in the ambient dimension $d$. We show that, due to a certain ``self-concordance'' property in these problems — where the local Hessian of a particle is bounded by a constant times the particle's velocity — polynomially many neurons are sufficient to closely approximate the mean-field dynamics throughout training.

6A - Learning in Games#

Time: Wednesday 02 July 11:10–12:46

Fast and Furious Symmetric Learning in Zero-Sum Games: Gradient Descent as Fictitious Play#

Session: 6A - Learning in Games (Wednesday 02 July 11:10–12:46)

Authors: Lazarsfeld, John; Piliouras, Georgios; Sim, Ryann; Wibisono, Andre

Abstract:
This paper investigates the sublinear regret guarantees of two non-no-regret algorithms in zero-sum games: Fictitious Play, and Online Gradient Descent with constant stepsizes. In general adversarial online learning settings, both algorithms may exhibit instability and linear regret due to no regularization (Fictitious Play) or small amounts of regularization (Gradient Descent). However, their ability to obtain tighter regret bounds in two-player zero-sum games is less understood. In this work, we obtain strong new regret guarantees for both algorithms on a class of symmetric zero-sum games that generalize the classic three-strategy Rock-Paper-Scissors to a weighted, n-dimensional regime. Under symmetric initializations of the players' strategies, we prove that Fictitious Play with any tiebreaking rule has O(\sqrt{T}) regret, establishing a new class of games for which Karlin's Fictitious Play conjecture holds. Moreover, by leveraging a connection between the geometry of the iterates of Fictitious Play and Gradient Descent in the dual space of payoff vectors, we prove that Gradient Descent, for almost all symmetric initializations, obtains a similar O(\sqrt{T}) regret bound when its stepsize is a sufficiently large constant. For Gradient Descent, this establishes the first "fast and furious" behavior (i.e., sublinear regret without time-vanishing stepsizes but instead using large, constant ones) for zero-sum games larger than 2x2.

Decision Making in Changing Environments: Robustness, Query-Based Learning, and Differential Privacy#

Session: 6A - Learning in Games (Wednesday 02 July 11:10–12:46)

Authors: Chen, Fan; Rakhlin, Alexander

Abstract:
We study the problem of interactive decision making in which the underlying environment changes over time subject to given constraints. We propose a framework, which we call hybrid Decision Making with Structured Observations (hybrid DMSO), that provides an interpolation between the stochastic and adversarial settings of decision making. Within this framework, we can analyze local differentially private (LDP) decision making, query-based learning (in particular, SQ learning), and robust and smooth decision making under the same umbrella, deriving upper and lower bounds based on variants of the Decision-Estimation Coefficient (DEC). We further establish strong connections between the DEC's behavior, the SQ dimension, local minimax complexity, learnability, and joint differential privacy.

Bayes correlated equilibria, no-regret dynamics in Bayesian games, and the price of anarchy#

Session: 6A - Learning in Games (Wednesday 02 July 11:10–12:46)

Authors: Fujii, Kaito

Abstract:
This paper investigates equilibrium computation and the price of anarchy for Bayesian games, which are the fundamental models of games with incomplete information. In normal-form games with complete information, it is known that efficiently computable no-regret dynamics converge to correlated equilibria, and the price of anarchy for correlated equilibria can be bounded for a broad class of games called smooth games. However, in Bayesian games, as surveyed by Forges (1993), several non-equivalent extensions of correlated equilibria exist, and it remains unclear whether they can be efficiently computed or whether their price of anarchy can be bounded. In this paper, we identify a natural extension of correlated equilibria that can be computed efficiently and is guaranteed to have bounds on the price of anarchy in various games. First, we propose a variant of regret called untruthful swap regret. If each player minimizes it in repeated play of Bayesian games, the empirical distribution of these dynamics is guaranteed to converge to communication equilibria, which is one of the extensions of correlated equilibria proposed by Myerson (1982). We present an efficient algorithm for minimizing untruthful swap regret with a sublinear upper bound, which we prove to be tight in terms of the number of types. As a result, by simulating the dynamics with our algorithm, we can approximately compute a communication equilibrium in polynomial time. Furthermore, we extend existing lower bounds on the price of anarchy based on the smoothness arguments from Bayes--Nash equilibria to equilibria obtained by the proposed dynamics.

Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback#

Session: 6A - Learning in Games (Wednesday 02 July 11:10–12:46)

Authors: Ito, Shinji; Luo, Haipeng; Tsuchiya, Taira; Wu, Yue

Abstract:
No-regret self-play learning dynamics have become one of the premier ways to solve large-scale games in practice. Accelerating their convergence via improving the regret of the players over the naive $O(\sqrt{T})$ bound after $T$ rounds has been extensively studied in recent years, but almost all studies assume access to exact gradient feedback. We address the question of whether acceleration is possible under bandit feedback only and provide an affirmative answer for two-player zero-sum normal-form games. Specifically, we show that if both players apply the Tsallis-INF algorithm of Zimmert and Seldin (2021), then their regret is at most $O(c_1 \log T + c_2 \sqrt{T})$, where $c_1$ and $c_2$ are game-dependent constants that characterize the difficulty of learning ----- $c_1$ resembles the complexity of learning a stochastic multi-armed bandit instance and depends inversely on some gap measures, while $c_2$ can be much smaller than the number of actions when the Nash equilibria have a small support or are close to the boundary. In particular, for the case when a pure strategy Nash equilibrium exists, $c_2$ becomes zero, leading to an optimal instance-dependent regret bound as we show. We additionally prove that in this case our algorithm also enjoys last-iterate convergence and can identify the pure strategy Nash equilibrium with near-optimal sample complexity.

Computational Intractability of Strategizing against Online Learners#

Session: 6A - Learning in Games (Wednesday 02 July 11:10–12:46)

Authors: Dagan, Yuval; Assos, Angelos; Rajaraman, Nived

Abstract:
Online learning algorithms are widely used in strategic multi-agent settings, including repeated auctions, contract design, and pricing competitions, where agents adapt their strategies over time. A key question in such environments is how an optimizing agent can best respond to a learning agent to improve its own long-term outcomes. While prior work has developed efficient algorithms for the optimizer in special cases—such as structured auction settings or contract design—no general efficient algorithm is known. In this paper, we establish a strong computational hardness result: unless $\mathsf{P} = \mathsf{NP}$, no polynomial-time optimizer can compute a near-optimal strategy against a learner using a standard no-regret algorithm, specifically Multiplicative Weights Update (MWU). Our result proves an $\Omega(T)$ hardness bound, significantly strengthening previous work that only showed an additive $\Theta(1)$ impossibility result. Furthermore, while prior hardness results focused on learners using fictitious play—an algorithm that is not no-regret—we prove intractability for a widely used no-regret learning algorithm. This establishes a fundamental computational barrier to finding optimal strategies in general game-theoretic settings.

The Oracle Complexity of Simplex-based Matrix Games: Linear Separability and Nash Equilibria#

Session: 6A - Learning in Games (Wednesday 02 July 11:10–12:46)

Authors: Kornowski, Guy; Shamir, Ohad

Abstract:
We study the problem of solving matrix games of the form $\max_{\mathbf{w}\in\mathcal{W}}\min_{\mathbf{p}\in\Delta}\mathbf{p}^{\top}A\mathbf{w}$, where $A$ is some matrix and $\Delta$ is the probability simplex. This problem encapsulates canonical tasks such as finding a linear separator and computing Nash equilibria in zero-sum games. However, perhaps surprisingly, its inherent complexity (as formalized in the standard framework of oracle complexity (Nemirovski and Yudin, 1983)) is not well-understood. In this work, we first identify different oracle models which are implicitly used by prior algorithms, amounting to multiplying the matrix $A$ by a vector from either one or both sides. We then prove complexity lower bounds for algorithms under both access models, which in particular imply a separation between them. Specifically, we start by proving that algorithms for linear separability based on one-sided multiplications must require $\Omega(\gamma_A^{-2})$ iterations, where $\gamma_A$ is the margin, as matched by the Perceptron algorithm. We then prove that accelerated algorithms for this task, which utilize multiplications from both sides, must require $\tilde{\Omega}(\gamma_{A}^{-2/3})$ iterations, establishing the first oracle complexity barrier for such algorithms. Finally, by adapting our lower bound to $\ell_1$ geometry, we prove that computing an $\epsilon$-approximate Nash equilibrium requires $\tilde{\Omega}(\epsilon^{-2/5})$ iterations, which is an exponential improvement over the previously best-known lower bound due to Hadiji et al. (2024).

Can a calibration metric be both testable and actionable?#

Session: 6A - Learning in Games (Wednesday 02 July 11:10–12:46)

Authors: Rossellini, Raphael; Soloff, Jake; Foygel Barber, Rina; Ren, Zhimei; Willett, Rebecca

Abstract:
Forecast probabilities often serve as critical inputs for binary decision making. In such settings, calibration---ensuring forecasted probabilities match empirical frequencies---is essential. Although the common notion of Expected Calibration Error (ECE) provides actionable insights for decision making, it is not testable: it cannot be empirically estimated in many practical cases. Conversely, the recently proposed Distance to Calibration (dCE) is testable but is not actionable since it lacks decision-theoretic guarantees needed for high-stakes applications. We introduce Cutoff Calibration Error, a new calibration measure that bridges this gap by measuring expected deviations over arbitrary forecast intervals. We show that Cutoff Calibration Error can be efficiently estimated and examine its implications for popular post-hoc calibration methods, such as isotonic regression and Platt scaling.

Corrupted Learning Dynamics in Games#

Session: 6A - Learning in Games (Wednesday 02 July 11:10–12:46)

Authors: Tsuchiya, Taira; Ito, Shinji; Luo, Haipeng

Abstract:
Learning in games refers to scenarios where multiple players interact in a shared environment, each aiming to minimize their regret. It is well known that an equilibrium can be computed at a fast rate of $O(1/T)$ when all players follow the optimistic follow-the-regularized-leader (OFTRL). However, this acceleration is limited to the honest regime, in which all players fully adhere to a prescribed algorithm---a situation that may not be realistic in practice. To address this issue, we present corrupted learning dynamics that adaptively find an equilibrium at a rate that depends on the extent to which each player deviates from the strategy suggested by the prescribed algorithm. First, in two-player zero-sum corrupted games, we provide learning dynamics for which the external regret of $x$-player (and similarly for $y$-player) is roughly bounded by $O(\log (m_x m_y) + \sqrt{\hat{C}_y} + \hat{C}_x)$, where $m_x$ and $m_y$ denote the number of actions of $x$- and $y$-players, respectively, and $\hat{C}_x$ and $\hat{C}_y$ represent their cumulative deviations. We then extend our approach to multi-player general-sum corrupted games, providing learning dynamics for which the swap regret of player $i$ is bounded by $O(\log T + \sqrt{\sum_{k} \hat{C}_k \log T} + \hat{C}_i)$ ignoring dependence on the number of players and actions, where $\hat{C}_i$ is the cumulative deviation of player $i$ from the prescribed algorithm. Our learning dynamics are agnostic to the levels of corruption. A key technical contribution is a new analysis that ensures the stability of a Markov chain under a new adaptive learning rate, thereby allowing us to achieve the desired bound in the corrupted regime while matching the best existing bound in the honest regime. Notably, our framework can be extended to address not only corruption in strategies but also corruption in the observed expected utilities, and we provide several matching lower bounds.

6B - Convex Optimization#

Time: Wednesday 02 July 11:10–13:00

Fast and Multiphase Rates for Nearest Neighbor Classifiers#

Session: 6B - Convex Optimization (Wednesday 02 July 11:10–13:00)

Authors: Yang, Pengkun; Zhang, Jingzhao

Abstract:
We study the scaling of error rates with respect to the size of the training dataset. In contrast to classical results where rates are minimax optimal for a problem class, this work starts with an empirical observation that, even for a fixed data distribution, the error scaling can have \emph{diverse} rates across different ranges of sample size. To understand when and why the error rate is non-uniform, we theoretically analyze nearest neighbor classifiers. We show that an error scaling law can have fine-grained rates: in the early phase, the test error depends polynomially on the data dimension and decreases fast; whereas in the later phase, the error depends exponentially on the data dimension and decreases slowly. Our analysis highlights the complexity of the data distribution in determining the test error. When the data distributes benignly, we show that the generalization error of nearest neighbor classifier can depend polynomially, instead of exponentially, on the data dimension.

Quantum State and Unitary Learning Implies Circuit Lower Bounds#

Session: 6B - Convex Optimization (Wednesday 02 July 11:10–13:00)

Authors: Liang, Daniel; Chia, Nai-Hui; Song, Fang

Abstract:
We establish connections between state tomography, pseudorandomness, quantum state synthesis, and circuit lower bounds. In particular, let $\mathfrak C$ be a family of non-uniform quantum circuits of polynomial size and suppose that there exists an algorithm that, given copies of $\ket \psi$, distinguishes whether $\ket \psi$ is produced by $\mathfrak C$ or is Haar random, promised one of these is the case. For arbitrary fixed constant $c$, we show that if the algorithm uses at most $O\left(2^{n^c}\right)$ time and $2^{n^{0.99}}$ samples then $\mathsf{stateBQE} \not\subset \mathsf{state}\mathfrak{C}$. Here $\mathsf{stateBQE} \coloneqq \mathsf{stateBQTIME}\left[2^{O(n)}\right]$ and $\mathsf{state}\mathfrak{C}$ are state synthesis complexity classes as introduced by Rosenthal and Yuen (2022), which capture problems with classical inputs but quantum output. Note that efficient tomography implies a similarly efficient distinguishing algorithm against Haar random states, even for nearly exponential-time algorithms. Because every state produced by a polynomial-size circuit can be learned with $2^{O(n)}$ samples and time, or $O\left(n^{\omega(1)}\right)$ samples and $2^{O(n^{\omega(1)})}$ time, we show that even slightly non-trivial quantum state tomography algorithms would lead to new statements about quantum state synthesis. Finally, a slight modification of our proof shows that distinguishing algorithms for quantum states can imply circuit lower bounds for decision problems as well. We then take these results and port them over to the setting of unitary learning and unitary synthesis. All combined, this helps shed light on why time-efficient tomography algorithms for non-uniform quantum circuit classes has only had limited and partial progress.

Solving Convex-Concave Problems with $\tilde{\mathcal{O}}(\epsilon^{-4/7})$ Second-Order Oracle Complexity#

Session: 6B - Convex Optimization (Wednesday 02 July 11:10–13:00)

Authors: Chen, Lesi; Liu, Chengchang; Luo, Luo; Zhang, Jingzhao

Abstract:
Previous algorithms can solve the convex-concave minimax problems $\min_{x \in \mathcal{X}} \max_{y \in \mathcal{Y}} f(x,y)$ with $\gO(\epsilon^{-2/3})$ second-order oracle calls with Newton-like methods. This result has long been speculated to be optimal because the upper bound is achieved by a natural generalization of the optimal first-order method. In this work, we show an improved upper bound of $\tilde{\gO}(\epsilon^{-4/7})$ by generalizing the optimal second-order method for convex optimization. We further apply a similar technique to lazy Hessian algorithms and show that our proposed algorithm can also be seen as a second-order ``Catalyst'' framework that could accelerate any globally convergent algorithms for solving minimax problems.

Accelerating Proximal Gradient Descent via Silver Stepsizes#

Session: 6B - Convex Optimization (Wednesday 02 July 11:10–13:00)

Authors: Bok, Jinho; Altschuler, Jason

Abstract:
Surprisingly, recent work has shown that gradient descent can be accelerated without using momentum—just by judiciously choosing stepsizes. An open question raised by several papers is whether this phenomenon of stepsize-based acceleration holds more generally for constrained and/or composite convex optimization via projected and/or proximal versions of gradient descent. We answer this in the affirmative by proving that the silver stepsize schedule yields analogously accelerated rates in these settings. These rates are conjectured to be asymptotically optimal among all stepsize schedules, and match the silver convergence rate of vanilla gradient descent (Altschuler and Parrilo, 2024), namely O(ε^(−log_ρ 2)) for smooth convex optimization and O(κ^(log_ρ 2) log 1/ε) under strong convexity, where ε is the precision, κ is the condition number, and ρ = 1 + √2 is the silver ratio. The key technical insight is the combination of recursive gluing—the technique underlying all analyses of gradient descent accelerated with time-varying stepsizes—with a certain Laplacian-structured sum-of-squares certificate for the analysis of proximal point updates.

Non-Euclidean High-Order Smooth Convex Optimization#

Session: 6B - Convex Optimization (Wednesday 02 July 11:10–13:00)

Authors: Contreras, Juan Pablo; Guzmán, Cristobal; Martínez-Rubio, David

Abstract:
We develop algorithms for the optimization of convex objectives that have Hölder continuous $q$-th derivatives by using a $q$-th order oracle, for any $q \geq 1$. Our algorithms work for general norms under mild conditions, including the $\ell_p$-settings for $1\leq p\leq \infty$. We can also optimize structured functions that allow for inexactly implementing a non-Euclidean ball optimization oracle. We do this by developing a non-Euclidean inexact accelerated proximal point method that makes use of an \textit{inexact uniformly convex regularizer}. We show a lower bound for general norms that demonstrates our algorithms are nearly optimal in high-dimensions in the black-box oracle model for $\ell_p$-settings and all $q \geq 1$, even in randomized and parallel settings. This new lower bound, when applied to the first-order smooth case, resolves an open question in parallel convex optimization.

Faster Acceleration for Steepest Descent#

Session: 6B - Convex Optimization (Wednesday 02 July 11:10–13:00)

Authors: Bai, Site; Bullins, Brian

Abstract:
Recent advances (Sherman, 2017; Sidford and Tian, 2018; Cohen et al., 2021) have overcome the fundamental barrier of dimension dependence in the iteration complexity of solving $\ell_\infty$ regression with first-order methods. Yet it remains unclear to what extent such acceleration can be achieved for general $\ell_p$ smooth functions. In this paper, we propose a new accelerated first-order method for convex optimization under non-Euclidean smoothness assumptions. In contrast to standard acceleration techniques, our approach uses primal-dual iterate sequences taken with respect to \emph{differing} norms, which are then coupled using an \emph{implicitly} determined interpolation parameter. For $\ell_p$ norm smooth problems in $d$ dimensions, our method provides an iteration complexity improvement of up to $O(d^{1-\frac{2}{p}})$ in terms of calls to a first-order oracle, thereby allowing us to circumvent long-standing barriers in accelerated non-Euclidean steepest descent.

Provable Complexity Improvement of AdaGrad over SGD: Upper and Lower Bounds in Stochastic Non-Convex Optimization#

Session: 6B - Convex Optimization (Wednesday 02 July 11:10–13:00)

Authors: Jiang, Ruichen; Maladkar, Devyani; Mokhtari, Aryan

Abstract:
Adaptive gradient methods, such as AdaGrad, are among the most successful optimization algorithms for neural network training. While these methods are known to achieve better dimensional dependence than stochastic gradient descent (SGD) for stochastic convex optimization under favorable geometry, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In fact, under standard assumptions of Lipschitz gradients and bounded noise variance, it is known that SGD is worst-case optimal (up to absolute constants) in terms of finding a near-stationary point with respect to the $\ell_2$-norm, making further improvements impossible. Motivated by this limitation, we introduce refined assumptions on the smoothness structure of the objective and the gradient noise variance, which better suit the coordinate-wise nature of adaptive gradient methods. Moreover, we adopt the $\ell_1$-norm of the gradient as the stationarity measure, as opposed to the standard $\ell_2$-norm, to align with the coordinate-wise analysis and obtain tighter convergence guarantees for AdaGrad. Under these new assumptions and the $\ell_1$-norm stationarity measure, we establish an upper bound on the convergence rate of AdaGrad and a corresponding lower bound for SGD. In particular, we identify non-convex settings in which the iteration complexity of AdaGrad is favorable over SGD and show that, for certain configurations of problem parameters, it outperforms SGD by a factor of $d$, where $d$ is the problem dimension. To the best of our knowledge, this is the first result to demonstrate a provable gain of adaptive gradient methods over SGD in a non-convex setting. We also present supporting lower bounds, including one specific to AdaGrad and one applicable to general deterministic first-order methods, showing that our upper bound for AdaGrad is tight and unimprovable up to a logarithmic factor under certain conditions.

New Lower Bounds for Stochastic Non-Convex Optimization through Divergence Decomposition#

Session: 6B - Convex Optimization (Wednesday 02 July 11:10–13:00)

Authors: Saad, El Mehdi; Lee, Wei-Cheng; Orabona, Francesco

Abstract:
We study fundamental limits of first-order stochastic optimization in a range of non-convex settings, including L-smooth functions satisfying Quasar-Convexity (QC), Quadratic Growth (QG), and Restricted Secant Inequalities (RSI). While the convergence properties of standard algorithms are well-understood in deterministic regimes, significantly fewer results address the stochastic case, where only unbiased and noisy gradients are available. We establish new lower bounds on the number of noisy gradient queries to minimize these classes of functions, also showing that they are tight (up to a logarithmic factor) in all the relevant quantities characterizing each class. Our approach reformulates the optimization task as a function identification problem, leveraging \emph{divergence decomposition} arguments to construct a challenging subclass that leads to sharp lower bounds. Furthermore, we present a specialized algorithm in the one-dimensional setting that achieves faster rates, suggesting that certain dimensional thresholds are intrinsic to the complexity of non-convex stochastic optimization.

7A - Clustering and Graphs#

Time: Thursday 03 July 09:00–10:00

Learning Partitions with Optimal Query and Round Complexities#

Session: 7A - Clustering and Graphs (Thursday 03 July 09:00–10:00)

Authors: Black, Hadley; Mazumdar, Arya; Saha, Barna

Abstract:
We consider the basic problem of learning an unknown partition of $n$ elements into at most $k$ sets using simple queries that reveal information about a small subset of elements. Our starting point is the popular and well-studied pairwise same-set queries which ask if a pair of elements belong to the same class. It is well-known that non-adaptive (fully parallel) algorithms require $\Theta(n^2)$ queries, while adaptive (fully sequential) algorithms require $\Theta(nk)$ queries, and the best known algorithm uses $k-1$ rounds of adaptivity. Many variations of this problem has been studied due to its connections to clustering and active learning. In these applications, it is of interest to reduce adaptivity while minimizing the query complexity. In this paper, we give a complete characterization of the query complexity of this problem as a function of the number of rounds, $r$, which interpolates smoothly between the non-adaptive and adaptive settings: for any constant $r \geq 1$, the query complexity is $\smash{\Theta(n^{1+\frac{1}{2^r-1}}k^{1-\frac{1}{2^r-1}})}$. Additionally, our algorithm only needs $O(\log \log n)$ rounds to attain the optimal $O(nk)$ query complexity, which is a double-exponential improvement over prior works when $k$ is a polynomial in $n$. Next, we consider two natural generalizations of pair-wise queries to general subsets $S$ of size at most $s$: (1) weak subset queries which return the number of classes intersected by $S$, and (2) strong subset queries which return the entire partition restricted on $S$. For non-adaptive algorithms, we show $\Omega(n^2/s^2)$ strong queries are needed. In contrast, perhaps surprisingly, we show that there is a non-adaptive algorithm using weak queries that matches this bound up to log-factors for all $s \leq \sqrt{n}$. More generally, we obtain nearly matching upper and lower bounds for algorithms using weak and strong queries in terms of both the number of rounds, $r$, and the query size bound, $s$.

Towards Fair Representation: Clustering and Consensus#

Session: 7A - Clustering and Graphs (Thursday 03 July 09:00–10:00)

Authors: Chakraborty, Diptarka; Das, Debarati; Chatterjee, Kushagra; Nguyen, Tien Long; Nobahari, Romina

Abstract:
Consensus clustering, a fundamental task in machine learning and data analysis, aims to aggregate multiple input clusterings of a dataset, potentially based on different non-sensitive attributes, into a single clustering that best represents the collective structure of the data. In this work, we study this fundamental problem through the lens of fair clustering, as introduced by Chierichetti et al. [NeurIPS'17], which incorporates the disparate impact doctrine to ensure proportional representation of each protected group in the dataset within every cluster. Our objective is to find a consensus clustering that is not only representative but also fair with respect to specific protected attributes. To the best of our knowledge, we are the first to address this problem and provide a constant-factor approximation. As part of our investigation, we examine how to minimally modify an existing clustering to enforce fairness -- an essential postprocessing step in many clustering applications that require fair representation. We develop an optimal algorithm for datasets with equal group representation and near-linear time constant factor approximation algorithms for more general scenarios with different proportions of two group sizes. Given the fundamental nature of this problem, we believe our results on Closest Fair Clustering could have broader implications for other clustering problems, particularly those for which no prior approximation guarantees exist for their fair variants.

Learning Augmented Graph k-Clustering#

Session: 7A - Clustering and Graphs (Thursday 03 July 09:00–10:00)

Authors: Shin, Kijun; Fan, Chenglin

Abstract:
Clustering is a fundamental task in unsupervised learning. Previous research has focused on learning-augmented $k$-means in Euclidean metrics, limiting its applicability to complex data representations. In this paper, we generalize learning-augmented $k$-clustering to operate on general metrics, enabling its application to graph-structured and non-Euclidean domains. Our framework also relaxes restrictive cluster size constraints, providing greater flexibility for datasets with imbalanced or unknown cluster distributions. Furthermore, we extend the hardness of query complexity to general metrics: under the Exponential Time Hypothesis (ETH), we show that any polynomial-time algorithm must perform approximately $\Omega(k / \alpha)$ queries to achieve a $(1 + \alpha)$-approximation. These contributions enhance both the theoretical foundations and the practical scope of learning-augmented clustering, addressing critical gaps between traditional methods and real-world challenges.

Estimating stationary mass, frequency by frequency#

Session: 7A - Clustering and Graphs (Thursday 03 July 09:00–10:00)

Authors: Nakul, Milind; Muthukumar, Vidya; Pananjady, Ashwin

Abstract:
Suppose we observe a trajectory of length n from an α-mixing stochastic process over a finite but potentially large state space. We consider the problem of estimating the probability mass placed by the stationary distribution of any such process on elements that occur with a certain frequency in the observed sequence. We estimate this vector of probabilities in total variation distance, showing consistency in n and recovering known results for i.i.d. sequences as special cases. Our proposed methodology carefully combines the plug-in estimator with a recently-proposed modification of the Good–Turing estimator called WINGIT, which was originally developed for Markovian sequences. En route to controlling the error of our estimator, we develop new performance bounds on WINGIT and the plug-in estimator for α-mixing stochastic processes. Importantly, the extensively used method of Poissonization can no longer be applied in our non i.i.d. setting, and so we develop complementary tools—including an empirical Bernstein bound for mixing sequences—that may prove independently useful in the design and analysis of estimators for related problems.

Metric Clustering and Graph Optimization Problems using Weak Comparison Oracles#

Session: 7A - Clustering and Graphs (Thursday 03 July 09:00–10:00)

Authors: Das, Syamantak; Galhotra, Sainyam; Li, Wen-Zhi; Raychaudhury, Rahul; Sintos, Stavros

Abstract:
Traditional clustering methods assume precise pairwise distances, but this is often impractical when dealing with images, videos, or natural language. This paper studies clustering and graph problems where direct access to distances is infeasible. We use oracle-based methods, as defined by~\cite{galhotra2024k}: the quadruplet oracle (weak, compares two pairs) and the distance oracle (strong, returns exact distances), under adversarial and probabilistic noise. These oracles can be implemented via crowdsourcing or predictive models. We consider a finite metric space $\Sigma=(\V,d)$ of size $|\V|=n$ with both oracles. When the dataset has low intrinsic (doubling) dimension, for $k$-center, $k$-median, and $k$-means clustering, we design constant approximation algorithms with $O((n+k^2)\cdot \polylog(n))$ quadruplet queries and $O(\polylog(n))$ distance queries. In general metric spaces, we achieve constant approximation with $O(k \cdot n \cdot \polylog(n))$ quadruplet queries and $O(\polylog(n))$ distance queries, improving the quadruplet query complexity by a factor of $k$ and distance query complexity by $k^2$ over~\cite{galhotra2024k}. If the dataset spread is polynomially bounded, we build a data structure using $O(n\cdot \polylog(n))$ quadruplet and $O(\polylog(n))$ distance queries to approximate any pairwise distance $\dist(u,v)$. This enables constant approximation algorithms for $k$-clustering with $O(n\cdot \polylog(n))$ quadruplet and $O(\polylog(n))$ distance queries. Even without bounded spread, our approach extends to graph problems like Minimum Spanning Tree, demonstrating its generality in oracle-based clustering and graph algorithms.

7B - Online Learning II#

Time: Thursday 03 July 09:00–10:00

Gradient Methods with Online Scaling#

Session: 7B - Online Learning II (Thursday 03 July 09:00–10:00)

Authors: Gao, Wenzhi; Chu, Ya-Chi; Ye, Yinyu; Udell, Madeleine

Abstract:
We introduce a framework to accelerate the convergence of gradient-based methods with online learning. The framework learns to scale the gradient at each iteration through an online learning algorithm and provably accelerates gradient-based methods asymptotically. In contrast with previous literature, where convergence is established based on worst-case analysis, our framework provides a strong convergence guarantee with respect to the optimal stepsize for the iteration trajectory. For smooth strongly convex optimization, our framework provides an $O(\kappa^\star \log(1/\varepsilon)$) asymptotic complexity result, where $\kappa^\star$ is the condition number achievable by the optimal preconditioner, improving on the previous $O(\sqrt{n}\kappa^\star \log(1/\varepsilon))$ result. For smooth convex optimization, we obtain the first convergence guarantee for the widely-used hypergradient descent heuristic.

Regret Bounds for Robust Online Decision Making#

Session: 7B - Online Learning II (Thursday 03 July 09:00–10:00)

Authors: Appel, Alexander; Kosoy, Vanessa

Abstract:
We propose a framework which generalizes "decision making with structured observations" from Foster et al. (2023) by allowing robust (i.e. multivalued) models. In this framework, each model associates each decision with a convex set of probability distributions over outcomes. Nature can choose distributions out of this set in an arbitrary (adversarial) manner, that can be non-oblivious and depend on past history. The resulting framework offers much greater generality than classical bandits and reinforcement learning, since the realizability assumption becomes much weaker and more realistic. We then derive a theory of regret bounds for this framework, which extends the "decision-estimation coefficients" of Foster et al. (2023). Although our lower and upper bounds are not tight, they are sufficient to fully characterize power-law learnability. We demonstrate this theory in two special cases: robust linear bandits (previously studied in Kosoy (2024)) and tabular robust online reinforcement learning (previously studied in Tian et al. (2021)). In both cases, we derive regret bounds that improve the state-of-the-art (except that we do not address computational efficiency).

Black-Box Reductions for Decentralized Online Convex Optimization in Changing Environments#

Session: 7B - Online Learning II (Thursday 03 July 09:00–10:00)

Authors: Wan, Yuanyu

Abstract:
We investigate decentralized online convex optimization (D-OCO) in changing environments, and choose adaptive regret and dynamic regret as the performance metric. Specifically, these two metrics compare each local learner against the optimal comparator over every interval, and any sequence of comparators over all rounds, respectively. It is well-known that in the centralized setting, plenty of algorithms with (nearly) optimal bounds on these two metrics have been proposed. However, none of them has been extended into D-OCO, possibly due to the difficulty in handling their commonly used two-level structure. To fill the gap, in this paper, we propose black-box reductions from minimizing these two metrics of D-OCO to minimizing them in the centralized setting. Let $n$, $\rho$, and $T$ denote the number of local learners, the spectral gap of the communication matrix, and the time horizon, respectively. For adaptive regret, our reduction can achieve an $\tilde{O}(n\rho^{-1/4}\sqrt{\tau}\log T)$ bound over any interval of length $\tau$ in general, and an improved one of $\tilde{O}(n\rho^{-1/2}(\log T)^3)$ when facing strongly convex functions. These two bounds match existing lower bounds up to polylogarithmic factors. For dynamic regret, our reduction can achieve an $\tilde{O}(n\rho^{-1/4}\sqrt{T(1+P_T)\log T})$ bound in general, where $P_T$ is the path-length of comparators. We also provide the first lower bound for dynamic regret of D-OCO to demonstrate that our dynamic regret is nearly optimal.

On the Hardness of Bandit Learning#

Session: 7B - Online Learning II (Thursday 03 July 09:00–10:00)

Authors: Brukhim, Nataly; Pacchiano, Aldo; Dudik, Miroslav; Schapire, Robert

Abstract:
We study the task of bandit learning, also known as best-arm identification, under the assumption that the true reward function f belongs to a known, but arbitrary, function class F. While many instances of this problem are well understood, we seek a general theory of bandit learnability, akin to the PAC model for classification. Our investigation is guided by the following two fundamental questions: (1) which classes F are learnable, and (2) how they are learnable. For example, in the case of binary PAC classification, learnability is fully determined by a combinatorial dimension, i.e., the VC dimension, and can be attained via a simple algorithmic principle, i.e., Empirical Risk Minimization (ERM). In contrast to classic learning theoretic results, our findings reveal fundamental limitations to learning in structured bandits, offering new insights into the boundaries of bandit learnability. First, for the question of "which", we show that the paradigm of identifying the learnable via a dimension-like quantity fails for bandit learning. We give a simple proof demonstrating that no combinatorial dimension can characterize bandit learnability, even in finite classes, following a standard definition of dimension introduced by Ben-David et al. (2019). For the question of "how" we prove a computational hardness result: we construct a reward function class for which at most two queries are needed to find the optimal action, yet no algorithm can do so in polynomial time, unless RP=NP. Perhaps surprisingly, we also prove that this class admits efficient algorithms to standard (yet possibly hard) algorithmic operations often considered in learning theory, such as an ERM. Therefore, this implies that computational hardness is in this case inherent to the task of bandit learning. Beyond these results, we investigate additional themes such as learning under noise, trade-offs between noise models, and the relationship between query complexity and regret minimization.

On the Minimax Regret of Sequential Probability Assignment via Square-Root Entropy#

Session: 7B - Online Learning II (Thursday 03 July 09:00–10:00)

Authors: Jia, Zeyu; Polyanskiy, Yury; Rakhlin, Alexander

Abstract:
We study the problem of sequential probability assignment under logarithmic loss, both with and without side information. Our objective is to analyze the minimax regret—a notion extensively studied in the literature—in terms of geometric quantities, such as covering numbers and scale-sensitive dimensions. We show that the minimax regret for the case of no side information (equivalently, the Shtarkov sum) can be upper bounded in terms of sequential square-root entropy, a notion closely related to Hellinger distance. For the problem of sequential probability assignment with side information, we develop both upper and lower bounds based on the aforementioned entropy. The lower bound matches the upper bound, up to log factors, for classes in the Donsker regime (according to our definition of entropy).

8A - Non-Convex Optimization#

Time: Thursday 03 July 10:30–11:20

Non-convex matrix sensing: Breaking the quadratic rank barrier in the sample complexity#

Session: 8A - Non-Convex Optimization (Thursday 03 July 10:30–11:20)

Authors: Stoger, Dominik; Zhu, Yizhe

Abstract:
For the problem of reconstructing a low-rank matrix from a few linear measurements, two classes of algorithms have been widely studied in the literature: convex approaches based on nuclear norm minimization, and non-convex approaches that use factorized gradient descent. Under certain sta- tistical model assumptions, it is known that nuclear norm minimization recovers the ground truth as soon as the number of samples scales linearly with the number of degrees of freedom of the ground-truth. In contrast, while non-convex approaches are computationally less expensive, ex- isting recovery guarantees assume that the number of samples scales at least quadratically with the rank r of the ground-truth matrix. In this paper, we close this gap by showing that the non- convex approaches can be as efficient as nuclear norm minimization in terms of sample complex- ity. Namely, we consider the problem of reconstructing a positive semidefinite matrix from a few Gaussian measurements. We show that factorized gradient descent with spectral initialization con- verges to the ground truth with a linear rate as soon as the number of samples scales with Ω(rdκ2), where d is the dimension, and κ is the condition number of the ground truth matrix. This improves the previous rank-dependence in the sample complexity of non-convex matrix factorization from quadratic to linear. Our proof relies on a probabilistic decoupling argument, where we show that the gradient descent iterates are only weakly dependent on the individual entries of the measure- ment matrices. We expect that our proof technique is of independent interest for other non-convex problems.

The Adaptive Complexity of Finding a Stationary Point#

Session: 8A - Non-Convex Optimization (Thursday 03 July 10:30–11:20)

Authors: Zhou, Huanjian; Han, Andi; Takeda, Akiko; Sugiyama, Masashi

Abstract:
In large-scale applications, such as machine learning, it is desirable to design non-convex optimization algorithms with a high degree of parallelization. In this work, we first study the adaptive complexity of finding a stationary point, which is the minimal number of sequential rounds required to achieve stationarity given polynomially many queries executed in parallel at each round. In this work, we examine two fundamental cases of non-convex optimization: high-dimensional case and constant-dimensional case. For the high-dimensional case, \emph{i.e.}, $d = \widetilde{\Omega}(\varepsilon^{-(2 + 2p)/p})$, we show that for any (potentially randomized) algorithm, there exists a function with Lipschitz $p$-th order derivatives such that the algorithm requires at least $\varepsilon^{-(p+1)/p}$ iterations to find an $\varepsilon$-stationary point. Our lower bounds are tight and show that even with $\mathrm{poly}(d)$ queries per iteration, no algorithm has better convergence rate than those achievable with one-query-per round algorithms. In other words, gradient descent, the cubic-regularized Newton's method, and the $p$th-order adaptive regularization method are adaptively optimal. Our proof relies upon novel analysis with the characterization of the output for the hardness potentials based on a chain-like structure with random partition. For the constant-dimensional case, \emph{i.e.}, $d = \Theta(1)$, we propose an algorithm that bridges grid search and gradient flow trapping, finding an approximate stationary point in constant iterations. Its asymptotic tightness is verified by a new lower bound on the required queries per iteration. We show there exists a smooth function such that any algorithm running with $\Theta(\log (1/\varepsilon))$ rounds requires at least $\widetilde{\Omega}((1/\varepsilon)^{(d-1)/2})$ queries per round. This lower bound is tight up to a logarithmic factor, and implies that the gradient flow trapping is adaptively optimal.

Data Selection for ERMs#

Session: 8A - Non-Convex Optimization (Thursday 03 July 10:30–11:20)

Authors: Hanneke, Steve; Moran, Shay; Shlimovich, Alexander; Yehudayoff, Amir

Abstract:
Learning theory has traditionally followed a model-centric approach, focusing on designing optimal algorithms for a fixed natural learning task (e.g., linear classification or regression). In this paper, we adopt a complementary data-centric perspective, whereby we fix a natural learning rule and focus on optimizing the training data. Specifically, we study the following question: given a learning rule $\mathcal{A}$ and a data selection budget $n$, how well can $\mathcal{A}$ perform when trained on at most $n$ data points selected from a population of $N$ points? We investigate when it is possible to select $n \ll N$ points and achieve performance comparable to training on the entire population. We address this question across a variety of empirical risk minimizers. Our results include optimal data-selection bounds for mean estimation, linear classification, and linear regression. Additionally, we establish two general results: a taxonomy of error rates in binary classification and in stochastic convex optimization. Finally, we propose several open questions and directions for future research.

Low-dimensional functions are efficiently learnable under randomly biased distributions#

Session: 8A - Non-Convex Optimization (Thursday 03 July 10:30–11:20)

Authors: Cornacchia, Elisabetta; Mikulincer, Dan; Mossel, Elchanan

Abstract:
The problem of learning single-index and multi-index models has gained significant interest as a fundamental task in high-dimensional statistics. Many recent works have analyzed gradient-based methods, particularly in the setting of isotropic data distributions, often in the context of neural network training. Such studies have uncovered precise characterizations of algorithmic sample complexity in terms of certain analytic properties of the target function, such as the leap, information, and generative exponents. These properties establish a quantitative separation between low- and high-complexity learning tasks. In this work, we show that high-complexity cases are rare. Specifically, we prove that introducing a small random perturbation to the data distribution—via a random shift in the first moment—renders any Gaussian single-index model as easy to learn as a linear function. We further extend this result to a class of multi-index models, namely sparse Boolean functions, also known as Juntas.

8B - Statistical Physics#

Time: Thursday 03 July 10:30–11:20

Fundamental Limits of Matrix Sensing: Exact Asymptotics, Universality, and Applications#

Session: 8B - Statistical Physics (Thursday 03 July 10:30–11:20)

Authors: Xu, Yizhou; Maillard, Antoine; KRZAKALA, FLORENT; Zdeborova, Lenka

Abstract:
In the matrix sensing problem, one wishes to reconstruct a matrix from (possibly noisy) observations of its linear projections along given directions. We consider this model in the high-dimensional limit: while previous works on this model primarily focused on the recovery of low-rank matrices, we consider in this work more general classes of structured signal matrices with potentially large rank, e.g.\ a product of two matrices of sizes proportional to the dimension. We provide rigorous asymptotic equations characterizing the Bayes-optimal learning performance from a number of samples which is proportional to the number of entries in the matrix. Our proof is composed of three key ingredients: $(i)$ we prove universality properties to handle structured sensing matrices, related to the ``Gaussian equivalence'' phenomenon in statistical learning, $(ii)$ we provide a sharp characterization of Bayes-optimal learning in generalized linear models with Gaussian data and structured matrix priors, generalizing previously studied settings, and $(iii)$ we leverage previous works on the problem of matrix denoising. The generality of our results allow for a variety of applications: notably, we mathematically establish predictions obtained via non-rigorous methods from statistical physics in~\cite{erba2024bilinear} regarding Bilinear Sequence Regression, a benchmark model for learning from sequences of tokens, and in~\cite{maillard2024bayes} on Bayes-optimal learning in neural networks with quadratic activation function, and width proportional to the dimension.

Robust Algorithms for Recovering Planted $r$-Colorable Graphs#

Session: 8B - Statistical Physics (Thursday 03 July 10:30–11:20)

Authors: Louis, Anand; Paul, Rameesh; Raghavendra, Prasad

Abstract:
The planted clique problem is a fundamental problem in study of algorithms and has been studied in various random and semirandom models. For the planted clique problem, we can recover the clique if the size of the clique is above the conjectured computational threshold of $\Omega_p(\sqrt{n})$. A natural question that arises then is what other planted structures can be recovered? In this work, we consider random planted and semirandom models for the $r$-coloring problem. We study the following model of instances, choose a set $S \subseteq V$ (of size $k$) and plant an arbitrary $r$-colorable graph in the subgraph induced on $S$. For each pair of vertices in $((V\S) \times (V \S))$, edges are added independently with probability $p$. An adversary is then allowed to add an arbitrary subgraph between $S$ and $V \S$. Our main result is an efficient algorithm that recovers most of the vertices of the planted $r$-colorable graph for $k\geq cr\sqrt{n/p}$, for some constant $c$. Our key technical innovation is a novel SDP relaxation and a rounding algorithm for this problem. Our algorithm is also robust to presence of a monotone adversary that can insert edges in the graph induced on $V\S$.

Statistical and Computational Limits of Detecting Arbitrary Planted Subgraphs in Random Graphs#

Session: 8B - Statistical Physics (Thursday 03 July 10:30–11:20)

Authors: Elimelech, Dor; Huleihel, Wasim

Abstract:
The problems of detecting and recovering planted structures/subgraphs in Erd\H{o}s-R\'{e}nyi random graphs, have received significant attention over the past three decades, leading to many exciting results and mathematical techniques. However, prior work has largely focused on specific ad hoc planted structures and inferential settings, while a general theory has remained elusive. In this paper, we bridge this gap by investigating the detection of an \emph{arbitrary} planted subgraph $\Gamma = \Gamma_n$ in an Erd\H{o}s-R\'{e}nyi random graph $\mathcal{G}(n, q_n)$, where the edge probability within $\Gamma$ is $p_n$. We examine both the statistical and computational aspects of this problem and establish the following results. In the dense regime, where the edge probabilities $p_n$ and $q_n$ are fixed, we tightly characterize the information-theoretic and computational thresholds for detecting $\Gamma$, and provide conditions under which a computational-statistical gap arises. Most notably, these thresholds depend on $\Gamma$ only through its number of edges, maximum degree, and maximum subgraph density. Our lower and upper bounds are general and apply to any value of $p_n$ and $q_n$ as functions of $n$. Accordingly, we also analyze the sparse regime where $q_n = \Theta(n^{-\alpha})$ and $p_n-q_n =\Theta(q_n)$, with $\alpha\in[0,2]$, as well as the critical regime where $p_n=1-o(1)$ and $q_n = \Theta(n^{-\alpha})$, both of which have been widely studied, for specific choices of $\Gamma$. For these regimes, we show that our bounds are tight for all planted subgraphs investigated in the literature thus far--and many more. Finally, we identify conditions under which detection undergoes sharp phase transition, where the boundaries at which algorithms succeed or fail shift abruptly as a function of $q_n$.

The Fundamental Limits of Recovering Planted Subgraphs#

Session: 8B - Statistical Physics (Thursday 03 July 10:30–11:20)

Authors: Lee, Daniel; Pernice, Francisco; Rajaraman, Amit; Zadik, Ilias

Abstract:
Given an arbitrary subgraph $H=H_n$ and $p=p_n\in(0,1)$, the planted subgraph model is defined as follows. A statistician observes the union of the "signal," which is a random "planted" copy $H^*$ of $H$, together with random "noise" in the form of an instance of an Erdős–Rényi graph $G(n,p)$. The goal then of the statistician is to recover the planted $H^*$ from the observed graph. Our focus in this work is to understand the minimum mean-squared error (MMSE) in terms of recovering the edges of $H^*$, as a function of $p$ and $H$. A recent paper [MNSSZ23] characterizes the graphs for which this MMSE curve undergoes a sharp phase transition from $0$ to $1$ as $p$ increases, a behavior known as the All-or-Nothing phenomenon, up to a mild density assumption on $H$. However, their techniques fail to describe the MMSE curves for graphs that do not display such a sharp phase transition. In this paper, we provide a formula for the limiting MMSE curve for any graph $H=H_n$, up to the same mild density assumption. This curve is expressed in terms of a variational formula over pairs of subgraphs of $H$, and is inspired by the celebrated subgraph expectation thresholds from probabilistic combinatorics [KK07]. Furthermore, we give a polynomial-time description of the optimizers of this variational problem. This allows one to efficiently compute the MMSE curve for any given dense graph $H$. The proof relies on a novel graph decomposition as well as a min-max duality theorem which may be of independent interest. Our results generalize to the setting of planting arbitrary monotone boolean properties, where the statistician observes the union of a planted minimal element $A\subseteq[N]$ of a monotone property and a random $\mathrm{Ber}(p)^{\otimes N}$ vector. In this setting, we provide a variational formula inspired by the so-called "fractional" expectation threshold [Tal10], again describing the MMSE curve (in this case up to a multiplicative constant).

9A - Random Graphs#

Time: Thursday 03 July 14:00–15:36

Community detection with the Bethe-Hessian#

Session: 9A - Random Graphs (Thursday 03 July 14:00–15:36)

Authors: Stephan, Ludovic; Zhu, Yizhe

Abstract:
The Bethe-Hessian matrix, introduced by Saade, Krzakala, and Zdeborová (2014), is a Hermitian matrix designed for applying spectral clustering algorithms to sparse networks. Rather than employing a non-symmetric and high-dimensional non-backtracking operator, a spectral method based on the Bethe-Hessian matrix is conjectured to also reach the Kesten-Stigum detection threshold in the sparse stochastic block model (SBM). We provide the first rigorous analysis of the Bethe-Hessian spectral method in the SBM under both the bounded expected degree and the growing degree regimes. Specifically, we demonstrate that: (i) When the expected degree $d\geq 2$, the number of negative outliers of the Bethe-Hessian matrix can consistently estimate the number of blocks above the Kesten-Stigum threshold, thus confirming a conjecture from Saade et al. (2014) for $d\geq 2$. (ii) For sufficiently large $d$, its eigenvectors can be used to achieve weak recovery. (iii) As $d\to\infty$, we establish the concentration of the locations of its negative outlier eigenvalues, and weak consistency can be achieved via a spectral method based on the Bethe-Hessian matrix.

The Planted Spanning Tree Problems: Exact Overlap Characterization via Local Weak Convergence#

Session: 9A - Random Graphs (Thursday 03 July 14:00–15:36)

Authors: Moharrami, Mehrdad; Moore, Cris; Xu, Jiaming

Abstract:
We study the problem of recovering a planted spanning tree $M_n^*$ hidden within a complete, randomly weighted graph $G_n$. Specifically, each edge $e$ has a non-negative weight drawn independently from $P_n$ if $e \in M_n^*$ and from $Q_n$ otherwise, where $P_n \equiv P$ is fixed and $Q_n$ scales with $n$ such that its density at the origin satisfies $\lim_{n\to\infty} n Q'_n(0)=1.$ We consider two representative cases: when $M_n^*$ is either a uniform spanning tree or a uniform Hamiltonian path. We analyze the recovery performance of the minimum spanning tree (MST) algorithm and derive a fixed-point equation that characterizes the asymptotic fraction of edges in $M_n^*$ successfully recovered by the MST as $n \to \infty.$ Furthermore, we establish the asymptotic mean weight of the MST, extending Frieze's $\zeta(3)$ result to the planted model. %A key ingredient of our analysis %is the asymptotic characterization of the local structure of the planted model, leveraging the framework of local weak convergence. Our analysis relies on an asymptotic characterization of the local structure of the planted model, employing the framework of local weak convergence.

All-Something-Nothing Phase Transitions in Planted k-Factor Recovery#

Session: 9A - Random Graphs (Thursday 03 July 14:00–15:36)

Authors: Gaudio, Julia; Sandon, Colin; Xu, Jiaming; Yang, Dana

Abstract:
This paper studies the problem of inferring a $k$-factor, specifically a spanning $k$-regular graph, planted within an Erdos-Renyi random graph $G(n,\lambda/n)$. We uncover an interesting "all-something-nothing" phase transition. Specifically, we show that as the average degree $\lambda$ surpasses the critical threshold of $1/k$, the inference problem undergoes a transition from almost exact recovery ("all" phase) to partial recovery ("something" phase). Moreover, as $\lambda$ tends to infinity, the accuracy of recovery diminishes to zero, leading to the onset of the "nothing" phase. This finding complements the recent result by Mossel, Niles-Weed, Sohn, Sun, and Zadik who established that for certain sufficiently dense graphs, the problem undergoes an "all-or-nothing" phase transition, jumping from near-perfect to near-zero recovery. In addition, we characterize the recovery accuracy of a linear-time iterative pruning algorithm and show that it achieves almost exact recovery when $\lambda < 1/k$. A key component of our analysis is a two-step cycle construction: we first build trees through local neighborhood exploration and then connect them by sprinkling using reserved edges. Interestingly, for proving impossibility of almost exact recovery, we construct $\Theta(n)$ many small trees of size $\Theta(1)$, whereas for establishing the algorithmic lower bound, a single large tree of size $\Theta(\sqrt{n\log n})$ suffices.

Robust random graph matching in dense graphs via vector approximate message passing#

Session: 9A - Random Graphs (Thursday 03 July 14:00–15:36)

Authors: Li, Zhangsong

Abstract:
In this paper, we focus on the matching recovery problem between a pair of correlated Gaussian Wigner matrices with a latent vertex correspondence. We are particularly interested in a robust version of this problem such that our observation is a perturbed input $(A+E,B+F)$ where $(A,B)$ is a pair of correlated Gaussian Wigner matrices and $E,F$ are adversarially chosen matrices supported on an unknown $\epsilon n * \epsilon n$ principle minor of $A,B$, respectively. We propose a vector approximate message passing (vector AMP) algorithm that succeeds in polynomial time as long as the correlation $\rho$ between $(A,B)$ is a non-vanishing constant and $\epsilon = o\big( \tfrac{1}{(\log n)^{20}} \big)$. The main methodological inputs for our result are the iterative random graph matching algorithm proposed in Ding and Li (2025+, 2023) and the spectral cleaning procedure proposed in Ivkov and Schramm (2024). To the best of our knowledge, our algorithm is the first efficient random graph matching type algorithm that is robust under any adversarial perturbations of $n^{1-o(1)}$ size.

Stochastic block models with many communities and the Kesten--Stigum bound#

Session: 9A - Random Graphs (Thursday 03 July 14:00–15:36)

Authors: Chin, Byron; Mossel, Elchanan; Sohn, Youngtak; Wein, Alexander

Abstract:
We study the inference of communities in stochastic block models with a growing number of communities. For block models with $n$ vertices and a fixed number of communities $q$, it was predicted in Decelle et al.\ that there are computationally efficient algorithms for recovering the communities above the Kesten--Stigum (KS) bound and that efficient recovery is impossible below the KS bound. This conjecture has since stimulated a lot of interest, with the achievability side proven in a line of research culminating in work of Abbe and Sandon. Conversely, the hardness side of the conjecture has been supported by recent progress based on the low-degree paradigm. In this paper we investigate community recovery in the regime $q \to \infty$ where no such predictions exist. We show that efficient inference of communities remains possible above the KS bound. Furthermore, we show that recovery of block models is low-degree-hard below the KS bound when the number of communities $q\ll \sqrt{n}$. Perhaps surprisingly, we find that when $q \gg \sqrt{n}$, there is an efficient algorithm based on non-backtracking walks for recovery even below the KS bound. We identify a new threshold which we conjecture is the threshold for weak recovery in this regime. Finally, we show that detection is easy and identify (up to a constant) the information-theoretic threshold for community recovery as the number of communities $q$ diverges. Our low-degree hardness results also naturally have consequences for graphon estimation, improving results of Luo and Gao.

Partial and Exact Recovery of a Random Hypergraph from its Graph Projection#

Session: 9A - Random Graphs (Thursday 03 July 14:00–15:36)

Authors: Bresler, Guy; Guo, Chenghao; Polyanskiy, Yury; Yao, Andrew

Abstract:
Consider a $d$-uniform random hypergraph on $n$ vertices in which hyperedges are included iid, each with probability $n^{-d+1+\delta}$, so that the average degree is $n^\delta$. The projection of a hypergraph is a graph on the same $n$ vertices where an edge connects two vertices if and only if they belong to some hyperedge. The goal is to reconstruct the hypergraph given its projection. An earlier work of (Bresler et. al., 2024), showed that exact recovery for $d=3$ is possible if and only if $\delta < 2/5$. This work completely resolves the question for all values of $d$ for both exact and partial recovery and for both cases of whether multiplicity information about each edge is available or not. In addition, it is also shown that the reconstruction fidelity undergoes an all-or-nothing transition at a threshold. In particular, this resolves all conjectures from (Bresler et. al., 2024).

Beyond Propagation of Chaos: A Stochastic Algorithm for Mean Field Optimization#

Session: 9A - Random Graphs (Thursday 03 July 14:00–15:36)

Authors: Tankala, Chandan; Nagaraj, Dheeraj; Raj, Anant

Abstract:
Gradient flow in the 2-Wasserstein space is widely used to optimize functionals over probability distributions and is typically implemented using an interacting particle system with n particles. Analyzing these algorithms requires showing (a) that the finite-particle system converges and/or (b) that the resultant empirical distribution of the particles closely approximates the optimal distribution (i.e., propagation of chaos). However, establishing efficient sufficient conditions can be challenging, as the finite particle system may produce heavily dependent random variables. In this work, we study the virtual particle stochastic approximation, originally introduced for Stein Variational Gradient Descent (Das and Nagaraj, 2023). This method can be viewed as a form of stochastic gradient descent in the Wasserstein space and can be implemented efficiently. In popular settings, we demonstrate that our algorithm's output converges to the optimal distribution under conditions similar to those for the infinite particle limit, and it produces i.i.d. samples without the need to explicitly establish propagation of chaos bounds.

A Proof of The Changepoint Detection Threshold Conjecture in Preferential Attachment Models#

Session: 9A - Random Graphs (Thursday 03 July 14:00–15:36)

Authors: Du, Hang; Gong, Shuyang; Xu, Jiaming

Abstract:
We investigate the problem of detecting and estimating a changepoint in the attachment function of a network evolving according to a preferential attachment model on $n$ vertices, using only a single final snapshot of the network. \cite{bet2023detecting} show that a simple test based on thresholding the number of vertices with minimum degrees can detect the changepoint when the change occurs at time $n-\Omega(\sqrt{n})$. They further make the striking conjecture that detection becomes impossible for any test if the change occurs at time $n-o(\sqrt{n}).$ \cite{kaddouri2024impossibility} make a step forward by proving the detection is impossible if the change occurs at time $n-o(n^{1/3}).$ In this paper, we resolve the conjecture affirmatively, proving that detection is indeed impossible if the change occurs at time $n-o(\sqrt{n}).$ Furthermore, we establish that estimating the changepoint with an error smaller than $o(\sqrt{n})$ is also impossible, thereby confirming that the estimator proposed in \cite{bhamidi2018change} is order-optimal.

9B - Learning Theory II#

Time: Thursday 03 July 14:00–15:36

Learning Constant-Depth Circuits in Malicious Noise Models#

Session: 9B - Learning Theory II (Thursday 03 July 14:00–15:36)

Authors: Klivans, Adam; Stavropoulos, Konstantinos; Vasilyan, Arsen

Abstract:
The seminal work of Linial, Mansour, and Nisan gave a quasipolynomial-time algorithm for learning constant-depth circuits (AC0) with respect to the uniform distribution on the hypercube. Extending their algorithm to the setting of malicious noise, where both covariates and labels can be adversarially corrupted, has remained open. Here we achieve such a result, inspired by recent work on learning with distribution shift. Our running time essentially matches their algorithm, which is known to be optimal assuming various cryptographic primitives. Our proof uses a simple outlier-removal method combined with Braverman's theorem for fooling constant-depth circuits. We attain the best possible dependence on the noise rate and succeed in the harshest possible noise model (i.e., contamination or so-called ``nasty noise").

Spectral Estimators for Multi-Index Models: Precise Asymptotics and Optimal Weak Recovery#

Session: 9B - Learning Theory II (Thursday 03 July 14:00–15:36)

Authors: Kova_evi_, Filip; Zhang, Yihan; Mondelli, Marco

Abstract:
Multi-index models provide a popular framework to investigate the learnability of functions with low-dimensional structure and, also due to their connections with neural networks, they have been object of recent intensive study. In this paper, we focus on recovering the subspace spanned by the signals via spectral estimators -- a family of methods routinely used in practice, often as a warm-start for iterative algorithms. Our main technical contribution is a precise asymptotic characterization of the performance of spectral methods, when sample size and input dimension grow proportionally and the dimension $p$ of the space to recover is fixed. Specifically, we locate the top-$p$ eigenvalues of the spectral matrix and establish the overlaps between the corresponding eigenvectors (which give the spectral estimators) and a basis of the signal subspace. Our analysis unveils a phase transition phenomenon in which, as the sample complexity grows, eigenvalues escape from the bulk of the spectrum and, when that happens, eigenvectors recover directions of the desired subspace. The precise characterization we put forward enables the optimization of the data preprocessing, thus allowing to identify the spectral estimator that requires the minimal sample size for weak recovery.

Learning sparse generalized linear models with binary outcome via iterative hard thresholding#

Session: 9B - Learning Theory II (Thursday 03 July 14:00–15:36)

Authors: Matsumoto, Namiko; Mazumdar, Arya

Abstract:
In statistics, generalized linear models (GLMs) are widely used for modeling data and can expressively capture potential nonlinear dependence of the model's outcomes on its covariates. Within the broad family of GLMs, those with binary outcomes, which include logistic and probit regressions, are motivated by common tasks such as binary classification with (possibly) non-separable data. In addition, in modern machine learning and statistics, data is often high-dimensional yet has a low intrinsic dimension, making sparsity constraints in models another reasonable consideration. In this work, we propose to use and analyze an iterative hard thresholding (projected gradient descent on the ReLU loss) algorithm, called \emph{binary iterative hard thresholding (BIHT)}, for parameter estimation in sparse GLMs with binary outcomes. We establish that BIHT is statistically efficient and converges to the correct solution for parameter estimation in a general class of sparse binary GLMs. Unlike many other methods for learning GLMs, including maximum likelihood estimation, generalized approximate message passing, and GLM-tron (Kakade et al, 2011, Bahmani et al., 2016), BIHT does not require knowledge of the GLM's link function, offering flexibility and generality in allowing the algorithm to learn arbitrary binary GLMs. As two applications, logistic and probit regression are additionally studied. In this regard, it is shown that in logistic regression, the algorithm is in fact statistically optimal in the sense that the order-wise sample complexity matches (up to logarithmic factors) the lower bound obtained previously. To the best of our knowledge, this is the first work achieving statistical optimality for logistic regression in all noise regimes with a computationally efficient algorithm. Moreover, for probit regression, our sample complexity is on the same order as that obtained for logistic regression.

Sharper Bounds for Chebyshev Moment Matching, with Applications#

Session: 9B - Learning Theory II (Thursday 03 July 14:00–15:36)

Authors: Musco, Cameron; Musco, Christopher; Rosenblatt, Lucas; Singh, Apoorv Vikram

Abstract:
We study the problem of approximately recovering a probability distribution given noisy measurements of its Chebyshev polynomial moments. This problem arises broadly across algorithms, statistics, and machine learning. By leveraging a global decay bound on the coefficients in the Chebyshev expansion of any Lipschitz function, we sharpen prior work, proving that accurate recovery in the Wasserstein distance is possible with more noise than previously known. Our result immediately yields a number of applications: 1. We give a simple ``linear query'' algorithm for constructing a differentially private synthetic data distribution with Wasserstein-1 error $\tilde{O}(1/n)$ based on a dataset of $n$ points in $[-1,1]$. This bound is optimal up to log factors and matches a recent breakthrough of Boedihardjo, Strohmer, and Vershynin [Probab. Theory. Rel., 2024], which uses a more complex ``superregular random walk'' method to beat an $O(1/\sqrt{n})$ accuracy barrier inherent to earlier approaches. 2. We give an $\tilde{O}(n^2/\epsilon)$ time algorithm for the linear algebraic problem of estimating the spectral density of an $n\times n$ symmetric matrix up to $\epsilon$ error in the Wasserstein distance. Our result accelerates prior methods from Chen et al. [ICML 2021] and Braverman et al. [STOC 2022]. 3. We tighten an analysis of Vinayak, Kong, Valiant, and Kakade [ICML 2019] on the maximum likelihood estimator for the statistical problem of ``Learning Populations of Parameters'', extending the parameter regime in which sample optimal results can be obtained. Beyond these main results, we provide an extension of our bound to estimating distributions in $d > 1$ dimensions. We hope that these bounds will find applications more broadly to problems involving distribution recovery from noisy moment information.

Learning DNF through Generalized Fourier Representations#

Session: 9B - Learning Theory II (Thursday 03 July 14:00–15:36)

Authors: Heidari, Mohsen; Khardon, Roni

Abstract:
The Fourier representation for the uniform distribution over the Boolean cube has found numerous applications in algorithms and complexity analysis. Notably, in learning theory, the learnability of Disjunctive Normal Form (DNF) under the uniform and product distributions has been established through such representations. This paper makes three main contributions. First, it introduces a generalized Fourier expansion that can be used with any distribution $D$ through the representation of the distribution as a Bayesian network (BN). Second, it shows that the main algorithmic tools for learning with the Fourier representation that use membership queries to approximate functions by recovering their heavy Fourier coefficients, can be used with slight modifications with the generalized expansion. These results hold for any distribution. Third, it analyzes the $L_1$ spectral norm of conjunctions under the new expansion, showing that it is bounded for a class of distributions which can be represented by a difference-bounded tree BN, where a parent node in the BN representation can change the conditional expectation of a child node by at most $\alpha<0.5$. Lower bounds are presented to show that such constraints are necessary. Combining these contributions, the paper shows learnability of DNF with membership queries under difference-bounded tree BN.

Robustly Learning Monotone Generalized Linear Models via Data Augmentation#

Session: 9B - Learning Theory II (Thursday 03 July 14:00–15:36)

Authors: Diakonikolas, Ilias; Diakonikolas, Jelena; Wang, Puqian; Zarifis, Nikos

Abstract:
We study the task of learning Generalized Linear models (GLMs) in the agnostic model under the Gaussian distribution. We give the first polynomial-time algorithm that achieves a constant-factor approximation for {\em any} monotone Lipschitz activation. Prior constant-factor GLM learners succeed for a substantially smaller class of activations. Our work resolves a well-known open problem, by developing a robust counterpart to the classical GLMtron algorithm~\citep{kakade2011efficient}. Our robust learner applies more generally, encompassing all monotone activations with bounded $(2+\zeta)$-moments, for any fixed $\zeta>0$---a condition that is essentially necessary. To obtain our results, we leverage a novel data augmentation technique with decreasing Gaussian noise injection and prove a number of structural results that may be useful in other settings.

How to safely discard features based on aggregate SHAP values#

Session: 9B - Learning Theory II (Thursday 03 July 14:00–15:36)

Authors: Bhattacharjee, Robi; Frohnapel, Karolin; Luxburg, Ulrike

Abstract:
SHAP is one of the most popular local feature-attribution methods. Given a function f and an input x \in R^d, it quantifies each feature's contribution to f(x). Recently, SHAP has been increasingly used for global insights: practitioners average the absolute SHAP values over many data points to compute global feature importance scores, which are then used to discard “unimportant'' features. In this work, we investigate the soundness of this practice by asking whether small aggregate SHAP values necessarily imply that the corresponding feature does not affect the function. Unfortunately, the answer is no: even if the i-th SHAP value equals 0 on the entire data support, there exist functions that clearly depend on Feature i. The issue is that computing SHAP values involves evaluating f on points outside of the data support, where f can be strategically designed to mask its dependence on Feature i. To address this, we propose to aggregate SHAP values over the extended support, which is the product of the marginals of the underlying distribution. With this modification, we show that a small aggregate SHAP value implies that we can safely discard the corresponding feature. We then extend our results to KernelSHAP, the most popular method to approximate SHAP values in practice. We show that if KernelSHAP is computed over the extended distribution, a small aggregate KernelSHAP value justifies feature removal. This result holds independently of whether KernelSHAP accurately approximates true SHAP values, making it one of the first theoretical results to characterize the KernelSHAP algorithm itself. Our findings have both theoretical and practical implications. We introduce the “Shapley Lie algebra,” which offers algebraic insights that may enable a deeper investigation of SHAP and we show that a simple preprocessing step -- randomly permuting each column of the data matrix -- enables safely discarding features based on aggregate SHAP and KernelSHAP values.

Universality of High-Dimensional Logistic Regression and a Novel CGMT under Dependence with Applications to Data Augmentation#

Session: 9B - Learning Theory II (Thursday 03 July 14:00–15:36)

Authors: Esmaili Mallory, Matthew; Huang, Kevin Han; Austern, Morgane

Abstract:
Over the last decade, a wave of research has characterized the exact asymptotic risk of many high-dimensional models in the proportional regime. Two foundational results have driven this progress: Gaussian universality, which shows that the asymptotic risk of estimators trained on non-Gaussian and Gaussian data is equivalent, and the convex Gaussian min-max theorem (CGMT), which characterizes the risk under Gaussian settings. However, these results rely on the assumption that the data consists of independent random vectors—an assumption that significantly limits its applicability to many practical setups. In this paper, we address this limitation by generalizing both results to the dependent setting. More precisely, we prove that Gaussian universality still holds for high-dimensional logistic regression under block dependence, and establish a novel CGMT framework that accommodates for correlation across both the covariates and observations. Using these results, we establish the impact of data augmentation, a widespread practice in deep learning, on the asymptotic risk.

10A - Language Models#

Time: Thursday 03 July 16:12–18:00

A Theory of Learning with Autoregressive Chain of Thought#

Session: 10A - Language Models (Thursday 03 July 16:12–18:00)

Authors: Joshi, Nirmit; Srebro, Nathan; Vardi, Gal; Block, Adam; Goel, Surbhi; Li, Zhiyuan; Misiakiewicz, Theodor

Abstract:
For a given base class of sequence-to-next-token generators, we consider learning prompt-to-answer mappings obtained by iterating a fixed (time-invariant) generator for multiple steps, thus generating a chain-of-thought, and then taking the final token as the answer. We formalize the learning problems both when the chain-of-thought is observed and when training only on prompt-answer pairs, with the chain-of-thought latent. We analyze the sample and computational complexity both in terms of general properties of the base class (e.g. its VC dimension) and for specific base classes. We present a simple base class that allows for universal representability and computationally tractable chain-of-thought learning. Central to our development is that time invariance allows for sample complexity that is independent of the length of the chain-of-thought. Attention arises naturally in our construction.

Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification#

Session: 10A - Language Models (Thursday 03 July 16:12–18:00)

Authors: Rohatgi, Dhruv; Block, Adam; Huang, Audrey; Krishnamurthy, Akshay; Foster, Dylan

Abstract:
Next-token prediction with the log loss is a cornerstone of autoregressive sequence modeling, but, in practice, suffers from error amplification, where errors in the model compound and generation quality degrades as sequence length $H$ increases. From a theoretical perspective, this phenomenon should not appear in well-specified settings, and, indeed, a growing body of empirical work hypothesizes that misspecification, where the learner is not sufficiently expressive to represent the target distribution, may be the root cause. Under misspecification---where the goal is to learn as well as the best-in-class model up to a multiplicative approximation factor $C\geq 1$---we confirm that $C$ indeed grows with $H$ for next-token prediction, lending theoretical support to this empirical hypothesis. We then ask whether this mode of error amplification is avoidable algorithmically, computationally, or information-theoretically, and uncover inherent computational-statistical tradeoffs. We show: (1) Information-theoretically, one can avoid error amplification and achieve $C=O(1)$. (2) Next-token prediction can be made robust to achieve $C=O(H)$, representing moderate error amplification, but this is an inherent barrier: any next-token prediction-style objective must suffer $C=\Omega(H)$. (3) For the natural testbed of autoregressive linear models, no computationally efficient algorithm can achieve sub-polynomial approximation factor $C=e^{(\log H)^{1-\Omega(1)}}$; however, at least for binary token spaces, one can smoothly trade compute for statistical power and improve on $C=\Omega(H)$ in sub-exponential time. Our results have consequences in the more general setting of imitation learning, where the widely-used behavior cloning generalizes next-token prediction.

Generation through the lens of learning theory#

Session: 10A - Language Models (Thursday 03 July 16:12–18:00)

Authors: Raman, Vinod; Tewari, Ambuj; Li, Jiaxun

Abstract:
We study generation through the lens of statistical learning theory. First, we abstract and formalize the results of Gold (1967), Angluin (1979), Angluin (1980) and Kleinberg and Mullainathan (2024) in terms of a binary hypothesis class defined over an abstract countable example space. Then, we extend the notion of “generation” from Kleinberg and Mullainathan (2024) to two new settings, we call “uniform” and “non-uniform” generation, and provide a characterization of which hypothesis classes are uniformly and non-uniformly generatable. As is standard in learning theory, our characterizations are in terms of the finiteness of a new combinatorial dimension termed the Closure dimension. By doing so, we are able to compare generatability with predictability (captured via PAC and on-line learnability) and show that these two properties of hypothesis classes are incomparable – there are classes that are generatable but not predictable and vice versa. Finally, we extend our results to capture prompted generation and give a complete characterization of which classes are prompt generatable, generalizing some of the work by Kleinberg and Mullainathan (2024).

Learning Algorithms in the Limit#

Session: 10A - Language Models (Thursday 03 July 16:12–18:00)

Authors: Papazov, Hristo; Flammarion, Nicolas

Abstract:
We study the problem of learning computable functions in the limit by extending Gold’s inductive inference framework to incorporate \textit{computational observations}. Traditional black-box learning is inherently limited, but we show that additional structured observations — such as runtime estimates (clock observations) or external traces of computation (behavior observations) — enable learnability. We establish that any time-bounded class of computable functions is learnable in the limit under black-box observations, while all computable functions can be learned with clock observations under a relaxed version of the Extended Church-Turing Thesis. Further, we build a formal framework around observations of \textit{computational agents} and show that learning computable functions from behavior reduces to learning rational functions from input and output, yielding a polynomial-time state-merging algorithm under specific constraints. On the negative side, we show that polynomial characteristic sets cannot exist for the class of all computable functions. Our results provide a unified perspective on the role of structured observations of computational models in inductive inference.

Language Model Reinforcement Learning: Exploration and the Computational Role of the Base Model#

Session: 10A - Language Models (Thursday 03 July 16:12–18:00)

Authors: Foster, Dylan; Mhammedi, Zakaria; Rohatgi, Dhruv

Abstract:
Language model alignment (or, reinforcement learning) techniques that leverage active exploration---deliberately encouraging the model to produce diverse, informative responses---offer the promise of super-human capabilities. However, current understanding of algorithm design primitives for computationally efficient exploration with language models is limited. To better understand how to leverage access to powerful pre-trained generative models to improve the efficiency of exploration, we introduce a new computational framework for RL with language models, in which the learner interacts with the model through a sampling oracle. Focusing on the linear softmax model parameterization, we provide new results that reveal the computational-statistical tradeoffs of efficient exploration: 1. [Necessity of coverage] Coverage refers to the extent to which the pre-trained model covers near-optimal responses---a form of hidden knowledge. We show that coverage, while not necessary for data efficiency, lower bounds the runtime of any algorithm in our framework. 2. [Inference-time exploration.] We introduce a new algorithm, SpannerSampling, which obtains optimal data efficiency and is computationally efficient whenever the pre-trained model enjoys sufficient coverage, matching our lower bound. SpannerSampling leverages inference-time computation with the pre-trained model to reduce the effective search space for exploration. 3. [Insufficiency of training-time interventions.] We contrast (2) by showing that training-time interventions that produce proper policies cannot achieve similar guarantees in polynomial time. Finally, we show that under additional representational assumptions, one can achieve improved runtime (replacing sequence-level coverage with token-level coverage) through multi-turn exploration.

Proofs as Explanations: Short Certificates for Reliable Predictions#

Session: 10A - Language Models (Thursday 03 July 16:12–18:00)

Authors: Blum, Avrim; Hanneke, Steve; Pabbaraju, Chirag; Saless, Donya

Abstract:
We consider a model for explainable AI in which an explanation for a prediction $h(x)=y$ consists of a subset $S'$ of the training data (if it exists) such that all classifiers $h'\in\mathcal{H}$ that make at most $b$ mistakes on $S'$ predict $h'(x)=y$. Such a set $S'$ serves as a {\em proof} that $x$ indeed has label $y$ under the assumption that (1) the true target function $h^\star$ belongs to $\mathcal{H}$, and (2) the set $S$ contains at most $b$ noisy or corrupted points. For example, if $b=0$ and $\mathcal{H}$ is the family of linear classifiers in $\mathbb{R}^d$, and if $x$ lies inside the convex hull of the positive data points in $S$ (and therefore every consistent linear classifier labels $x$ as positive), then Carath\'eodory's theorem states that $x$ in fact lies inside the convex hull of $d+1$ of those points. So, a set $S'$ of size $d+1$ could be released as an explanation for a positive prediction, and would serve as a short proof of correctness of the prediction under the assumption of perfect realizability. In this work, we consider this problem more generally, for general hypothesis classes $\mathcal{H}$ and general values $b\geq 0$. We define the notion of the {\em robust hollow star number} of $\mathcal{H}$ (which generalizes the standard hollow star number), and show that it precisely characterizes the worst-case size of the smallest certificate achievable, and analyze its size for natural classes. We also consider worst-case distributional bounds on certificate size, as well as {\em distribution-dependent} bounds that we show tightly control the sample size needed to get a certificate for any given test example. In particular, we define a notion of the {\em certificate coefficient} $\varepsilon_x$ of an example $x$ with respect to a data distribution $\mathcal{D}$ and target function $h^\star$, and prove matching upper and lower bounds on sample size as a function of $\varepsilon_x$, $b$, and the VC dimension $d$ of $\mathcal{H}$.

Exploring Facets of Language Generation in the Limit#

Session: 10A - Language Models (Thursday 03 July 16:12–18:00)

Authors: Charikar, Moses; Pabbaraju, Chirag

Abstract:
The recent work of Kleinberg and Mullainathan (2024) provides a concrete model for language generation in the limit: given a sequence of examples from an unknown target language, the goal is to generate new examples from the target language such that no incorrect examples are generated beyond some point. In sharp contrast to strong negative results for the closely related problem of language identification, they establish positive results for language generation in the limit for all countable collections of languages. Follow-up work by Li, Raman, and Tewari (2024) studies bounds on the number of distinct inputs required by an algorithm before correct language generation is achieved — namely, whether this is a constant for all languages in the collection (uniform generation) or a language-dependent constant (non-uniform generation). We show that every countable collection has a generator with the stronger property of non-uniform generation in the limit. However, while the generation algorithm of Kleinberg and Mullainathan (2024) can be implemented using membership queries, we show that any algorithm cannot non-uniformly generate even for collections of just two languages, using only membership queries. We also formalize the tension between validity and breadth in the generation algorithm of Kleinberg and Mullainathan (2024) by introducing a definition of exhaustive generation, and show a strong negative result for exhaustive generation. Our result shows that a tradeoff between validity and breadth is inherent for generation in the limit. We also provide a precise characterization of the language collections for which exhaustive generation is possible. Finally, inspired by algorithms that can choose to obtain feedback, we consider a model of uniform generation with feedback, completely characterizing language collections for which such uniform generation with feedback is possible in terms of an abstract complexity measure of the collection.

Trade-offs in Data Memorization: Learn More to Remember Less#

Session: 10A - Language Models (Thursday 03 July 16:12–18:00)

Authors: Feldman, Vitaly; Kornowski, Guy; Lyu, Xin

Abstract:
Several recent works demonstrate that training of large language models leads to memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sensitive user data and thus motivates the study of data memorization's role in learning. In this work we demonstrate that several simple and well-studied binary classification problems exhibit a trade-off between the number of samples available to a learning algorithm and the amount of information about the training data that a learning algorithm needs to memorize to be accurate. In particular, $\Omega(d)$ bits of information about the training data need to be memorized when a single $d$-dimensional example is available, which then decays as $\Theta(d/n)$ as the number of examples grows (for $n\leq \sqrt{d}$). Further, this rate is achieved (up to logarithmic factors) by simple learning algorithms. Our results build on the work of Brown et al. (2021), and establish a new framework for proving memorization lower bounds that is based on an approximate version of strong data processing inequalities.

Compression Barriers in Autoregressive Transformers#

Session: 10A - Language Models (Thursday 03 July 16:12–18:00)

Authors: Haris, Themistoklis; Onak, Krzysztof

Abstract:
A key limitation of autoregressive Transformers is the large memory needed at inference-time to cache all previous key-value (KV) embeddings. Prior works address this by compressing the KV cache but often assume specific structural properties of the embeddings. This raises the following natural question: Can truly sublinear space utilization be achieved without such assumptions? In this work, we answer this question in the negative. Any algorithm for attention-based token generation must use $\Theta(nd)$ space, where n is the number of tokens generated so far and $d \geq \Omega(\log n)$ is the dimension of the KV embeddings. Our proof involves a reduction from a classic communication complexity problem and uses a randomized construction that leverages properties of projections in the spirit of the Johnson-Linderstrauss lemma. For the low-dimensional regime $d = o(\log n)$, we show that any algorithm requires $\Omega(d e^d)$ space and prove, using tight bounds on covering numbers, that \textsc{SubGen}, proposed by [Zandieh et al. (2024b)], matches this bound. Further, we investigate how sparsity assumptions enable token generation in truly sublinear space, presenting impossibility results and proposing a new KV cache compression algorithm for sliding window attention when the value cache outside the window is unmasked. Finally, we analyze token generation’s time complexity, using an indistinguishability argument to prove that no non-adaptive algorithm can compute attention online in sublinear time for all tokens.

10B - Bandits and Causality#

Time: Thursday 03 July 16:12–18:00

What Makes Treatment Effects Identifiable? Characterizations and Estimators Beyond Unconfoundedness#

Session: 10B - Bandits and Causality (Thursday 03 July 16:12–18:00)

Authors: Cai, Yang; Kalavasis, Alkis; Mamali, Katerina; Mehrotra, Anay; Zampetakis, Manolis

Abstract:
Most of the widely used estimators of average treatment effect (ATE) in causal inference rely on the assumptions of unconfoundedness and overlap. Unconfoundedness requires that the observed covariates account for all correlations between the outcome and treatment. Overlap requires the existence of randomness in treatment decisions for all individuals. Nevertheless, there are many types of studies that frequently violate unconfoundedness or overlap, e.g., observational studies of studies with deterministic treatment decisions. In this paper, we initiate the study of general conditions that enable the identification of the average treatment effect, extending well beyond unconfoundedness and overlap. In particular, following the paradigm of learning theory, we provide an interpretable condition that is sufficient and nearly necessary for the identification of ATE. Moreover, this condition characterizes the identification of the average treatment effect on the treated (ATT) and can be used to characterize other treatment effects as well. To illustrate the utility of our condition, we present several well-studied scenarios where our condition is satisfied and, hence, we prove that ATE can be identified in regimes that prior works could not capture. For example, this holds for the models proposed by Tan (2006) and Rosenbaum (2002), and the Regression Discontinuity design model introduced by Thistlethwaite and Campbell (1960). For each of these scenarios, we also show that, under natural additional assumptions, ATE can be estimated from finite samples. We believe these findings open new avenues for bridging learning-theoretic insights and causal inference methodologies, particularly in observational studies with complex treatment mechanisms.

Optimal Graph Reconstruction by Counting Connected Components in Induced Subgraphs#

Session: 10B - Bandits and Causality (Thursday 03 July 16:12–18:00)

Authors: Black, Hadley; Mazumdar, Arya; Saha, Barna; Xu, Yinzhan

Abstract:
The graph reconstruction problem has been extensively studied under various query models. In this paper, we propose a new query model regarding the number of connected components, which is one of the most basic and fundamental graph parameters. Formally, we consider the problem of reconstructing an $n$-node $m$-edge graph with oracle queries of the following form: provided with a subset of vertices, the oracle returns the number of connected components in the induced subgraph. We show $\Theta(\frac{m \log n}{\log m})$ queries in expectation are both sufficient and necessary to adaptively reconstruct the graph. In contrast, we show that $\Omega(n^2)$ non-adaptive queries are required, even when $m = O(n)$. We also provide an $O(m\log n + n\log^2 n)$ query algorithm using only two rounds of adaptivity.

Experimental Design for Semiparametric Bandits#

Session: 10B - Bandits and Causality (Thursday 03 July 16:12–18:00)

Authors: Kim, Seok-Jin; Kim, Gi-Soo; Oh, Min-hwan

Abstract:
We study the semiparametric bandit problem, where the reward for each action depends on both a linear component and an unmodeled, potentially adversarial shift. This setting strictly generalizes classical linear bandits while capturing additional complexities frequently encountered in real-world systems. To our best knowledge, experimental design for semiparametric reward models had not been investigated in the prior work. We develop a novel experimental design approach that, for the first time, enables both sharp regret bounds and exploration guarantees simultaneously in semiparametric bandits. Our proposed algorithm achieves $\tilde{\mathcal{O}}(\sqrt{dT})$ regret bound matching known lower bounds for linear bandits with finite arms for the first time. Moreover, our approach achieves the first logarithmic regret under a positive suboptimality gap. We further establish the first exploration-based guarantees for semiparametric bandits, such as PAC and best arm identification guarantees. These theoretical advances are enabled by a refined, non-asymptotic analysis of orthogonalized regression that achieves the optimal $\sqrt{d}$-rate, paving the way for robust and efficient bandit algorithms in a broader class of problems.

Deterministic Apple Tasting#

Session: 10B - Bandits and Causality (Thursday 03 July 16:12–18:00)

Authors: Chase, Zachary; Mehalel, Idan

Abstract:
In binary ($0/1$) online classification with apple tasting feedback, the learner receives feedback only when predicting $1$. Besides some degenerate learning tasks, all previously known learning algorithms for this model are randomized. Consequently, prior to this work it was unknown whether deterministic apple tasting is generally feasible. In this work, we provide the first widely-applicable deterministic apple tasting learner, and show that in the realizable case, a hypothesis class is learnable if and only if it is deterministically learnable, confirming a conjecture of Raman, Subedi, Raman, Tewari-24. Quantitatively, we show that every class H is learnable with mistake bound $O (\sqrt{L(H) T log T})$ (where $L(H)$ is the Littlestone dimension of $H$), and that this is tight for some classes. This demonstrates a separation between a deterministic and randomized learner, where the latter can learn every class with mistake bound $O(\sqrt{L(H)T})$, as shown in Raman et al.-24. We further study the agnostic case, in which the best hypothesis makes at most $k$ many mistakes, and prove a trichotomy stating that every class $H$ must be either easy, hard, or unlearnable. Easy classes have (both randomized and deterministic) mistake bound $\Theta_{H}(k)$. Hard classes have randomized mistake bound $\tilde{\Theta}_{H}(k + \sqrt{T})$, and deterministic mistake bound $\tilde{\Theta}_{H}(\sqrt{k \cdot T})$, where $T$ is the time horizon. Unlearnable classes have (both randomized and deterministic) mistake bound $\Theta(T)$. Our upper bound is based on a deterministic algorithm for learning from expert advice with apple tasting feedback, a problem interesting in its own right. For this problem, we show that the optimal deterministic mistake bound is $\Theta (\sqrt{T (k + \log n)})$ for all $k$ and $T \leq n \leq 2^T$, where $n$ is the number of experts. Our algorithm is a natural variation of the well-known exponential weights forecaster.

Orthogonal Causal Calibration#

Session: 10B - Bandits and Causality (Thursday 03 July 16:12–18:00)

Authors: Whitehouse, Justin; Syrgkanis, Vasilis; Wilder, Bryan; Jung, Chris; Wu, Steven

Abstract:
Estimates of heterogeneous treatment effects such as conditional average treatment effects (CATEs) and conditional quantile treatment effects (CQTEs) play an important role in real-world decision making. Given this importance, one should ensure these estimators are calibrated. While there is a rich literature on calibrating estimators of non-causal parameters, very few methods have been derived for calibrating estimators of causal parameters, or more generally estimators of quantities involving nuisance parameters. In this work, we develop general algorithms for reducing the task of causal calibration to that of calibrating a standard (non-causal) predictive model. Throughout, we study a notion of calibration defined with respect to an arbitrary, nuisance-dependent loss $\ell$, under which we say an estimator $\theta$ is calibrated if its predictions cannot be changed on any level set to decrease loss. For losses $\ell$ satisfying a condition called \textit{universal orthogonality}, we present a simple algorithm that transforms partially-observed data into generalized pseudo-outcomes and applies any off-the-shelf calibration procedure. For losses $\ell$ satisfying a weaker assumption called conditional orthogonality, we provide a similar sample splitting algorithm the performs empirical risk minimization over an appropriately defined class of functions. Convergence of both algorithms follows from a generic, two term upper bound of the calibration error of any model that decouples the error in estimating unknown nuisance parameters from the calibration error in a hypothetical world where the learned nuisances are true. We demonstrate the practical applicability of our results in experiments on both observational and synthetic data. Our results are exceedingly general, showing that essentially any existing calibration algorithm can be used in causal settings, with additional loss only arising from errors in nuisance estimation.

Universal Rates for Multiclass Learning with Bandit Feedback#

Session: 10B - Bandits and Causality (Thursday 03 July 16:12–18:00)

Authors: Hanneke, Steve; Shaeiri, Amirreza; Zhang, Qian

Abstract:
The seminal work of (Daniely et al., COLT 2011) introduced the problem of multiclass learning under bandit feedback and provided a combinatorial characterization of its learnability within the framework of PAC learning. In multiclass learning under bandit feedback, there is an unknown data distribution over an instance space $\mathcal{X}$ and a label space $\mathcal{Y}$ similar to classical multiclass learning, but the learner does not directly observe the correct labels of the i.i.d. training examples. Instead, during each round, the learner receives an example, makes a prediction for its label, and receives bandit feedback only indicating whether the prediction is correct. Despite this restriction, the goal remains the same as in classical multiclass learning. In the present work, we study the problem of multiclass learning under bandit feedback within the framework of \emph{universal learning} (Bousquet et al., STOC 2021). This makes it possible to study the behavior of learning curves. In the \emph{uniform learning} framework, no concept class $\mathcal{C}$ is learnable when the effective label space is unbounded. In contrast, surprisingly, we demonstrate that the universal learnability of concept classes $\mathcal{C}$ even when the effective label space is unbounded gives rise to a rich theory. More concretely, our primary contribution is a theory that reveals an inherent trichotomy governing instance optimal learning curves in the realizable setting. Moreover, the best achievable universal learning rate for any given concept class can only decay either at an \emph{exponential}, a \emph{linear}, or an \emph{arbitrarily slow} rate. In particular, the trichotomy is combinatorially characterized by the absence of an infinite multiclass Littlestone tree and the combination of an infinite Natarajan Littlestone tree and an infinite progressive Littlestone tree. Furthermore, we introduce novel learning algorithms for achieving instance optimal universal rates.

Efficient Near-Optimal Algorithm for Online Shortest Paths in Directed Acyclic Graphs with Bandit Feedback Against Adaptive Adversaries#

Session: 10B - Bandits and Causality (Thursday 03 July 16:12–18:00)

Authors: Maiti, Arnab; Fan, Zhiyuan; Farina, Gabriele; Jamieson, Kevin; Ratliff, Lillian

Abstract:
In this paper, we study the online shortest path problem in directed acyclic graphs (DAGs) under bandit feedback against an adaptive adversary. Given a DAG $G = (V, E)$ with a source node $v_{\mathsf{s}}$ and a sink node $v_{\mathsf{t}}$, let $\mathcal{X} \subseteq \{0,1\}^{|E|}$ denote the set of all paths from $\source$ to $\sink$. At each round $t$, we select a path $\bfx_t \in \mathcal{X}$ and receive bandit feedback on our loss $\langle \mathbf{x}_t, \mathbf{y}_t \rangle \in [-1,1]$, where $\mathbf{y}_t$ is an adversarially chosen loss vector. Our goal is to minimize regret with respect to the best path in hindsight over $T$ rounds. We propose the first computationally efficient algorithm to achieve a near-minimax optimal regret bound of $\tilde{\mathcal{O}}(\sqrt{|E|T\log |\mathcal{X}|})$ with high probability against any adaptive adversary, where $\tilde{\mathcal{O}}(\cdot)$ hides logarithmic factors in the number of edges $|E|$. Our algorithm leverages a novel loss estimator and a centroid-based decomposition in a nontrivial manner to attain this regret bound. As an application, we show that our algorithm for DAGs provides state-of-the-art efficient algorithms for $m$-sets, extensive-form games, the Colonel Blotto game, shortest walks in directed graphs, hypercubes, and multi-task multi-armed bandits, achieving improved high-probability regret guarantees in all these settings.

Non-Monetary Mechanism Design without Distributional Information: Using Scarce Audits Wisely#

Session: 10B - Bandits and Causality (Thursday 03 July 16:12–18:00)

Authors: Dai, Yan; Blanchard, Moise; Jaillet, Patrick

Abstract:
We study a repeated resource allocation problem with strategic agents where monetary transfers are disallowed and the central planner has no prior information on agents' utility distributions. In light of Arrow's impossibility theorem, acquiring information about agent preferences through some form of feedback is necessary. We assume that the central planner can request powerful but expensive audits on the winner in any round, revealing the true utility of the winner in that round. We design a mechanism achieving $T$-independent $\mathcal O(K^2)$ regret in social welfare while requesting $\mathcal O(K^3 \log T)$ audits in expectation, where $K$ is the number of agents and $T$ is the number of rounds. We also show an $\Omega(K)$ lower bound on the regret and an $\Omega(1)$ lower bound on the number of audits when having low regret. Algorithmically, we show that incentive-compatibility can be mostly enforced with an accurate estimation of the winning probability of each agent under truthful reporting. To do so, we impose future punishments and introduce a \emph{flagging} component, allowing agents to flag any biased estimate (we show that doing so aligns with individual incentives). On the technical side, without monetary transfers and distributional information, the central planner cannot ensure that truthful reporting is exactly an equilibrium. Instead, we characterize the equilibrium via a reduction to a simpler \emph{auxiliary game}, in which agents cannot strategize until late in the $T$ rounds of the allocation problem. The tools developed therein may be of independent interest for other mechanism design problems in which the revelation principle cannot be readily applied.

Market Making without Regret#

Session: 10B - Bandits and Causality (Thursday 03 July 16:12–18:00)

Authors: Cesa-Bianchi, Nicolo; Cesari, Tom; Colomboni, Roberto; Foscari, Luigi; Pathak, Vinayak

Abstract:
We consider a sequential decision-making setting where, at every round $t$, the learner (a market maker) posts a bid price $B_t$ and an ask price $A_t$ to an incoming trader (the taker) with a private valuation for some asset. If the trader's valuation is lower than the bid price, or higher than the ask price, then a trade (sell or buy) occurs. Letting $M_t$ be the market price (observed only at the end of round $t$), the maker's utility is $M_t-B_t$ if the maker bought the asset, it is $A_t-M_t$ if they sold it, and it is $0$ if no trade occurred. We characterize the maker's regret with respect to the best fixed choice of bid and ask pairs under a variety of assumptions (adversarial, i.i.d., and their variants) on the sequence of market prices and valuations. Our upper bound analysis unveils an intriguing connection relating market making to first-price auctions and dynamic pricing. Our main technical contribution is a lower bound for the i.i.d. case with Lipschitz distributions and independence between market prices and takers' valuations. The difficulty in the analysis stems from a unique relationship between the reward and feedback functions that allows learning algorithms to trade off reward for information in a continuous way.

11A - Computational Complexity#

Time: Friday 04 July 10:00–11:24

Multi-Pass Memory Lower Bounds for Learning Problems#

Session: 11A - Computational Complexity (Friday 04 July 10:00–11:24)

Authors: Li, Qian; Wang, Shuo; Zhang, Jiapeng

Abstract:
Space complexity in learning problems has received a lot of attention in recent years. In this direction, Brown, Bun, and Smith (COLT 2022) studied space complexity lower bounds for several natural learning problems under the \textit{one-pass streaming} setting. Assuming that the examples are sampled from $\{0,1\}^d$ and the optimal hypothesis can be encoded using $\kappa$ bits, they showed learning algorithms with constant error using a near-minimal number of examples, $\Tilde{O}(\kappa)$, require $\Tilde{\Omega}(d\kappa)$ bits of memory. Moreover, for a general number $N$ of examples, their memory lower bound takes the form $\Tilde{\Omega}(d\kappa\cdot \frac{\kappa}{N})$. However, as mentioned by Brown, Bun, and Smith (COLT 2022), the learning process often involves multiple passes over the data. Hence, it is equally important to study the space complexity in the \textit{multi-pass streaming} setting. The authors conjectured that similar lower bounds should apply but left it as an open problem. In this paper, we resolve this open problem by proving that any $L$-pass streaming algorithm using $N$ samples requires $\Tilde{\Omega}(d\kappa\cdot \frac{\kappa}{NL})$ bits of memory. Intuitively, our lower bound shows that a stream of $L\cdot N$ fresh examples is at least as useful as $L$ passes over $N$ examples. A key component of our approach is a lower bound on the information complexity of the \textsf{Bit-Bias$(p,q)$} problem in the multi-pass streaming setting, a basic problem that may have independent significance. In the \textsf{Bit-Bias$(p,q)$} problem, one sees a stream of $N$ i.i.d. random bits drawn from either \textsf{Bernoulli$(p)$} or \textsf{Bernoulli$(q)$}, and would like to distinguish the two cases. Our results not only extend the previous lower bound on \textsf{Bit-Bias$(0,1/2)$} by Brown, Bun, and Smith from the one-pass streaming setting to the more general multi-pass setting, but also cover more general values of $p$ and $q$.

Low coordinate degree algorithms II: Categorical signals and generalized stochastic block models#

Session: 11A - Computational Complexity (Friday 04 July 10:00–11:24)

Authors: Kunisky, Dmitriy

Abstract:
We study when low coordinate degree functions (LCDF)---linear combinations of functions depending on small subsets of entries of a vector---can test for the presence of categorical structure, including community structure and generalizations thereof, in high-dimensional data. This complements recent results studying the power of LCDF in testing for continuous structure like real-valued signals corrupted by additive noise. We study a general form of stochastic block model (SBM), where a population is assigned random labels and every p-tuple generates an observation according to an arbitrary probability measure associated to the p labels of its members. We show that the performance of LCDF admits a unified analysis for this class of models. As applications, we prove tight lower bounds against LCDF (and therefore also against low degree polynomials) for nearly arbitrary graph and regular hypergraph SBMs, always matching suitable generalizations of the Kesten-Stigum threshold. We also prove tight lower bounds for group synchronization and abelian group sumset problems under the "truth-or-Haar" noise model, and give an improved analysis of Gaussian multi-frequency group synchronization. In most of these models, for some parameter settings our lower bounds give new evidence for conjectural statistical-to-computational gaps. Finally, interpreting some of our findings, we propose a new analogy between categorical and continuous signals: a general SBM as above behaves qualitatively like a spiked p*-tensor model of a certain order p* depending on the parameters of the SBM.

Algorithms for Sparse LPN and LSPN Against Low-noise#

Session: 11A - Computational Complexity (Friday 04 July 10:00–11:24)

Authors: Chen, Xue; Zhou, Zhaienhe; Shu, Wenxuan

Abstract:
We consider sparse variants of the classical Learning Parities with random Noise (LPN) problem. Our main contribution is a new algorithmic framework that provides learning algorithms against low-noise for both Learning Sparse Parities (LSPN) problem and sparse LPN problem. Different from previous approaches for LSPN and sparse LPN, this framework has a simple structure without fast matrix multiplication or tensor methods such that its algorithms are easy to implement and run in polynomial space. Let $n$ be the dimension, $k$ denote the sparsity, and $\eta$ be the noise rate. As a fundamental problem in computational learning theory, Learning Sparse Parities with Noise (LSPN) assumes the hidden parity is $k$-sparse. While a simple enumeration algorithm takes ${n \choose k}=O(n/k)^k$ time, previously known results stills need ${n \choose k/2} = \Omega(n/k)^{k/2}$ time for any noise rate $\eta$. Our framework provides a LSPN algorithm runs in time $O(\eta \cdot n/k)^k$ for any noise rate $\eta$, which improves the state-of-the-art of LSPN whenever $\eta \in (\sqrt{k/n},k/n)$. The sparse LPN problem is closely related to the classical problem of refuting random $k$-CSP and has been widely used in cryptography as the hardness assumption. Different from the standard LPN, it samples random $k$-sparse vectors. Because the number of $k$-sparse vectors is ${n \choose k}n^{k/2}$. However, much less is known about learning algorithms for constant $k$ like 3 and $m

Computational Equivalence of Spiked Covariance and Spiked Wigner Models via Gram-Schmidt Perturbation#

Session: 11A - Computational Complexity (Friday 04 July 10:00–11:24)

Authors: Bresler, Guy; Harbuzova, Alina

Abstract:
In this work, we show the first average-case reduction transforming the sparse Spiked Covariance Model into the sparse Spiked Wigner Model and as a consequence obtain the first computational equivalence result between two well-studied high-dimensional statistics models. Our approach leverages a new perturbation equivariance property for Gram-Schmidt orthogonalization, enabling removal of dependence in the noise while preserving the signal.

Some easy optimization problems have the overlap-gap property#

Session: 11A - Computational Complexity (Friday 04 July 10:00–11:24)

Authors: Li, Shuangping; Schramm, Tselil

Abstract:
We show that the shortest $s$-$t$ path problem has the overlap-gap property in (i) sparse $\bbG(n,p)$ graphs and (ii) complete graphs with i.i.d. Exponential edge weights. Furthermore, we demonstrate that in sparse $\bbG(n,p)$ graphs, shortest path is solved by $O(\log n)$-degree polynomial estimators, and a uniform approximate shortest path can be sampled in polynomial time. This constitutes the first example in which the overlap-gap property is not predictive of algorithmic intractability for a (non-algebraic) average-case optimization problem.

11B - Concentration Inequalities#

Time: Friday 04 July 10:00–11:24

Universal rates of ERM for agnostic learning#

Session: 11B - Concentration Inequalities (Friday 04 July 10:00–11:24)

Authors: Hanneke, Steve; Xu, Mingyue

Abstract:
The universal learning framework has been developed to obtain guarantees on the learning rates that hold for any fixed distribution, which can be much faster than the ones uniformly hold over all the distributions. Given that the Empirical Risk Minimization (ERM) principle being fundamental in the PAC theory and ubiquitous in practical machine learning, the recent work of Hanneke and Xu (2024) studied the universal rates of ERM for binary classification under the realizable setting. However, the assumption of realizability is too restrictive to hold in practice. Indeed, the majority of the literature on universal learning has focused on the realizable case, leaving the non-realizable case barely explored. In this paper, we consider the problem of universal learning by ERM for binary classification under the agnostic setting, where the ``learning curve" reflects the decay of the excess risk as the sample size increases. We explore the possibilities of agnostic universal rates and reveal a compact trichotomy: there are three possible agnostic universal rates of ERM, being either e^{-n}, o(n^{-1/2}), or arbitrarily slow. We provide a complete characterization of which concept classes fall into each of these categories. Moreover, we also establish complete characterizations for the target-dependent universal rates as well as the Bayes-dependent universal rates.

Time-Uniform, Self-Normalized Concentration for Vector-Valued Processes#

Session: 11B - Concentration Inequalities (Friday 04 July 10:00–11:24)

Authors: Whitehouse, Justin; Ramdas, Aaditya; Wu, Steven

Abstract:
Self-normalized processes arise naturally in many learning-theoretic tasks. While self-normalized concentration has been extensively studied for scalar-valued processes, there is less work on multidimensional processes outside of the sub-Gaussian setting. In this work, we construct a general, self-normalized inequality for $\R^d$-valued processes that satisfy a simple yet broad ``sub-$\psi$'' tail condition, which generalizes assumptions based on cumulant generating functions. From this general inequality, we derive an upper law of the iterated logarithm for sub-$\psi$ vector-valued processes, which is tight up to small constants. We demonstrate applications in prototypical statistical tasks, such as parameter estimation in online linear regression and bounded mean estimation via a new (multivariate) empirical Bernstein concentration inequality.

Spherical Dimension#

Session: 11B - Concentration Inequalities (Friday 04 July 10:00–11:24)

Authors: Chornomaz, Bogdan; Moran, Shay; Waknine, Tom

Abstract:
We introduce and study the \emph{spherical dimension}, a natural topological relaxation of the VC dimension that unifies several results in learning theory where topology plays a key role in the proofs. The spherical dimension is defined by extending the set of realizable datasets (used to define the VC dimension) to the continuous space of realizable distributions. In this space, a shattered set of size d (in the VC sense) is completed into a continuous object, specifically a d-dimensional sphere of realizable distributions. The spherical dimension is then defined as the dimension of the largest sphere in this space. Thus, the spherical dimension is at least the VC dimension. The spherical dimension serves as a common foundation for leveraging the Borsuk-Ulam theorem and related topological tools. We demonstrate the utility of the spherical dimension in diverse applications, including disambiguations of partial concept classes, reductions from classification to stochastic convex optimization, stability and replicability, and sample compression schemes. Perhaps surprisingly, we show that the open question posed by Alon, Hanneke, Holzman, and Moran (FOCS 2021) of whether there exist non-trivial disambiguations for halfspaces with margin is equivalent to the basic open question of whether the VC and spherical dimensions are finite together.

Computable learning of natural hypothesis classes#

Session: 11B - Concentration Inequalities (Friday 04 July 10:00–11:24)

Authors: Akbari, Syed; Harrison-Trainor, Matthew

Abstract:
This paper is about the recent notion of computably probably approximately correct learning, which lies between the statistical learning theory where there is no computational requirement on the learner and efficient PAC where the learner must be polynomially bounded. Examples have recently been given of hypothesis classes which are PAC learnable but not computably PAC learnable, but these hypothesis classes are unnatural or non-canonical in the sense that they depend on a numbering of proofs, formulas, or programs. We use the on-a-cone machinery from computability theory to prove that, under mild assumptions such as that the hypothesis class can be computably listed, any natural hypothesis class which is learnable must be computably learnable. Thus the counterexamples given previously are necessarily unnatural.

A Fine-grained Characterization of PAC Learnability#

Session: 11B - Concentration Inequalities (Friday 04 July 10:00–11:24)

Authors: Bressan, Marco; Brukhim, Nataly; Cesa-Bianchi, Nicolo; Esposito, Emmanuel; Mansour, Yishay; Moran, Shay; Thiessen, Maximilian

Abstract:
In the multiclass PAC setting, even when full learnability is unattainable, meaningful information can often be extracted to guide predictions. However, classical learning theory has mainly focused on the dichotomy “learnable vs. non-learnable”, leaving notions of partial learnability largely unexplored. Indeed, even for a non-learnable class, a learner may still achieve partial success—for example, by making reliable predictions whenever the true label belongs to a fixed subset of the label space, even if it fails otherwise. Similarly, the rigid nature of PAC learnability makes it impossible to distinguish between classes where one can achieve favorable trade-offs between, say, false-positive and false-negative rates, and classes where such trade-offs are fundamentally unattainable. In a nutshell, standard PAC learnability precludes a fine-grained exploration of learnability. To overcome this limitation, we develop a fine-grained theory of PAC learnability. For any hypothesis class H, given a loss function (which quantifies the penalty for predicting ŷ instead of the true label y) and a target loss threshold z, our theory determines whether it is possible to achieve a loss of at most z. In contrast, classical PAC learning considers only the special case of the zero-one loss and z = 0, corresponding to a near perfect classification guarantee. We give a complete characterization of all attainable guarantees, captured by a finite family of combinatorial dimensions, which we term the J-cube dimensions of H. These dimensions are defined for every subset J of at least two labels. This extends the fundamental theorem of realizable PAC learning based on the VC dimension. In fact, our results hold in a more general multi-objective setting where we fully characterize the Pareto frontier of guarantees attainable for the class H.

Local Regularizers Are Not Transductive Learners#

Session: 11B - Concentration Inequalities (Friday 04 July 10:00–11:24)

Authors: Jafar, Sky; Asilis, Julian; Dughmi, Shaddin

Abstract:
We partly resolve an open question raised by Asilis et al. 2024: whether the algorithmic template of local regularization --- an intriguing generalization of explicit regularization, a.k.a. structural risk minimization --- suffices to learn all learnable multiclass problems. Specifically, we provide a negative answer to this question in the transductive model of learning. We exhibit a multiclass classification problem which is learnable in both the transductive and PAC models, yet cannot be learned transductively by any local regularizer. The corresponding hypothesis class, and our proof, are based on principles from cryptographic secret sharing. We outline challenges in extending our negative result to the PAC model, leaving open the tantalizing possibility of a PAC/transductive separation with respect to local regularization.

Open Problems#

Time: Friday 04 July 13:30–14:20

Open Problem: Fixed-Parameter Tractability of Zonotope Problems#

Session: Open Problems (Friday 04 July 13:30–14:20)

Authors: Froese, Vincent; Grillo, Moritz; Hertrich, Christoph; Skutella, Martin

Abstract:
Neural networks with ReLU activation play a key role in modern machine learning. Understanding the functions represented by ReLU networks is a major topic in current research. Recent results are achieved via connections to tropical geometry based on a duality between convex piecewise linear functions and polytopes. It turns out that several questions about properties of functions computed by ReLU neural networks can be answered by solving certain problems on special polytopes called zonotopes. For example, computing the Lipschitz constant of a ReLU network with one hidden layer corresponds to norm maximization over a zonotope. Moreover, deciding whether the ReLU network attains a positive output is equivalent to zonotope non-containment. These problems are known to be NP-hard in general but polynomial-time solvable if the input dimension is constant. However, it is open whether they are fixed-parameter tractable (FPT) with respect to the input dimension~$d$, that is, solvable in~$f(d)\cdot n^{O(1)}$ time for some function~$f$ solely depending on~$d$. Notably, these zonotope problems also arise in other areas such as robotics and control, reachability analysis, pattern recognition, signal processing or political analysis. Thus, settling their parameterized complexity status is of broad interest.

Open Problem: Regret Minimization in Heavy-Tailed Bandits with Unknown Distributional Parameters#

Session: Open Problems (Friday 04 July 13:30–14:20)

Authors: Genalti, Gianmarco; Metelli, Alberto Maria

Abstract:
The heavy-tailed bandit problem (Bubeck et al., 2013), is a variant of the stochastic multi-armed bandit problem where the reward distributions have finite absolute raw moments of maximum order $1+\epsilon$, uniformly bounded by a constant $u < +\infty$, for some $\epsilon \in (0,1]$. In this setting, most of the proposed approaches crucially rely on the knowledge of both $\epsilon$ and $u$. Recent works have highlighted that adapting to such parameters when they are unknown is harder than adapting to the subgaussian constant or the rewards range in non-heavy-tailed bandits. It is known that it is not possible to adapt to either $\epsilon$ or $u$ without either ($i$) incurring extra regret or ($ii$) enforcing additional assumptions. However, it remains an \emph{open question} what the best attainable performance is when no additional assumptions are provided. Moreover, the assumptions proposed in the literature are not comparable, as none of them is strictly weaker than the others. Thus, another open question is about the nature of the assumptions needed to compensate for this cost.

Open Problem: Structure-Agnostic Minimax Risk for Partial Linear Model#

Session: Open Problems (Friday 04 July 13:30–14:20)

Authors: Gu, Yihong

Abstract:
Double machine learning is a theoretically grounded and practically efficient procedure for a variety of causal estimands and functional estimation problems when adopting black-box machine learning models for estimating nuisance parameters. It is known that double machine learning may have sub-optimal performance in the structure-aware settings, e.g., the nuisances are Holder smooth functions, and recent articles (Balakrishnan et al., 2023) are delivering the message that double machine learning is optimal in structure-agnostic settings. This note claims that whether double machine learning is optimal for black-box machine learning models remains open, even for the
simplest linear coefficient estimation in the partial linear model. The key gap that differentiates structure-agnostic and structure-aware settings, and also the previous lower bound results do not address, is the role of variance. As illustrated in detail, it is possible to reduce the variance in many traditional structure-aware settings, but that is not clear in structure-agnostic settings. The answer to this question has significant implications both in theory and practice.

Open Problem: Data Selection for Regression Tasks#

Session: Open Problems (Friday 04 July 13:30–14:20)

Authors: Hanneke, Steve; Moran, Shay; Shlimovich, Alexander; Yehudayoff, Amir

Abstract:
This note proposes a set of open problems concerning data selection in regression tasks. The central question is: given a natural learning rule $ \mathcal{A} $ and a selection budget $ n $, how well can $ \mathcal{A} $ perform when trained on $ n $ examples selected from a larger dataset? We present concrete instances of this question in basic regression settings, including mean estimation and linear regression.

Open Problem: Optimal Instance-Dependent Sample Complexity for finding Nash Equilibrium in Two Player Zero-Sum Matrix games#

Session: Open Problems (Friday 04 July 13:30–14:20)

Authors: Maiti, Arnab

Abstract:
Optimal instance-dependent sample complexity is a well-studied topic in the multi-armed bandit literature. However, the analogous question in the setting of two-player zero-sum matrix games, where the payoff matrix can only be accessed through noisy samples, remains largely unexplored despite being a natural generalization of the multi-armed bandit problem. In this write-up, we pose a simple open question: What is the optimal instance-dependent sample complexity to find an approximate Nash equilibrium in two-player zero-sum matrix games?

12A - Online Algorithms#

Time: Friday 04 July 14:30–15:30

Rate-Preserving Reductions for Blackwell Approachability#

Session: 12A - Online Algorithms (Friday 04 July 14:30–15:30)

Authors: Dann, Christoph; Mansour, Yishay; Mohri, Mehryar; Schneider, Jon; Sivan, Balasubramanian

Abstract:
Abernethy et al.\ (2011) showed that Blackwell approachability and no-regret learning are equivalent, in the sense that any algorithm that solves a specific Blackwell approachability instance can be converted to a sublinear regret algorithm for a specific no-regret learning instance, and vice versa. In this paper, we study a more fine-grained form of such reductions, and ask when this translation between problems preserves not only a sublinear rate of convergence, but also preserves the optimal rate of convergence. That is, in which cases does it suffice to find the optimal regret bound for a no-regret learning instance in order to find the optimal rate of convergence for a corresponding approachability instance? We show that the reduction of Abernethy et al.\ (2011) does not preserve rates: their reduction may reduce a $d$-dimensional approachability instance $\mathcal{I}_1$ with optimal convergence rate $R_1$ to a no-regret learning instance $\mathcal{I}_2$ with optimal regret-per-round of $R_2$, with $R_{2}/R_{1}$ arbitrarily large (in particular, it is possible that $R_1 = 0$ and $R_{2} > 0$). On the other hand, we show that it is possible to tightly reduce any approachability instance to an instance of a generalized form of regret minimization we call \emph{improper $\phi$-regret minimization} (a variant of the $\phi$-regret minimization of Gordon et al.\ (2008)). Finally, we characterize when linear transformations suffice to reduce improper $\phi$-regret minimization problems to standard classes of regret minimization problems (such as external regret minimization and proper $\phi$-regret minimization) in a rate preserving manner. We prove that some improper $\phi$-regret minimization instances cannot be reduced to either subclass of instance in this way, suggesting that approachability can capture some problems that cannot be easily phrased in the standard language of online learning.

Optimal Online Bookmaking For Any Number of Outcomes#

Session: 12A - Online Algorithms (Friday 04 July 14:30–15:30)

Authors: Tal, Hadar; Sabag, Oron

Abstract:
We study the Online Bookmaking problem, where a bookmaker dynamically updates betting odds on the possible outcomes of an event. In each betting round, the bookmaker has the opportunity to adjust odds based on the cumulative betting behavior, with the aim of mitigating risk. In a worstcase setting, defined by arbitrary bets’ sequences and event outcome, we show the bookmaker’s optimal loss is characterized as the largest root of a simple polynomial for any number of betting rounds and any number of possible outcomes. Our solution shows that bookmakers can be as fair as desired while still guaranteeing a gain, and the explicit characterization reveals an intriguing relation between the bookmaker’s regret and Hermite polynomials. We also develop an efficient algorithm that computes the optimal bookmaking strategy: when facing an optimal gambler, the algorithm achieves the optimal loss, and in rounds where the gambler is suboptimal, it reduces the achieved loss to the opportunistic optimal loss, a notion that is related to subgame perfect Nash equilibrium. The key technical contribution to achieve these results is an explicit characterization of the Bellman-Pareto Frontier, which unifies the dynamic programming updates for Bellman’s value function with multi-criteria optimization framework of the Pareto frontier in the context of vector repeated games.

Gap in Gaussian RKHS and Neural Networks: An infinite sample asymptotic#

Session: 12A - Online Algorithms (Friday 04 July 14:30–15:30)

Authors: Kumar, Akash; Parhi, Rahul; Belkin, Misha

Abstract:
Recent work has characterized the space of two-layered infinite-width neural networks as a bounded variation space $\rbv{\Omega}$ over domains $\Omega \subset \reals^d$. These spaces encompass several classical multivariate function spaces, including the $L_1$- and $L_2$-Sobolev spaces of order $d+1$, where $d$ represents the ambient dimension of the domain. This Sobolev regularity provides sufficient structure to overcome the curse of dimensionality in approximation theory. Notably, $\rbv{\Omega}$ also contains functions with less classical regularity, particularly those exhibiting significant variations in only a few directions. For bounded domains, it is well-established that Gaussian reproducing kernel Hilbert spaces (RKHS) strictly continuously embed within $\rbv{\Omega}$, demonstrating a clear gap between Gaussian RKHS with $\rbv{\Omega}$. However, this relationship becomes more nuanced in unbounded domains. In this work, we investigate the setting where $\Omega = \reals^d$ and establish a fundamental result: certain Gaussian kernel functions cannot be represented within $\rbv{\reals^d}$, providing a contrasting non-trivial gap in the complement of the intersection of these two spaces.

Online Scheduling and Learning with Delays#

Session: 12A - Online Algorithms (Friday 04 July 14:30–15:30)

Authors: Ryabchenko, Alexander; Attias, Idan; Roy, Daniel

Abstract:
We study online learning with oblivious losses and delays under a novel ``capacity constraint'' that limits how many past rounds can be tracked simultaneously for delayed feedback. Under ``clairvoyance'' (i.e., delays are revealed immediately) and/or ``preemptibility'' (i.e., we may stop tracking a loss we initially chose to track), we establish matching upper and lower bounds (up to logarithmic terms) on achievable regret, characterizing the ``optimal capacity'' needed to match the minimax rates of standard delayed learning, which implicitly assume unlimited capacity. Our algorithms achieve minimax-optimal regret across all capacity levels, with performance gracefully degrading under suboptimal capacity. For $K$ actions and total delay $D$ over $T$ rounds, under the clairvoyant setting and assuming capacity $C = \Omega(\log(T))$, we achieve regret $\widetilde{\Theta}(\sqrt{TK + D\log(K) + DK/C})$ for bandits and $\widetilde{\Theta}(\sqrt{(D+T)\log(K)})$ for full-information feedback. Replacing clairvoyance with preemptiveness requires a known delay bound $d_{\max}$, adding $\smash{\widetilde{O}(d_{\max})}$ to the regret. For fixed delays $d$, i.e., where $D = Td$, minimax regret is $\Theta(\sqrt{TK(1+d/C) + Td\log(K)})$ for bandits and $\Theta(\sqrt{T(d+1)\log(K)})$ for full-information. The optimal capacity is precisely $\Theta(\min\{K/\log(K), d\})$ for bandits and $\Theta(1)$ for full-information. Our upper bounds are achieved using novel \preemptive and non-\preemptive schedulers based on Pareto-distributed proxy delays for round-dependent delays and batching technique for fixed delays. Crucially, our work unifies delayed bandits, label-efficient sampling, and scheduling theory, demonstrating that robust online learning under delays is possible with surprisingly modest tracking capacity.

Structure-agnostic Optimality of Doubly Robust Learning for Treatment Effect Estimation#

Session: 12A - Online Algorithms (Friday 04 July 14:30–15:30)

Authors: JIN, JIKAI; Syrgkanis, Vasilis

Abstract:
Average treatment effect estimation is the most central problem in causal inference with application to numerous disciplines. While many estimation strategies have been proposed in the literature, the statistical optimality of these methods has still remained an open area of investigation, especially in regimes where these methods do not achieve parametric rates. In this paper, we adopt the recently introduced structure-agnostic framework of statistical lower bounds, which poses no structural properties on the nuisance functions other than access to black-box estimators that achieve some statistical estimation rate. This framework is particularly appealing when one is only willing to consider estimation strategies that use non-parametric regression and classification oracles as black-box sub-processes. Within this framework, we prove the statistical optimality of the celebrated and widely used doubly robust estimators for both the Average Treatment Effect (ATE) and the Average Treatment Effect on the Treated (ATT), as well as weighted variants of the former, which arise in policy evaluation.

12B - Quantum#

Time: Friday 04 July 14:30–15:30

Predicting quantum channels over general product distributions#

Session: 12B - Quantum (Friday 04 July 14:30–15:30)

Authors: Chen, Sitan; de dios Pont, Jaume; Hsieh, Jun-Ting; Huang, Hsin-Yuan; Lange, Jane; Li, Jerry

Abstract:
We investigate the problem of predicting the output behavior of unknown quantum channels. Given query access to an $n$-qubit channel $\mathcal{E}$ and an observable $\mathcal{O}$, we aim to learn the mapping \begin{equation*} \rho \mapsto \Tr(\mathcal{O} \mathcal{E}[\rho]) \end{equation*} to within a small error for most $\rho$ sampled from a distribution $\mathcal{D}$. Previously, Huang, Chen, and Preskill proved a surprising result that even if $\mathcal{E}$ is arbitrary, this task can be solved in time roughly $n^{O(\log(1/\epsilon))}$, where $\epsilon$ is the target prediction error. However, their guarantee applied only to input distributions $\mathcal{D}$ invariant under all single-qubit Clifford gates, and their algorithm fails for important cases such as general product distributions over product states $\rho$. In this work, we propose a new approach that achieves accurate prediction over essentially any product distribution $\mathcal{D}$, provided it is not "classical" in which case there is a trivial exponential lower bound. Our method employs a "biased Pauli analysis", analogous to classical biased Fourier analysis. Implementing this approach requires overcoming several challenges unique to the quantum setting, including the lack of a basis with appropriate orthogonality properties. The techniques we develop to address these issues may have broader applications in quantum information.

Improved Sample Upper and Lower Bounds for Trace Estimation of Quantum State Powers#

Session: 12B - Quantum (Friday 04 July 14:30–15:30)

Authors: Chen, Kean; Wang, Qisheng

Abstract:
As often emerges in various basic quantum properties such as entropy, the trace of quantum state powers $\operatorname{tr}(\rho^q)$ has attracted a lot of attention. The recent work of Liu and Wang (SODA 2025) showed that $\operatorname{tr}(\rho^q)$ can be estimated to within additive error $\varepsilon$ with a dimension-independent sample complexity of $\widetilde O(1/\varepsilon^{3+\frac{2}{q-1}})$ for any constant $q > 1$, where only an $\Omega(1/\varepsilon)$ lower bound was given. In this paper, we significantly improve the sample complexity of estimating $\operatorname{tr}(\rho^q)$ in both the upper and lower bounds. In particular: - For $q > 2$, we settle the sample complexity with matching upper and lower bounds $\widetilde \Theta(1/\varepsilon^2)$. - For $1 < q < 2$, we provide an upper bound $\widetilde O(1/\varepsilon^{\frac{2}{q-1}})$, with a lower bound $\Omega(1/\varepsilon^{\max\{\frac{1}{q-1}, 2\}})$ for dimension-independent estimators, implying there is only room for a quadratic improvement. Our upper bounds are obtained by (non-plug-in) quantum estimators based on weak Schur sampling, in sharp contrast to the prior approach based on quantum singular value transformation and samplizer.

Improved algorithms for learning quantum Hamiltonians, via flat polynomials#

Session: 12B - Quantum (Friday 04 July 14:30–15:30)

Authors: Narayanan, Shyam

Abstract:
We give an improved algorithm for learning a quantum Hamiltonian given copies of its Gibbs state, that can succeed at any temperature. Specifically, we improve over the work of Bakshi, Liu, Moitra, and Tang (2024), by reducing the sample complexity and runtime dependence to singly exponential in the inverse-temperature parameter, as opposed to doubly exponential. Our main technical contribution is a new flat polynomial approximation to the exponential function, with significantly lower degree than the flat polynomial approximation used in Bakshi et al.

Learning shallow quantum circuits with many-qubit gates#

Session: 12B - Quantum (Friday 04 July 14:30–15:30)

Authors: Vasconcelos, Francisca; Huang, Hsin-Yuan

Abstract:
The seminal work of [LMN93] established a cornerstone result for classical complexity, with profound implications for learning theory. By proving low-degree Fourier concentration of AC^0, the work demonstrated that Boolean functions computed by constant-depth circuits can be efficiently PAC-learned via low-degree Fourier sampling. This breakthrough provided the first sample- and time-efficient (quasi-polynomial) algorithm for learning AC^0. Proposed by [Moore99] as a natural quantum analog of AC^0, QAC^0 is the class of constant-depth quantum circuits composed of arbitrary single-qubit gates and polynomial CZ gates of unbounded width. In this work, we present the first algorithm for efficient average-case learning of QAC^0 circuits with logarithmic ancilla. Namely, our algorithm achieves quasi-polynomial sample- and time-complexity for learning unknown QAC^0 unitaries to inverse-polynomially small error. We further show that these learned unitaries can be efficiently synthesized via poly-logarithmic depth circuits, making progress towards proper learning of QAC^0.

13A - Algorithms#

Time: Friday 04 July 16:00–17:36

Improved Algorithms for Effective Resistance Computation on Graphs#

Session: 13A - Algorithms (Friday 04 July 16:00–17:36)

Authors: Yang, Yichun; Li, Rong-Hua; Liao, Meihao; Wang, Guoren

Abstract:
Effective Resistance (ER) is a fundamental tool in various graph learning tasks. In this paper, we address the problem of efficiently approximating ER on a graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ with $n$ vertices and $m$ edges. First, we focus on local online-computation algorithms for ER approximation, aiming to improve the dependency on the approximation error parameter $\epsilon$. Specifically, for a given vertex pair $(s,t)$, we propose a local algorithm with a time complexity of $\tilde{O}(\sqrt{d}/\epsilon)$ to compute an $\epsilon$-approximation of the $s,t$-ER value for expander graphs, where $d=\min \{d_s,d_t\}$. This improves upon the previous state-of-the-art, including an $\tilde{O}(1/\epsilon^2)$ time algorithm based on random walk sampling by Andoni et al. (ITCS'19) and Peng et al. (KDD'21). Our method achieves this improvement by combining deterministic search with random walk sampling to reduce variance. Second, we establish a lower bound for ER approximation on expander graphs. We prove that for any $\epsilon\in (0,1)$, there exist an expander graph and a vertex pair $(s,t)$ such that any local algorithm requires at least $\Omega(1/\epsilon)$ time to compute the $\epsilon$-approximation of the $s,t$-ER value. Finally, we extend our techniques to index-based algorithms for ER computation. We propose an algorithm with $\tilde{O}(\min \{m+n/\epsilon^{1.5},\sqrt{nm}/\epsilon\})$ processing time, $\tilde{O}(n/\epsilon)$ space complexity and $O(1)$ query complexity, which returns an $\epsilon$-approximation of the $s,t$-ER value for any $s,t\in \mathcal{V}$ for expander graphs. Our approach improves upon the state-of-the-art $\tilde{O}(m/\epsilon)$ processing time by Dwaraknath et al. (NeurIPS'24) and the $\tilde{O}(m+n/\epsilon^2)$ processing time by Li and Sachdeva (SODA'23).

Metric Embeddings Beyond Bi-Lipschitz Distortion via Sherali-Adams#

Session: 13A - Algorithms (Friday 04 July 16:00–17:36)

Authors: Bakshi, Ainesh; Cohen-Addad, Vincent; Hopkins, Sam; Jayaram, Rajesh; Lattanzi, Silvio

Abstract:
Metric embeddings are a widely used method in algorithm design, where generally a ``complex'' metric is embedded into a simpler, lower-dimensional one. Historically, the theoretical computer science community has focused on bi-Lipschitz embeddings, which guarantee that every pairwise distance is approximately preserved. In contrast, alternative embedding objectives that avoid bi-Lipschitz distortion are commonly used in practice to map points to lower dimensions, yet these approaches have received relatively have received comparatively less study in theory. In this paper, we focus on one such objective, Multi-dimensional Scaling (MDS), which embeds an $n$-point metric into low-dimensional Euclidean space. MDS is widely used as a data visualization tool in the social and biological sciences, statistics, and machine learning. Given a set of non-negative dissimilarities $\{d_{i,j}\}_{i , j \in [n]}$ over $n$ points (which may or may not form a metric), the goal is to find an embedding $\{x_1,\dots,x_n\} \subset \R^k$ that minimizes \[ \OPT = \min_{x} \expecf{i,j \in [n]}{ \left(1-\frac{\|x_i - x_j\|}{d_{i,j}}\right)^2 }.\] Despite its popularity, our theoretical understanding of MDS is extremely limited. Recently, Demaine, Hesterberg, Koehler, Lynch, and Urschel gave the first approximation algorithm with provable guarantees for this objective, which achieves an embedding in constant dimensional Euclidean space with cost $\OPT +\eps$ in $n^2 \cdot 2^{ \poly(\Delta/\eps) }$ time, where $\Delta$ is the aspect ratio of the input dissimilarities. For metrics that admit low-cost embeddings, the aspect ratio $\Delta$ scales polynomially in $n$. In this work, we give the first approximation algorithm for MDS with quasi-polynomial dependency on $\Delta$: for constant dimensional Euclidean space, we achieve a solution with cost $\tilde{\mathcal{O}}(\log \Delta ) \cdot \OPT^{ \Omega(1) } + \eps$ in time $n^{ \mathcal{O}(1)} \cdot 2^{ \poly((\log(\Delta)/\eps)) }$.

Tight Bounds for Noisy Computation of High-Influence Functions, Connectivity, and Threshold#

Session: 13A - Algorithms (Friday 04 July 16:00–17:36)

Authors: Gu, Yuzhou; Li, Xin; Xu, Yinzhan

Abstract:
In the noisy query model, the (binary) return value of every query (possibly repeated) is independently flipped with some fixed probability $p \in (0, 1/2)$. In this paper, we obtain tight bounds on the noisy query complexity of several fundamental problems. Our first contribution is to show that any Boolean function with total influence $\Omega(n)$ has noisy query complexity $\Theta(n\log n)$. Previous works often focus on specific problems, and it is of great interest to have a characterization of noisy query complexity for general functions. Our result is the first noisy query complexity lower bound of this generality, beyond what was known for random Boolean functions (Reischuk and Schmeltz, FOCS 1991). Our second contribution is to prove that Graph Connectivity has noisy query complexity $\Theta(n^2 \log n)$. In this problem, the goal is to determine whether an undirected graph is connected, where each query asks for the existence of an edge in the graph. A simple algorithm can solve the problem with error probability $o(1)$ using $O(n^2 \log n)$ noisy queries, but no non-trivial lower bounds were known prior to this work. Last but not least, we determine the exact number of noisy queries (up to lower order terms) needed to solve the $k$-Threshold problem and the Counting problem. The $k$-Threshold problem asks to decide whether there are at least $k$ ones among $n$ bits, given noisy query access to the bits. We prove that $(1\pm o(1)) \frac{n\log (\min\{k,n-k+1\}/\delta)}{(1-2p)\log \frac{1-p}p}$ queries are both sufficient and necessary to achieve error probability $\delta = o(1)$. Previously, such a result was only known when $\min\{k,n-k+1\}=o(n)$ (Wang, Ghaddar, Zhu and Wang, arXiv 2024). We also show a similar $(1\pm o(1)) \frac{n\log (\min\{k+1,n-k+1\}/\delta)}{(1-2p)\log \frac{1-p}p}$ bound for the Counting problem, where one needs to count the number of ones among $n$ bits given noisy query access and $k$ denotes the answer.

Noisy Group Testing in the Linear Regime: Exact Thresholds and Efficient Algorithms#

Session: 13A - Algorithms (Friday 04 July 16:00–17:36)

Authors: Hintze, Lukas; Krieg, Lena; Scheftelowitsch, Olga; Zhu, Haodong

Abstract:
In group testing, the task is to identify defective items by testing groups of them together using as few tests as possible. We consider the setting where each item is defective with a constant probability $\alpha$, independent of all other items. In the (over-)idealized noiseless setting, tests are positive exactly if any of the tested items are defective. We study a more realistic model in which observed test results are subject to noise, i.e., tests can display false positive or false negative results with constant positive probabilities. We determine precise constants $c$ such that $cn\log n$ tests are required to recover the infection status of every individual for both adaptive and non-adaptive group testing: in the former, the selection of groups to test can depend on previously observed test results, whereas it cannot in the latter. Additionally, for both settings, we provide efficient algorithms that identify all defective items with the optimal amount of tests with high probability. Thus, we completely solve the problem of binary noisy group testing in the studied setting.

Faster Low-Rank Approximation and Kernel Ridge Regression via the Block-Nystrom Method#

Session: 13A - Algorithms (Friday 04 July 16:00–17:36)

Authors: Garg, Sachin; Derezinski, Michal

Abstract:
The Nystrom method is a popular low-rank approximation technique for large matrices that arise in kernel methods and convex optimization. Yet, when the data exhibits heavy-tailed spectral decay, the effective dimension of the problem often becomes so large that even the Nystrom method may be outside of our computational budget. To address this, we propose Block-Nystrom, an algorithm that injects a block-diagonal structure into the Nystrom method, thereby significantly reducing its computational cost while recovering strong approximation guarantees. We show that Block-Nystrom improves the computational complexity of kernel ridge regression for statistical learning over Hilbert spaces, and it can be used to construct more efficient preconditioners for second-order optimization. Our key technical insight is that, within the same computational budget, combining several smaller Nystrom approximations leads to stronger tail estimates of the input spectrum than using one larger approximation. Along the way, we provide a novel recursive preconditioning scheme for efficiently inverting the Block-Nystrom matrix, and provide new statistical learning bounds for a broad class of approximate kernel ridge regression solvers.

Optimal Scheduling of Dynamic Transport#

Session: 13A - Algorithms (Friday 04 July 16:00–17:36)

Authors: Tsimpos, Panagiotis; Ren, Zhi; Zech, Jakob; Marzouk, Youssef

Abstract:
Flow-based methods for sampling and generative modeling use continuous-time dynamical systems to represent a {transport map} that pushes forward a source measure to a target measure. The introduction of a time axis provides considerable design freedom, and a central question is how to exploit this freedom. Though many popular methods seek straight line (i.e., zero acceleration) trajectories, we show here that a specific class of ``curved'' trajectories can significantly improve approximation and learning. In particular, we consider the unit-time interpolation of any given transport map $T$ and seek the schedule $\tau: [0,1] \to [0,1]$ that minimizes the spatial Lipschitz constant of the corresponding velocity field over all times $t \in [0,1]$. This quantity is crucial as it allows for control of the approximation error when the velocity field is learned from data. We show that, for a broad class of source/target measures and transport maps $T$, the \emph{optimal schedule} can be computed in closed form, and that the resulting optimal Lipschitz constant is \emph{exponentially smaller} than that induced by an identity schedule (corresponding to, for instance, the Wasserstein geodesic). Our proof technique relies on the calculus of variations and the notion of $\Gamma$-convergence, allowing us to approximate the aforementioned degenerate objective by a family of smooth, tractable problems.