Time: Tuesday 01 July 09:00–10:00
Session: 1A - Sampling (Tuesday 01 July 09:00–10:00)
Authors: Wibisono, Andre
Abstract:
We study the mixing time guarantee for sampling in relative Fisher information via the Proximal Sampler algorithm, which is an approximate proximal discretization of the Langevin dynamics. We show that when the target probability distribution is strongly log-concave, the relative Fisher information converges exponentially fast along the Proximal Sampler; this matches the exponential convergence rate of the relative Fisher information along the continuous-time Langevin dynamics for strongly log-concave target. When combined with a standard implementation of the Proximal Sampler via rejection sampling, this exponential convergence rate provides a high-accuracy iteration complexity guarantee for the Proximal Sampler in relative Fisher information when the target distribution is strongly log-concave and log-smooth. Our proof proceeds by establishing a strong data processing inequality for relative Fisher information along the Gaussian channel under strong log-concavity, and a data processing inequality along the reverse Gaussian channel for a special distribution. The forward and reverse Gaussian channels compose to form the Proximal Sampler, and these data processing inequalities imply the exponential convergence of the relative Fisher information along the Proximal Sampler.
Session: 1A - Sampling (Tuesday 01 July 09:00–10:00)
Authors: Cai, Yang; Mitra, Siddharth; Wang, Xiuyuan; Wibisono, Andre
Abstract:
We study zero-sum games in the space of probability distributions over the Euclidean space $\mathbb{R}^d$ with entropy regularization, in the setting when the interaction function between the players is smooth and strongly convex-strongly concave. We prove an exponential convergence guarantee for the mean-field min-max Langevin dynamics to compute the equilibrium distribution of the zero-sum game. We also study the finite-particle approximation of the mean-field min-max Langevin dynamics, both in continuous and discrete times. We prove biased convergence guarantees for the continuous-time finite-particle min-max Langevin dynamics to the stationary mean-field equilibrium distribution with an explicit bias term which does not scale with the number of particles. We also prove biased convergence guarantees for the discrete-time finite-particle min-max Langevin algorithm to the stationary mean-field equilibrium distribution with an additional bias term which scales with the step size and the number of particles. This provides an explicit iteration complexity for the average particle along the finite-particle algorithm to approximately compute the equilibrium distribution of the zero-sum game.
Session: 1A - Sampling (Tuesday 01 July 09:00–10:00)
Authors: He, Yuchen; Zhang, Chihao
Abstract:
We study the problem of sampling from a $d$-dimensional distribution with density $p(x)\propto e^{-f(x)}$, which does not necessarily satisfy good isoperimetric conditions. Specifically, we show that for any $L,M$ satisfying $LM\ge d\ge 5$, $\eps\in \left\{0,\frac{1}{32}\right\}$, and any algorithm with query accesses to the value of $f(x)$ and $\grad f(x)$, there exists an $L$-log-smooth distribution with second moment at most $M$ such that the algorithm requires $\left\{\frac{LM}{d\eps}\right\}^{\Omega(d)}$ queries to compute a sample whose distribution is within $\eps$ in total variation distance to the target distribution. We complement the lower bound with an algorithm requiring $\left\{\frac{LM}{d\eps}\right\}^{\mathcal O(d)}$ queries, thereby characterizing the tight (up to the constant in the exponent) query complexity for sampling from non-log-concave distributions. Our results are in sharp contrast with the recent work of Huang et al. (COLT'24), where an algorithm with quasi-polynomial query complexity was proposed for sampling from a non-log-concave distribution when $M=\mathtt{poly}(d)$. Their algorithm works under the stronger condition that all distributions along the trajectory of the Ornstein-Uhlenbeck process, starting from the target distribution, are $\mathcal O(1)$-log-smooth. We investigate this condition and prove that it is strictly stronger than requiring the target distribution to be $\mathcal O(1)$-log-smooth. Additionally, we study this condition in the context of mixtures of Gaussians. Finally, we place our results within the broader theme of ``sampling versus optimization'', as studied in Ma et al. (PNAS'19). We show that for a wide range of parameters, sampling is strictly easier than optimization by a super-exponential factor in the dimension $d$.
Session: 1A - Sampling (Tuesday 01 July 09:00–10:00)
Authors: Koehler, Frederic; Lee, Holden; Vuong, Thuy-Duong
Abstract:
We consider the problem of sampling a multimodal distribution with a Markov chain given a small number of samples from the stationary measure. Although mixing can be arbitrarily slow, we show that if the Markov chain has a $k$th order spectral gap, initialization from a set of $\tilde O(k/\varepsilon^2)$ samples from the stationary distribution will, with high probability over the samples, efficiently generate a sample whose conditional law is $\varepsilon$-close in TV distance to the stationary measure. In particular, this applies to mixtures of $k$ distributions satisfying a Poincar\'e inequality, with faster convergence when they satisfy a log-Sobolev inequality. Our bounds are stable to perturbations to the Markov chain, and in particular work for Langevin diffusion over $\mathbb R^d$ with score estimation error, as well as Glauber dynamics combined with approximation error from pseudolikelihood estimation. This justifies the success of data-based initialization for score matching methods despite slow mixing for the data distribution, and improves and generalizes the results of Koehler and Vuong '23 to have linear, rather than exponential, dependence on $k$ and apply to arbitrary semigroups. As a consequence of our results, we show for the first time that a natural class of low-complexity Ising measures can be efficiently learned from samples.
Session: 1A - Sampling (Tuesday 01 July 09:00–10:00)
Authors: Mitra, Siddharth; Wibisono, Andre; Liang, Jiaming
Abstract:
The mixing time of a Markov chain determines how fast the iterates of the Markov chain converge to the stationary distribution; however, it does not control the dependencies between samples along the Markov chain. In this paper, we study the question of how fast the samples become approximately independent along popular Markov chains for continuous-space sampling: the Langevin dynamics in continuous time, and the Unadjusted Langevin Algorithm and the Proximal Sampler in discrete time. We measure the dependence between samples via Φ-mutual information, which is a broad generalization of the standard mutual information, and which is equal to 0 if and only if the the samples are independent. We show that along these Markov chains, the Φ-mutual information between the first and the k-th iterate decreases to 0 exponentially fast in k when the target distribution is strongly log-concave. Our proof technique is based on showing the Strong Data Processing Inequalities (SDPIs) hold along the Markov chains. To prove fast mixing of the Markov chains, we only need to show the SDPIs hold for the stationary distribution. In contrast, to prove the contraction of Φ-mutual information, we need to show the SDPIs hold along the entire trajectories of the Markov chains; we prove this when the iterates along the Markov chains satisfy the corresponding Φ-Sobolev inequality, which is implied by the strong log-concavity of the target distribution.
Time: Tuesday 01 July 09:00–10:00
Session: 1B - Bandits (Tuesday 01 July 09:00–10:00)
Authors: Nguyen, Quan; Ito, Shinji; Komiyama, Junpei; Mehta, Nishant
Abstract:
Existing data-dependent and best-of-both-worlds regret bounds for multi-armed bandits problems have limited adaptivity as they are either data-dependent but not best-of-both-worlds (BOBW), BOBW but not data-dependent or have sub-optimal $O(\sqrt{T\ln{T}})$ worst-case guarantee in the adversarial regime. To overcome these limitations, we propose real-time stability-penalty matching (SPM), a new method for obtaining regret bounds that are simultaneously data-dependent, best-of-both-worlds and $T$-optimal for multi-armed bandits problems. In particular, we show that real-time SPM obtains bounds with worst-case guarantees of order $O(\sqrt{T})$ in the adversarial regime and $O(\ln{T})$ in the stochastic regime while simultaneously being adaptive to data-dependent quantities such as sparsity, variations, and small losses. Our results are obtained by extending the SPM technique for tuning the learning rates in the follow-the-regularized-leader (FTRL) framework, which further indicates that the combination of SPM and FTRL is a promising approach for proving new adaptive bounds in online learning problems.
Session: 1B - Bandits (Tuesday 01 July 09:00–10:00)
Authors: Narasimha, Dheeraj; Gast, Nicolas
Abstract:
We consider the discrete time infinite horizon average reward Restless Markovian bandit (RMAB) problem. We propose a model predictive control based non-stationary policy with a rolling computational horizon τ . At each time-slot, this policy solves a τ horizon linear program whose first control value is kept as a control for the RMAB. Our solution requires minimal assumptions and quantifies the loss in optimality in terms of τ and the number of arms, N. We show that its suboptimality gap is O(1/√N) in general, and exp(−Ω(N)) under a local-stability condition. Our proof is based on a framework from dynamic control known as dissipativity. Our solution is easy to implement and performs very well in practice when compared to the state of the art. Further, both our solution and our proof methodology can easily be generalized to more general constrained MDP settings and should thus be of great interest to the burgeoning RMAB community.
Session: 1B - Bandits (Tuesday 01 July 09:00–10:00)
Authors: Ryu, Jongha; Kwon, Jeongyeol; Koppe, Benjamin; Jun, Kwang-Sung
Abstract:
We consider the off-policy selection and learning in contextual bandits where the learner aims to select or train a reward-maximizing policy using data collected by a fixed behavior policy. Our contribution is two-fold. First, we propose a novel off-policy selection method that leverages a new betting-based confidence bound applied to an inverse propensity weight sequence. Our theoretical analysis reveals that our method achieves a significantly better, variance-adaptive guarantee upon prior art. Second, we propose a novel and generic condition on the optimization objective for off-policy learning that strikes a difference balance in bias and variance. One special case that we call freezing tends to induce small variance, which is preferred in small-data regimes. Our analysis shows that they match the best existing guarantee. In our empirical study, our selection method outperforms existing methods, and freezing exhibits improved performance in small-sample regimes.
Session: 1B - Bandits (Tuesday 01 July 09:00–10:00)
Authors: Zhang, Raymond; Hadiji, Hedi; Combes, Richard
Abstract:
We consider linear stochastic bandits where the set of actions is an ellipsoid. We provide the first known minimax optimal algorithm for this problem. We first derive a novel information-theoretic lower bound on the regret of any algorithm, which must be at least $\Omega(\min(d \sigma \sqrt{T} + d \|\theta\|_{A}, \|\theta\|_{A} T))$ where $d$ is the dimension, $T$ the time horizon, $\sigma^2$ the noise variance, $A$ a matrix defining the set of actions and $\theta$ the vector of unknown parameters. We then provide an algorithm whose regret matches this bound to a multiplicative universal constant. The algorithm is non-classical in the sense that it is not optimistic, and it is not a sampling algorithm. The main idea is to combine a novel sequential procedure to estimate $\|\theta\|$, followed by an explore-and-commit strategy informed by this estimate. The algorithm is highly computationally efficient, and a run requires only time $O(dT + d^2 \log(T/d) + d^3)$ and memory $O(d^2)$, in contrast with known optimistic algorithms, which are not implementable in polynomial time. We go beyond minimax optimality and show that our algorithm is locally asymptotically minimax optimal, a much stronger notion of optimality. We further provide numerical experiments to illustrate our theoretical findings.
Session: 1B - Bandits (Tuesday 01 July 09:00–10:00)
Authors: Bakhtiari, Alireza; Lattimore, Tor; Szepesvari, Csaba
Abstract:
We show that Thompson sampling has a Bayesian regret of at most ~O(sqrt(n)) for 1-dimensional bandit convex optimisation where n is the time horizon and no assumptions are made on the loss function beyond convexity, boundedness and a mild Lipschitz assumption. For general high-dimensional problems we show that Thompson sampling can fail catastrophically. More positively, we show that Thompson sampling has Bayesian regret of ~O (d^{2.5}sqrt(n)) for generalised linear bandits with an unknown convex monotone link function. Lastly, we prove that the standard information-theoretic machinery can never give a bound on the regret in the general case that improves on the best known bound of ~O(d^{1.5} sqrt(n)).
Time: Tuesday 01 July 10:30–11:20
Session: 2A - Concentration Inequalities (Tuesday 01 July 10:30–11:20)
Authors: Bressan, Marco; Brukhim, Nataly; Cesa-Bianchi, Nicolo; Esposito, Emmanuel; Mansour, Yishay; Moran, Shay; Thiessen, Maximilian
Abstract:
Cost-sensitive loss functions are crucial in many real-world prediction problems, where different types of errors are penalized differently; for example, in medical diagnosis, a false negative prediction can lead to worse consequences than a false positive prediction. However, traditional PAC learning theory has mostly focused on the symmetric 0-1 loss, leaving cost-sensitive losses largely unaddressed. In this work, we extend the celebrated theory of boosting to incorporate both cost-sensitive and multi-objective losses. Cost-sensitive losses assign costs to the entries of a confusion matrix, and are used to control the sum of prediction errors accounting for the cost of each error type. Multi-objective losses, on the other hand, simultaneously track multiple cost-sensitive losses, and are useful when the goal is to satisfy several criteria at once (e.g., minimizing false positives while keeping false negatives below a critical threshold). We develop a comprehensive theory of cost-sensitive and multi-objective boosting, providing a taxonomy of weak learning guarantees that distinguishes which guarantees are trivial (i.e., can always be achieved), which ones are boostable (i.e., imply strong learning), and which ones are intermediate, implying non-trivial yet not arbitrarily accurate learning. For binary classification, we establish a dichotomy: a weak learning guarantee is either trivial or boostable. In the multiclass setting, we describe a more intricate landscape of intermediate weak learning guarantees. Our characterization relies on a geometric interpretation of boosting, revealing a surprising equivalence between cost-sensitive and multi-objective losses.
Session: 2A - Concentration Inequalities (Tuesday 01 July 10:30–11:20)
Authors: Hogsgaard, Mikael Moller; Larsen, Kasper Green
Abstract:
In this paper we establish a new margin-based generalization bound for voting classifiers, refining existing results and yielding tighter generalization guarantees for widely used boosting algorithms such as AdaBoost Freund and Schapire (1997). Furthermore, the new margin-based generalization bound enables the derivation of an optimal weak-to-strong learner: a Majority-of-3 large-margin classifiers with an expected error matching the theoretical lower bound. This result provides a more natural alternative to the Majority-of-5 algorithm by Høgsgaard et al. (2024), and matches the Majority-of-3 result by Aden-Ali et al. (2024) for the realizable prediction model.
Session: 2A - Concentration Inequalities (Tuesday 01 July 10:30–11:20)
Authors: Mueller, Manuel; Luo, Yuetian; Foygel Barber, Rina
Abstract:
In statistics and machine learning, when we train a fitted model on available data, we typically want to ensure that we are searching within a model class that contains at least one accurate model---that is, we would like to ensure an upper bound on the model class risk (the lowest possible risk that can be attained by any model in the class). However, it is also of interest to establish lower bounds on the model class risk, for instance so that we can determine whether our fitted model is at least approximately optimal within the class, or, so that we can decide whether the model class is unsuitable for the particular task at hand. Particularly in the setting of interpolation learning where machine learning models are trained to reach zero error on the training data, we might ask if, at the very least, a positive lower bound on the model class risk is possible---or are we unable to detect that "all models are wrong"? In this work, we answer these questions in a distribution-free setting by establishing a model-agnostic, fundamental hardness result for the problem of constructing a lower bound on the best test error achievable over a model class, and examine its implications on specific model classes such as tree-based methods and linear regression.
Session: 2A - Concentration Inequalities (Tuesday 01 July 10:30–11:20)
Authors: Zhu, Xiaohan; Srebro, Nathan
Abstract:
We provide a complete characterization of the entire regularization curve of a modified two-part-code Minimum Description Length (MDL) learning rule for binary classification, based on an arbitrary prior or description language. \citet{GL} previously established the lack of asymptotic consistency, from an agnostic PAC (frequentist worst case) perspective, of the MDL rule with a penalty parameter of $\lambda=1$, suggesting that it underegularizes. Driven by interest in understanding how benign or catastrophic under-regularization and overfitting might be, we obtain a precise quantitative description of the worst case limiting error as a function of the regularization parameter $\lambda$ and noise level (or approximation error), significantly tightening the analysis of \citeauthor{GL} for $\lambda=1$ and extending it to all other choices of $\lambda$.
Time: Tuesday 01 July 10:30–11:20
Session: 2B - Learning Theory I (Tuesday 01 July 10:30–11:20)
Authors: Blanc, Guy; Lange, Jane; Strassle, Carmen; Tan, Li-Yang
Abstract:
The apparent difficulty of efficient distribution-free PAC learning has led to a large body of work on distribution-specific learning. Distributional assumptions greatly facilitate the design of efficient algorithms but also limit their reach and relevance. Towards addressing this, we prove a {\sl distributional-lifting theorem} that shows how a learner that succeeds with respect to a distribution family $\mathcal{D}$ can be lifted to one that succeeds with respect to {\sl any} distribution $D^\star$, with an efficiency overhead that scales with the complexity of expressing $D^\star$ as a mixture of distributions in $\mathcal{D}$. Recent work of Blanc, Lange, Malik, and Tan considered the special case of lifting {\sl uniform-distribution} learners and designed a lifter that uses a {\sl conditional sample oracle} for $D^\star$, a strong form of access not afforded by the standard PAC model. Their approach, which draws on ideas from semi-supervised learning, first learns $D^\star$ and then exploits this information to lift. We show that their approach, while natural, is information-theoretically intractable with access only to random examples, thereby giving formal justification for their use of the conditional sample oracle. We then give a different approach that sidesteps the need to learn $D^\star$, yielding a lifter that works in the standard PAC model and enjoys additional advantages: it works for all base distribution families, preserves the noise tolerance of learners, has better sample complexity, and is simpler.
Session: 2B - Learning Theory I (Tuesday 01 July 10:30–11:20)
Authors: Francois, Alexandre; Bach, Francis; Orvieto, Antonio
Abstract:
We consider linear recurrent neural networks, which have become a key building block of sequence modeling due to their ability for stable and effective long-range modeling. In this paper, we aim at characterizing this ability on the simple but core copy task, whose goal is to build a linear filter of order $S$ that approximates the filter that looks $K$ time steps in the past (which we refer to as the shift-$K$ filter), where $K$ is larger than $S$. Using classical signal models and quadratic cost, we fully characterize the problem by providing lower bounds of approximation, as well as explicit filters that achieve this lower bound up to constants. The optimal performance highlights an uncertainty principle for this task: the optimal filter has to average values around the $K$-th time step in the past with a range~(width) that is proportional to $K/S$.
Session: 2B - Learning Theory I (Tuesday 01 July 10:30–11:20)
Authors: Diakonikolas, Ilias; Ma, Mingchen; Ren, Lisheng; Tzamos, Christos
Abstract:
Learning intersections of halfspaces is a central problem in Computational Learning Theory. Even for just two halfspaces, it remains a major open question whether learning is possible in polynomial time with respect to the margin \gamma of the data points and their dimensionality d. The best-known algorithms run in quasi-polynomial time d^{O( \log{1/\gamma} )}, and it has been shown that this complexity is unavoidable for any algorithm relying solely on correlational statistical queries (CSQ). In this work, we introduce a novel algorithm that provably circumvents the CSQ hardness barrier. Our approach applies to a broad class of distributions satisfying a natural, previously studied, factorizability assumption. Under these distributions, we show that CSQ-based methods still require quasipolynomial time, whereas our algorithm achieves poly(d,1/\gamma) time by leveraging more general statistical queries (SQ). Factorizable distributions lie between distribution-specific and distribution-free settings, and significantly extend previously known tractable cases. Our result is grounded in a rigorous analysis utilizing a novel duality framework that characterizes the moment tensor structure induced by the marginal distributions. Building on these structural insights, we propose new, efficient learning algorithms. These algorithms combine a refined variant of Jennrich’s Algorithm with PCA over random projections of the moment tensor, along with a gradient-descent-based non-convex optimization framework.
Session: 2B - Learning Theory I (Tuesday 01 July 10:30–11:20)
Authors: Diakonikolas, Ilias; Diakonikolas, Jelena; Wang, Puqian; Zarifis, Nikos
Abstract:
We study the task of learning Generalized Linear models (GLMs) in the agnostic model under the Gaussian distribution. We give the first polynomial-time algorithm that achieves a constant-factor approximation for {\em any} monotone Lipschitz activation. Prior constant-factor GLM learners succeed for a substantially smaller class of activations. Our work resolves a well-known open problem, by developing a robust counterpart to the classical GLMtron algorithm~\citep{kakade2011efficient}. Our robust learner applies more generally, encompassing all monotone activations with bounded $(2+\zeta)$-moments, for any fixed $\zeta>0$---a condition that is essentially necessary. To obtain our results, we leverage a novel data augmentation technique with decreasing Gaussian noise injection and prove a number of structural results that may be useful in other settings.
Time: Tuesday 01 July 14:00–15:36
Session: 3A - Reinforcement Learning (Tuesday 01 July 14:00–15:36)
Authors: Pfrommer, Daniel; Simchowitz, Max; Jadbabaie, Ali
Abstract:
We study the problem of imitating an expert demonstrator in a discrete-time, continuous state-and-action space control system. We show that there exist stable dynamics (i.e. contracting exponentially quickly) and smooth, deterministic experts such that any smooth, deterministic imitator policy necessarily suffers error on execution that is exponentially larger, as a function of problem horizon, than the error under the distribution of expert training data. Our negative result applies to both behavior cloning and offline-RL algorithms, unless they produce highly improper imitator policies --- those which are non-smooth, non-Markovian, or which exhibit highly state-dependent stochasticity --- or unless the expert trajectory distribution is sufficiently spread. We provide preliminary evidence of the benefits of these more complex policy parameterizations, explicating the benefits of today's popular policy parameterizations in robot learning (e.g. action-chunking and Diffusion-policies). We also establish a host of complementary negative and positive results for imitation in control systems.
Session: 3A - Reinforcement Learning (Tuesday 01 July 14:00–15:36)
Authors: Zurek, Matthew; Chen, Yudong
Abstract:
We study the sample complexity of finding an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDPs) with a generative model. The minimax optimal span-based complexity of $\widetilde{O}(SAH/\varepsilon^2)$, where $H$ is the span of the optimal bias function, has only been achievable with prior knowledge of the value of $H$. Prior-knowledge-free algorithms have been the objective of intensive research, but several natural approaches provably fail to achieve this goal. We resolve this problem, developing the first algorithms matching the optimal span-based complexity without $H$ knowledge, both when the dataset size is fixed and when the suboptimality level $\varepsilon$ is fixed. Our main technique combines the discounted reduction approach with a method for automatically tuning the effective horizon based on empirical confidence intervals or lower bounds on performance, which we term \textit{horizon calibration}. We also develop an \textit{empirical span regularization} approach, inspired by sample variance penalization, which can outperform the minimax complexity in benign settings such as when there exist near-optimal policies with span much smaller than $H$.
Session: 3A - Reinforcement Learning (Tuesday 01 July 14:00–15:36)
Authors: Krishnamurthy, Akshay; Li, Gene; Sekhari, Ayush
Abstract:
We study Reinforcement Learning (RL) in environments with large state spaces, where function approximation is required for sample-efficient learning. Departing from a long history of prior work, we consider the weakest possible form of function approximation, called agnostic policy learning, where the learner seeks to find the best policy in a given class $\Pi$, with no guarantee that $\Pi$ contains an optimal policy for the underlying task. Although it is known that sample-efficient agnostic policy learning is not possible in the standard online RL setting without further assumptions, we investigate the extent to which this can be overcome with stronger forms of access to the environment. Specifically, we show that: -Agnostic policy learning remains statistically intractable when given access to a local simulator, from which one can reset to any previously seen state. This result holds even when the policy class is realizable, and stands in contrast to a positive result of [MFR24] showing that value-based learning under realizability is tractable with local simulator access. -Agnostic policy learning remains statistically intractable when given online access to a reset distribution with good coverage properties over the state space (the so-called $\mu$-reset setting). We also study stronger forms of function approximation for policy learning, showing that PSDP [BKSN03] and CPI [KL02] provably fail in the absence of policy completeness. -On a positive note, agnostic policy learning is statistically tractable for Block MDPs with access to both of the above reset models. We establish this via a new algorithm that carefully constructs a policy emulator: a tabular MDP with a small state space that approximates the value functions of all $\pi \in \Pi$. These values are approximated without any explicit value function class. Taken together, our results contribute to a deeper understanding of the interplay between function approximation and environment access in RL.
Session: 3A - Reinforcement Learning (Tuesday 01 July 14:00–15:36)
Authors: Rohatgi, Dhruv; Foster, Dylan
Abstract:
Algorithms for reinforcement learning (RL) in large state spaces crucially rely on supervised learning subroutines to estimate objects such as value functions or transition probabilities. Since only the simplest supervised learning problems can be solved provably and efficiently, practical performance of an RL algorithm depends on which of these supervised learning ``oracles'' it assumes access to (and how they are implemented). But which oracles are better or worse? Is there a minimal oracle? In this work, we clarify the impact of the choice of supervised learning oracle on the computational complexity of RL, as quantified by the oracle strength. First, for the task of reward-free exploration in Block MDPs in the standard episodic access model---a ubiquitous setting for RL with function approximation---we identify two-context regression as a minimal oracle, i.e. an oracle that is both necessary and sufficient (under a mild regularity assumption). Second, we identify one-context regression as a near-minimal oracle in the stronger reset access model, establishing a provable computational benefit of resets in the process. Third, we broaden our focus to Low-Rank MDPs, where we give cryptographic evidence that the analogous oracle from the Block MDP setting is insufficient.
Session: 3A - Reinforcement Learning (Tuesday 01 July 14:00–15:36)
Authors: Moulin, Antoine; Neu, Gergely; Viano, Luca
Abstract:
We study the problem of reinforcement learning in infinite-horizon discounted linear Markov decision processes (MDPs), and propose the first computationally efficient algorithm achieving near-optimal regret guarantees in this setting. Our main idea is to combine two classic techniques for optimistic exploration: additive exploration bonuses applied to the reward function, and artificial transitions made to an absorbing state with maximal return. We show that, combined with a regularized approximate dynamic-programming scheme, the resulting algorithm achieves a regret of order $\tilde{\mathcal{O}} (\sqrt{d^3 (1 - \gamma)^{- 7 / 2} T})$, where $T$ is the total number of sample transitions, $\gamma \in (0,1)$ is the discount factor, and $d$ is the feature dimensionality. The results continue to hold against adversarial reward sequences, which also enables us to apply our method to the problem of imitation learning in linear MDPs, and achieve state-of-the-art results in this setting.
Session: 3A - Reinforcement Learning (Tuesday 01 July 14:00–15:36)
Authors: Agrawal, Priyank; Agrawal, Shipra
Abstract:
We present an optimistic Q-learning algorithm for regret minimization in average reward reinforcement learning under an additional assumption on the underlying MDP that for all policies, the time to visit some frequent state $s_0$ is finite and upper bounded by $H$, either in expectation or with constant probability. Our setting strictly generalizes the episodic setting and is significantly less restrictive than the assumption of bounded hitting time \textit{for all states} made by most previous literature on model-free algorithms in average reward settings. We demonstrate a regret bound of $\tilde{O}(H^5 S\sqrt{AT})$, where $S$ and $A$ are the numbers of states and actions, and $T$ is the horizon. A key technical novelty of our work is the introduction of an $\overline{L}$ operator defined as $\overline{L} v = \frac{1}{H} \sum_{h=1}^H L^h v$ where $L$ denotes the Bellman operator. Under the given assumption, we show that the $\overline{L}$ operator has a strict contraction (in span) even in the average-reward setting where the discount factor is $1$. Our algorithm design uses ideas from episodic Q-learning to estimate and apply this operator iteratively. Thus, we provide a unified view of regret minimization in episodic and non-episodic settings, which may be of independent interest.
Session: 3A - Reinforcement Learning (Tuesday 01 July 14:00–15:36)
Authors: Boone, Victor; Gaujal, Bruno
Abstract:
In average reward Markov decision processes, state–of–the–art algorithms for regret minimization follow a well–established framework: They are model–based, optimistic and episodic. First, they maintain a confidence region from which optimistic policies are computed using a well–known subroutine called Extended Value Iteration (EVI). Second, these policies are used over time windows called episodes, each ended by the Doubling Trick (DT) rule or a variant thereof. In this work, without modifying EVI, we show that there is a significant advantage in replacing (DT) by another simple rule, that we call the Vanishing Multiplicative (VM) rule. When managing episodes with (VM), the algorithm’s regret is, both in theory and in practice, as good if not better than with (DT), while the one–shot behavior is greatly improved. More specifically, the management of bad episodes (when sub–optimal policies are being used) is much better under (VM) than (DT) by making the regret of exploration logarithmic rather than linear. These results are made possible by a new in-depth understanding of the contrasting behaviors of confidence regions during good and bad episodes.
Session: 3A - Reinforcement Learning (Tuesday 01 July 14:00–15:36)
Authors: Mhammedi, Zakaria
Abstract:
Designing sample-efficient and computationally feasible reinforcement learning (RL) algorithms is particularly challenging in environments with large or infinite state and action spaces. In this paper, we advance this effort by presenting an efficient algorithm for Markov Decision Processes (MDPs) where the state-action value function of any policy is linear in a given feature map. This challenging setting can model environments with infinite states and actions, strictly generalizes classic linear MDPs, and currently lacks a computationally efficient algorithm under online access to the MDP. Specifically, we introduce a new RL algorithm that efficiently finds a near-optimal policy in this setting, using a number of episodes and calls to a cost-sensitive classification (CSC) oracle that are both polynomial in the problem parameters. Notably, our CSC oracle can be efficiently implemented when the feature dimension is constant, representing a clear improvement over state-of-the-art methods, which require solving non-convex problems with horizon-many variables and can incur computational costs that are exponential in the horizon.
Time: Tuesday 01 July 14:00–15:36
Session: 3B - Robust Learning (Tuesday 01 July 14:00–15:36)
Authors: Egosi, Amitsour; Yehudai, Gilad; Shamir, Ohad
Abstract:
The memorization capacity of neural networks with a given architecture has been thoroughly studied in many works. Specifically, it is well-known that memorizing $N$ samples can be done using a network of constant width, independent of $N$. However, the required constructions are often quite delicate. In this paper, we consider the natural question of how well feedforward ReLU neural networks can memorize \emph{robustly}, namely while being able to withstand adversarial perturbations of a given radius. We establish both upper and lower bounds on the possible radius for general $l_p$ norms, implying (among other things) that width \emph{logarithmic} in the number of input samples is necessary and sufficient to achieve robust memorization (with robustness radius independent of $N$).
Session: 3B - Robust Learning (Tuesday 01 July 14:00–15:36)
Authors: Guo, Anxin; Vijayaraghavan, Aravindan
Abstract:
We consider the problem of learning an arbitrarily-biased ReLU activation (or neuron) over Gaussian marginals with the squared loss objective. Despite the ReLU neuron being the basic building block of modern neural networks, we still do not understand the basic algorithmic question of whether one arbitrary ReLU neuron is learnable in the non-realizable setting. In particular, all existing polynomial time algorithms only provide approximation guarantees for the better-behaved unbiased setting or restricted bias setting. Our main result is a polynomial time statistical query (SQ) algorithm that gives the first constant factor approximation for arbitrary bias. It outputs a ReLU activation that achieves a loss of $O(\mathrm{OPT}) + \varepsilon$ in time $\mathrm{poly}(d,1/\varepsilon)$, where $\mathrm{OPT}$ is the loss obtained by the optimal ReLU activation. Our algorithm presents an interesting departure from existing algorithms, which are all based on gradient descent and thus fall within the class of correlational statistical query (CSQ) algorithms. We complement our algorithmic result by showing that no polynomial time CSQ algorithm can achieve a constant factor approximation. Together, these results shed light on the intrinsic limitation of gradient descent, while identifying arguably the simplest setting (a single neuron) where there is a separation between SQ and CSQ algorithms.
Session: 3B - Robust Learning (Tuesday 01 July 14:00–15:36)
Authors: Ashtiani, Hassan; Pathak, Vinayak; Urner, Ruth
Abstract:
Adversarially robust PAC learning has proved to be challenging, with the currently best known learners (Montasser et al., 2021a) relying on improper methods based on intricate compression schemes, resulting in sample complexity exponential in the VC-dimension. A series of follow up work considered a slightly relaxed version of the problem called adversarially robust learning with tolerance (Ashtiani et al., 2023; Bhattacharjee et al., 2023; Raman et al., 2024) and achieved better sample complexity in terms of the VC-dimension. However, those algorithms were either improper and complex, or required additional assumptions on the hypothesis class H. We prove, for the first time, the existence of a simpler learner that achieves a sample complexity linear in the VC-dimension without requiring additional assumptions on H. Even though our learner is improper, it is “almost proper” in the sense that it outputs a hypothesis that is “similar” to a hypothesis in H. We also use the ideas from our algorithm to construct a semi-supervised learner in the tolerant setting. This simple algorithm achieves comparable bounds to the previous (non-tolerant) semi- supervised algorithm of Attias et al. (2022a), but avoids the use of intricate subroutines from previous works, and is “almost proper.”
Session: 3B - Robust Learning (Tuesday 01 July 14:00–15:36)
Authors: Daniely, Amit
Abstract:
We show that adversarial examples exists for various random convolutional networks, and furthermore, that this is a relatively simple consequence of the isoperimetric inequality on the special orthogonal group $\so(d)$. This extends and simplifies a recent line of work which shows similar results for random fully connected networks.
Session: 3B - Robust Learning (Tuesday 01 July 14:00–15:36)
Authors: Cherapanamjeri, Yeshwanth; Lee, Daniel
Abstract:
A large body of work in the statistics and computer science communities dating back to Huber (Huber, 1960) has led to statistically and computationally efficient outlier-robust estimators. Two particular outlier models have received significant attention: the adversarial and heavy-tailed models. While the former models outliers as the result of a malicious adversary manipulating the data, the latter relaxes distributional assumptions on the data allowing outliers to naturally occur as part of the data generating process. In the first setting, the goal is to develop estimators robust to the largest fraction of outliers while in the second, one seeks estimators to combat the loss of statistical efficiency, where the dependence on the failure probability is paramount. Despite these distinct motivations, the algorithmic approaches to both these settings have converged, prompting questions on the relationship between the models. In this paper, we investigate and provide a principled explanation for this phenomenon. First, we prove that any adversarially robust estimator is also resilient to heavy-tailed outliers for any statistical estimation problem with i.i.d data. As a corollary, optimal adversarially robust estimators for mean estimation, linear regression, and covariance estimation are also optimal heavy-tailed estimators. Conversely, for arguably the simplest high-dimensional estimation task of mean estimation, we construct heavy-tailed estimators whose application to the adversarial setting requires any black-box reduction to remove almost all the outliers in the data. Taken together, our results imply that heavy-tailed estimation is likely easier than adversarially robust estimation opening the door to novel algorithmic approaches for the heavy-tailed setting. Additionally, confidence intervals obtained for adversarially robust estimation also hold with high-probability.
Session: 3B - Robust Learning (Tuesday 01 July 14:00–15:36)
Authors: Diakonikolas, Ilias; M. Kane, Daniel; Ren, Lisheng
Abstract:
We study the algorithmic task of learning Boolean disjunctions in the distribution-free agnostic PAC model. The best known agnostic learner for the class of disjunctions over $\{0, 1\}^n$ is the $L_1$-polynomial regression algorithm, achieving complexity $2^{\tilde{O}(n^{1/2})}$. This complexity bound is known to be nearly best possible within the class of Correlational Statistical Query (CSQ) algorithms. In this work, we develop an agnostic learner for this concept class with complexity $2^{\tilde{O}(n^{1/3})}$. Our algorithm can be implemented in the Statistical Query (SQ) model, providing the first separation between the SQ and CSQ models in distribution-free agnostic learning.
Session: 3B - Robust Learning (Tuesday 01 July 14:00–15:36)
Authors: Liu, Haolin; Wei, Chen-Yu; Zimmert, Julian
Abstract:
Recent work by Foster et al. (2021), Foster et al. (2022), Foster et al. (2023), and Xu and Zeevi (2023) developed the framework of decision estimation coefficient (DEC) that characterizes the complexity of general online decision making problems and provides a general algorithm design principle. These works, however, either focus on the pure stochastic regime where the world remains fixed over time, or the pure adversarial regime where the world arbitrarily changes over time. For the hybrid regime where the dynamics of the world is fixed while the reward arbitrarily changes, they only give pessimistic bounds on the decision complexity. In this work, we propose a general extension of DEC that more precisely characterizes this case. Besides applications in special cases, our framework leads to a flexible algorithm design where the learner learns over partitions of the hypothesis set, trading estimation complexity with decision complexity, which could be of independent interest. Our work covers model-based learning and model-free learning in the hybrid regime, with a newly proposed extension of the bilinear classes (Du et al., 2021) to the adversarial-reward case. We also recover some existing model-free learning results in the pure stochastic regime.
Time: Tuesday 01 July 16:12–18:00
Session: 4A - Diffusion and Sampling (Tuesday 01 July 16:12–18:00)
Authors: Jiang, Minhui; Chen, Yuansi
Abstract:
We study sampling from logconcave distributions truncated on polytopes, motivated by Bayesian models with indicator variables. Built on interior point methods and the Dikin walk, we analyze the mixing time of regularized Dikin walks. Our contributions include: (1) proving that the soft-threshold Dikin walk mixes in $O(mn+\kappa n)$ iterations for logconcave distributions with condition number $\kappa$, dimension $n$ and $m$ linear constraints, without requiring bounded polytopes. Moreover, we introduce the regularized Dikin walk using Lewis weights and show it mixes in $O(n^{2.5}+\kappa n)$; (2) extending the above mixing time guarantees to weakly log-concave truncated distributions with finite covariance matrices; and (3) going beyond worst-case mixing time analysis, we show that soft-threshold Dikin walk mix significantly faster when $O(1)$ number of constraints intersect the high-probability mass of the distribution, improving the $O(mn+\kappa n)$ upper bound to $O(m + \kappa n)$. Additionally, we also provide practical implementation to generate a warm initialization.
Session: 4A - Diffusion and Sampling (Tuesday 01 July 16:12–18:00)
Authors: WU, Yu-Han; Marion, Pierre; Biau, Gerard; Boyer, Claire
Abstract:
Denoising score matching plays a pivotal role in the performance of diffusion-based generative models. However, the empirical optimal score–the exact solution to the denoising score matching–leads to memorization, where generated samples replicate the training data. Yet, in practice, only a moderate degree of memorization is observed, even without explicit regularization. In this paper, we investigate this phenomenon by uncovering an implicit regularization mechanism driven by large learning rates. Specifically, we show that in the small-noise regime, the empirical optimal score exhibits high irregularity. We then prove that, when trained by stochastic gradient descent with a large enough learning rate, neural networks cannot stably converge to a local minimum with arbitrarily small excess risk. Consequently, the learned score cannot be arbitrarily close to the empirical optimal score, thereby mitigating memorization. To make the analysis tractable, we consider one-dimensional data and two-layer neural networks. Experiments validate the crucial role of the learning rate in preventing memorization, even beyond the one-dimensional setting.
Session: 4A - Diffusion and Sampling (Tuesday 01 July 16:12–18:00)
Authors: Potaptchik, Peter; Azangulov, Iskander; Deligiannidis, George
Abstract:
Score-matching generative models have proven successful at sampling from complex high-dimensional data distributions. In many applications, this distribution is believed to concentrate on a much lower $d$-dimensional manifold embedded into $D$-dimensional space; this is known as the manifold hypothesis. The current best-known convergence guarantees are either linear in $D$ or polynomial (superlinear) in $d$. The latter exploits a novel integration scheme for the backward SDE. We take the best of both worlds and show that the number of steps diffusion models require in order to converge in Kullback-Leibler~(KL) divergence is linear (up to logarithmic terms) in the intrinsic dimension $d$. Moreover, we show that this linear dependency is sharp.
Session: 4A - Diffusion and Sampling (Tuesday 01 July 16:12–18:00)
Authors: Chen, Sitan; Kontonis, Vasilis; Shah, Kulin
Abstract:
We study the problem of learning mixtures of $k$ Gaussians in $d$ dimensions. We make no separation assumptions on the underlying mixture components: we only require that the covariance matrices have bounded condition number and that the means and covariances lie in a ball of bounded radius. We give an algorithm that draws $d^{\poly(k/\eps)}$ samples from the target mixture, runs in sample-polynomial time, and constructs a sampler whose output distribution is $\eps$-close from the unknown mixture in total variation. Prior works for this problem either (i) required exponential runtime in the dimension $d$, (ii) placed strong assumptions on the instance (e.g., spherical covariances or clusterability), or (iii) had doubly exponential dependence on the number of components $k$. Our approach departs from commonly used techniques for this problem like the method of moments. Instead, we leverage a recently developed reduction, based on diffusion models, from distribution learning to a supervised learning task called score matching. We give an algorithm for the latter by proving a structural result showing that the score function of a Gaussian mixture can be approximated by a piecewise-polynomial function, and there is an efficient algorithm for finding it. To our knowledge, this is the first example of diffusion models achieving a state-of-the-art theoretical guarantee for an unsupervised learning task.
Session: 4A - Diffusion and Sampling (Tuesday 01 July 16:12–18:00)
Authors: Zhu, Yusong; Kumar, Syamantak; Sarkar, Purnamrita; Tian, Kevin
Abstract:
Posterior sampling with the spike-and-slab prior (Mitchell and Beauchamp (1988)), a popular multi-modal distribution used to model uncertainty in variable selection, is considered the theoretical gold standard method for Bayesian sparse linear regression (Carvalho et al. (2009); Rockova et al. (2018)). However, designing provable algorithms for performing this sampling task is notoriously challenging. Existing posterior samplers for Bayesian sparse variable selection tasks either require strong assumptions about the signal-to-noise ratio (SNR) (Yang et al. (2016)), only work in moderate-dimensional regimes with a full-rank measurement matrix (Montanari and Wu (2024)), or rely on heuristic approximations to the posterior. We give the first provable algorithms for spike-and-slab posterior sampling that apply for any SNR, and use a measurement count sublinear in the problem dimension. Concretely, assume we are given a measurement matrix X in R^(n × d) and noisy observations y = Xθ* + ξ of a signal θ* drawn from a spike-and-slab prior π with a Gaussian diffuse density and expected sparsity k, where the noise ξ is distributed as a normal distribution with mean 0 (an n-dimensional zero vector) and covariance σ² times the n × n identity matrix. We give a polynomial-time, high-accuracy sampler for the posterior π(· | X, y), for any SNR satisfying σ⁻¹ > 0, as long as n is at least k³ multiplied by polylog(d) and X is drawn from a matrix ensemble satisfying the restricted isometry property. We further give a sampler that runs in near-linear time (approximately n·d) in the same setting, as long as n is at least k⁵ multiplied by polylog(d). To demonstrate the flexibility of our framework, we extend our result to spike-and-slab posterior sampling with Laplace diffuse densities, achieving similar guarantees when σ is on the order of 1/k.
Session: 4A - Diffusion and Sampling (Tuesday 01 July 16:12–18:00)
Authors: Liang, Jiadong; Huang, Zhihan; Chen, Yuxin
Abstract:
This paper investigates how diffusion generative models leverage (unknown) low-dimensional structure to accelerate sampling. Focusing on two mainstream samplers -- the denoising diffusion implicit model (DDIM) and the denoising diffusion probabilistic model (DDPM) -- and assuming accurate score estimates, we prove that their iteration complexities are no greater than the order of k/ε (up to some log factor), where ε is the precision in total variation distance and k is some intrinsic dimension of the target distribution. Our results are applicable to a broad family of target distributions without requiring smoothness or log-concavity assumptions. Further, we develop a lower bound that suggests the (near) necessity of the coefficients introduced by Ho et al. (2020) and Song et al. (2020) in facilitating low-dimensional adaptation. Our findings provide the first rigorous evidence for the adaptivity of the DDIM-type samplers to unknown low-dimensional structure, and improve over the state-of-the-art DDPM theory regarding total variation convergence.
Session: 4A - Diffusion and Sampling (Tuesday 01 July 16:12–18:00)
Authors: Gatmiry, Khashayar; Lee, Holden; Kelner, Jonathan A.
Abstract:
We give a new algorithm for learning mixtures of $k$ Gaussians (with identity covariance in $\mathbb{R}^n$) to TV error $\varepsilon$, with quasi-polynomial ($O(n^{\text{poly\,log}\left(\frac{n+k}{\varepsilon}\right)})$) time and sample complexity, under a minimum weight assumption. Our results extend to continuous mixtures of Gaussians where the mixing distribution is supported on a union of $k$ balls of constant radius. In particular, this applies to the case of Gaussian convolutions of distributions on low-dimensional manifolds, or more generally sets with small covering number, for which no sub-exponential algorithm was previously known. Unlike previous approaches, most of which are algebraic in nature, our approach is analytic and relies on the framework of diffusion models. Diffusion models are a modern paradigm for generative modeling, which typically rely on learning the score function (gradient log-pdf) along a process transforming a pure noise distribution, in our case a Gaussian, to the data distribution. Despite their dazzling performance in tasks such as image generation, there are few end-to-end theoretical guarantees that they can efficiently learn nontrivial families of distributions; we give some of the first such guarantees. We proceed by deriving higher-order Gaussian noise sensitivity bounds for the score functions for a Gaussian mixture to show that that they can be inductively learned using piecewise polynomial regression (up to poly-logarithmic degree), and combine this with known convergence results for diffusion models.
Session: 4A - Diffusion and Sampling (Tuesday 01 July 16:12–18:00)
Authors: Chen, August; Sridharan, Karthik
Abstract:
In this paper, we prove that optimizability of any F using Gradient Flow from all initializations implies a Poincaré Inequality for Gibbs measures mu_{beta} = e^{-beta F}/Z at low temperature. In particular, under mild regularity assumptions on the convergence rate of Gradient Flow, we establish that mu_{beta} satisfies a Poincaré Inequality with constant O(C') for beta >= Omega(d), where C' is the Poincaré constant of mu_{beta} restricted to a neighborhood of the global minimizers of F. Under an additional mild condition on F, we show that mu_{beta} satisfies a Log-Sobolev Inequality with constant O(S beta C') where S denotes the second moment of mu_{beta}. Here asymptotic notation hides F-dependent parameters. At a high level, this establishes that optimizability via Gradient Flow from every initialization implies a Poincaré and Log-Sobolev Inequality for the low-temperature Gibbs measure, which in turn imply sampling from all initializations. Analogously, we establish that under the same assumptions, if F can be initialized from everywhere except some set S, then mu_{beta} satisfies a Weak Poincaré Inequality with parameters (O(C'), O(mu_{beta}(S)) ) for beta = Omega(d). At a high level, this shows while optimizability from `most' initializations implies a Weak Poincaré Inequality, which in turn implies sampling from suitable warm starts. Our regularity assumptions are mild and as a consequence, we show we can efficiently sample from several new natural and interesting classes of non-log-concave densities, an important setting with relatively few examples. As another corollary, we obtain efficient discrete-time sampling results for log-concave measures satisfying milder regularity conditions than smoothness, similar to Lehec (2023).
Session: 4A - Diffusion and Sampling (Tuesday 01 July 16:12–18:00)
Authors: Yakovlev, Konstantin; Puchkin, Nikita
Abstract:
We examine theoretical properties of the denoising score matching estimate. We model the density of observations with a nonparametric Gaussian mixture. We significantly relax the standard manifold assumption allowing the samples step away from the manifold. At the same time, we are still able to leverage a nice distribution structure. We derive non-asymptotic bounds on the approximation and generalization errors of the denoising score matching estimate. The rates of convergence are determined by the intrinsic dimension. Furthermore, our bounds remain valid even if we allow the ambient dimension grow polynomially with the sample size.
Time: Tuesday 01 July 16:12–18:00
Session: 4B - Privacy and Fairness (Tuesday 01 July 16:12–18:00)
Authors: S. Portella, Victor; Harvey, Nick
Abstract:
One of the most basic problems in statistics is estimating the covariance matrix of a Gaussian distribution. Over the past decade, researchers have studied the efficiency of covariance estimation in the setting of differential privacy. The goal is to minimize the number of samples needed to achieve particular accuracy and privacy guarantees. We prove lower bounds on the number of samples needed to privately estimate the covariance matrix of a Gaussian distribution. Our bounds match existing upper bounds in the widest known setting of parameters. Our analysis can be seen as a fingerprinting argument, one of the main techniques used to prove lower bounds in differential privacy. Most fingerprinting arguments rely on results analogous to the celebrated Stein's identity from probability theory. Our argument uses a matrix extension of this identity known as the Stein-Haff identity.
Session: 4B - Privacy and Fairness (Tuesday 01 July 16:12–18:00)
Authors: Ghazi, Badih; Guzman, Cristóbal; Kamath, Pritish; Knop, Alexander; Kumar, Ravi; Manurangsi, Pasin; Sachdeva, Sushant
Abstract:
We introduce $\mathsf{PREM}$ (Private Relative Error Multiplicative weight update), a new framework for generating synthetic data that achieves a {\em relative} error guarantee for statistical queries under $(\varepsilon, \delta)$ differential privacy (DP). Namely, for a domain ${\cal X}$, a family ${\cal F}$ of queries $f : {\cal X} \to \{0, 1\}$, and $\zeta > 0$, our framework yields a mechanism that on input dataset $D \in {\cal X}^n$ outputs a synthetic dataset $\widehat{D} \in {\cal X}^n$ such that all statistical queries in ${\cal F}$ on $D$, namely $\sum_{x \in D} f(x)$ for $f \in {\cal F}$, are within a $1 \pm \zeta$ {\em multiplicative} factor of the corresponding value on $\widehat{D}$ up to an {\em additive error} that is polynomial in $\log |{\cal F}|$, $\log |{\cal X}|$, $\log n$, $\log(1/\delta)$, $1/\varepsilon$, and $1/\zeta$. In contrast, any $(\varepsilon, \delta)$-DP mechanism is known to require worst-case additive error that is polynomial in at least one of $n, |{\cal F}|$, or $|{\cal X}|$. We complement our algorithm with nearly matching lower bounds.
Session: 4B - Privacy and Fairness (Tuesday 01 July 16:12–18:00)
Authors: Aliakbarpour, Maryam; Burudgunte, Arnav; Canonne, Clement; Rubinfeld, Ronitt
Abstract:
We extend the framework of augmented distribution testing (Aliakbarpour, Indyk, Rubinfeld, and Silwal, NeurIPS 2024) to the differentially private setting. This captures scenarios where a data analyst must perform hypothesis testing tasks on sensitive data, but is able to leverage prior knowledge (public, but possibly erroneous or untrusted) about the data distribution. We design private algorithms in this augmented setting for three flagship distribution testing tasks, uniformity, identity, and closeness testing, whose sample complexity smoothly scales with the claimed quality of the auxiliary information. We complement our algorithms with information-theoretic lower bounds, showing that their sample complexity is optimal (up to logarithmic factors).
Session: 4B - Privacy and Fairness (Tuesday 01 July 16:12–18:00)
Authors: Iverson, Valentio; Kamath, Gautam; Mouzakis, Argyris
Abstract:
We provide the first $\tilde O(d)$-sample algorithm for sampling from unbounded Gaussian distributions under the constraint of $(\varepsilon, \delta)$-differential privacy. This is a quadratic improvement over previous results for the same problem.
Session: 4B - Privacy and Fairness (Tuesday 01 July 16:12–18:00)
Authors: Li, Bo; Wang, Wei; Ye, Peng
Abstract:
The realizable-to-agnostic transformation (Beimel et al., 2015; Alon et al., 2020) provides a general mechanism to convert a private learner in the realizable setting (where the examples are labeled by some function in the concept class) to a private learner in the agnostic setting (where no assumptions are imposed on the data). Specifically, for any concept class $\mathcal{C}$ and error parameter $\alpha$, a private realizable learner for $\mathcal{C}$ can be transformed into a private agnostic learner while only increasing the sample complexity by $\widetilde{O}(\mathrm{VC}(\mathcal{C})/\alpha^2)$, which is essentially tight assuming a constant privacy parameter $\varepsilon = \Theta(1)$. However, when $\varepsilon$ can be arbitrary, one has to apply the standard privacy-amplification-by-subsampling technique (Kasiviswanathan et al., 2011), resulting in a suboptimal extra sample complexity of $\widetilde{O}(\mathrm{VC}(\mathcal{C})/\alpha^2\varepsilon)$ that involves a $1/\varepsilon$ factor. In this work, we give an improved construction that eliminates the $1/\varepsilon$ dependency, thereby achieving a near-optimal extra sample complexity of $\widetilde{O}(\mathrm{VC}(\mathcal{C})/\alpha^2)$ for any $\varepsilon\le 1$. Moreover, our result reveals that in private agnostic learning, the privacy cost is only significant for the realizable part. We also leverage our technique to obtain a nearly tight sample complexity bound for the private prediction problem, resolving an open question posed by Dwork and Feldman (2018) and Dagan and Feldman (2020).
Session: 4B - Privacy and Fairness (Tuesday 01 July 16:12–18:00)
Authors: Dwork, Cynthia; Hays, Chris; Immorlica, Nicole; Perdomo, Juan C.; Tankala, Pranay
Abstract:
Professional networks provide invaluable entree to opportunity through referrals and introductions. A rich literature shows they also serve to entrench and even exacerbate a status quo of privilege and disadvantage. Hiring platforms, equipped with the ability to nudge link formation, provide a tantalizing opening for beneficial structural change. We anticipate that key to this prospect will be the ability to estimate the likelihood of edge formation in an evolving graph. Outcome-indistinguishable prediction algorithms ensure that the modeled world is indistinguishable from the real world by a family of statistical tests. Omnipredictors ensure that predictions can be post-processed to yield loss minimization competitive with respect to a benchmark class of predictors for many losses simultaneously, with appropriate post-processing. We begin by observing that, by combining a slightly modified form of the online K29* algorithm of Vovk (2007) with basic facts from the theory of reproducing kernel Hilbert spaces, one can derive simple and efficient online algorithms satisfying outcome indistinguishability and omniprediction, with guarantees that improve upon, or are complementary to, those currently known. This is of independent interest; for example, we obtain efficient outcome indistinguishability for some interesting infinite collections of tests, as well as for any bounded function --- including those computable by deep (graph) neural networks. We apply these techniques to evolving graphs by designing efficient kernel functions that capture socially meaningful features of nodes and their neighborhoods. We obtain online outcome-indistinguishable omnipredictors for rich --- possibly infinite --- sets of distinguishers yielding, inter alia, multicalibrated predictions of edge formation with respect to pairs of demographic groups, and the ability to simultaneously optimize loss as measured by a variety of social welfare functions.
Session: 4B - Privacy and Fairness (Tuesday 01 July 16:12–18:00)
Authors: Peng, Pan; Xu, Hangyu
Abstract:
We study the problem of releasing a differentially private (DP) synthetic graph $G'$ that well approximates the triangle-motif sizes of all cuts of any given graph $G$, where a motif in general refers to a frequently occurring subgraph within complex networks. Non-private versions of such graphs have found applications in diverse fields such as graph clustering, graph sparsification, and social network analysis. Specifically, we present the first $(\varepsilon,\delta)$-DP mechanism that, given an input graph $G$ with $n$ vertices, $m$ edges and local sensitivity of triangles $\LSKt$, generates a synthetic graph $G'$ in polynomial time, approximating the triangle-motif sizes of all cuts $(S,V\setminus S)$ of the input graph $G$ up to an additive error of $\tilde{O}(\sqrt{m\LSKt}n/\varepsilon^{3/2})$. Additionally, we provide a lower bound of $\Omega(\sqrt{mn}\LSKt/\varepsilon)$ on the additive error for any DP algorithm that answers the triangle motif size queries of all $(S,T)$-cut of $G$. Finally, our algorithm generalizes to weighted graphs, and our lower bound extends to any $K_h$-motif cut for any constant $h\geq 2$.
Session: 4B - Privacy and Fairness (Tuesday 01 July 16:12–18:00)
Authors: Cherapanamjeri, Yeshwanth; Garg, Sumegha; Sekhari, Ayush; Shetty, Abhishek; Rajaraman, Nived
Abstract:
We study the memory complexity of machine unlearning algorithms that provide strong data deletion guarantees to the users. Formally, consider an algorithm for a particular learning task that initially receives a training dataset. Then, after learning, it receives data deletion requests from a subset of users (of arbitrary size), and the goal of unlearning is to perform the task as if the learner never received the data of deleted users. In this paper, we ask how many bits of storage are needed to be able to delete certain training samples, at a later time. We focus on the task of realizability testing, where the goal is to check whether the remaining training samples are realizable within a given hypothesis class $\mcH$. Toward that end, we first provide a negative result showing that the VC dimension---a well-known combinatorial property of \(\cH\) that characterizes the amount of information needed for learning and representing the ERM hypothesis in the standard PAC learning task---is not a characterization of the space complexity of unlearning. In particular, we provide a hypothesis class with constant VC dimension (and Littlestone dimension), but for which, any unlearning algorithm for realizability testing needs to store \(\Omega(n)\)-bits, where \(n\) denotes the size of the initial training dataset. In fact, we provide a stronger separation by showing that for any hypothesis class \(\cH\), the amount of information that the learner needs to store, so as to perform unlearning later, is lower bounded by the eluder dimension of \(\cH\), a combinatorial notion always larger than the VC dimension. We complement the lower bound with an upper bound, albeit in a stronger ticketed-memory model proposed by [Ghazi et al., 2023]. We show that for any class $\mcH$ with bounded eluder dimension, there exists a ticketed scheme that uses only \(\widetilde{O}(\text{Eluder}(\cH))\) many bits of storage and these many sized tickets.
Session: 4B - Privacy and Fairness (Tuesday 01 July 16:12–18:00)
Authors: Hanneke, Steve; Moran, Shay; Schefler, Hilla; Tsubari, Iska
Abstract:
This work explores the connection between differential privacy (DP) and online learning in the context of PAC list learning. In this setting, a $k$-list learner outputs a list of $k$ potential predictions for an instance $x$ and incurs a loss if the true label of $x$ is not included in the list. A basic result in the multiclass PAC framework with a finite number of labels states that private learnability is equivalent to online learnability [Alon, Livni, Malliaris, and Moran (2019); Bun, Livni, and Moran (2020); Jung, Kim, and Tewari (2020)]. Perhaps surprisingly, we show that this equivalence does not hold in the context of list learning. Specifically, we prove that, unlike in the multiclass setting, a finite $k$-Littlestone dimension—a variant of the classical Littlestone dimension that characterizes online $k$-list learnability—is not a sufficient condition for DP $k$-list learnability. However, similar to the multiclass case, we prove that it remains a necessary condition. To demonstrate where the equivalence breaks down, we provide an example showing that the class of monotone functions with $k+1$ labels over $\mathbb{N}$ is online $k$-list learnable, but not DP $k$-list learnable. This leads us to introduce a new combinatorial dimension, the \emph{$k$-monotone dimension}, which serves as a generalization of the threshold dimension. Unlike the multiclass setting, where the Littlestone and threshold dimensions are finite together, for $k>1$, the $k$-Littlestone and $k$-monotone dimensions do not exhibit this relationship. We prove that a finite $k$-monotone dimension is another necessary condition for DP $k$-list learnability, alongside finite $k$-Littlestone dimension. Whether the finiteness of both dimensions implies private $k$-list learnability remains an open question.
Time: Wednesday 02 July 09:00–10:36
Session: 5A - Online Learning I (Wednesday 02 July 09:00–10:36)
Authors: Blondal, Ari; Gao, Shan; Hatami, Hamed; Hatami, Pooya
Abstract:
Two seminal papers--Alon, Livni, Malliaris, Moran~(STOC 2019) and Bun, Livni, and Moran~(FOCS 2020)--established the equivalence between online learnability and globally stable PAC learnability in binary classification. However, Chase, Chornomaz, Moran, and Yehudayoff (STOC 2024) recently showed that this equivalence does not hold in the agnostic setting. Specifically, they proved that in the agnostic setting, only finite hypothesis classes are globally stable learnable. Therefore, agnostic global stability is too restrictive to capture interesting hypothesis classes. To address this limitation, Chase et al. introduced two relaxations of agnostic global stability. In this paper, we characterize the classes that are learnable under their proposed relaxed conditions, resolving the two open problems raised in their work. First, we prove that in the setting where the stability parameter can depend on the excess error (the gap between the learner's error and the best achievable error by the hypothesis class), agnostic stability is fully characterized by the Littlestone dimension. Consequently, as in the realizable case, this form of learnability is equivalent to online learnability. As part of the proof of this theorem, we strengthen the celebrated result of Bun et al. by showing that classes with infinite Littlestone dimension are not stably PAC learnable, even if we allow the stability parameter to depend on the excess error. For the second relaxation proposed by Chase et al., we prove that only finite hypothesis classes are globally stable learnable even if we restrict the agnostic setting to distributions with small population loss.
Session: 5A - Online Learning I (Wednesday 02 July 09:00–10:36)
Authors: Montasser, Omar; Shetty, Abhishek; Zhivotovskiy, Nikita
Abstract:
We revisit online binary classification by shifting the focus from competing with the best-in-class binary loss to competing against relaxed benchmarks that capture smoothed notions of optimality. Instead of measuring regret relative to the exact minimal binary error—a standard approach that leads to worst-case bounds tied to the Littlestone dimension—we consider comparing with predictors that are robust to small input perturbations, perform well under Gaussian smoothing, or maintain a prescribed output margin. Previous examples of this were primarily limited to the hinge loss. Our algorithms achieve regret guarantees that depend only on the VC dimension and the complexity of the instance space (e.g., metric entropy), and notably, they incur only an $O(log(1/\gamma))$ dependence on the generalized margin . This stands in contrast to most existing regret bounds, which typically exhibit a polynomial dependence on $1/\gamma$. We complement this with matching lower bounds. Our analysis connects recent ideas from adversarial robustness and smoothed online learning.
Session: 5A - Online Learning I (Wednesday 02 July 09:00–10:36)
Authors: Qiao, Mingda; Zhao, Eric
Abstract:
Calibration measures quantify how much a forecaster's predictions violates calibration, which requires that forecasts are unbiased conditioning on the forecasted probabilities. Two important desiderata for a calibration measure are its decision-theoretic implications [KPLST23] (i.e., downstream decision-makers that best-respond to the forecasts are always no-regret) and its truthfulness [HQYZ24] (i.e., a forecaster approximately minimizes the error by always reporting the true probabilities). Existing measures satisfy at most one of the two properties, but not both. We introduce a new calibration measure termed step calibration that strengthens the U-Calibration error (UCal) of [KPLST23]. Our main result shows that a subsampled variant of step calibration, stepCE^sub, is both decision-theoretic and truthful. In particular, on any product distribution, stepCE^sub is truthful up to an $O(1)$ factor whereas UCal suffers from an $e^{-\Omega(T)}$-$\Omega(\sqrt{T})$ truthfulness gap. Moreover, in any smoothed setting where the conditional probability of each event is perturbed by a noise of magnitude $c > 0$, stepCE^sub is truthful up to an $O(\sqrt{\log(1/c)})$ factor, while UCal has an $e^{-\Omega(T)}$-$\Omega(T^{1/3})$ truthfulness gap. We also prove more generally an impossibility result for truthful decision-theoretic forecasting: any complete and decision-theoretic calibration measure must be discontinuous and non-truthful in the non-smoothed setting.
Session: 5A - Online Learning I (Wednesday 02 July 09:00–10:36)
Authors: Lu, Jiuyao; Roth, Aaron; Shi, Mirah
Abstract:
We define ``decision swap regret'' which generalizes both prediction for downstream swap regret and omniprediction, and give algorithms for obtaining it for arbitrary multi-dimensional Lipschitz loss functions in online adversarial settings. We also give sample complexity bounds in the batch setting via an online-to-batch reduction. When applied to omniprediction, our algorithm gives the first polynomial sample-complexity bounds for Lipschitz loss functions---prior bounds either applied only to linear loss (or binary outcomes) or scaled exponentially with the error parameter even under the assumption that the loss functions were convex. When applied to prediction for downstream regret, we give the first algorithm capable of guaranteeing swap regret bounds for all downstream agents with non-linear loss functions over a multi-dimensional outcome space: prior work applied only to linear loss functions, modeling risk neutral agents. Our general bounds scale exponentially with the dimension of the outcome space, but we give improved regret and sample complexity bounds for specific families of multidimensional functions of economic interest: constant elasticity of substitution (CES), Cobb-Douglas, and Leontief utility functions.
Session: 5A - Online Learning I (Wednesday 02 July 09:00–10:36)
Authors: Gatmiry, Khashayar; Schneider, Jon; Jegelka, Stefanie
Abstract:
Follow-the-Regularized-Leader (FTRL) algorithms are a popular class of learning algorithms for online linear optimization (OLO) that guarantee sub-linear regret. However, the choice of regularizer can significantly impact dimension-dependent factors in the regret bound. We present an algorithm that takes as input convex and symmetric action sets and loss sets for a specific OLO instance, and outputs a regularizer such that running FTRL with this regularizer guarantees regret within a universal constant factor of the best possible regret bound. In particular, for any choice of (convex, symmetric) action set and loss set we prove that there exists an instantiation of FTRL that achieves regret within a constant factor of the best possible learning algorithm, strengthening the universality result of Srebro et al., 2011. Our algorithm requires preprocessing time and space exponential in the dimension $d$ of the OLO instance, but can be run efficiently online assuming a membership and linear optimization oracle for the action and loss sets, respectively (and is fully polynomial time for the case of constant dimension $d$). We complement this with a lower bound showing that even deciding whether a given regularizer is $\alpha$-strongly-convex with respect to a given norm is NP-hard.
Session: 5A - Online Learning I (Wednesday 02 July 09:00–10:36)
Authors: Mhammedi, Zakaria
Abstract:
In this paper, we introduce a new projection-free algorithm for Online Convex Optimization (OCO) with a state-of-the-art regret guarantee among separation-based algorithms. Existing projection-free methods based on the classical Frank-Wolfe algorithm achieve a suboptimal regret bound of $O(T^{3/4})$, while more recent separation-based approaches guarantee a regret bound of $O(\kappa \sqrt{T})$, where $\kappa$ denotes the asphericity of the feasible set, defined as the ratio of the radii of the containing and contained balls. However, for ill-conditioned sets, $\kappa$ can be arbitrarily large, potentially leading to poor performance. Our algorithm achieves a regret bound of $\widetilde{O}(\sqrt{dT} + \kappa d)$, while requiring only $\widetilde{O}(1)$ calls to a separation oracle per round. Crucially, the main term in the bound, $\widetilde{O}(\sqrt{d T})$, is independent of $\kappa$, addressing the limitations of previous methods. Additionally, as a by-product of our analysis, we recover the $O(\kappa \sqrt{T})$ regret bound of existing OCO algorithms with a more straightforward analysis and improve the regret bound for projection-free online exp-concave optimization. Finally, for constrained stochastic convex optimization, we achieve a state-of-the-art convergence rate of $\widetilde{O}(\sigma/\sqrt{T} + \kappa d/T)$, where $\sigma$ represents the noise in the stochastic gradients, while requiring only $\widetilde{O}(1)$ calls to a separation oracle per iteration.
Session: 5A - Online Learning I (Wednesday 02 July 09:00–10:36)
Authors: Hait, Soumita; Li, Ping; Luo, Haipeng; Zhang, Mengxiao
Abstract:
Motivated by alternating learning dynamics in two-player games, a recent work by \citet{cevher2024alternation} shows that $o(\sqrt{T})$ alternating regret is possible for any $T$-round adversarial Online Linear Optimization (OLO) problem, and left as an open question whether the same is true for general Online Convex Optimization (OCO). We answer this question in the affirmative by showing that the continuous Hedge algorithm achieves $\tilde{\mathcal{O}}(d^{\frac{2}{3}}T^{\frac{1}{3}})$ alternating regret for any adversarial $d$-dimensional OCO problems. We show that this implies an alternating learning dynamic that finds a Nash equilibrium for any convex-concave zero-sum games or a coarse correlated equilibrium for any convex two-player general-sum games at a rate of $\tilde{\mathcal{O}}(d^{\frac{2}{3}}/T^{\frac{2}{3}})$. To further improve the time complexity and/or the dimension dependence, we propose another simple algorithm, Follow-the-Regularized-Leader with a regularizer whose convex conjugate is 3rd-order smooth, for OCO with smooth and self-concordant loss functions (such as linear or quadratic losses). We instantiate our algorithm with different regularizers and show that, for example, when the decision set is the $\ell_2$ ball, our algorithm achieves $\tilde{\mathcal{O}}(T^{\frac{2}{5}})$ alternating regret with no dimension dependence (and a better $\tilde{\mathcal{O}}(T^{\frac{1}{3}})$ bound for quadratic losses). We complement our results by showing some algorithm-specific alternating regret lower bounds, including a somewhat surprising $\Omega(\sqrt{T})$ lower bound for a Regret Matching variant that is widely used in alternating learning dynamics.
Session: 5A - Online Learning I (Wednesday 02 July 09:00–10:36)
Authors: Li, Junfan; Liao, Shizhong; Xu, Zenglin; Nie, Liqiang
Abstract:
In this paper, we study the problem of online sparse linear regression (OSLR) where the algorithms are restricted to accessing only $k$ out of $d$ attributes per instance for prediction, which was proved to be NP-hard. Previous work gave polynomial-time algorithms assuming the data matrix satisfies the linear independence of features, the compatibility condition, or the restricted isometry property. We introduce a new polynomial-time algorithm, which significantly improves previous regret bounds \citep{Ito2017Efficient} under the compatibility condition that is weaker than the other two assumptions. The improvements benefit from a tighter convergence rate of the $\ell_1$-norm error of our estimators. Our algorithm leverages the well-studied Dantzig Selector, but importantly with several novel techniques, including an algorithm-dependent sampling scheme for estimating the covariance matrix, an adaptive parameter tuning scheme, and a batching online Newton step with careful initializations. We also give novel and non-trivial analyses, including an induction method for analyzing the $\ell_1$-norm error, careful analyses on the covariance of non-independent random variables, and a decomposition on the regret. We further extend our algorithm to OSLR with additional observations where the algorithms can observe additional $k_0$ attributes after each prediction, and improve previous regret bounds \citep{Kale2017Adaptive,Ito2017Efficient}.
Time: Wednesday 02 July 09:00–10:36
Session: 5B - Deep Learning (Wednesday 02 July 09:00–10:36)
Authors: Safran, Itay; Reichman, Daniel; Valiant, Paul
Abstract:
We prove an exponential separation between depth 2 and depth 3 neural networks, when approximating a $\mathcal{O}(1)$-Lipschitz target function to constant accuracy, with respect to a distribution with support in the unit ball, under the mild assumption that the weights of the depth 2 network are exponentially bounded. This resolves an open problem posed in \citet{safran2019depth}, and proves that the curse of dimensionality manifests itself in depth 2 approximation, even in cases where the target function can be represented efficiently using a depth 3 network. Previously, lower bounds that were used to separate depth 2 from depth 3 networks required that at least one of the Lipschitz constant, target accuracy or (some measure of) the size of the domain of approximation scale \emph{polynomially} with the input dimension, whereas in our result these parameters are fixed to be \emph{constants} independent of the input dimension: our parameters are simultaneously optimal. Our lower bound holds for a wide variety of activation functions, and is based on a novel application of a worst- to average-case random self-reducibility argument, allowing us to leverage depth 2 threshold circuits lower bounds in a new domain.
Session: 5B - Deep Learning (Wednesday 02 July 09:00–10:36)
Authors: Grillo, Moritz; Froese, Vincent; Skutella, Martin
Abstract:
Neural networks with ReLU activation play a key role in modern machine learning. Understanding the functions represented by ReLU networks is a major topic in current research as this enables a better interpretability of learning processes. Injectivity plays a crucial role whenever invertibility of a neural network is necessary, such as, e.g., for inverse problems or generative models. The exact computational complexity of deciding injectivity was recently posed as an open problem (Puthawala et al. [JMLR 2022]). We answer this question by proving coNP-completeness. On the positive side, we show that the problem for a single ReLU-layer is still tractable for small input dimension; more precisely, we present a parameterized algorithm which yields fixed-parameter tractability with respect to the input dimension. In addition, we study the network verification problem which is of great importance since neural networks are increasingly used in safety-critical systems. We prove that network verification is coNP-hard for a general class of input domains. Our results also exclude constant-factor polynomial-time approximations for the maximum of a function computed by a ReLU network. In this context, we also characterize surjectivity for ReLU networks with one-dimensional output which turns out to be the complement of a basic network verification task. We reveal interesting connections to computational convexity by formulating the surjectivity problem as a zonotope containment problem.
Session: 5B - Deep Learning (Wednesday 02 July 09:00–10:36)
Authors: Chee, Jerry; Backurs, Arturs; Heck, Rainie; Zhang, Li; Kulkarni, Janardhan; Rothvoss, Thomas; Gopi, Sivakanth
Abstract:
Quantizing the weights of a neural network has two steps: (1) Finding a good low bit-complexity representation for weights (which we call the quantization grid) and (2) Rounding the original weights to values in the quantization grid. In this paper, we study the problem of rounding optimally given any quantization grid. The simplest and most commonly used way to round is Round-to-Nearest (RTN). By rounding in a data-dependent way instead, one can improve the quality of the quantized model significantly. We study the rounding problem from the lens of \emph{discrepancy theory}, which studies how well we can round a continuous solution to a discrete solution without affecting solution quality too much. We prove that given $m=\poly\left(\frac{\log n}{\epsilon}\right)$ samples from the data distribution, we can round nearly all $n$ model parameters such that the expected approximation error of the quantized model on the true data distribution is $\le \epsilon$ as long as the space of gradients of the original model is approximately low rank (which we empirically validate). Our algorithm is based on the famous Lovett-Meka algorithm from discrepancy theory and uses sticky Brownian motion to find a good rounding. We also give a simple and practical rounding algorithm called \emph{DiscQuant}, which is inspired by our theoretical insights. In our experiments, we demonstrate that DiscQuant significantly improves over the prior state-of-the-art rounding method called GPTQ and the baseline RTN over a range of benchmarks on Phi3mini-3.8B and Llama3.1-8B. For example, rounding Phi3mini-3.8B to a fixed quantization grid with 3.25 bits per parameter using DiscQuant gets 64\% accuracy on the GSM8k dataset, whereas GPTQ achieves 54\% and RTN achieves 31\% (the original model achieves 84\%).
Session: 5B - Deep Learning (Wednesday 02 July 09:00–10:36)
Authors: Camilli, Francesco; Tieplova, Daria; Barbier, Jean; Bergamin, Eleonora
Abstract:
We rigorously analyse fully-trained neural networks of arbitrary depth in the Bayesian optimal setting in the so-called \emph{proportional scaling regime} where the number of training samples and width of the input and all inner layers diverge proportionally. We prove an information-theoretic equivalence between the Bayesian deep neural network model trained from data generated by a teacher with matching architecture, and a simpler model of optimal inference in a generalized linear model. This equivalence enables us to compute the optimal generalization error for deep neural networks in this regime. We thus prove the ``deep Gaussian equivalence principle'' conjectured in \cite{cui2023optimal}. Our result highlights that in order to escape this ``trivialisation'' of deep neural networks (in the sense of reduction to a linear model) happening in the strongly overparametrized proportional regime, models trained from much more data have to be considered.
Session: 5B - Deep Learning (Wednesday 02 July 09:00–10:36)
Authors: Schechtman, Sholom; Schreuder, Nicolas
Abstract:
We analyze the implicit bias of constant step stochastic subgradient descent (SGD). We consider the setting of binary classification with homogeneous neural networks -- a large class of deep neural networks with $\ReLU$-type activation functions such as MLPs and CNNs without biases. We interpret the dynamics of normalized SGD iterates as an Euler-like discretization of a conservative field flow that is naturally associated to the normalized classification margin. Owing to this interpretation, we show that normalized SGD iterates converge to the set of critical points of the normalized margin at late-stage training (i.e., assuming that the data is correctly classified with positive normalized margin). Up to our knowledge, this is the first extension of the analysis of Lyu and Li (2020) on the discrete dynamics of gradient descent to the nonsmooth and stochastic setting. Our main result applies to binary classification with exponential or logistic losses. We additionally discuss extensions to more general settings.
Session: 5B - Deep Learning (Wednesday 02 July 09:00–10:36)
Authors: Wang, Zixuan; Nichani, Eshaan; Bietti, Alberto; Damian, Alex; Hsu, Daniel; Lee, Jason; Wu, Denny
Abstract:
Transformer-based language models have demonstrated impressive capabilities across a range of complex reasoning tasks. Prior theoretical work exploring the expressive power of transformers has shown that they can efficiently perform multi-step reasoning tasks involving parallelizable computations. However, the learnability of such constructions, particularly the conditions on the data distribution that enable efficient learning via SGD, remains an open question. Towards answering this question, we study the learnability of a task called the \emph{$k$-fold composition}, which requires computing an interleaved composition of $k$ input permutations and $k$ hidden permutations, and can be expressed by a transformer with $O(\log k)$ layers. On the negative front, we provide a Statistical Query lower bound showing that any learner which is trained on samples from the $k$-fold composition task and makes polynomially many queries must have sample size exponential in $k$, thus establishing a statistical-computational gap. On the other hand, we show that this function class can be efficiently learned, with runtime and sample complexity polynomial in $k$, by gradient descent on an $O(\log k)$-depth transformer via two different curriculum learning strategies: one in which data consists of $k'$-fold composition functions with $k' \le k$ presented in increasing order of difficulty, and another in which all data is presented simultaneously. Our work sheds light on the necessity and sufficiency of having both easy and hard examples in the data distribution for transformers to learn complex compositional tasks.
Session: 5B - Deep Learning (Wednesday 02 July 09:00–10:36)
Authors: Dayi, Arif Kerem; Chen, Sitan
Abstract:
LoRA has emerged as one of the \emph{de facto} methods for fine-tuning foundation models with low computational cost and memory footprint. The idea is to only train a low-rank perturbation to the weights of a pre-trained model, given supervised data for a downstream task. Despite its empirical success, mathematically it remains poorly understood what learning mechanisms ensure that gradient descent converges to useful low-rank perturbations. In this work we study low-rank fine-tuning in a student-teacher setting. We are given the weights of a two-layer base model $f$, as well as i.i.d. samples $(x,f^*(x))$ where $x$ is Gaussian and $f^*$ is the teacher model given by perturbing the weights of $f$ by a rank-1 matrix. This generalizes the setting of generalized linear model (GLM) regression where the weights of $f$ are zero. When the rank-1 perturbation is comparable in norm to the weight matrix of $f$, we show that the training dynamics are genuinely distinct from both the lazy linearized dynamics of the kernel regime, and the rich feature learning dynamics captured by GLM regression. We prove under mild assumptions that a student model which is initialized at the base model and trained with online SGD will converge to the teacher in $dk^{O(1)}$ iterations, where $k$ is the number of neurons in $f$. Importantly, unlike in the GLM setting, the complexity does not depend on fine-grained properties of the activation's Hermite expansion. We also prove that in our setting, learning the teacher model ``from scratch'' can require significantly more iterations.
Session: 5B - Deep Learning (Wednesday 02 July 09:00–10:36)
Authors: Glasgow, Margalit; Bruna, Joan; Wu, Denny
Abstract:
We study the approximation gap between the dynamics of a polynomial-width neural network and its infinite-width counterpart, both trained using projected gradient descent in the mean-field scaling regime. We demonstrate how to tightly bound this approximation gap through a differential equation governed by the mean-field dynamics. A key factor influencing the growth of this ODE is the \textit{local Hessian} of each particle, defined as the derivative of the particle's velocity in the mean-field dynamics with respect to its position. We apply our results to the canonical feature learning problem of estimating a well-specified single-index model; we permit the information exponent to be arbitrarily large, leading to convergence times that grow polynomially in the ambient dimension $d$. We show that, due to a certain ``self-concordance'' property in these problems — where the local Hessian of a particle is bounded by a constant times the particle's velocity — polynomially many neurons are sufficient to closely approximate the mean-field dynamics throughout training.
Time: Wednesday 02 July 11:10–12:46
Session: 6A - Learning in Games (Wednesday 02 July 11:10–12:46)
Authors: Lazarsfeld, John; Piliouras, Georgios; Sim, Ryann; Wibisono, Andre
Abstract:
This paper investigates the sublinear regret guarantees of two non-no-regret algorithms in zero-sum games: Fictitious Play, and Online Gradient Descent with constant stepsizes. In general adversarial online learning settings, both algorithms may exhibit instability and linear regret due to no regularization (Fictitious Play) or small amounts of regularization (Gradient Descent). However, their ability to obtain tighter regret bounds in two-player zero-sum games is less understood. In this work, we obtain strong new regret guarantees for both algorithms on a class of symmetric zero-sum games that generalize the classic three-strategy Rock-Paper-Scissors to a weighted, n-dimensional regime. Under symmetric initializations of the players' strategies, we prove that Fictitious Play with any tiebreaking rule has O(\sqrt{T}) regret, establishing a new class of games for which Karlin's Fictitious Play conjecture holds. Moreover, by leveraging a connection between the geometry of the iterates of Fictitious Play and Gradient Descent in the dual space of payoff vectors, we prove that Gradient Descent, for almost all symmetric initializations, obtains a similar O(\sqrt{T}) regret bound when its stepsize is a sufficiently large constant. For Gradient Descent, this establishes the first "fast and furious" behavior (i.e., sublinear regret without time-vanishing stepsizes but instead using large, constant ones) for zero-sum games larger than 2x2.
Session: 6A - Learning in Games (Wednesday 02 July 11:10–12:46)
Authors: Chen, Fan; Rakhlin, Alexander
Abstract:
We study the problem of interactive decision making in which the underlying environment changes over time subject to given constraints. We propose a framework, which we call hybrid Decision Making with Structured Observations (hybrid DMSO), that provides an interpolation between the stochastic and adversarial settings of decision making. Within this framework, we can analyze local differentially private (LDP) decision making, query-based learning (in particular, SQ learning), and robust and smooth decision making under the same umbrella, deriving upper and lower bounds based on variants of the Decision-Estimation Coefficient (DEC). We further establish strong connections between the DEC's behavior, the SQ dimension, local minimax complexity, learnability, and joint differential privacy.
Session: 6A - Learning in Games (Wednesday 02 July 11:10–12:46)
Authors: Fujii, Kaito
Abstract:
This paper investigates equilibrium computation and the price of anarchy for Bayesian games, which are the fundamental models of games with incomplete information. In normal-form games with complete information, it is known that efficiently computable no-regret dynamics converge to correlated equilibria, and the price of anarchy for correlated equilibria can be bounded for a broad class of games called smooth games. However, in Bayesian games, as surveyed by Forges (1993), several non-equivalent extensions of correlated equilibria exist, and it remains unclear whether they can be efficiently computed or whether their price of anarchy can be bounded. In this paper, we identify a natural extension of correlated equilibria that can be computed efficiently and is guaranteed to have bounds on the price of anarchy in various games. First, we propose a variant of regret called untruthful swap regret. If each player minimizes it in repeated play of Bayesian games, the empirical distribution of these dynamics is guaranteed to converge to communication equilibria, which is one of the extensions of correlated equilibria proposed by Myerson (1982). We present an efficient algorithm for minimizing untruthful swap regret with a sublinear upper bound, which we prove to be tight in terms of the number of types. As a result, by simulating the dynamics with our algorithm, we can approximately compute a communication equilibrium in polynomial time. Furthermore, we extend existing lower bounds on the price of anarchy based on the smoothness arguments from Bayes--Nash equilibria to equilibria obtained by the proposed dynamics.
Session: 6A - Learning in Games (Wednesday 02 July 11:10–12:46)
Authors: Ito, Shinji; Luo, Haipeng; Tsuchiya, Taira; Wu, Yue
Abstract:
No-regret self-play learning dynamics have become one of the premier ways to solve large-scale games in practice. Accelerating their convergence via improving the regret of the players over the naive $O(\sqrt{T})$ bound after $T$ rounds has been extensively studied in recent years, but almost all studies assume access to exact gradient feedback. We address the question of whether acceleration is possible under bandit feedback only and provide an affirmative answer for two-player zero-sum normal-form games. Specifically, we show that if both players apply the Tsallis-INF algorithm of Zimmert and Seldin (2021), then their regret is at most $O(c_1 \log T + c_2 \sqrt{T})$, where $c_1$ and $c_2$ are game-dependent constants that characterize the difficulty of learning ----- $c_1$ resembles the complexity of learning a stochastic multi-armed bandit instance and depends inversely on some gap measures, while $c_2$ can be much smaller than the number of actions when the Nash equilibria have a small support or are close to the boundary. In particular, for the case when a pure strategy Nash equilibrium exists, $c_2$ becomes zero, leading to an optimal instance-dependent regret bound as we show. We additionally prove that in this case our algorithm also enjoys last-iterate convergence and can identify the pure strategy Nash equilibrium with near-optimal sample complexity.
Session: 6A - Learning in Games (Wednesday 02 July 11:10–12:46)
Authors: Dagan, Yuval; Assos, Angelos; Rajaraman, Nived
Abstract:
Online learning algorithms are widely used in strategic multi-agent settings, including repeated auctions, contract design, and pricing competitions, where agents adapt their strategies over time. A key question in such environments is how an optimizing agent can best respond to a learning agent to improve its own long-term outcomes. While prior work has developed efficient algorithms for the optimizer in special cases—such as structured auction settings or contract design—no general efficient algorithm is known. In this paper, we establish a strong computational hardness result: unless $\mathsf{P} = \mathsf{NP}$, no polynomial-time optimizer can compute a near-optimal strategy against a learner using a standard no-regret algorithm, specifically Multiplicative Weights Update (MWU). Our result proves an $\Omega(T)$ hardness bound, significantly strengthening previous work that only showed an additive $\Theta(1)$ impossibility result. Furthermore, while prior hardness results focused on learners using fictitious play—an algorithm that is not no-regret—we prove intractability for a widely used no-regret learning algorithm. This establishes a fundamental computational barrier to finding optimal strategies in general game-theoretic settings.
Session: 6A - Learning in Games (Wednesday 02 July 11:10–12:46)
Authors: Kornowski, Guy; Shamir, Ohad
Abstract:
We study the problem of solving matrix games of the form $\max_{\mathbf{w}\in\mathcal{W}}\min_{\mathbf{p}\in\Delta}\mathbf{p}^{\top}A\mathbf{w}$, where $A$ is some matrix and $\Delta$ is the probability simplex. This problem encapsulates canonical tasks such as finding a linear separator and computing Nash equilibria in zero-sum games. However, perhaps surprisingly, its inherent complexity (as formalized in the standard framework of oracle complexity (Nemirovski and Yudin, 1983)) is not well-understood. In this work, we first identify different oracle models which are implicitly used by prior algorithms, amounting to multiplying the matrix $A$ by a vector from either one or both sides. We then prove complexity lower bounds for algorithms under both access models, which in particular imply a separation between them. Specifically, we start by proving that algorithms for linear separability based on one-sided multiplications must require $\Omega(\gamma_A^{-2})$ iterations, where $\gamma_A$ is the margin, as matched by the Perceptron algorithm. We then prove that accelerated algorithms for this task, which utilize multiplications from both sides, must require $\tilde{\Omega}(\gamma_{A}^{-2/3})$ iterations, establishing the first oracle complexity barrier for such algorithms. Finally, by adapting our lower bound to $\ell_1$ geometry, we prove that computing an $\epsilon$-approximate Nash equilibrium requires $\tilde{\Omega}(\epsilon^{-2/5})$ iterations, which is an exponential improvement over the previously best-known lower bound due to Hadiji et al. (2024).
Session: 6A - Learning in Games (Wednesday 02 July 11:10–12:46)
Authors: Rossellini, Raphael; Soloff, Jake; Foygel Barber, Rina; Ren, Zhimei; Willett, Rebecca
Abstract:
Forecast probabilities often serve as critical inputs for binary decision making. In such settings, calibration---ensuring forecasted probabilities match empirical frequencies---is essential. Although the common notion of Expected Calibration Error (ECE) provides actionable insights for decision making, it is not testable: it cannot be empirically estimated in many practical cases. Conversely, the recently proposed Distance to Calibration (dCE) is testable but is not actionable since it lacks decision-theoretic guarantees needed for high-stakes applications. We introduce Cutoff Calibration Error, a new calibration measure that bridges this gap by measuring expected deviations over arbitrary forecast intervals. We show that Cutoff Calibration Error can be efficiently estimated and examine its implications for popular post-hoc calibration methods, such as isotonic regression and Platt scaling.
Session: 6A - Learning in Games (Wednesday 02 July 11:10–12:46)
Authors: Tsuchiya, Taira; Ito, Shinji; Luo, Haipeng
Abstract:
Learning in games refers to scenarios where multiple players interact in a shared environment, each aiming to minimize their regret. It is well known that an equilibrium can be computed at a fast rate of $O(1/T)$ when all players follow the optimistic follow-the-regularized-leader (OFTRL). However, this acceleration is limited to the honest regime, in which all players fully adhere to a prescribed algorithm---a situation that may not be realistic in practice. To address this issue, we present corrupted learning dynamics that adaptively find an equilibrium at a rate that depends on the extent to which each player deviates from the strategy suggested by the prescribed algorithm. First, in two-player zero-sum corrupted games, we provide learning dynamics for which the external regret of $x$-player (and similarly for $y$-player) is roughly bounded by $O(\log (m_x m_y) + \sqrt{\hat{C}_y} + \hat{C}_x)$, where $m_x$ and $m_y$ denote the number of actions of $x$- and $y$-players, respectively, and $\hat{C}_x$ and $\hat{C}_y$ represent their cumulative deviations. We then extend our approach to multi-player general-sum corrupted games, providing learning dynamics for which the swap regret of player $i$ is bounded by $O(\log T + \sqrt{\sum_{k} \hat{C}_k \log T} + \hat{C}_i)$ ignoring dependence on the number of players and actions, where $\hat{C}_i$ is the cumulative deviation of player $i$ from the prescribed algorithm. Our learning dynamics are agnostic to the levels of corruption. A key technical contribution is a new analysis that ensures the stability of a Markov chain under a new adaptive learning rate, thereby allowing us to achieve the desired bound in the corrupted regime while matching the best existing bound in the honest regime. Notably, our framework can be extended to address not only corruption in strategies but also corruption in the observed expected utilities, and we provide several matching lower bounds.
Time: Wednesday 02 July 11:10–13:00
Session: 6B - Convex Optimization (Wednesday 02 July 11:10–13:00)
Authors: Yang, Pengkun; Zhang, Jingzhao
Abstract:
We study the scaling of error rates with respect to the size of the training dataset. In contrast to classical results where rates are minimax optimal for a problem class, this work starts with an empirical observation that, even for a fixed data distribution, the error scaling can have \emph{diverse} rates across different ranges of sample size. To understand when and why the error rate is non-uniform, we theoretically analyze nearest neighbor classifiers. We show that an error scaling law can have fine-grained rates: in the early phase, the test error depends polynomially on the data dimension and decreases fast; whereas in the later phase, the error depends exponentially on the data dimension and decreases slowly. Our analysis highlights the complexity of the data distribution in determining the test error. When the data distributes benignly, we show that the generalization error of nearest neighbor classifier can depend polynomially, instead of exponentially, on the data dimension.
Session: 6B - Convex Optimization (Wednesday 02 July 11:10–13:00)
Authors: Liang, Daniel; Chia, Nai-Hui; Song, Fang
Abstract:
We establish connections between state tomography, pseudorandomness, quantum state synthesis, and circuit lower bounds. In particular, let $\mathfrak C$ be a family of non-uniform quantum circuits of polynomial size and suppose that there exists an algorithm that, given copies of $\ket \psi$, distinguishes whether $\ket \psi$ is produced by $\mathfrak C$ or is Haar random, promised one of these is the case. For arbitrary fixed constant $c$, we show that if the algorithm uses at most $O\left(2^{n^c}\right)$ time and $2^{n^{0.99}}$ samples then $\mathsf{stateBQE} \not\subset \mathsf{state}\mathfrak{C}$. Here $\mathsf{stateBQE} \coloneqq \mathsf{stateBQTIME}\left[2^{O(n)}\right]$ and $\mathsf{state}\mathfrak{C}$ are state synthesis complexity classes as introduced by Rosenthal and Yuen (2022), which capture problems with classical inputs but quantum output. Note that efficient tomography implies a similarly efficient distinguishing algorithm against Haar random states, even for nearly exponential-time algorithms. Because every state produced by a polynomial-size circuit can be learned with $2^{O(n)}$ samples and time, or $O\left(n^{\omega(1)}\right)$ samples and $2^{O(n^{\omega(1)})}$ time, we show that even slightly non-trivial quantum state tomography algorithms would lead to new statements about quantum state synthesis. Finally, a slight modification of our proof shows that distinguishing algorithms for quantum states can imply circuit lower bounds for decision problems as well. We then take these results and port them over to the setting of unitary learning and unitary synthesis. All combined, this helps shed light on why time-efficient tomography algorithms for non-uniform quantum circuit classes has only had limited and partial progress.
Session: 6B - Convex Optimization (Wednesday 02 July 11:10–13:00)
Authors: Chen, Lesi; Liu, Chengchang; Luo, Luo; Zhang, Jingzhao
Abstract:
Previous algorithms can solve the convex-concave minimax problems $\min_{x \in \mathcal{X}} \max_{y \in \mathcal{Y}} f(x,y)$ with $\gO(\epsilon^{-2/3})$ second-order oracle calls with Newton-like methods. This result has long been speculated to be optimal because the upper bound is achieved by a natural generalization of the optimal first-order method. In this work, we show an improved upper bound of $\tilde{\gO}(\epsilon^{-4/7})$ by generalizing the optimal second-order method for convex optimization. We further apply a similar technique to lazy Hessian algorithms and show that our proposed algorithm can also be seen as a second-order ``Catalyst'' framework that could accelerate any globally convergent algorithms for solving minimax problems.
Session: 6B - Convex Optimization (Wednesday 02 July 11:10–13:00)
Authors: Bok, Jinho; Altschuler, Jason
Abstract:
Surprisingly, recent work has shown that gradient descent can be accelerated without using momentum—just by judiciously choosing stepsizes. An open question raised by several papers is whether this phenomenon of stepsize-based acceleration holds more generally for constrained and/or composite convex optimization via projected and/or proximal versions of gradient descent. We answer this in the affirmative by proving that the silver stepsize schedule yields analogously accelerated rates in these settings. These rates are conjectured to be asymptotically optimal among all stepsize schedules, and match the silver convergence rate of vanilla gradient descent (Altschuler and Parrilo, 2024), namely O(ε^(−log_ρ 2)) for smooth convex optimization and O(κ^(log_ρ 2) log 1/ε) under strong convexity, where ε is the precision, κ is the condition number, and ρ = 1 + √2 is the silver ratio. The key technical insight is the combination of recursive gluing—the technique underlying all analyses of gradient descent accelerated with time-varying stepsizes—with a certain Laplacian-structured sum-of-squares certificate for the analysis of proximal point updates.
Session: 6B - Convex Optimization (Wednesday 02 July 11:10–13:00)
Authors: Contreras, Juan Pablo; Guzmán, Cristobal; Martínez-Rubio, David
Abstract:
We develop algorithms for the optimization of convex objectives that have Hölder continuous $q$-th derivatives by using a $q$-th order oracle, for any $q \geq 1$. Our algorithms work for general norms under mild conditions, including the $\ell_p$-settings for $1\leq p\leq \infty$. We can also optimize structured functions that allow for inexactly implementing a non-Euclidean ball optimization oracle. We do this by developing a non-Euclidean inexact accelerated proximal point method that makes use of an \textit{inexact uniformly convex regularizer}. We show a lower bound for general norms that demonstrates our algorithms are nearly optimal in high-dimensions in the black-box oracle model for $\ell_p$-settings and all $q \geq 1$, even in randomized and parallel settings. This new lower bound, when applied to the first-order smooth case, resolves an open question in parallel convex optimization.
Session: 6B - Convex Optimization (Wednesday 02 July 11:10–13:00)
Authors: Zhang, Zihan; Lee, Jason; Du, Simon; Chen, Yuxin
Abstract:
This work investigates stepsize-based acceleration of gradient descent with {\em anytime} convergence guarantees. For smooth (non-strongly) convex optimization, we propose a stepsize schedule that allows gradient descent to achieve convergence guarantees of $O\big(T^{-\frac{2\log_2\rho}{1+\log_2\rho}}\big) \approx O(T^{-1.119})$ for any stopping time $T$, where $\rho=\sqrt{2}+1$ is the silver ratio and the stepsize schedule is predetermined without prior knowledge of the stopping time. This result provides an affirmative answer to a COLT open problem regarding whether stepsize-based acceleration can yield anytime convergence rates of $o(T^{-1})$. We further extend our theory to yield anytime convergence guarantees of $\exp(-\Omega(T/\kappa^{0.893}))$ for smooth and strongly convex optimization, with $\kappa$ being the condition number.
Session: 6B - Convex Optimization (Wednesday 02 July 11:10–13:00)
Authors: Bai, Site; Bullins, Brian
Abstract:
Recent advances (Sherman, 2017; Sidford and Tian, 2018; Cohen et al., 2021) have overcome the fundamental barrier of dimension dependence in the iteration complexity of solving $\ell_\infty$ regression with first-order methods. Yet it remains unclear to what extent such acceleration can be achieved for general $\ell_p$ smooth functions. In this paper, we propose a new accelerated first-order method for convex optimization under non-Euclidean smoothness assumptions. In contrast to standard acceleration techniques, our approach uses primal-dual iterate sequences taken with respect to \emph{differing} norms, which are then coupled using an \emph{implicitly} determined interpolation parameter. For $\ell_p$ norm smooth problems in $d$ dimensions, our method provides an iteration complexity improvement of up to $O(d^{1-\frac{2}{p}})$ in terms of calls to a first-order oracle, thereby allowing us to circumvent long-standing barriers in accelerated non-Euclidean steepest descent.
Session: 6B - Convex Optimization (Wednesday 02 July 11:10–13:00)
Authors: Jiang, Ruichen; Maladkar, Devyani; Mokhtari, Aryan
Abstract:
Adaptive gradient methods, such as AdaGrad, are among the most successful optimization algorithms for neural network training. While these methods are known to achieve better dimensional dependence than stochastic gradient descent (SGD) for stochastic convex optimization under favorable geometry, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In fact, under standard assumptions of Lipschitz gradients and bounded noise variance, it is known that SGD is worst-case optimal (up to absolute constants) in terms of finding a near-stationary point with respect to the $\ell_2$-norm, making further improvements impossible. Motivated by this limitation, we introduce refined assumptions on the smoothness structure of the objective and the gradient noise variance, which better suit the coordinate-wise nature of adaptive gradient methods. Moreover, we adopt the $\ell_1$-norm of the gradient as the stationarity measure, as opposed to the standard $\ell_2$-norm, to align with the coordinate-wise analysis and obtain tighter convergence guarantees for AdaGrad. Under these new assumptions and the $\ell_1$-norm stationarity measure, we establish an upper bound on the convergence rate of AdaGrad and a corresponding lower bound for SGD. In particular, we identify non-convex settings in which the iteration complexity of AdaGrad is favorable over SGD and show that, for certain configurations of problem parameters, it outperforms SGD by a factor of $d$, where $d$ is the problem dimension. To the best of our knowledge, this is the first result to demonstrate a provable gain of adaptive gradient methods over SGD in a non-convex setting. We also present supporting lower bounds, including one specific to AdaGrad and one applicable to general deterministic first-order methods, showing that our upper bound for AdaGrad is tight and unimprovable up to a logarithmic factor under certain conditions.
Session: 6B - Convex Optimization (Wednesday 02 July 11:10–13:00)
Authors: Saad, El Mehdi; Lee, Wei-Cheng; Orabona, Francesco
Abstract:
We study fundamental limits of first-order stochastic optimization in a range of non-convex settings, including L-smooth functions satisfying Quasar-Convexity (QC), Quadratic Growth (QG), and Restricted Secant Inequalities (RSI). While the convergence properties of standard algorithms are well-understood in deterministic regimes, significantly fewer results address the stochastic case, where only unbiased and noisy gradients are available. We establish new lower bounds on the number of noisy gradient queries to minimize these classes of functions, also showing that they are tight (up to a logarithmic factor) in all the relevant quantities characterizing each class. Our approach reformulates the optimization task as a function identification problem, leveraging \emph{divergence decomposition} arguments to construct a challenging subclass that leads to sharp lower bounds. Furthermore, we present a specialized algorithm in the one-dimensional setting that achieves faster rates, suggesting that certain dimensional thresholds are intrinsic to the complexity of non-convex stochastic optimization.
Time: Thursday 03 July 09:00–10:00
Session: 7A - Clustering and Graphs (Thursday 03 July 09:00–10:00)
Authors: Black, Hadley; Mazumdar, Arya; Saha, Barna
Abstract:
We consider the basic problem of learning an unknown partition of $n$ elements into at most $k$ sets using simple queries that reveal information about a small subset of elements. Our starting point is the popular and well-studied pairwise same-set queries which ask if a pair of elements belong to the same class. It is well-known that non-adaptive (fully parallel) algorithms require $\Theta(n^2)$ queries, while adaptive (fully sequential) algorithms require $\Theta(nk)$ queries, and the best known algorithm uses $k-1$ rounds of adaptivity. Many variations of this problem has been studied due to its connections to clustering and active learning. In these applications, it is of interest to reduce adaptivity while minimizing the query complexity. In this paper, we give a complete characterization of the query complexity of this problem as a function of the number of rounds, $r$, which interpolates smoothly between the non-adaptive and adaptive settings: for any constant $r \geq 1$, the query complexity is $\smash{\Theta(n^{1+\frac{1}{2^r-1}}k^{1-\frac{1}{2^r-1}})}$. Additionally, our algorithm only needs $O(\log \log n)$ rounds to attain the optimal $O(nk)$ query complexity, which is a double-exponential improvement over prior works when $k$ is a polynomial in $n$. Next, we consider two natural generalizations of pair-wise queries to general subsets $S$ of size at most $s$: (1) weak subset queries which return the number of classes intersected by $S$, and (2) strong subset queries which return the entire partition restricted on $S$. For non-adaptive algorithms, we show $\Omega(n^2/s^2)$ strong queries are needed. In contrast, perhaps surprisingly, we show that there is a non-adaptive algorithm using weak queries that matches this bound up to log-factors for all $s \leq \sqrt{n}$. More generally, we obtain nearly matching upper and lower bounds for algorithms using weak and strong queries in terms of both the number of rounds, $r$, and the query size bound, $s$.
Session: 7A - Clustering and Graphs (Thursday 03 July 09:00–10:00)
Authors: Chakraborty, Diptarka; Das, Debarati; Chatterjee, Kushagra; Nguyen, Tien Long; Nobahari, Romina
Abstract:
Consensus clustering, a fundamental task in machine learning and data analysis, aims to aggregate multiple input clusterings of a dataset, potentially based on different non-sensitive attributes, into a single clustering that best represents the collective structure of the data. In this work, we study this fundamental problem through the lens of fair clustering, as introduced by Chierichetti et al. [NeurIPS'17], which incorporates the disparate impact doctrine to ensure proportional representation of each protected group in the dataset within every cluster. Our objective is to find a consensus clustering that is not only representative but also fair with respect to specific protected attributes. To the best of our knowledge, we are the first to address this problem and provide a constant-factor approximation. As part of our investigation, we examine how to minimally modify an existing clustering to enforce fairness -- an essential postprocessing step in many clustering applications that require fair representation. We develop an optimal algorithm for datasets with equal group representation and near-linear time constant factor approximation algorithms for more general scenarios with different proportions of two group sizes. Given the fundamental nature of this problem, we believe our results on Closest Fair Clustering could have broader implications for other clustering problems, particularly those for which no prior approximation guarantees exist for their fair variants.
Session: 7A - Clustering and Graphs (Thursday 03 July 09:00–10:00)
Authors: Shin, Kijun; Fan, Chenglin
Abstract:
Clustering is a fundamental task in unsupervised learning. Previous research has focused on learning-augmented $k$-means in Euclidean metrics, limiting its applicability to complex data representations. In this paper, we generalize learning-augmented $k$-clustering to operate on general metrics, enabling its application to graph-structured and non-Euclidean domains. Our framework also relaxes restrictive cluster size constraints, providing greater flexibility for datasets with imbalanced or unknown cluster distributions. Furthermore, we extend the hardness of query complexity to general metrics: under the Exponential Time Hypothesis (ETH), we show that any polynomial-time algorithm must perform approximately $\Omega(k / \alpha)$ queries to achieve a $(1 + \alpha)$-approximation. These contributions enhance both the theoretical foundations and the practical scope of learning-augmented clustering, addressing critical gaps between traditional methods and real-world challenges.
Session: 7A - Clustering and Graphs (Thursday 03 July 09:00–10:00)
Authors: Nakul, Milind; Muthukumar, Vidya; Pananjady, Ashwin
Abstract:
Suppose we observe a trajectory of length n from an α-mixing stochastic process over a finite but potentially large state space. We consider the problem of estimating the probability mass placed by the stationary distribution of any such process on elements that occur with a certain frequency in the observed sequence. We estimate this vector of probabilities in total variation distance, showing consistency in n and recovering known results for i.i.d. sequences as special cases. Our proposed methodology carefully combines the plug-in estimator with a recently-proposed modification of the Good–Turing estimator called WINGIT, which was originally developed for Markovian sequences. En route to controlling the error of our estimator, we develop new performance bounds on WINGIT and the plug-in estimator for α-mixing stochastic processes. Importantly, the extensively used method of Poissonization can no longer be applied in our non i.i.d. setting, and so we develop complementary tools—including an empirical Bernstein bound for mixing sequences—that may prove independently useful in the design and analysis of estimators for related problems.
Session: 7A - Clustering and Graphs (Thursday 03 July 09:00–10:00)
Authors: Das, Syamantak; Galhotra, Sainyam; Li, Wen-Zhi; Raychaudhury, Rahul; Sintos, Stavros
Abstract:
Traditional clustering methods assume precise pairwise distances, but this is often impractical when dealing with images, videos, or natural language. This paper studies clustering and graph problems where direct access to distances is infeasible. We use oracle-based methods, as defined by~\cite{galhotra2024k}: the quadruplet oracle (weak, compares two pairs) and the distance oracle (strong, returns exact distances), under adversarial and probabilistic noise. These oracles can be implemented via crowdsourcing or predictive models. We consider a finite metric space $\Sigma=(\V,d)$ of size $|\V|=n$ with both oracles. When the dataset has low intrinsic (doubling) dimension, for $k$-center, $k$-median, and $k$-means clustering, we design constant approximation algorithms with $O((n+k^2)\cdot \polylog(n))$ quadruplet queries and $O(\polylog(n))$ distance queries. In general metric spaces, we achieve constant approximation with $O(k \cdot n \cdot \polylog(n))$ quadruplet queries and $O(\polylog(n))$ distance queries, improving the quadruplet query complexity by a factor of $k$ and distance query complexity by $k^2$ over~\cite{galhotra2024k}. If the dataset spread is polynomially bounded, we build a data structure using $O(n\cdot \polylog(n))$ quadruplet and $O(\polylog(n))$ distance queries to approximate any pairwise distance $\dist(u,v)$. This enables constant approximation algorithms for $k$-clustering with $O(n\cdot \polylog(n))$ quadruplet and $O(\polylog(n))$ distance queries. Even without bounded spread, our approach extends to graph problems like Minimum Spanning Tree, demonstrating its generality in oracle-based clustering and graph algorithms.
Time: Thursday 03 July 09:00–10:00
Session: 7B - Online Learning II (Thursday 03 July 09:00–10:00)
Authors: Gao, Wenzhi; Chu, Ya-Chi; Ye, Yinyu; Udell, Madeleine
Abstract:
We introduce a framework to accelerate the convergence of gradient-based methods with online learning. The framework learns to scale the gradient at each iteration through an online learning algorithm and provably accelerates gradient-based methods asymptotically. In contrast with previous literature, where convergence is established based on worst-case analysis, our framework provides a strong convergence guarantee with respect to the optimal stepsize for the iteration trajectory. For smooth strongly convex optimization, our framework provides an $O(\kappa^\star \log(1/\varepsilon)$) asymptotic complexity result, where $\kappa^\star$ is the condition number achievable by the optimal preconditioner, improving on the previous $O(\sqrt{n}\kappa^\star \log(1/\varepsilon))$ result. For smooth convex optimization, we obtain the first convergence guarantee for the widely-used hypergradient descent heuristic.
Session: 7B - Online Learning II (Thursday 03 July 09:00–10:00)
Authors: Appel, Alexander; Kosoy, Vanessa
Abstract:
We propose a framework which generalizes "decision making with structured observations" from Foster et al. (2023) by allowing robust (i.e. multivalued) models. In this framework, each model associates each decision with a convex set of probability distributions over outcomes. Nature can choose distributions out of this set in an arbitrary (adversarial) manner, that can be non-oblivious and depend on past history. The resulting framework offers much greater generality than classical bandits and reinforcement learning, since the realizability assumption becomes much weaker and more realistic. We then derive a theory of regret bounds for this framework, which extends the "decision-estimation coefficients" of Foster et al. (2023). Although our lower and upper bounds are not tight, they are sufficient to fully characterize power-law learnability. We demonstrate this theory in two special cases: robust linear bandits (previously studied in Kosoy (2024)) and tabular robust online reinforcement learning (previously studied in Tian et al. (2021)). In both cases, we derive regret bounds that improve the state-of-the-art (except that we do not address computational efficiency).
Session: 7B - Online Learning II (Thursday 03 July 09:00–10:00)
Authors: Wan, Yuanyu
Abstract:
We investigate decentralized online convex optimization (D-OCO) in changing environments, and choose adaptive regret and dynamic regret as the performance metric. Specifically, these two metrics compare each local learner against the optimal comparator over every interval, and any sequence of comparators over all rounds, respectively. It is well-known that in the centralized setting, plenty of algorithms with (nearly) optimal bounds on these two metrics have been proposed. However, none of them has been extended into D-OCO, possibly due to the difficulty in handling their commonly used two-level structure. To fill the gap, in this paper, we propose black-box reductions from minimizing these two metrics of D-OCO to minimizing them in the centralized setting. Let $n$, $\rho$, and $T$ denote the number of local learners, the spectral gap of the communication matrix, and the time horizon, respectively. For adaptive regret, our reduction can achieve an $\tilde{O}(n\rho^{-1/4}\sqrt{\tau}\log T)$ bound over any interval of length $\tau$ in general, and an improved one of $\tilde{O}(n\rho^{-1/2}(\log T)^3)$ when facing strongly convex functions. These two bounds match existing lower bounds up to polylogarithmic factors. For dynamic regret, our reduction can achieve an $\tilde{O}(n\rho^{-1/4}\sqrt{T(1+P_T)\log T})$ bound in general, where $P_T$ is the path-length of comparators. We also provide the first lower bound for dynamic regret of D-OCO to demonstrate that our dynamic regret is nearly optimal.
Session: 7B - Online Learning II (Thursday 03 July 09:00–10:00)
Authors: Devulapalli, Pramith; Hanneke, Steve; Shaeiri, Amirreza
Abstract:
We study self-directed online learning, a variant of the online learning framework that involves a learner adaptively selecting instances to make its predictions. Our paper is centrally focused on characterizing learning rates by quantifying the growth of the self-directed mistake-bound given a concept class C as a function of the number of rounds T. We present our results across three different learning environments: deterministic learning, randomized learning, and agnostic learning. In the deterministic case, we introduce the perfect self-directed dimension, PSD(C), built from the self-directed dimension SD, and show that if PSD(C) = infinite, then the mistake bound grows Theta(T). In the scenario when PSD(C) < infinity but SD(C) = infinity, we prove a O(PSD(C)log T) upper bound using a combinatorial technique based on shattered sets. When deriving lower bounds under this scenario, our analysis reveals a surprising phenomenon: the self-directed learner can take a spectrum of infinite rates within omega(1) and O(PSD(C)log(T)). Specifically, we show that for any unbounded rate function R(T) = Olog(T), obeying certain mild properties, there exists a concept class C_R whose mistake-bound grows Theta(R(T)). In randomized self-directed learning, we present a dichotomy of rates under an oblivious adversary using the DS dimension and characterize rates for an adaptive adversary using similar conditions on PSD and SD from deterministic learning. Finally, we study agnostic self-directed learning rates under an oblivious adversary and present sublinear regret bounds when DS is finite.
Session: 7B - Online Learning II (Thursday 03 July 09:00–10:00)
Authors: Jia, Zeyu; Polyanskiy, Yury; Rakhlin, Alexander
Abstract:
We study the problem of sequential probability assignment under logarithmic loss, both with and without side information. Our objective is to analyze the minimax regret—a notion extensively studied in the literature—in terms of geometric quantities, such as covering numbers and scale-sensitive dimensions. We show that the minimax regret for the case of no side information (equivalently, the Shtarkov sum) can be upper bounded in terms of sequential square-root entropy, a notion closely related to Hellinger distance. For the problem of sequential probability assignment with side information, we develop both upper and lower bounds based on the aforementioned entropy. The lower bound matches the upper bound, up to log factors, for classes in the Donsker regime (according to our definition of entropy).
Time: Thursday 03 July 10:30–11:20
Session: 8A - Non-Convex Optimization (Thursday 03 July 10:30–11:20)
Authors: Stoger, Dominik; Zhu, Yizhe
Abstract:
For the problem of reconstructing a low-rank matrix from a few linear measurements, two classes of algorithms have been widely studied in the literature: convex approaches based on nuclear norm minimization, and non-convex approaches that use factorized gradient descent. Under certain sta- tistical model assumptions, it is known that nuclear norm minimization recovers the ground truth as soon as the number of samples scales linearly with the number of degrees of freedom of the ground-truth. In contrast, while non-convex approaches are computationally less expensive, ex- isting recovery guarantees assume that the number of samples scales at least quadratically with the rank r of the ground-truth matrix. In this paper, we close this gap by showing that the non- convex approaches can be as efficient as nuclear norm minimization in terms of sample complex- ity. Namely, we consider the problem of reconstructing a positive semidefinite matrix from a few Gaussian measurements. We show that factorized gradient descent with spectral initialization con- verges to the ground truth with a linear rate as soon as the number of samples scales with Ω(rdκ2), where d is the dimension, and κ is the condition number of the ground truth matrix. This improves the previous rank-dependence in the sample complexity of non-convex matrix factorization from quadratic to linear. Our proof relies on a probabilistic decoupling argument, where we show that the gradient descent iterates are only weakly dependent on the individual entries of the measure- ment matrices. We expect that our proof technique is of independent interest for other non-convex problems.
Session: 8A - Non-Convex Optimization (Thursday 03 July 10:30–11:20)
Authors: Zhou, Huanjian; Han, Andi; Takeda, Akiko; Sugiyama, Masashi
Abstract:
In large-scale applications, such as machine learning, it is desirable to design non-convex optimization algorithms with a high degree of parallelization. In this work, we first study the adaptive complexity of finding a stationary point, which is the minimal number of sequential rounds required to achieve stationarity given polynomially many queries executed in parallel at each round. In this work, we examine two fundamental cases of non-convex optimization: high-dimensional case and constant-dimensional case. For the high-dimensional case, \emph{i.e.}, $d = \widetilde{\Omega}(\varepsilon^{-(2 + 2p)/p})$, we show that for any (potentially randomized) algorithm, there exists a function with Lipschitz $p$-th order derivatives such that the algorithm requires at least $\varepsilon^{-(p+1)/p}$ iterations to find an $\varepsilon$-stationary point. Our lower bounds are tight and show that even with $\mathrm{poly}(d)$ queries per iteration, no algorithm has better convergence rate than those achievable with one-query-per round algorithms. In other words, gradient descent, the cubic-regularized Newton's method, and the $p$th-order adaptive regularization method are adaptively optimal. Our proof relies upon novel analysis with the characterization of the output for the hardness potentials based on a chain-like structure with random partition. For the constant-dimensional case, \emph{i.e.}, $d = \Theta(1)$, we propose an algorithm that bridges grid search and gradient flow trapping, finding an approximate stationary point in constant iterations. Its asymptotic tightness is verified by a new lower bound on the required queries per iteration. We show there exists a smooth function such that any algorithm running with $\Theta(\log (1/\varepsilon))$ rounds requires at least $\widetilde{\Omega}((1/\varepsilon)^{(d-1)/2})$ queries per round. This lower bound is tight up to a logarithmic factor, and implies that the gradient flow trapping is adaptively optimal.
Session: 8A - Non-Convex Optimization (Thursday 03 July 10:30–11:20)
Authors: Hanneke, Steve; Moran, Shay; Shlimovich, Alexander; Yehudayoff, Amir
Abstract:
Learning theory has traditionally followed a model-centric approach, focusing on designing optimal algorithms for a fixed natural learning task (e.g., linear classification or regression). In this paper, we adopt a complementary data-centric perspective, whereby we fix a natural learning rule and focus on optimizing the training data. Specifically, we study the following question: given a learning rule \(\mathcal{A}\) and a data selection budget \(n\), how well can \(\mathcal{A}\) perform when trained on at most \(n\) data points selected from a population of \(N\) points? We investigate when it is possible to select \(n \ll N\) points and achieve performance comparable to training on the entire population. We address this question across a variety of empirical risk minimizers. Our results include optimal data-selection bounds for mean estimation, linear classification, and linear regression. Additionally, we establish two general results: a taxonomy of error rates in binary classification and in stochastic convex optimization. Finally, we propose several open questions and directions for future research.
Session: 8A - Non-Convex Optimization (Thursday 03 July 10:30–11:20)
Authors: Cornacchia, Elisabetta; Mikulincer, Dan; Mossel, Elchanan
Abstract:
The problem of learning single-index and multi-index models has gained significant interest as a fundamental task in high-dimensional statistics. Many recent works have analyzed gradient-based methods, particularly in the setting of isotropic data distributions, often in the context of neural network training. Such studies have uncovered precise characterizations of algorithmic sample complexity in terms of certain analytic properties of the target function, such as the leap, information, and generative exponents. These properties establish a quantitative separation between low- and high-complexity learning tasks. In this work, we show that high-complexity cases are rare. Specifically, we prove that introducing a small random perturbation to the data distribution—via a random shift in the first moment—renders any Gaussian single-index model as easy to learn as a linear function. We further extend this result to a class of multi-index models, namely sparse Boolean functions, also known as Juntas.
Time: Thursday 03 July 10:30–11:20
Session: 8B - Statistical Physics (Thursday 03 July 10:30–11:20)
Authors: Xu, Yizhou; Maillard, Antoine; KRZAKALA, FLORENT; Zdeborova, Lenka
Abstract:
In the matrix sensing problem, one wishes to reconstruct a matrix from (possibly noisy) observations of its linear projections along given directions. We consider this model in the high-dimensional limit: while previous works on this model primarily focused on the recovery of low-rank matrices, we consider in this work more general classes of structured signal matrices with potentially large rank, e.g.\ a product of two matrices of sizes proportional to the dimension. We provide rigorous asymptotic equations characterizing the Bayes-optimal learning performance from a number of samples which is proportional to the number of entries in the matrix. Our proof is composed of three key ingredients: $(i)$ we prove universality properties to handle structured sensing matrices, related to the ``Gaussian equivalence'' phenomenon in statistical learning, $(ii)$ we provide a sharp characterization of Bayes-optimal learning in generalized linear models with Gaussian data and structured matrix priors, generalizing previously studied settings, and $(iii)$ we leverage previous works on the problem of matrix denoising. The generality of our results allow for a variety of applications: notably, we mathematically establish predictions obtained via non-rigorous methods from statistical physics in~\cite{erba2024bilinear} regarding Bilinear Sequence Regression, a benchmark model for learning from sequences of tokens, and in~\cite{maillard2024bayes} on Bayes-optimal learning in neural networks with quadratic activation function, and width proportional to the dimension.
Session: 8B - Statistical Physics (Thursday 03 July 10:30–11:20)
Authors: Louis, Anand; Paul, Rameesh; Raghavendra, Prasad
Abstract:
The planted clique problem is a fundamental problem in study of algorithms and has been studied in various random and semirandom models. For the planted clique problem, we can recover the clique if the size of the clique is above the conjectured computational threshold of $\Omega_p(\sqrt{n})$. A natural question that arises then is what other planted structures can be recovered? In this work, we consider random planted and semirandom models for the $r$-coloring problem. We study the following model of instances, choose a set $S \subseteq V$ (of size $k$) and plant an arbitrary $r$-colorable graph in the subgraph induced on $S$. For each pair of vertices in $((V\S) \times (V \S))$, edges are added independently with probability $p$. An adversary is then allowed to add an arbitrary subgraph between $S$ and $V \S$. Our main result is an efficient algorithm that recovers most of the vertices of the planted $r$-colorable graph for $k\geq cr\sqrt{n/p}$, for some constant $c$. Our key technical innovation is a novel SDP relaxation and a rounding algorithm for this problem. Our algorithm is also robust to presence of a monotone adversary that can insert edges in the graph induced on $V\S$.
Session: 8B - Statistical Physics (Thursday 03 July 10:30–11:20)
Authors: Elimelech, Dor; Huleihel, Wasim
Abstract:
The problems of detecting and recovering planted structures/subgraphs in Erd\H{o}s-R\'{e}nyi random graphs, have received significant attention over the past three decades, leading to many exciting results and mathematical techniques. However, prior work has largely focused on specific ad hoc planted structures and inferential settings, while a general theory has remained elusive. In this paper, we bridge this gap by investigating the detection of an \emph{arbitrary} planted subgraph $\Gamma = \Gamma_n$ in an Erd\H{o}s-R\'{e}nyi random graph $\mathcal{G}(n, q_n)$, where the edge probability within $\Gamma$ is $p_n$. We examine both the statistical and computational aspects of this problem and establish the following results. In the dense regime, where the edge probabilities $p_n$ and $q_n$ are fixed, we tightly characterize the information-theoretic and computational thresholds for detecting $\Gamma$, and provide conditions under which a computational-statistical gap arises. Most notably, these thresholds depend on $\Gamma$ only through its number of edges, maximum degree, and maximum subgraph density. Our lower and upper bounds are general and apply to any value of $p_n$ and $q_n$ as functions of $n$. Accordingly, we also analyze the sparse regime where $q_n = \Theta(n^{-\alpha})$ and $p_n-q_n =\Theta(q_n)$, with $\alpha\in[0,2]$, as well as the critical regime where $p_n=1-o(1)$ and $q_n = \Theta(n^{-\alpha})$, both of which have been widely studied, for specific choices of $\Gamma$. For these regimes, we show that our bounds are tight for all planted subgraphs investigated in the literature thus far--and many more. Finally, we identify conditions under which detection undergoes sharp phase transition, where the boundaries at which algorithms succeed or fail shift abruptly as a function of $q_n$.
Session: 8B - Statistical Physics (Thursday 03 July 10:30–11:20)
Authors: Lee, Daniel; Pernice, Francisco; Rajaraman, Amit; Zadik, Ilias
Abstract:
Given an arbitrary subgraph $H=H_n$ and $p=p_n\in(0,1)$, the planted subgraph model is defined as follows. A statistician observes the union of the "signal," which is a random "planted" copy $H^*$ of $H$, together with random "noise" in the form of an instance of an Erdős–Rényi graph $G(n,p)$. The goal then of the statistician is to recover the planted $H^*$ from the observed graph. Our focus in this work is to understand the minimum mean-squared error (MMSE) in terms of recovering the edges of $H^*$, as a function of $p$ and $H$. A recent paper [MNSSZ23] characterizes the graphs for which this MMSE curve undergoes a sharp phase transition from $0$ to $1$ as $p$ increases, a behavior known as the All-or-Nothing phenomenon, up to a mild density assumption on $H$. However, their techniques fail to describe the MMSE curves for graphs that do not display such a sharp phase transition. In this paper, we provide a formula for the limiting MMSE curve for any graph $H=H_n$, up to the same mild density assumption. This curve is expressed in terms of a variational formula over pairs of subgraphs of $H$, and is inspired by the celebrated subgraph expectation thresholds from probabilistic combinatorics [KK07]. Furthermore, we give a polynomial-time description of the optimizers of this variational problem. This allows one to efficiently compute the MMSE curve for any given dense graph $H$. The proof relies on a novel graph decomposition as well as a min-max duality theorem which may be of independent interest. Our results generalize to the setting of planting arbitrary monotone boolean properties, where the statistician observes the union of a planted minimal element $A\subseteq[N]$ of a monotone property and a random $\mathrm{Ber}(p)^{\otimes N}$ vector. In this setting, we provide a variational formula inspired by the so-called "fractional" expectation threshold [Tal10], again describing the MMSE curve (in this case up to a multiplicative constant).
Time: Thursday 03 July 14:00–15:36
Session: 9A - Random Graphs (Thursday 03 July 14:00–15:36)
Authors: Stephan, Ludovic; Zhu, Yizhe
Abstract:
The Bethe-Hessian matrix, introduced by Saade, Krzakala, and Zdeborová (2014), is a Hermitian matrix designed for applying spectral clustering algorithms to sparse networks. Rather than employing a non-symmetric and high-dimensional non-backtracking operator, a spectral method based on the Bethe-Hessian matrix is conjectured to also reach the Kesten-Stigum detection threshold in the sparse stochastic block model (SBM). We provide the first rigorous analysis of the Bethe-Hessian spectral method in the SBM under both the bounded expected degree and the growing degree regimes. Specifically, we demonstrate that: (i) When the expected degree $d\geq 2$, the number of negative outliers of the Bethe-Hessian matrix can consistently estimate the number of blocks above the Kesten-Stigum threshold, thus confirming a conjecture from Saade et al. (2014) for $d\geq 2$. (ii) For sufficiently large $d$, its eigenvectors can be used to achieve weak recovery. (iii) As $d\to\infty$, we establish the concentration of the locations of its negative outlier eigenvalues, and weak consistency can be achieved via a spectral method based on the Bethe-Hessian matrix.
Session: 9A - Random Graphs (Thursday 03 July 14:00–15:36)
Authors: Moharrami, Mehrdad; Moore, Cris; Xu, Jiaming
Abstract:
We study the problem of recovering a planted spanning tree $M_n^*$ hidden within a complete, randomly weighted graph $G_n$. Specifically, each edge $e$ has a non-negative weight drawn independently from $P_n$ if $e \in M_n^*$ and from $Q_n$ otherwise, where $P_n \equiv P$ is fixed and $Q_n$ scales with $n$ such that its density at the origin satisfies $\lim_{n\to\infty} n Q'_n(0)=1.$ We consider two representative cases: when $M_n^*$ is either a uniform spanning tree or a uniform Hamiltonian path. We analyze the recovery performance of the minimum spanning tree (MST) algorithm and derive a fixed-point equation that characterizes the asymptotic fraction of edges in $M_n^*$ successfully recovered by the MST as $n \to \infty.$ Furthermore, we establish the asymptotic mean weight of the MST, extending Frieze's $\zeta(3)$ result to the planted model. %A key ingredient of our analysis %is the asymptotic characterization of the local structure of the planted model, leveraging the framework of local weak convergence. Our analysis relies on an asymptotic characterization of the local structure of the planted model, employing the framework of local weak convergence.
Session: 9A - Random Graphs (Thursday 03 July 14:00–15:36)
Authors: Gaudio, Julia; Sandon, Colin; Xu, Jiaming; Yang, Dana
Abstract:
This paper studies the problem of inferring a $k$-factor, specifically a spanning $k$-regular graph, planted within an Erdos-Renyi random graph $G(n,\lambda/n)$. We uncover an interesting "all-something-nothing" phase transition. Specifically, we show that as the average degree $\lambda$ surpasses the critical threshold of $1/k$, the inference problem undergoes a transition from almost exact recovery ("all" phase) to partial recovery ("something" phase). Moreover, as $\lambda$ tends to infinity, the accuracy of recovery diminishes to zero, leading to the onset of the "nothing" phase. This finding complements the recent result by Mossel, Niles-Weed, Sohn, Sun, and Zadik who established that for certain sufficiently dense graphs, the problem undergoes an "all-or-nothing" phase transition, jumping from near-perfect to near-zero recovery. In addition, we characterize the recovery accuracy of a linear-time iterative pruning algorithm and show that it achieves almost exact recovery when $\lambda < 1/k$. A key component of our analysis is a two-step cycle construction: we first build trees through local neighborhood exploration and then connect them by sprinkling using reserved edges. Interestingly, for proving impossibility of almost exact recovery, we construct $\Theta(n)$ many small trees of size $\Theta(1)$, whereas for establishing the algorithmic lower bound, a single large tree of size $\Theta(\sqrt{n\log n})$ suffices.
Session: 9A - Random Graphs (Thursday 03 July 14:00–15:36)
Authors: Li, Zhangsong
Abstract:
In this paper, we focus on the matching recovery problem between a pair of correlated Gaussian Wigner matrices with a latent vertex correspondence. We are particularly interested in a robust version of this problem such that our observation is a perturbed input $(A+E,B+F)$ where $(A,B)$ is a pair of correlated Gaussian Wigner matrices and $E,F$ are adversarially chosen matrices supported on an unknown $\epsilon n * \epsilon n$ principle minor of $A,B$, respectively. We propose a vector approximate message passing (vector AMP) algorithm that succeeds in polynomial time as long as the correlation $\rho$ between $(A,B)$ is a non-vanishing constant and $\epsilon = o\big( \tfrac{1}{(\log n)^{20}} \big)$. The main methodological inputs for our result are the iterative random graph matching algorithm proposed in Ding and Li (2025+, 2023) and the spectral cleaning procedure proposed in Ivkov and Schramm (2024). To the best of our knowledge, our algorithm is the first efficient random graph matching type algorithm that is robust under any adversarial perturbations of $n^{1-o(1)}$ size.
Session: 9A - Random Graphs (Thursday 03 July 14:00–15:36)
Authors: Chin, Byron; Mossel, Elchanan; Sohn, Youngtak; Wein, Alexander
Abstract:
We study the inference of communities in stochastic block models with a growing number of communities. For block models with $n$ vertices and a fixed number of communities $q$, it was predicted in Decelle et al.\ that there are computationally efficient algorithms for recovering the communities above the Kesten--Stigum (KS) bound and that efficient recovery is impossible below the KS bound. This conjecture has since stimulated a lot of interest, with the achievability side proven in a line of research culminating in work of Abbe and Sandon. Conversely, the hardness side of the conjecture has been supported by recent progress based on the low-degree paradigm. In this paper we investigate community recovery in the regime $q \to \infty$ where no such predictions exist. We show that efficient inference of communities remains possible above the KS bound. Furthermore, we show that recovery of block models is low-degree-hard below the KS bound when the number of communities $q\ll \sqrt{n}$. Perhaps surprisingly, we find that when $q \gg \sqrt{n}$, there is an efficient algorithm based on non-backtracking walks for recovery even below the KS bound. We identify a new threshold which we conjecture is the threshold for weak recovery in this regime. Finally, we show that detection is easy and identify (up to a constant) the information-theoretic threshold for community recovery as the number of communities $q$ diverges. Our low-degree hardness results also naturally have consequences for graphon estimation, improving results of Luo and Gao.
Session: 9A - Random Graphs (Thursday 03 July 14:00–15:36)
Authors: Bresler, Guy; Guo, Chenghao; Polyanskiy, Yury; Yao, Andrew
Abstract:
Consider a $d$-uniform random hypergraph on $n$ vertices in which hyperedges are included iid, each with probability $n^{-d+1+\delta}$, so that the average degree is $n^\delta$. The projection of a hypergraph is a graph on the same $n$ vertices where an edge connects two vertices if and only if they belong to some hyperedge. The goal is to reconstruct the hypergraph given its projection. An earlier work of (Bresler et. al., 2024), showed that exact recovery for $d=3$ is possible if and only if $\delta < 2/5$. This work completely resolves the question for all values of $d$ for both exact and partial recovery and for both cases of whether multiplicity information about each edge is available or not. In addition, it is also shown that the reconstruction fidelity undergoes an all-or-nothing transition at a threshold. In particular, this resolves all conjectures from (Bresler et. al., 2024).
Session: 9A - Random Graphs (Thursday 03 July 14:00–15:36)
Authors: Tankala, Chandan; Nagaraj, Dheeraj; Raj, Anant
Abstract:
Gradient flow in the 2-Wasserstein space is widely used to optimize functionals over probability distributions and is typically implemented using an interacting particle system with n particles. Analyzing these algorithms requires showing (a) that the finite-particle system converges and/or (b) that the resultant empirical distribution of the particles closely approximates the optimal distribution (i.e., propagation of chaos). However, establishing efficient sufficient conditions can be challenging, as the finite particle system may produce heavily dependent random variables. In this work, we study the virtual particle stochastic approximation, originally introduced for Stein Variational Gradient Descent (Das and Nagaraj, 2023). This method can be viewed as a form of stochastic gradient descent in the Wasserstein space and can be implemented efficiently. In popular settings, we demonstrate that our algorithm's output converges to the optimal distribution under conditions similar to those for the infinite particle limit, and it produces i.i.d. samples without the need to explicitly establish propagation of chaos bounds.
Session: 9A - Random Graphs (Thursday 03 July 14:00–15:36)
Authors: Du, Hang; Gong, Shuyang; Xu, Jiaming
Abstract:
We investigate the problem of detecting and estimating a changepoint in the attachment function of a network evolving according to a preferential attachment model on $n$ vertices, using only a single final snapshot of the network. \cite{bet2023detecting} show that a simple test based on thresholding the number of vertices with minimum degrees can detect the changepoint when the change occurs at time $n-\Omega(\sqrt{n})$. They further make the striking conjecture that detection becomes impossible for any test if the change occurs at time $n-o(\sqrt{n}).$ \cite{kaddouri2024impossibility} make a step forward by proving the detection is impossible if the change occurs at time $n-o(n^{1/3}).$ In this paper, we resolve the conjecture affirmatively, proving that detection is indeed impossible if the change occurs at time $n-o(\sqrt{n}).$ Furthermore, we establish that estimating the changepoint with an error smaller than $o(\sqrt{n})$ is also impossible, thereby confirming that the estimator proposed in \cite{bhamidi2018change} is order-optimal.
Time: Thursday 03 July 14:00–15:36
Session: 9B - Learning Theory II (Thursday 03 July 14:00–15:36)
Authors: Klivans, Adam; Stavropoulos, Konstantinos; Vasilyan, Arsen
Abstract:
The seminal work of Linial, Mansour, and Nisan gave a quasipolynomial-time algorithm for learning constant-depth circuits (AC0) with respect to the uniform distribution on the hypercube. Extending their algorithm to the setting of malicious noise, where both covariates and labels can be adversarially corrupted, has remained open. Here we achieve such a result, inspired by recent work on learning with distribution shift. Our running time essentially matches their algorithm, which is known to be optimal assuming various cryptographic primitives. Our proof uses a simple outlier-removal method combined with Braverman's theorem for fooling constant-depth circuits. We attain the best possible dependence on the noise rate and succeed in the harshest possible noise model (i.e., contamination or so-called ``nasty noise").
Session: 9B - Learning Theory II (Thursday 03 July 14:00–15:36)
Authors: Kova_evi_, Filip; Zhang, Yihan; Mondelli, Marco
Abstract:
Multi-index models provide a popular framework to investigate the learnability of functions with low-dimensional structure and, also due to their connections with neural networks, they have been object of recent intensive study. In this paper, we focus on recovering the subspace spanned by the signals via spectral estimators -- a family of methods routinely used in practice, often as a warm-start for iterative algorithms. Our main technical contribution is a precise asymptotic characterization of the performance of spectral methods, when sample size and input dimension grow proportionally and the dimension $p$ of the space to recover is fixed. Specifically, we locate the top-$p$ eigenvalues of the spectral matrix and establish the overlaps between the corresponding eigenvectors (which give the spectral estimators) and a basis of the signal subspace. Our analysis unveils a phase transition phenomenon in which, as the sample complexity grows, eigenvalues escape from the bulk of the spectrum and, when that happens, eigenvectors recover directions of the desired subspace. The precise characterization we put forward enables the optimization of the data preprocessing, thus allowing to identify the spectral estimator that requires the minimal sample size for weak recovery.
Session: 9B - Learning Theory II (Thursday 03 July 14:00–15:36)
Authors: Matsumoto, Namiko; Mazumdar, Arya
Abstract:
In statistics, generalized linear models (GLMs) are widely used for modeling data and can expressively capture potential nonlinear dependence of the model's outcomes on its covariates. Within the broad family of GLMs, those with binary outcomes, which include logistic and probit regressions, are motivated by common tasks such as binary classification with (possibly) non-separable data. In addition, in modern machine learning and statistics, data is often high-dimensional yet has a low intrinsic dimension, making sparsity constraints in models another reasonable consideration. In this work, we propose to use and analyze an iterative hard thresholding (projected gradient descent on the ReLU loss) algorithm, called \emph{binary iterative hard thresholding (BIHT)}, for parameter estimation in sparse GLMs with binary outcomes. We establish that BIHT is statistically efficient and converges to the correct solution for parameter estimation in a general class of sparse binary GLMs. Unlike many other methods for learning GLMs, including maximum likelihood estimation, generalized approximate message passing, and GLM-tron (Kakade et al, 2011, Bahmani et al., 2016), BIHT does not require knowledge of the GLM's link function, offering flexibility and generality in allowing the algorithm to learn arbitrary binary GLMs. As two applications, logistic and probit regression are additionally studied. In this regard, it is shown that in logistic regression, the algorithm is in fact statistically optimal in the sense that the order-wise sample complexity matches (up to logarithmic factors) the lower bound obtained previously. To the best of our knowledge, this is the first work achieving statistical optimality for logistic regression in all noise regimes with a computationally efficient algorithm. Moreover, for probit regression, our sample complexity is on the same order as that obtained for logistic regression.
Session: 9B - Learning Theory II (Thursday 03 July 14:00–15:36)
Authors: Musco, Cameron; Musco, Christopher; Rosenblatt, Lucas; Singh, Apoorv Vikram
Abstract:
We study the problem of approximately recovering a probability distribution given noisy measurements of its Chebyshev polynomial moments. This problem arises broadly across algorithms, statistics, and machine learning. By leveraging a global decay bound on the coefficients in the Chebyshev expansion of any Lipschitz function, we sharpen prior work, proving that accurate recovery in the Wasserstein distance is possible with more noise than previously known. Our result immediately yields a number of applications: 1. We give a simple ``linear query'' algorithm for constructing a differentially private synthetic data distribution with Wasserstein-1 error $\tilde{O}(1/n)$ based on a dataset of $n$ points in $[-1,1]$. This bound is optimal up to log factors and matches a recent breakthrough of Boedihardjo, Strohmer, and Vershynin [Probab. Theory. Rel., 2024], which uses a more complex ``superregular random walk'' method to beat an $O(1/\sqrt{n})$ accuracy barrier inherent to earlier approaches. 2. We give an $\tilde{O}(n^2/\epsilon)$ time algorithm for the linear algebraic problem of estimating the spectral density of an $n\times n$ symmetric matrix up to $\epsilon$ error in the Wasserstein distance. Our result accelerates prior methods from Chen et al. [ICML 2021] and Braverman et al. [STOC 2022]. 3. We tighten an analysis of Vinayak, Kong, Valiant, and Kakade [ICML 2019] on the maximum likelihood estimator for the statistical problem of ``Learning Populations of Parameters'', extending the parameter regime in which sample optimal results can be obtained. Beyond these main results, we provide an extension of our bound to estimating distributions in $d > 1$ dimensions. We hope that these bounds will find applications more broadly to problems involving distribution recovery from noisy moment information.
Session: 9B - Learning Theory II (Thursday 03 July 14:00–15:36)
Authors: Heidari, Mohsen; Khardon, Roni
Abstract:
The Fourier representation for the uniform distribution over the Boolean cube has found numerous applications in algorithms and complexity analysis. Notably, in learning theory, the learnability of Disjunctive Normal Form (DNF) under the uniform and product distributions has been established through such representations. This paper makes three main contributions. First, it introduces a generalized Fourier expansion that can be used with any distribution $D$ through the representation of the distribution as a Bayesian network (BN). Second, it shows that the main algorithmic tools for learning with the Fourier representation that use membership queries to approximate functions by recovering their heavy Fourier coefficients, can be used with slight modifications with the generalized expansion. These results hold for any distribution. Third, it analyzes the $L_1$ spectral norm of conjunctions under the new expansion, showing that it is bounded for a class of distributions which can be represented by a difference-bounded tree BN, where a parent node in the BN representation can change the conditional expectation of a child node by at most $\alpha<0.5$. Lower bounds are presented to show that such constraints are necessary. Combining these contributions, the paper shows learnability of DNF with membership queries under difference-bounded tree BN.
Session: 9B - Learning Theory II (Thursday 03 July 14:00–15:36)
Authors: Bhattacharjee, Robi; Frohnapel, Karolin; Luxburg, Ulrike
Abstract:
SHAP is one of the most popular local feature-attribution methods. Given a function f and an input x \in R^d, it quantifies each feature's contribution to f(x). Recently, SHAP has been increasingly used for global insights: practitioners average the absolute SHAP values over many data points to compute global feature importance scores, which are then used to discard “unimportant'' features. In this work, we investigate the soundness of this practice by asking whether small aggregate SHAP values necessarily imply that the corresponding feature does not affect the function. Unfortunately, the answer is no: even if the i-th SHAP value equals 0 on the entire data support, there exist functions that clearly depend on Feature i. The issue is that computing SHAP values involves evaluating f on points outside of the data support, where f can be strategically designed to mask its dependence on Feature i. To address this, we propose to aggregate SHAP values over the extended support, which is the product of the marginals of the underlying distribution. With this modification, we show that a small aggregate SHAP value implies that we can safely discard the corresponding feature. We then extend our results to KernelSHAP, the most popular method to approximate SHAP values in practice. We show that if KernelSHAP is computed over the extended distribution, a small aggregate KernelSHAP value justifies feature removal. This result holds independently of whether KernelSHAP accurately approximates true SHAP values, making it one of the first theoretical results to characterize the KernelSHAP algorithm itself. Our findings have both theoretical and practical implications. We introduce the “Shapley Lie algebra,” which offers algebraic insights that may enable a deeper investigation of SHAP and we show that a simple preprocessing step -- randomly permuting each column of the data matrix -- enables safely discarding features based on aggregate SHAP and KernelSHAP values.
Session: 9B - Learning Theory II (Thursday 03 July 14:00–15:36)
Authors: Esmaili Mallory, Matthew; Huang, Kevin Han; Austern, Morgane
Abstract:
Over the last decade, a wave of research has characterized the exact asymptotic risk of many high-dimensional models in the proportional regime. Two foundational results have driven this progress: Gaussian universality, which shows that the asymptotic risk of estimators trained on non-Gaussian and Gaussian data is equivalent, and the convex Gaussian min-max theorem (CGMT), which characterizes the risk under Gaussian settings. However, these results rely on the assumption that the data consists of independent random vectors—an assumption that significantly limits its applicability to many practical setups. In this paper, we address this limitation by generalizing both results to the dependent setting. More precisely, we prove that Gaussian universality still holds for high-dimensional logistic regression under block dependence, and establish a novel CGMT framework that accommodates for correlation across both the covariates and observations. Using these results, we establish the impact of data augmentation, a widespread practice in deep learning, on the asymptotic risk.
Time: Thursday 03 July 16:12–18:00
Session: 10A - Language Models (Thursday 03 July 16:12–18:00)
Authors: Joshi, Nirmit; Srebro, Nathan; Vardi, Gal; Block, Adam; Goel, Surbhi; Li, Zhiyuan; Misiakiewicz, Theodor
Abstract:
For a given base class of sequence-to-next-token generators, we consider learning prompt-to-answer mappings obtained by iterating a fixed (time-invariant) generator for multiple steps, thus generating a chain-of-thought, and then taking the final token as the answer. We formalize the learning problems both when the chain-of-thought is observed and when training only on prompt-answer pairs, with the chain-of-thought latent. We analyze the sample and computational complexity both in terms of general properties of the base class (e.g. its VC dimension) and for specific base classes. We present a simple base class that allows for universal representability and computationally tractable chain-of-thought learning. Central to our development is that time invariance allows for sample complexity that is independent of the length of the chain-of-thought. Attention arises naturally in our construction.
Session: 10A - Language Models (Thursday 03 July 16:12–18:00)
Authors: Rohatgi, Dhruv; Block, Adam; Huang, Audrey; Krishnamurthy, Akshay; Foster, Dylan
Abstract:
Next-token prediction with the log loss is a cornerstone of autoregressive sequence modeling, but, in practice, suffers from error amplification, where errors in the model compound and generation quality degrades as sequence length $H$ increases. From a theoretical perspective, this phenomenon should not appear in well-specified settings, and, indeed, a growing body of empirical work hypothesizes that misspecification, where the learner is not sufficiently expressive to represent the target distribution, may be the root cause. Under misspecification---where the goal is to learn as well as the best-in-class model up to a multiplicative approximation factor $C\geq 1$---we confirm that $C$ indeed grows with $H$ for next-token prediction, lending theoretical support to this empirical hypothesis. We then ask whether this mode of error amplification is avoidable algorithmically, computationally, or information-theoretically, and uncover inherent computational-statistical tradeoffs. We show: (1) Information-theoretically, one can avoid error amplification and achieve $C=O(1)$. (2) Next-token prediction can be made robust to achieve $C=O(H)$, representing moderate error amplification, but this is an inherent barrier: any next-token prediction-style objective must suffer $C=\Omega(H)$. (3) For the natural testbed of autoregressive linear models, no computationally efficient algorithm can achieve sub-polynomial approximation factor $C=e^{(\log H)^{1-\Omega(1)}}$; however, at least for binary token spaces, one can smoothly trade compute for statistical power and improve on $C=\Omega(H)$ in sub-exponential time. Our results have consequences in the more general setting of imitation learning, where the widely-used behavior cloning generalizes next-token prediction.
Session: 10A - Language Models (Thursday 03 July 16:12–18:00)
Authors: Raman, Vinod; Tewari, Ambuj; Li, Jiaxun
Abstract:
We study generation through the lens of statistical learning theory. First, we abstract and formalize the results of Gold (1967), Angluin (1979), Angluin (1980) and Kleinberg and Mullainathan (2024) in terms of a binary hypothesis class defined over an abstract countable example space. Then, we extend the notion of “generation” from Kleinberg and Mullainathan (2024) to two new settings, we call “uniform” and “non-uniform” generation, and provide a characterization of which hypothesis classes are uniformly and non-uniformly generatable. As is standard in learning theory, our characterizations are in terms of the finiteness of a new combinatorial dimension termed the Closure dimension. By doing so, we are able to compare generatability with predictability (captured via PAC and on-line learnability) and show that these two properties of hypothesis classes are incomparable – there are classes that are generatable but not predictable and vice versa. Finally, we extend our results to capture prompted generation and give a complete characterization of which classes are prompt generatable, generalizing some of the work by Kleinberg and Mullainathan (2024).
Session: 10A - Language Models (Thursday 03 July 16:12–18:00)
Authors: Papazov, Hristo; Flammarion, Nicolas
Abstract:
We study the problem of learning computable functions in the limit by extending Gold’s inductive inference framework to incorporate \textit{computational observations}. Traditional black-box learning is inherently limited, but we show that additional structured observations — such as runtime estimates (clock observations) or external traces of computation (behavior observations) — enable learnability. We establish that any time-bounded class of computable functions is learnable in the limit under black-box observations, while all computable functions can be learned with clock observations under a relaxed version of the Extended Church-Turing Thesis. Further, we build a formal framework around observations of \textit{computational agents} and show that learning computable functions from behavior reduces to learning rational functions from input and output, yielding a polynomial-time state-merging algorithm under specific constraints. On the negative side, we show that polynomial characteristic sets cannot exist for the class of all computable functions. Our results provide a unified perspective on the role of structured observations of computational models in inductive inference.
Session: 10A - Language Models (Thursday 03 July 16:12–18:00)
Authors: Foster, Dylan; Mhammedi, Zakaria; Rohatgi, Dhruv
Abstract:
Language model alignment (or, reinforcement learning) techniques that leverage active exploration---deliberately encouraging the model to produce diverse, informative responses---offer the promise of super-human capabilities. However, current understanding of algorithm design primitives for computationally efficient exploration with language models is limited. To better understand how to leverage access to powerful pre-trained generative models to improve the efficiency of exploration, we introduce a new computational framework for RL with language models, in which the learner interacts with the model through a sampling oracle. Focusing on the linear softmax model parameterization, we provide new results that reveal the computational-statistical tradeoffs of efficient exploration: 1. [Necessity of coverage] Coverage refers to the extent to which the pre-trained model covers near-optimal responses---a form of hidden knowledge. We show that coverage, while not necessary for data efficiency, lower bounds the runtime of any algorithm in our framework. 2. [Inference-time exploration.] We introduce a new algorithm, SpannerSampling, which obtains optimal data efficiency and is computationally efficient whenever the pre-trained model enjoys sufficient coverage, matching our lower bound. SpannerSampling leverages inference-time computation with the pre-trained model to reduce the effective search space for exploration. 3. [Insufficiency of training-time interventions.] We contrast (2) by showing that training-time interventions that produce proper policies cannot achieve similar guarantees in polynomial time. Finally, we show that under additional representational assumptions, one can achieve improved runtime (replacing sequence-level coverage with token-level coverage) through multi-turn exploration.
Session: 10A - Language Models (Thursday 03 July 16:12–18:00)
Authors: Blum, Avrim; Hanneke, Steve; Pabbaraju, Chirag; Saless, Donya
Abstract:
We consider a model for explainable AI in which an explanation for a prediction $h(x)=y$ consists of a subset $S'$ of the training data (if it exists) such that all classifiers $h'\in\mathcal{H}$ that make at most $b$ mistakes on $S'$ predict $h'(x)=y$. Such a set $S'$ serves as a {\em proof} that $x$ indeed has label $y$ under the assumption that (1) the true target function $h^\star$ belongs to $\mathcal{H}$, and (2) the set $S$ contains at most $b$ noisy or corrupted points. For example, if $b=0$ and $\mathcal{H}$ is the family of linear classifiers in $\mathbb{R}^d$, and if $x$ lies inside the convex hull of the positive data points in $S$ (and therefore every consistent linear classifier labels $x$ as positive), then Carath\'eodory's theorem states that $x$ in fact lies inside the convex hull of $d+1$ of those points. So, a set $S'$ of size $d+1$ could be released as an explanation for a positive prediction, and would serve as a short proof of correctness of the prediction under the assumption of perfect realizability. In this work, we consider this problem more generally, for general hypothesis classes $\mathcal{H}$ and general values $b\geq 0$. We define the notion of the {\em robust hollow star number} of $\mathcal{H}$ (which generalizes the standard hollow star number), and show that it precisely characterizes the worst-case size of the smallest certificate achievable, and analyze its size for natural classes. We also consider worst-case distributional bounds on certificate size, as well as {\em distribution-dependent} bounds that we show tightly control the sample size needed to get a certificate for any given test example. In particular, we define a notion of the {\em certificate coefficient} $\varepsilon_x$ of an example $x$ with respect to a data distribution $\mathcal{D}$ and target function $h^\star$, and prove matching upper and lower bounds on sample size as a function of $\varepsilon_x$, $b$, and the VC dimension $d$ of $\mathcal{H}$.
Session: 10A - Language Models (Thursday 03 July 16:12–18:00)
Authors: Charikar, Moses; Pabbaraju, Chirag
Abstract:
The recent work of Kleinberg and Mullainathan (2024) provides a concrete model for language generation in the limit: given a sequence of examples from an unknown target language, the goal is to generate new examples from the target language such that no incorrect examples are generated beyond some point. In sharp contrast to strong negative results for the closely related problem of language identification, they establish positive results for language generation in the limit for all countable collections of languages. Follow-up work by Li, Raman, and Tewari (2024) studies bounds on the number of distinct inputs required by an algorithm before correct language generation is achieved — namely, whether this is a constant for all languages in the collection (uniform generation) or a language-dependent constant (non-uniform generation). We show that every countable collection has a generator with the stronger property of non-uniform generation in the limit. However, while the generation algorithm of Kleinberg and Mullainathan (2024) can be implemented using membership queries, we show that any algorithm cannot non-uniformly generate even for collections of just two languages, using only membership queries. We also formalize the tension between validity and breadth in the generation algorithm of Kleinberg and Mullainathan (2024) by introducing a definition of exhaustive generation, and show a strong negative result for exhaustive generation. Our result shows that a tradeoff between validity and breadth is inherent for generation in the limit. We also provide a precise characterization of the language collections for which exhaustive generation is possible. Finally, inspired by algorithms that can choose to obtain feedback, we consider a model of uniform generation with feedback, completely characterizing language collections for which such uniform generation with feedback is possible in terms of an abstract complexity measure of the collection.
Session: 10A - Language Models (Thursday 03 July 16:12–18:00)
Authors: Feldman, Vitaly; Kornowski, Guy; Lyu, Xin
Abstract:
Several recent works demonstrate that training of large language models leads to memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sensitive user data and thus motivates the study of data memorization's role in learning. In this work we demonstrate that several simple and well-studied binary classification problems exhibit a trade-off between the number of samples available to a learning algorithm and the amount of information about the training data that a learning algorithm needs to memorize to be accurate. In particular, $\Omega(d)$ bits of information about the training data need to be memorized when a single $d$-dimensional example is available, which then decays as $\Theta(d/n)$ as the number of examples grows (for $n\leq \sqrt{d}$). Further, this rate is achieved (up to logarithmic factors) by simple learning algorithms. Our results build on the work of Brown et al. (2021), and establish a new framework for proving memorization lower bounds that is based on an approximate version of strong data processing inequalities.
Session: 10A - Language Models (Thursday 03 July 16:12–18:00)
Authors: Haris, Themistoklis; Onak, Krzysztof
Abstract:
A key limitation of autoregressive Transformers is the large memory needed at inference-time to cache all previous key-value (KV) embeddings. Prior works address this by compressing the KV cache but often assume specific structural properties of the embeddings. This raises the following natural question: Can truly sublinear space utilization be achieved without such assumptions? In this work, we answer this question in the negative. Any algorithm for attention-based token generation must use $\Theta(nd)$ space, where n is the number of tokens generated so far and $d \geq \Omega(\log n)$ is the dimension of the KV embeddings. Our proof involves a reduction from a classic communication complexity problem and uses a randomized construction that leverages properties of projections in the spirit of the Johnson-Linderstrauss lemma. For the low-dimensional regime $d = o(\log n)$, we show that any algorithm requires $\Omega(d e^d)$ space and prove, using tight bounds on covering numbers, that \textsc{SubGen}, proposed by [Zandieh et al. (2024b)], matches this bound. Further, we investigate how sparsity assumptions enable token generation in truly sublinear space, presenting impossibility results and proposing a new KV cache compression algorithm for sliding window attention when the value cache outside the window is unmasked. Finally, we analyze token generation’s time complexity, using an indistinguishability argument to prove that no non-adaptive algorithm can compute attention online in sublinear time for all tokens.
Time: Thursday 03 July 16:12–18:00
Session: 10B - Bandits and Causality (Thursday 03 July 16:12–18:00)
Authors: Cai, Yang; Kalavasis, Alkis; Mamali, Katerina; Mehrotra, Anay; Zampetakis, Manolis
Abstract:
Most of the widely used estimators of average treatment effect (ATE) in causal inference rely on the assumptions of unconfoundedness and overlap. Unconfoundedness requires that the observed covariates account for all correlations between the outcome and treatment. Overlap requires the existence of randomness in treatment decisions for all individuals. Nevertheless, there are many types of studies that frequently violate unconfoundedness or overlap, e.g., observational studies of studies with deterministic treatment decisions. In this paper, we initiate the study of general conditions that enable the identification of the average treatment effect, extending well beyond unconfoundedness and overlap. In particular, following the paradigm of learning theory, we provide an interpretable condition that is sufficient and nearly necessary for the identification of ATE. Moreover, this condition characterizes the identification of the average treatment effect on the treated (ATT) and can be used to characterize other treatment effects as well. To illustrate the utility of our condition, we present several well-studied scenarios where our condition is satisfied and, hence, we prove that ATE can be identified in regimes that prior works could not capture. For example, this holds for the models proposed by Tan (2006) and Rosenbaum (2002), and the Regression Discontinuity design model introduced by Thistlethwaite and Campbell (1960). For each of these scenarios, we also show that, under natural additional assumptions, ATE can be estimated from finite samples. We believe these findings open new avenues for bridging learning-theoretic insights and causal inference methodologies, particularly in observational studies with complex treatment mechanisms.
Session: 10B - Bandits and Causality (Thursday 03 July 16:12–18:00)
Authors: Black, Hadley; Mazumdar, Arya; Saha, Barna; Xu, Yinzhan
Abstract:
The graph reconstruction problem has been extensively studied under various query models. In this paper, we propose a new query model regarding the number of connected components, which is one of the most basic and fundamental graph parameters. Formally, we consider the problem of reconstructing an $n$-node $m$-edge graph with oracle queries of the following form: provided with a subset of vertices, the oracle returns the number of connected components in the induced subgraph. We show $\Theta(\frac{m \log n}{\log m})$ queries in expectation are both sufficient and necessary to adaptively reconstruct the graph. In contrast, we show that $\Omega(n^2)$ non-adaptive queries are required, even when $m = O(n)$. We also provide an $O(m\log n + n\log^2 n)$ query algorithm using only two rounds of adaptivity.
Session: 10B - Bandits and Causality (Thursday 03 July 16:12–18:00)
Authors: Kim, Seok-Jin; Kim, Gi-Soo; Oh, Min-hwan
Abstract:
We study the semiparametric bandit problem, where the reward for each action depends on both a linear component and an unmodeled, potentially adversarial shift. This setting strictly generalizes classical linear bandits while capturing additional complexities frequently encountered in real-world systems. To our best knowledge, experimental design for semiparametric reward models had not been investigated in the prior work. We develop a novel experimental design approach that, for the first time, enables both sharp regret bounds and exploration guarantees simultaneously in semiparametric bandits. Our proposed algorithm achieves \(\tilde{\mathcal{O}}(\sqrt{dT})\) regret bound matching known lower bounds for linear bandits with finite arms for the first time. Moreover, our approach achieves the first logarithmic regret under a positive suboptimality gap. We further establish the first exploration-based guarantees for semiparametric bandits, such as PAC and best arm identification guarantees. These theoretical advances are enabled by a refined, non-asymptotic analysis of orthogonalized regression that achieves the optimal \(\sqrt{d}\)-rate, paving the way for robust and efficient bandit algorithms in a broader class of problems.
Session: 10B - Bandits and Causality (Thursday 03 July 16:12–18:00)
Authors: Chase, Zachary; Mehalel, Idan
Abstract:
In binary ($0/1$) online classification with apple tasting feedback, the learner receives feedback only when predicting $1$. Besides some degenerate learning tasks, all previously known learning algorithms for this model are randomized. Consequently, prior to this work it was unknown whether deterministic apple tasting is generally feasible. In this work, we provide the first widely-applicable deterministic apple tasting learner, and show that in the realizable case, a hypothesis class is learnable if and only if it is deterministically learnable, confirming a conjecture of Raman, Subedi, Raman, Tewari-24. Quantitatively, we show that every class H is learnable with mistake bound $O (\sqrt{L(H) T log T})$ (where $L(H)$ is the Littlestone dimension of $H$), and that this is tight for some classes. This demonstrates a separation between a deterministic and randomized learner, where the latter can learn every class with mistake bound $O(\sqrt{L(H)T})$, as shown in Raman et al.-24. We further study the agnostic case, in which the best hypothesis makes at most $k$ many mistakes, and prove a trichotomy stating that every class $H$ must be either easy, hard, or unlearnable. Easy classes have (both randomized and deterministic) mistake bound $\Theta_{H}(k)$. Hard classes have randomized mistake bound $\tilde{\Theta}_{H}(k + \sqrt{T})$, and deterministic mistake bound $\tilde{\Theta}_{H}(\sqrt{k \cdot T})$, where $T$ is the time horizon. Unlearnable classes have (both randomized and deterministic) mistake bound $\Theta(T)$. Our upper bound is based on a deterministic algorithm for learning from expert advice with apple tasting feedback, a problem interesting in its own right. For this problem, we show that the optimal deterministic mistake bound is $\Theta (\sqrt{T (k + \log n)})$ for all $k$ and $T \leq n \leq 2^T$, where $n$ is the number of experts. Our algorithm is a natural variation of the well-known exponential weights forecaster.
Session: 10B - Bandits and Causality (Thursday 03 July 16:12–18:00)
Authors: Whitehouse, Justin; Syrgkanis, Vasilis; Wilder, Bryan; Jung, Chris; Wu, Steven
Abstract:
Estimates of heterogeneous treatment effects such as conditional average treatment effects (CATEs) and conditional quantile treatment effects (CQTEs) play an important role in real-world decision making. Given this importance, one should ensure these estimators are calibrated. While there is a rich literature on calibrating estimators of non-causal parameters, very few methods have been derived for calibrating estimators of causal parameters, or more generally estimators of quantities involving nuisance parameters. In this work, we develop general algorithms for reducing the task of causal calibration to that of calibrating a standard (non-causal) predictive model. Throughout, we study a notion of calibration defined with respect to an arbitrary, nuisance-dependent loss $\ell$, under which we say an estimator $\theta$ is calibrated if its predictions cannot be changed on any level set to decrease loss. For losses $\ell$ satisfying a condition called \textit{universal orthogonality}, we present a simple algorithm that transforms partially-observed data into generalized pseudo-outcomes and applies any off-the-shelf calibration procedure. For losses $\ell$ satisfying a weaker assumption called conditional orthogonality, we provide a similar sample splitting algorithm the performs empirical risk minimization over an appropriately defined class of functions. Convergence of both algorithms follows from a generic, two term upper bound of the calibration error of any model that decouples the error in estimating unknown nuisance parameters from the calibration error in a hypothetical world where the learned nuisances are true. We demonstrate the practical applicability of our results in experiments on both observational and synthetic data. Our results are exceedingly general, showing that essentially any existing calibration algorithm can be used in causal settings, with additional loss only arising from errors in nuisance estimation.
Session: 10B - Bandits and Causality (Thursday 03 July 16:12–18:00)
Authors: Hanneke, Steve; Shaeiri, Amirreza; Zhang, Qian
Abstract:
The seminal work of (Daniely et al., COLT 2011) introduced the problem of multiclass learning under bandit feedback and provided a combinatorial characterization of its learnability within the framework of PAC learning. In multiclass learning under bandit feedback, there is an unknown data distribution over an instance space $\mathcal{X}$ and a label space $\mathcal{Y}$ similar to classical multiclass learning, but the learner does not directly observe the correct labels of the i.i.d. training examples. Instead, during each round, the learner receives an example, makes a prediction for its label, and receives bandit feedback only indicating whether the prediction is correct. Despite this restriction, the goal remains the same as in classical multiclass learning. In the present work, we study the problem of multiclass learning under bandit feedback within the framework of \emph{universal learning} (Bousquet et al., STOC 2021). This makes it possible to study the behavior of learning curves. In the \emph{uniform learning} framework, no concept class $\mathcal{C}$ is learnable when the effective label space is unbounded. In contrast, surprisingly, we demonstrate that the universal learnability of concept classes $\mathcal{C}$ even when the effective label space is unbounded gives rise to a rich theory. More concretely, our primary contribution is a theory that reveals an inherent trichotomy governing instance optimal learning curves in the realizable setting. Moreover, the best achievable universal learning rate for any given concept class can only decay either at an \emph{exponential}, a \emph{linear}, or an \emph{arbitrarily slow} rate. In particular, the trichotomy is combinatorially characterized by the absence of an infinite multiclass Littlestone tree and the combination of an infinite Natarajan Littlestone tree and an infinite progressive Littlestone tree. Furthermore, we introduce novel learning algorithms for achieving instance optimal universal rates.
Session: 10B - Bandits and Causality (Thursday 03 July 16:12–18:00)
Authors: Maiti, Arnab; Fan, Zhiyuan; Farina, Gabriele; Jamieson, Kevin; Ratliff, Lillian
Abstract:
In this paper, we study the online shortest path problem in directed acyclic graphs (DAGs) under bandit feedback against an adaptive adversary. Given a DAG $G = (V, E)$ with a source node $v_{\mathsf{s}}$ and a sink node $v_{\mathsf{t}}$, let $\mathcal{X} \subseteq \{0,1\}^{|E|}$ denote the set of all paths from $\source$ to $\sink$. At each round $t$, we select a path $\bfx_t \in \mathcal{X}$ and receive bandit feedback on our loss $\langle \mathbf{x}_t, \mathbf{y}_t \rangle \in [-1,1]$, where $\mathbf{y}_t$ is an adversarially chosen loss vector. Our goal is to minimize regret with respect to the best path in hindsight over $T$ rounds. We propose the first computationally efficient algorithm to achieve a near-minimax optimal regret bound of $\tilde{\mathcal{O}}(\sqrt{|E|T\log |\mathcal{X}|})$ with high probability against any adaptive adversary, where $\tilde{\mathcal{O}}(\cdot)$ hides logarithmic factors in the number of edges $|E|$. Our algorithm leverages a novel loss estimator and a centroid-based decomposition in a nontrivial manner to attain this regret bound. As an application, we show that our algorithm for DAGs provides state-of-the-art efficient algorithms for $m$-sets, extensive-form games, the Colonel Blotto game, shortest walks in directed graphs, hypercubes, and multi-task multi-armed bandits, achieving improved high-probability regret guarantees in all these settings.
Session: 10B - Bandits and Causality (Thursday 03 July 16:12–18:00)
Authors: Dai, Yan; Blanchard, Moise; Jaillet, Patrick
Abstract:
We study a repeated resource allocation problem with strategic agents where monetary transfers are disallowed and the central planner has no prior information on agents' utility distributions. In light of Arrow's impossibility theorem, acquiring information about agent preferences through some form of feedback is necessary. We assume that the central planner can request powerful but expensive audits on the winner in any round, revealing the true utility of the winner in that round. We design a mechanism achieving $T$-independent $\mathcal O(K^2)$ regret in social welfare while requesting $\mathcal O(K^3 \log T)$ audits in expectation, where $K$ is the number of agents and $T$ is the number of rounds. We also show an $\Omega(K)$ lower bound on the regret and an $\Omega(1)$ lower bound on the number of audits when having low regret. Algorithmically, we show that incentive-compatibility can be mostly enforced with an accurate estimation of the winning probability of each agent under truthful reporting. To do so, we impose future punishments and introduce a \emph{flagging} component, allowing agents to flag any biased estimate (we show that doing so aligns with individual incentives). On the technical side, without monetary transfers and distributional information, the central planner cannot ensure that truthful reporting is exactly an equilibrium. Instead, we characterize the equilibrium via a reduction to a simpler \emph{auxiliary game}, in which agents cannot strategize until late in the $T$ rounds of the allocation problem. The tools developed therein may be of independent interest for other mechanism design problems in which the revelation principle cannot be readily applied.
Session: 10B - Bandits and Causality (Thursday 03 July 16:12–18:00)
Authors: Cesa-Bianchi, Nicolo; Cesari, Tom; Colomboni, Roberto; Foscari, Luigi; Pathak, Vinayak
Abstract:
We consider a sequential decision-making setting where, at every round $t$, the learner (a market maker) posts a bid price $B_t$ and an ask price $A_t$ to an incoming trader (the taker) with a private valuation for some asset. If the trader's valuation is lower than the bid price, or higher than the ask price, then a trade (sell or buy) occurs. Letting $M_t$ be the market price (observed only at the end of round $t$), the maker's utility is $M_t-B_t$ if the maker bought the asset, it is $A_t-M_t$ if they sold it, and it is $0$ if no trade occurred. We characterize the maker's regret with respect to the best fixed choice of bid and ask pairs under a variety of assumptions (adversarial, i.i.d., and their variants) on the sequence of market prices and valuations. Our upper bound analysis unveils an intriguing connection relating market making to first-price auctions and dynamic pricing. Our main technical contribution is a lower bound for the i.i.d. case with Lipschitz distributions and independence between market prices and takers' valuations. The difficulty in the analysis stems from a unique relationship between the reward and feedback functions that allows learning algorithms to trade off reward for information in a continuous way.
Time: Friday 04 July 10:00–11:24
Session: 11A - Computational Complexity (Friday 04 July 10:00–11:24)
Authors: Li, Qian; Wang, Shuo; Zhang, Jiapeng
Abstract:
Space complexity in learning problems has received a lot of attention in recent years. In this direction, Brown, Bun, and Smith (COLT 2022) studied space complexity lower bounds for several natural learning problems under the \textit{one-pass streaming} setting. Assuming that the examples are sampled from $\{0,1\}^d$ and the optimal hypothesis can be encoded using $\kappa$ bits, they showed learning algorithms with constant error using a near-minimal number of examples, $\Tilde{O}(\kappa)$, require $\Tilde{\Omega}(d\kappa)$ bits of memory. Moreover, for a general number $N$ of examples, their memory lower bound takes the form $\Tilde{\Omega}(d\kappa\cdot \frac{\kappa}{N})$. However, as mentioned by Brown, Bun, and Smith (COLT 2022), the learning process often involves multiple passes over the data. Hence, it is equally important to study the space complexity in the \textit{multi-pass streaming} setting. The authors conjectured that similar lower bounds should apply but left it as an open problem. In this paper, we resolve this open problem by proving that any $L$-pass streaming algorithm using $N$ samples requires $\Tilde{\Omega}(d\kappa\cdot \frac{\kappa}{NL})$ bits of memory. Intuitively, our lower bound shows that a stream of $L\cdot N$ fresh examples is at least as useful as $L$ passes over $N$ examples. A key component of our approach is a lower bound on the information complexity of the \textsf{Bit-Bias$(p,q)$} problem in the multi-pass streaming setting, a basic problem that may have independent significance. In the \textsf{Bit-Bias$(p,q)$} problem, one sees a stream of $N$ i.i.d. random bits drawn from either \textsf{Bernoulli$(p)$} or \textsf{Bernoulli$(q)$}, and would like to distinguish the two cases. Our results not only extend the previous lower bound on \textsf{Bit-Bias$(0,1/2)$} by Brown, Bun, and Smith from the one-pass streaming setting to the more general multi-pass setting, but also cover more general values of $p$ and $q$.
Session: 11A - Computational Complexity (Friday 04 July 10:00–11:24)
Authors: Kunisky, Dmitriy
Abstract:
We study when low coordinate degree functions (LCDF)---linear combinations of functions depending on small subsets of entries of a vector---can test for the presence of categorical structure, including community structure and generalizations thereof, in high-dimensional data. This complements recent results studying the power of LCDF in testing for continuous structure like real-valued signals corrupted by additive noise. We study a general form of stochastic block model (SBM), where a population is assigned random labels and every p-tuple generates an observation according to an arbitrary probability measure associated to the p labels of its members. We show that the performance of LCDF admits a unified analysis for this class of models. As applications, we prove tight lower bounds against LCDF (and therefore also against low degree polynomials) for nearly arbitrary graph and regular hypergraph SBMs, always matching suitable generalizations of the Kesten-Stigum threshold. We also prove tight lower bounds for group synchronization and abelian group sumset problems under the "truth-or-Haar" noise model, and give an improved analysis of Gaussian multi-frequency group synchronization. In most of these models, for some parameter settings our lower bounds give new evidence for conjectural statistical-to-computational gaps. Finally, interpreting some of our findings, we propose a new analogy between categorical and continuous signals: a general SBM as above behaves qualitatively like a spiked p*-tensor model of a certain order p* depending on the parameters of the SBM.
Session: 11A - Computational Complexity (Friday 04 July 10:00–11:24)
Authors: Huang, Han; Mossel, Elchanan
Abstract:
Broadcasting on trees is a fundamental model from statistical physics that plays an important role in information theory, noisy computation and phylogenetic reconstruction within computational biology and linguistics. While this model permits efficient linear-time algorithms for the inference of the root from the leaves, recent work suggests that non-trivial computational complexity may be required for inference. The inference of the root state can be performed using the celebrated Belief Propagation (BP) algorithm, which achieves Bayes-optimal performance. Although BP runs in linear time using real arithmetic operations, recent research indicates that it requires non-trivial computational complexity using more refined complexity measures. Moitra, Mossel, and Sandon demonstrated such complexity by constructing a Markov chain for which estimating the root better than random guessing (for typical inputs) is $NC^1$-complete. Kohler and Mossel constructed chains where, for trees with $N$ leaves, achieving better-than-random root recovery requires polynomials of degree $N^{\Omega(1)}$. The papers above raised the question of whether such complexity bounds hold generally below the celebrated Kesten-Stigum bound. In a recent work, Huang and Mossel established a general degree lower bound of $\Omega(\log N)$ below the Kesten-Stigum bound. Specifically, they proved that any function expressed as a linear combination of functions of at most $O(log N)$ leaves has vanishing correlation with the root. In this work, we get an exponential improvement of this lower bound by establishing an $N^{\Omega(1)}$ degree lower bound, for any broadcast process in the whole regime below the Kesten-Stigum bound.
Session: 11A - Computational Complexity (Friday 04 July 10:00–11:24)
Authors: Chen, Xue; Zhou, Zhaienhe; Shu, Wenxuan
Abstract:
We consider sparse variants of the classical Learning Parities with random Noise (LPN) problem. Our main contribution is a new algorithmic framework that provides learning algorithms against low-noise for both Learning Sparse Parities (LSPN) problem and sparse LPN problem. Different from previous approaches for LSPN and sparse LPN, this framework has a simple structure without fast matrix multiplication or tensor methods such that its algorithms are easy to implement and run in polynomial space. Let $n$ be the dimension, $k$ denote the sparsity, and $\eta$ be the noise rate. As a fundamental problem in computational learning theory, Learning Sparse Parities with Noise (LSPN) assumes the hidden parity is $k$-sparse. While a simple enumeration algorithm takes ${n \choose k}=O(n/k)^k$ time, previously known results stills need ${n \choose k/2} = \Omega(n/k)^{k/2}$ time for any noise rate $\eta$. Our framework provides a LSPN algorithm runs in time $O(\eta \cdot n/k)^k$ for any noise rate $\eta$, which improves the state-of-the-art of LSPN whenever $\eta \in (\sqrt{k/n},k/n)$. The sparse LPN problem is closely related to the classical problem of refuting random $k$-CSP and has been widely used in cryptography as the hardness assumption. Different from the standard LPN, it samples random $k$-sparse vectors. Because the number of $k$-sparse vectors is ${n \choose k}
Session: 11A - Computational Complexity (Friday 04 July 10:00–11:24)
Authors: Bresler, Guy; Harbuzova, Alina
Abstract:
In this work, we show the first average-case reduction transforming the sparse Spiked Covariance Model into the sparse Spiked Wigner Model and as a consequence obtain the first computational equivalence result between two well-studied high-dimensional statistics models. Our approach leverages a new perturbation equivariance property for Gram-Schmidt orthogonalization, enabling removal of dependence in the noise while preserving the signal.
Session: 11A - Computational Complexity (Friday 04 July 10:00–11:24)
Authors: Li, Shuangping; Schramm, Tselil
Abstract:
We show that the shortest $s$-$t$ path problem has the overlap-gap property in (i) sparse $\bbG(n,p)$ graphs and (ii) complete graphs with i.i.d. Exponential edge weights. Furthermore, we demonstrate that in sparse $\bbG(n,p)$ graphs, shortest path is solved by $O(\log n)$-degree polynomial estimators, and a uniform approximate shortest path can be sampled in polynomial time. This constitutes the first example in which the overlap-gap property is not predictive of algorithmic intractability for a (non-algebraic) average-case optimization problem.
Time: Friday 04 July 10:00–11:24
Session: 11B - Concentration Inequalities (Friday 04 July 10:00–11:24)
Authors: Hanneke, Steve; Xu, Mingyue
Abstract:
The universal learning framework has been developed to obtain guarantees on the learning rates that hold for any fixed distribution, which can be much faster than the ones uniformly hold over all the distributions. Given that the Empirical Risk Minimization (ERM) principle being fundamental in the PAC theory and ubiquitous in practical machine learning, the recent work of Hanneke and Xu (2024) studied the universal rates of ERM for binary classification under the realizable setting. However, the assumption of realizability is too restrictive to hold in practice. Indeed, the majority of the literature on universal learning has focused on the realizable case, leaving the non-realizable case barely explored. In this paper, we consider the problem of universal learning by ERM for binary classification under the agnostic setting, where the ``learning curve" reflects the decay of the excess risk as the sample size increases. We explore the possibilities of agnostic universal rates and reveal a compact trichotomy: there are three possible agnostic universal rates of ERM, being either e^{-n}, o(n^{-1/2}), or arbitrarily slow. We provide a complete characterization of which concept classes fall into each of these categories. Moreover, we also establish complete characterizations for the target-dependent universal rates as well as the Bayes-dependent universal rates.
Session: 11B - Concentration Inequalities (Friday 04 July 10:00–11:24)
Authors: Whitehouse, Justin; Ramdas, Aaditya; Wu, Steven
Abstract:
Self-normalized processes arise naturally in many learning-theoretic tasks. While self-normalized concentration has been extensively studied for scalar-valued processes, there is less work on multidimensional processes outside of the sub-Gaussian setting. In this work, we construct a general, self-normalized inequality for $\R^d$-valued processes that satisfy a simple yet broad ``sub-$\psi$'' tail condition, which generalizes assumptions based on cumulant generating functions. From this general inequality, we derive an upper law of the iterated logarithm for sub-$\psi$ vector-valued processes, which is tight up to small constants. We demonstrate applications in prototypical statistical tasks, such as parameter estimation in online linear regression and bounded mean estimation via a new (multivariate) empirical Bernstein concentration inequality.
Session: 11B - Concentration Inequalities (Friday 04 July 10:00–11:24)
Authors: Chornomaz, Bogdan; Moran, Shay; Waknine, Tom
Abstract:
We introduce and study the \emph{spherical dimension}, a natural topological relaxation of the VC dimension that unifies several results in learning theory where topology plays a key role in the proofs. The spherical dimension is defined by extending the set of realizable datasets (used to define the VC dimension) to the continuous space of realizable distributions. In this space, a shattered set of size d (in the VC sense) is completed into a continuous object, specifically a d-dimensional sphere of realizable distributions. The spherical dimension is then defined as the dimension of the largest sphere in this space. Thus, the spherical dimension is at least the VC dimension. The spherical dimension serves as a common foundation for leveraging the Borsuk-Ulam theorem and related topological tools. We demonstrate the utility of the spherical dimension in diverse applications, including disambiguations of partial concept classes, reductions from classification to stochastic convex optimization, stability and replicability, and sample compression schemes. Perhaps surprisingly, we show that the open question posed by Alon, Hanneke, Holzman, and Moran (FOCS 2021) of whether there exist non-trivial disambiguations for halfspaces with margin is equivalent to the basic open question of whether the VC and spherical dimensions are finite together.
Session: 11B - Concentration Inequalities (Friday 04 July 10:00–11:24)
Authors: Akbari, Syed; Harrison-Trainor, Matthew
Abstract:
This paper is about the recent notion of computably probably approximately correct learning, which lies between the statistical learning theory where there is no computational requirement on the learner and efficient PAC where the learner must be polynomially bounded. Examples have recently been given of hypothesis classes which are PAC learnable but not computably PAC learnable, but these hypothesis classes are unnatural or non-canonical in the sense that they depend on a numbering of proofs, formulas, or programs. We use the on-a-cone machinery from computability theory to prove that, under mild assumptions such as that the hypothesis class can be computably listed, any natural hypothesis class which is learnable must be computably learnable. Thus the counterexamples given previously are necessarily unnatural.
Session: 11B - Concentration Inequalities (Friday 04 July 10:00–11:24)
Authors: Bressan, Marco; Brukhim, Nataly; Cesa-Bianchi, Nicolo; Esposito, Emmanuel; Mansour, Yishay; Moran, Shay; Thiessen, Maximilian
Abstract:
In the multiclass PAC setting, even when full learnability is unattainable, meaningful information can often be extracted to guide predictions. However, classical learning theory has mainly focused on the dichotomy “learnable vs. non-learnable”, leaving notions of partial learnability largely unexplored. Indeed, even for a non-learnable class, a learner may still achieve partial success—for example, by making reliable predictions whenever the true label belongs to a fixed subset of the label space, even if it fails otherwise. Similarly, the rigid nature of PAC learnability makes it impossible to distinguish between classes where one can achieve favorable trade-offs between, say, false-positive and false-negative rates, and classes where such trade-offs are fundamentally unattainable. In a nutshell, standard PAC learnability precludes a fine-grained exploration of learnability. To overcome this limitation, we develop a fine-grained theory of PAC learnability. For any hypothesis class H, given a loss function (which quantifies the penalty for predicting ŷ instead of the true label y) and a target loss threshold z, our theory determines whether it is possible to achieve a loss of at most z. In contrast, classical PAC learning considers only the special case of the zero-one loss and z = 0, corresponding to a near perfect classification guarantee. We give a complete characterization of all attainable guarantees, captured by a finite family of combinatorial dimensions, which we term the J-cube dimensions of H. These dimensions are defined for every subset J of at least two labels. This extends the fundamental theorem of realizable PAC learning based on the VC dimension. In fact, our results hold in a more general multi-objective setting where we fully characterize the Pareto frontier of guarantees attainable for the class H.
Session: 11B - Concentration Inequalities (Friday 04 July 10:00–11:24)
Authors: Jafar, Sky; Asilis, Julian; Dughmi, Shaddin
Abstract:
We partly resolve an open question raised by Asilis et al. 2024: whether the algorithmic template of local regularization --- an intriguing generalization of explicit regularization, a.k.a. structural risk minimization --- suffices to learn all learnable multiclass problems. Specifically, we provide a negative answer to this question in the transductive model of learning. We exhibit a multiclass classification problem which is learnable in both the transductive and PAC models, yet cannot be learned transductively by any local regularizer. The corresponding hypothesis class, and our proof, are based on principles from cryptographic secret sharing. We outline challenges in extending our negative result to the PAC model, leaving open the tantalizing possibility of a PAC/transductive separation with respect to local regularization.
Time: Friday 04 July 14:30–15:30
Session: 12A - Online Algorithms (Friday 04 July 14:30–15:30)
Authors: Garber, Dan; Massalha, Mhna
Abstract:
We revisit Blackwell's celebrated approachability problem which considers a repeated vector-valued game between a player and an adversary. Motivated by settings in which the action set of the player or adversary (or both) is difficult to optimize over, for instance when it corresponds to the set of all possible solutions to some NP-Hard optimization problem, we ask what can the player guarantee \textit{efficiently}, when only having access to these sets via approximation algorithms with ratios $\alpha_{\mX} \geq 1$ and $ 1 \geq \alpha_{\mY} > 0$, respectively. Assuming the player has monotone preferences, i.e., that he does not prefer a vector-valued loss $\ell_1$ over $\ell_2$ if $\ell_2 \leq \ell_1$, we establish that given a Blackwell instance with an approachable target set $S$, the downward closure of the appropriately-scaled set $\alpha_{\mX}\alpha_{\mY}^{-1}S$ is \textit{efficiently} approachable with optimal rate. In case only the player's or adversary's set is equipped with an approximation algorithm, we give simpler and more efficient algorithms.
Session: 12A - Online Algorithms (Friday 04 July 14:30–15:30)
Authors: Dann, Christoph; Mansour, Yishay; Mohri, Mehryar; Schneider, Jon; Sivan, Balasubramanian
Abstract:
Abernethy et al.\ (2011) showed that Blackwell approachability and no-regret learning are equivalent, in the sense that any algorithm that solves a specific Blackwell approachability instance can be converted to a sublinear regret algorithm for a specific no-regret learning instance, and vice versa. In this paper, we study a more fine-grained form of such reductions, and ask when this translation between problems preserves not only a sublinear rate of convergence, but also preserves the optimal rate of convergence. That is, in which cases does it suffice to find the optimal regret bound for a no-regret learning instance in order to find the optimal rate of convergence for a corresponding approachability instance? We show that the reduction of Abernethy et al.\ (2011) does not preserve rates: their reduction may reduce a $d$-dimensional approachability instance $\mathcal{I}_1$ with optimal convergence rate $R_1$ to a no-regret learning instance $\mathcal{I}_2$ with optimal regret-per-round of $R_2$, with $R_{2}/R_{1}$ arbitrarily large (in particular, it is possible that $R_1 = 0$ and $R_{2} > 0$). On the other hand, we show that it is possible to tightly reduce any approachability instance to an instance of a generalized form of regret minimization we call \emph{improper $\phi$-regret minimization} (a variant of the $\phi$-regret minimization of Gordon et al.\ (2008)). Finally, we characterize when linear transformations suffice to reduce improper $\phi$-regret minimization problems to standard classes of regret minimization problems (such as external regret minimization and proper $\phi$-regret minimization) in a rate preserving manner. We prove that some improper $\phi$-regret minimization instances cannot be reduced to either subclass of instance in this way, suggesting that approachability can capture some problems that cannot be easily phrased in the standard language of online learning.
Session: 12A - Online Algorithms (Friday 04 July 14:30–15:30)
Authors: Tal, Hadar; Sabag, Oron
Abstract:
We study the Online Bookmaking problem, where a bookmaker dynamically updates betting odds on the possible outcomes of an event. In each betting round, the bookmaker has the opportunity to adjust odds based on the cumulative betting behavior, with the aim of mitigating risk. In a worstcase setting, defined by arbitrary bets’ sequences and event outcome, we show the bookmaker’s optimal loss is characterized as the largest root of a simple polynomial for any number of betting rounds and any number of possible outcomes. Our solution shows that bookmakers can be as fair as desired while still guaranteeing a gain, and the explicit characterization reveals an intriguing relation between the bookmaker’s regret and Hermite polynomials. We also develop an efficient algorithm that computes the optimal bookmaking strategy: when facing an optimal gambler, the algorithm achieves the optimal loss, and in rounds where the gambler is suboptimal, it reduces the achieved loss to the opportunistic optimal loss, a notion that is related to subgame perfect Nash equilibrium. The key technical contribution to achieve these results is an explicit characterization of the Bellman-Pareto Frontier, which unifies the dynamic programming updates for Bellman’s value function with multi-criteria optimization framework of the Pareto frontier in the context of vector repeated games.
Session: 12A - Online Algorithms (Friday 04 July 14:30–15:30)
Authors: Ryabchenko, Alexander; Attias, Idan; Roy, Daniel
Abstract:
We study online learning with oblivious losses and delays under a novel ``capacity constraint'' that limits how many past rounds can be tracked simultaneously for delayed feedback. Under ``clairvoyance'' (i.e., delays are revealed immediately) and/or ``preemptibility'' (i.e., we may stop tracking a loss we initially chose to track), we establish matching upper and lower bounds (up to logarithmic terms) on achievable regret, characterizing the ``optimal capacity'' needed to match the minimax rates of standard delayed learning, which implicitly assume unlimited capacity. Our algorithms achieve minimax-optimal regret across all capacity levels, with performance gracefully degrading under suboptimal capacity. For $K$ actions and total delay $D$ over $T$ rounds, under the clairvoyant setting and assuming capacity $C = \Omega(\log(T))$, we achieve regret $\widetilde{\Theta}(\sqrt{TK + D\log(K) + DK/C})$ for bandits and $\widetilde{\Theta}(\sqrt{(D+T)\log(K)})$ for full-information feedback. Replacing clairvoyance with preemptiveness requires a known delay bound $d_{\max}$, adding $\smash{\widetilde{O}(d_{\max})}$ to the regret. For fixed delays $d$, i.e., where $D = Td$, minimax regret is $\Theta(\sqrt{TK(1+d/C) + Td\log(K)})$ for bandits and $\Theta(\sqrt{T(d+1)\log(K)})$ for full-information. The optimal capacity is precisely $\Theta(\min\{K/\log(K), d\})$ for bandits and $\Theta(1)$ for full-information. Our upper bounds are achieved using novel \preemptive and non-\preemptive schedulers based on Pareto-distributed proxy delays for round-dependent delays and batching technique for fixed delays. Crucially, our work unifies delayed bandits, label-efficient sampling, and scheduling theory, demonstrating that robust online learning under delays is possible with surprisingly modest tracking capacity.
Session: 12A - Online Algorithms (Friday 04 July 14:30–15:30)
Authors: JIN, JIKAI; Syrgkanis, Vasilis
Abstract:
Average treatment effect estimation is the most central problem in causal inference with application to numerous disciplines. While many estimation strategies have been proposed in the literature, the statistical optimality of these methods has still remained an open area of investigation, especially in regimes where these methods do not achieve parametric rates. In this paper, we adopt the recently introduced structure-agnostic framework of statistical lower bounds, which poses no structural properties on the nuisance functions other than access to black-box estimators that achieve some statistical estimation rate. This framework is particularly appealing when one is only willing to consider estimation strategies that use non-parametric regression and classification oracles as black-box sub-processes. Within this framework, we prove the statistical optimality of the celebrated and widely used doubly robust estimators for both the Average Treatment Effect (ATE) and the Average Treatment Effect on the Treated (ATT), as well as weighted variants of the former, which arise in policy evaluation.
Time: Friday 04 July 14:30–15:30
Session: 12B - Quantum (Friday 04 July 14:30–15:30)
Authors: Chen, Sitan; de dios Pont, Jaume; Hsieh, Jun-Ting; Huang, Hsin-Yuan; Lange, Jane; Li, Jerry
Abstract:
We investigate the problem of predicting the output behavior of unknown quantum channels. Given query access to an $n$-qubit channel $\mathcal{E}$ and an observable $\mathcal{O}$, we aim to learn the mapping \begin{equation*} \rho \mapsto \Tr(\mathcal{O} \mathcal{E}[\rho]) \end{equation*} to within a small error for most $\rho$ sampled from a distribution $\mathcal{D}$. Previously, Huang, Chen, and Preskill proved a surprising result that even if $\mathcal{E}$ is arbitrary, this task can be solved in time roughly $n^{O(\log(1/\epsilon))}$, where $\epsilon$ is the target prediction error. However, their guarantee applied only to input distributions $\mathcal{D}$ invariant under all single-qubit Clifford gates, and their algorithm fails for important cases such as general product distributions over product states $\rho$. In this work, we propose a new approach that achieves accurate prediction over essentially any product distribution $\mathcal{D}$, provided it is not "classical" in which case there is a trivial exponential lower bound. Our method employs a "biased Pauli analysis", analogous to classical biased Fourier analysis. Implementing this approach requires overcoming several challenges unique to the quantum setting, including the lack of a basis with appropriate orthogonality properties. The techniques we develop to address these issues may have broader applications in quantum information.
Session: 12B - Quantum (Friday 04 July 14:30–15:30)
Authors: Chen, Kean; Wang, Qisheng
Abstract:
As often emerges in various basic quantum properties such as entropy, the trace of quantum state powers $\operatorname{tr}(\rho^q)$ has attracted a lot of attention. The recent work of Liu and Wang (SODA 2025) showed that $\operatorname{tr}(\rho^q)$ can be estimated to within additive error $\varepsilon$ with a dimension-independent sample complexity of $\widetilde O(1/\varepsilon^{3+\frac{2}{q-1}})$ for any constant $q > 1$, where only an $\Omega(1/\varepsilon)$ lower bound was given. In this paper, we significantly improve the sample complexity of estimating $\operatorname{tr}(\rho^q)$ in both the upper and lower bounds. In particular: - For $q > 2$, we settle the sample complexity with matching upper and lower bounds $\widetilde \Theta(1/\varepsilon^2)$. - For $1 < q < 2$, we provide an upper bound $\widetilde O(1/\varepsilon^{\frac{2}{q-1}})$, with a lower bound $\Omega(1/\varepsilon^{\max\{\frac{1}{q-1}, 2\}})$ for dimension-independent estimators, implying there is only room for a quadratic improvement. Our upper bounds are obtained by (non-plug-in) quantum estimators based on weak Schur sampling, in sharp contrast to the prior approach based on quantum singular value transformation and samplizer.
Session: 12B - Quantum (Friday 04 July 14:30–15:30)
Authors: Narayanan, Shyam
Abstract:
We give an improved algorithm for learning a quantum Hamiltonian given copies of its Gibbs state, that can succeed at any temperature. Specifically, we improve over the work of Bakshi, Liu, Moitra, and Tang (2024), by reducing the sample complexity and runtime dependence to singly exponential in the inverse-temperature parameter, as opposed to doubly exponential. Our main technical contribution is a new flat polynomial approximation to the exponential function, with significantly lower degree than the flat polynomial approximation used in Bakshi et al.
Session: 12B - Quantum (Friday 04 July 14:30–15:30)
Authors: Vasconcelos, Francisca; Huang, Hsin-Yuan
Abstract:
The seminal work of [LMN93] established a cornerstone result for classical complexity, with profound implications for learning theory. By proving low-degree Fourier concentration of AC^0, the work demonstrated that Boolean functions computed by constant-depth circuits can be efficiently PAC-learned via low-degree Fourier sampling. This breakthrough provided the first sample- and time-efficient (quasi-polynomial) algorithm for learning AC^0. Proposed by [Moore99] as a natural quantum analog of AC^0, QAC^0 is the class of constant-depth quantum circuits composed of arbitrary single-qubit gates and polynomial CZ gates of unbounded width. In this work, we present the first algorithm for efficient average-case learning of QAC^0 circuits with logarithmic ancilla. Namely, our algorithm achieves quasi-polynomial sample- and time-complexity for learning unknown QAC^0 unitaries to inverse-polynomially small error. We further show that these learned unitaries can be efficiently synthesized via poly-logarithmic depth circuits, making progress towards proper learning of QAC^0.
Time: Friday 04 July 16:00–17:36
Session: 13A - Algorithms (Friday 04 July 16:00–17:36)
Authors: Yang, Yichun; Li, Rong-Hua; Liao, Meihao; Wang, Guoren
Abstract:
Effective Resistance (ER) is a fundamental tool in various graph learning tasks. In this paper, we address the problem of efficiently approximating ER on a graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ with $n$ vertices and $m$ edges. First, we focus on local online-computation algorithms for ER approximation, aiming to improve the dependency on the approximation error parameter $\epsilon$. Specifically, for a given vertex pair $(s,t)$, we propose a local algorithm with a time complexity of $\tilde{O}(\sqrt{d}/\epsilon)$ to compute an $\epsilon$-approximation of the $s,t$-ER value for expander graphs, where $d=\min \{d_s,d_t\}$. This improves upon the previous state-of-the-art, including an $\tilde{O}(1/\epsilon^2)$ time algorithm based on random walk sampling by Andoni et al. (ITCS'19) and Peng et al. (KDD'21). Our method achieves this improvement by combining deterministic search with random walk sampling to reduce variance. Second, we establish a lower bound for ER approximation on expander graphs. We prove that for any $\epsilon\in (0,1)$, there exist an expander graph and a vertex pair $(s,t)$ such that any local algorithm requires at least $\Omega(1/\epsilon)$ time to compute the $\epsilon$-approximation of the $s,t$-ER value. Finally, we extend our techniques to index-based algorithms for ER computation. We propose an algorithm with $\tilde{O}(\min \{m+n/\epsilon^{1.5},\sqrt{nm}/\epsilon\})$ processing time, $\tilde{O}(n/\epsilon)$ space complexity and $O(1)$ query complexity, which returns an $\epsilon$-approximation of the $s,t$-ER value for any $s,t\in \mathcal{V}$ for expander graphs. Our approach improves upon the state-of-the-art $\tilde{O}(m/\epsilon)$ processing time by Dwaraknath et al. (NeurIPS'24) and the $\tilde{O}(m+n/\epsilon^2)$ processing time by Li and Sachdeva (SODA'23).
Session: 13A - Algorithms (Friday 04 July 16:00–17:36)
Authors: Bakshi, Ainesh; Cohen-Addad, Vincent; Hopkins, Sam; Jayaram, Rajesh; Lattanzi, Silvio
Abstract:
Metric embeddings are a widely used method in algorithm design, where generally a ``complex'' metric is embedded into a simpler, lower-dimensional one. Historically, the theoretical computer science community has focused on bi-Lipschitz embeddings, which guarantee that every pairwise distance is approximately preserved. In contrast, alternative embedding objectives that avoid bi-Lipschitz distortion are commonly used in practice to map points to lower dimensions, yet these approaches have received relatively have received comparatively less study in theory. In this paper, we focus on one such objective, Multi-dimensional Scaling (MDS), which embeds an $n$-point metric into low-dimensional Euclidean space. MDS is widely used as a data visualization tool in the social and biological sciences, statistics, and machine learning. Given a set of non-negative dissimilarities $\{d_{i,j}\}_{i , j \in [n]}$ over $n$ points (which may or may not form a metric), the goal is to find an embedding $\{x_1,\dots,x_n\} \subset \R^k$ that minimizes \[ \OPT = \min_{x} \expecf{i,j \in [n]}{ \left(1-\frac{\|x_i - x_j\|}{d_{i,j}}\right)^2 }.\] Despite its popularity, our theoretical understanding of MDS is extremely limited. Recently, Demaine, Hesterberg, Koehler, Lynch, and Urschel gave the first approximation algorithm with provable guarantees for this objective, which achieves an embedding in constant dimensional Euclidean space with cost $\OPT +\eps$ in $n^2 \cdot 2^{ \poly(\Delta/\eps) }$ time, where $\Delta$ is the aspect ratio of the input dissimilarities. For metrics that admit low-cost embeddings, the aspect ratio $\Delta$ scales polynomially in $n$. In this work, we give the first approximation algorithm for MDS with quasi-polynomial dependency on $\Delta$: for constant dimensional Euclidean space, we achieve a solution with cost $\tilde{\mathcal{O}}(\log \Delta ) \cdot \OPT^{ \Omega(1) } + \eps$ in time $n^{ \mathcal{O}(1)} \cdot 2^{ \poly((\log(\Delta)/\eps)) }$.
Session: 13A - Algorithms (Friday 04 July 16:00–17:36)
Authors: Gu, Yuzhou; Li, Xin; Xu, Yinzhan
Abstract:
In the noisy query model, the (binary) return value of every query (possibly repeated) is independently flipped with some fixed probability $p \in (0, 1/2)$. In this paper, we obtain tight bounds on the noisy query complexity of several fundamental problems. Our first contribution is to show that any Boolean function with total influence $\Omega(n)$ has noisy query complexity $\Theta(n\log n)$. Previous works often focus on specific problems, and it is of great interest to have a characterization of noisy query complexity for general functions. Our result is the first noisy query complexity lower bound of this generality, beyond what was known for random Boolean functions (Reischuk and Schmeltz, FOCS 1991). Our second contribution is to prove that Graph Connectivity has noisy query complexity $\Theta(n^2 \log n)$. In this problem, the goal is to determine whether an undirected graph is connected, where each query asks for the existence of an edge in the graph. A simple algorithm can solve the problem with error probability $o(1)$ using $O(n^2 \log n)$ noisy queries, but no non-trivial lower bounds were known prior to this work. Last but not least, we determine the exact number of noisy queries (up to lower order terms) needed to solve the $k$-Threshold problem and the Counting problem. The $k$-Threshold problem asks to decide whether there are at least $k$ ones among $n$ bits, given noisy query access to the bits. We prove that $(1\pm o(1)) \frac{n\log (\min\{k,n-k+1\}/\delta)}{(1-2p)\log \frac{1-p}p}$ queries are both sufficient and necessary to achieve error probability $\delta = o(1)$. Previously, such a result was only known when $\min\{k,n-k+1\}=o(n)$ (Wang, Ghaddar, Zhu and Wang, arXiv 2024). We also show a similar $(1\pm o(1)) \frac{n\log (\min\{k+1,n-k+1\}/\delta)}{(1-2p)\log \frac{1-p}p}$ bound for the Counting problem, where one needs to count the number of ones among $n$ bits given noisy query access and $k$ denotes the answer.
Session: 13A - Algorithms (Friday 04 July 16:00–17:36)
Authors: Pilliat, Emmanuel
Abstract:
Crowdsourcing involves aggregating meaningful information from partial and noisy data provided by a pool of $n$ workers across $d$ tasks. Traditional models, such as the Dawid-Skene model, assume that workers' abilities are independent of tasks, limiting their applicability in real-world scenarios where worker ability often varies significantly across tasks. Recent advances have proposed permutation-based models, which relax these assumptions by imposing only isotonicity constraints on worker abilities. In this work, we study a permutation-based model where each worker $i$ has an ability $M_{ik}$ to recover a binary label $x_k^*\in\{-1,1\}$ for task $k$. The ability matrix $M$ is assumed to be isotonic up to a permutation of its rows, and only a fraction $\lambda$ of the worker-task pairs is observed. We focus on three primary objectives: recovering the true labels, ranking the workers, and estimating the ability matrix $M$. We introduce a polynomial-time and minimax optimal procedure to recover the labels, contradicting a conjecture in the literature regarding the existence of a statistical-computational gap for this problem. Additionally, building on the literature on ranking, we further introduce a polynomial-time procedure to rank the workers and to estimate their abilities. Notably, we show that ranking the workers or estimating their abilities is no harder when the true labels are unknown than when they are known, within the main regimes of interest in the isotonic model.
Session: 13A - Algorithms (Friday 04 July 16:00–17:36)
Authors: Brukhim, Nataly; Pacchiano, Aldo; Dudik, Miroslav; Schapire, Robert
Abstract:
We study the task of bandit learning, also known as best-arm identification, under the assumption that the true reward function f belongs to a known, but arbitrary, function class F. While many instances of this problem are well understood, we seek a general theory of bandit learnability, akin to the PAC model for classification. Our investigation is guided by the following two fundamental questions: (1) which classes F are learnable, and (2) how they are learnable. For example, in the case of binary PAC classification, learnability is fully determined by a combinatorial dimension, i.e., the VC dimension, and can be attained via a simple algorithmic principle, i.e., Empirical Risk Minimization (ERM). In contrast to classic learning theoretic results, our findings reveal fundamental limitations to learning in structured bandits, offering new insights into the boundaries of bandit learnability. First, for the question of "which", we show that the paradigm of identifying the learnable via a dimension-like quantity fails for bandit learning. We give a simple proof demonstrating that no combinatorial dimension can characterize bandit learnability, even in finite classes, following a standard definition of dimension introduced by Ben-David et al. (2019). For the question of "how" we prove a computational hardness result: we construct a reward function class for which at most two queries are needed to find the optimal action, yet no algorithm can do so in polynomial time, unless RP=NP. Perhaps surprisingly, we also prove that this class admits efficient algorithms to standard (yet possibly hard) algorithmic operations often considered in learning theory, such as an ERM. Therefore, this implies that computational hardness is in this case inherent to the task of bandit learning. Beyond these results, we investigate additional themes such as learning under noise, trade-offs between noise models, and the relationship between query complexity and regret minimization.
Session: 13A - Algorithms (Friday 04 July 16:00–17:36)
Authors: Hintze, Lukas; Krieg, Lena; Scheftelowitsch, Olga; Zhu, Haodong
Abstract:
In group testing, the task is to identify defective items by testing groups of them together using as few tests as possible. We consider the setting where each item is defective with a constant probability $\alpha$, independent of all other items. In the (over-)idealized noiseless setting, tests are positive exactly if any of the tested items are defective. We study a more realistic model in which observed test results are subject to noise, i.e., tests can display false positive or false negative results with constant positive probabilities. We determine precise constants $c$ such that $cn\log n$ tests are required to recover the infection status of every individual for both adaptive and non-adaptive group testing: in the former, the selection of groups to test can depend on previously observed test results, whereas it cannot in the latter. Additionally, for both settings, we provide efficient algorithms that identify all defective items with the optimal amount of tests with high probability. Thus, we completely solve the problem of binary noisy group testing in the studied setting.
Session: 13A - Algorithms (Friday 04 July 16:00–17:36)
Authors: Garg, Sachin; Derezinski, Michal
Abstract:
The Nystrom method is a popular low-rank approximation technique for large matrices that arise in kernel methods and convex optimization. Yet, when the data exhibits heavy-tailed spectral decay, the effective dimension of the problem often becomes so large that even the Nystrom method may be outside of our computational budget. To address this, we propose Block-Nystrom, an algorithm that injects a block-diagonal structure into the Nystrom method, thereby significantly reducing its computational cost while recovering strong approximation guarantees. We show that Block-Nystrom improves the computational complexity of kernel ridge regression for statistical learning over Hilbert spaces, and it can be used to construct more efficient preconditioners for second-order optimization. Our key technical insight is that, within the same computational budget, combining several smaller Nystrom approximations leads to stronger tail estimates of the input spectrum than using one larger approximation. Along the way, we provide a novel recursive preconditioning scheme for efficiently inverting the Block-Nystrom matrix, and provide new statistical learning bounds for a broad class of approximate kernel ridge regression solvers.
Session: 13A - Algorithms (Friday 04 July 16:00–17:36)
Authors: Tsimpos, Panagiotis; Ren, Zhi; Zech, Jakob; Marzouk, Youssef
Abstract:
Flow-based methods for sampling and generative modeling use continuous-time dynamical systems to represent a {transport map} that pushes forward a source measure to a target measure. The introduction of a time axis provides considerable design freedom, and a central question is how to exploit this freedom. Though many popular methods seek straight line (i.e., zero acceleration) trajectories, we show here that a specific class of ``curved'' trajectories can significantly improve approximation and learning. In particular, we consider the unit-time interpolation of any given transport map $T$ and seek the schedule $\tau: [0,1] \to [0,1]$ that minimizes the spatial Lipschitz constant of the corresponding velocity field over all times $t \in [0,1]$. This quantity is crucial as it allows for control of the approximation error when the velocity field is learned from data. We show that, for a broad class of source/target measures and transport maps $T$, the \emph{optimal schedule} can be computed in closed form, and that the resulting optimal Lipschitz constant is \emph{exponentially smaller} than that induced by an identity schedule (corresponding to, for instance, the Wasserstein geodesic). Our proof technique relies on the calculus of variations and the notion of $\Gamma$-convergence, allowing us to approximate the aforementioned degenerate objective by a family of smooth, tractable problems.
Time: Friday 04 July 16:00–17:36
Session: 13B - Testing (Friday 04 July 16:00–17:36)
Authors: Seyfried, Jan; Sen, Sayantan; Tomamichel, Marco
Abstract:
We investigate the sample complexity of mutual information and conditional mutual information testing. For conditional mutual information testing, given access to independent samples of a triple of random variables $(A, B, C)$ with unknown distribution, we want to distinguish between two cases (i) $A$ and $C$ are conditionally independent, i.e., $I(A\!:\!C|B) = 0$, and (ii) $A$ and $C$ are conditionally dependent, i.e., $I(A\!:\!C|B) \geq \eps$ for some threshold $\eps$. We establish an upper bound on the number of samples required to distinguish between the two cases with high confidence, as a function of $\eps$ and the three alphabet sizes. We conjecture that our bound is tight and show that this is indeed the case in several parameter regimes. For the special case of mutual information testing (when $B$ is trivial), we establish the necessary and sufficient number of samples required up to polylogarithmic terms. Our technical contributions include a novel method to efficiently simulate weakly correlated samples from the conditionally independent distribution $P_{A|B} P_{C|B} P_B$ given access to samples from an unknown distribution $P_{ABC}$, and a new estimator for equivalence testing that can handle such correlated samples, which might be of independent interest.
Session: 13B - Testing (Friday 04 July 16:00–17:36)
Authors: Servedio, Rocco; Chen, Xi; Pires, William; Pitassi, Toniann
Abstract:
This papers considers the junta testing problem in a recently introduced ``relative error'' variant of the standard Boolean function property testing model. In relative-error testing we measure the distance from $f$ to $g$, where $f,g: \zo^n \to \zo$, by the ratio of $|f^{-1}(1) \triangle g^{-1}(1)|$ (the number of inputs on which $f$ and $g$ disagree) to $|f^{-1}(1)|$ (the number of satisfying assignments of $f$), and we give the testing algorithm both black-box access to $f$ and also access to independent uniform samples from $f^{-1}(1)$. \cite{CDHLNSY2024} observed that the class of $k$-juntas is $\poly(2^k,1/\eps)$-query testable in the relative-error model, and asked whether $\poly(k,1/\eps)$ queries is achievable. We answer this question affirmatively by giving a $\tilde{O}(k/\eps)$-query algorithm, matching the optimal complexity achieved in the less challenging standard model. Moreover, as our main result, we show that any \emph{subclass} of $k$-juntas that is closed under permuting variables is relative-error testable with a similar complexity. This gives highly efficient relative-error testing algorithms for a number of well-studied function classes, including size-$k$ decision trees, size-$k$ branching programs, and size-$k$ Boolean formulas.
Session: 13B - Testing (Friday 04 July 16:00–17:36)
Authors: Compton, Spencer; Pabbaraju, Chirag; Zhivotovskiy, Nikita
Abstract:
A fundamental open problem in learning theory is to characterize the best-case teaching dimension $\operatorname{TS}_{\min}$ of a concept class $\mathcal{C}$ with finite VC dimension $d$. Resolving this problem will, in particular, settle the conjectured upper bound on Recursive Teaching Dimension posed by [Simon and Zilles; COLT 2015]. Prior work used a natural greedy algorithm to construct teaching sets recursively, thereby proving upper bounds on $\operatorname{TS}_{\min}$, with the best known bound being $O(d^2)$ [Hu, Wu, Li, and Wang; COLT 2017]. In each iteration, this greedy algorithm chooses to add to the teaching set the $k$ labeled points that restrict the concept class the most. In this work, we prove lower bounds on the performance of this greedy approach. Specifically, we show that for $k = 1$, the algorithm does not improve upon the halving-based bound of $O(\log(|\mathcal{C}|))$. Furthermore, for $k = 2$, we complement the upper bound of $O\left(\log(\log(|\mathcal{C}|))\right)$ from [Moran, Shpilka, Wigderson, and Yuhudayaoff; FOCS 2015] with a matching lower bound. Most consequentially, our lower bound extends up to $k \le \lceil c d \rceil$ for small constant $c>0$: suggesting that alternative, non-greedy methods may be necessary to resolve the conjecture that $\operatorname{TS}_{\min}= O(d)$.
Session: 13B - Testing (Friday 04 July 16:00–17:36)
Authors: Gao, Chao; Shan, Liren; Srinivas, Vaidehi; Vijayaraghavan, Aravindan
Abstract:
We study the problem of learning a high-density region of an arbitrary distribution over $\mathbb{R}^d$. Given a target coverage parameter $\delta$, and sample access to an arbitrary distribution $\mathcal{D}$, we want to output a confidence set $S \subset \mathbb{R}^d$ such that $S$ achieves $\delta$ coverage of $\mathcal{D}$, i.e., $\mathbb{P}_{y \sim \mathcal{D}} \left[ y \in S \right] \ge \delta$, and the volume of $S$ is as small as possible. This is a central problem in high-dimensional statistics with applications in high-dimensional analogues of finding confidence intervals, uncertainty quantification, and support estimation. In the most general setting, this problem is statistically intractable, so we restrict our attention to competing with sets from a concept class $\mathcal{C}$ with bounded VC-dimension. An algorithm for learning confidence sets is competitive with class $\mathcal{C}$ if given samples from an arbitrary distribution $\mathcal{D}$ it outputs in polynomial time a set that achieves $\delta$ coverage of $\mathcal{D}$ and whose volume is competitive with the smallest set in $\mathcal{C}$ with the required coverage $\delta$. This problem is computationally challenging even in the basic setting when $\mathcal{C}$ is the set of all Euclidean balls. Existing algorithms based on coresets find in polynomial time a ball whose volume is $\exp(\tilde{O}( d/ \log d))$-factor competitive with the volume of the best ball. Our main result is an algorithm that finds a confidence set whose volume is $\exp(\tilde{O}(d^{2/3}))$ factor competitive with the optimal ball having the desired coverage. The algorithm is improper, simple and also extends to finding confidence sets competitive against unions of $k$ balls, and other variants. Combined with an NP-hardness for proper learning balls with a volume approximation factor of $\exp(\tilde{O}(d^{1-o(1)}))$, this provides a striking separation between proper and (improper) learning of confidence sets.
Session: 13B - Testing (Friday 04 July 16:00–17:36)
Authors: Baguley, Samuel; Gobel, Andreas; Pappik, Marcus; Schiller, Leon
Abstract:
We study high-dimensional random geometric graphs (RGGs) of edge-density p with vertices uniformly distributed on the d-dimensional torus and edges inserted between sufficiently close vertices with respect to an Lq-norm. In this setting, we focus on distinguishing an RGG from an Erdős–Rényi graph if both models have the same marginal edge probability p. So far, most results in the literature considered either spherical RGGs with L2-distance or toroidal RGGs under L∞-distance. However, for general Lq-distances, many questions remained open, especially if p is allowed to depend on n. The main reason for this is that RGGs under Lq-distances can not easily be represented as the logical "AND" of their 1-dimensional counterparts as is the case for L∞ geometries. To overcome this difficulty, we devise a novel technique for quantifying the dependence between edges based on modified Edgeworth expansions. Our technique yields the first tight algorithmic upper bounds for distinguishing toroidal RGGs under general Lq norms from Erdős–Rényi graphs for any fixed p and q. We achieve this by showing that the signed triangle statistic can distinguish the two models when d ≪ n^3p^3 for the whole regime of edge probabilities c/n
Session: 13B - Testing (Friday 04 July 16:00–17:36)
Authors: Feng, Weiming; Liu, Hongyang; Yang, Minji
Abstract:
Spin systems form an important class of undirected graphical models. For two Gibbs distributions \mu and \nu induced by two spin systems on the same graph G = (V, E), we study the problem of approximating the total variation distance dtv(\mu, \nu) with an \epsilon-relative error. We propose a new reduction that connects the problem of approximating the TV-distance to sampling and approximate counting. Our applications include the hardcore model and the antiferromagnetic Ising model in the uniqueness regime, the ferromagnetic Ising model, and the general Ising model satisfying the spectral condition. Additionally, we explore the computational complexity of approximating the total variation distance dtv(\mu_S, \nu_S) between two marginal distributions on an arbitrary subset S \subseteq V. We prove that this problem remains hard even when both \mu and \nu admit polynomial-time sampling and approximate counting algorithms.
Session: 13B - Testing (Friday 04 July 16:00–17:36)
Authors: Lyu, Yichen; Yang, Pengkun
Abstract:
This paper studies the problems of identifiability and estimation in high-dimensional nonparametric latent structure models. We introduce an identifiability theorem that generalizes existing conditions, establishing a unified framework applicable to diverse statistical settings. Our results rigorously demonstrate how increased dimensionality, coupled with diversity in variables, inherently facilitates identifiability. For the estimation problem, we establish near-optimal minimax rate bounds for the high-dimensional nonparametric density estimation under latent structures with smooth marginals. Contrary to the conventional curse of dimensionality, our sample complexity scales only polynomially with the dimension. Additionally, we develop a perturbation theory for component recovery and propose a recovery procedure based on simultaneous diagonalization.
Session: 13B - Testing (Friday 04 July 16:00–17:36)
Authors: Kazemi, Hadi; Pensia, Ankit; Jog, Varun
Abstract:
This paper resolves two open problems from a recent paper~\citep{PenEtal24b} concerning the sample complexity of distributed simple binary hypothesis testing under information constraints. The first open problem asks whether interaction reduces the sample complexity of distributed simple binary hypothesis testing. In this paper, we show that sequential interaction does not help. The second problem suggests tightening existing sample complexity bounds for communication-constrained simple binary hypothesis testing. We derive optimally tight bounds for this setting and resolve this problem. Our main technical contributions are: (i) a one-shot lower bound on the Bayes error in simple binary hypothesis testing that tensorises; (ii) a streamlined proof of the formula for the sample complexity of simple binary hypothesis testing without constraints, first established in~\cite{PenEtal24b}; and (iii) a reverse data-processing inequality for Hellinger-$\lambda$ divergences, generalising the results from \cite{BhaEtal21} and \cite{PenEtal23}.
Time: (no scheduled time)
Session: Virtual
Authors: Jiang, Liwei; Roy, Abhishek; Balasubramanian, Krishna; Davis, Damek; Drusvyatskiy, Dmitriy; Na, Sen
Abstract:
We consider applying stochastic approximation (SA) methods to solve nonsmooth variational inclusion problems. Existing studies have shown that the averaged iterates of SA methods exhibit asymptotic normality, with an optimal limiting covariance matrix in the local minimax sense of Hájek and Le Cam. However, no methods have been proposed to estimate this covariance matrix in a nonsmooth and potentially non-monotone (nonconvex) setting. In this paper, we study an online batch-means covariance matrix estimator introduced in Zhu et al. (2023). The estimator groups the SA iterates appropriately and computes the sample covariance among batches as an estimate of the limiting covariance. Its construction does not require prior knowledge of the total sample size, and updates can be performed recursively as new data arrives. We establish that, as long as the batch size sequence is properly specified (depending on the stepsize sequence), the estimator achieves a convergence rate of order $O(\sqrt{d}n^{-1/8+\varepsilon})$ for any $\varepsilon>0$, where $d$ and $n$ denote the problem dimensionality and the number of iterations (or samples) used. Although the problem is nonsmooth and potentially non-monotone (nonconvex), our convergence rate matches the best-known rate for covariance estimation methods using only first-order information in smooth and strongly-convex settings. The consistency of this covariance estimator enables asymptotically valid statistical inference, including constructing confidence intervals and performing hypothesis testing.
Session: Virtual
Authors: Pittas, Thanasis; Pensia, Ankit
Abstract:
Algorithmic robust statistics has traditionally focused on the contamination model where a small fraction of the samples are arbitrarily corrupted. We consider a recent contamination model that combines two kinds of corruptions: (i) small fraction of arbitrary outliers, as in classical robust statistics, and (ii) local perturbations, where samples may undergo bounded shifts on average. While each noise model is well understood individually, the combined contamination model poses new algorithmic challenges, with only partial results known. Existing efficient algorithms are limited in two ways: (i) they work only for a weak notion of local perturbations, and (ii) they obtain suboptimal error for isotropic subgaussian distributions (among others). The latter limitation led \cite{NieGS24} to hypothesize that improving the error might, in fact, be computationally hard. Perhaps surprisingly, we show that information theoretically optimal error can indeed be achieved in polynomial time, under an even \emph{stronger} local perturbation model (the sliced-Wasserstein metric as opposed to the Wasserstein metric). Notably, our analysis reveals that the entire family of \emph{stability-based} robust mean estimators continues to work optimally in a black-box manner for the combined contamination model. This generalization is particularly useful in real-world scenarios where the specific form of data corruption is not known in advance. We also present efficient algorithms for distribution learning and principal component analysis in the combined contamination model.
Session: Virtual
Authors: Kumar, Akash; Parhi, Rahul; Belkin, Misha
Abstract:
Recent work has characterized the space of two-layered infinite-width neural networks as a bounded variation space $\rbv{\Omega}$ over domains $\Omega \subset \reals^d$. These spaces encompass several classical multivariate function spaces, including the $L_1$- and $L_2$-Sobolev spaces of order $d+1$, where $d$ represents the ambient dimension of the domain. This Sobolev regularity provides sufficient structure to overcome the curse of dimensionality in approximation theory. Notably, $\rbv{\Omega}$ also contains functions with less classical regularity, particularly those exhibiting significant variations in only a few directions. For bounded domains, it is well-established that Gaussian reproducing kernel Hilbert spaces (RKHS) strictly continuously embed within $\rbv{\Omega}$, demonstrating a clear gap between Gaussian RKHS with $\rbv{\Omega}$. However, this relationship becomes more nuanced in unbounded domains. In this work, we investigate the setting where $\Omega = \reals^d$ and establish a fundamental result: certain Gaussian kernel functions cannot be represented within $\rbv{\reals^d}$, providing a contrasting non-trivial gap in the complement of the intersection of these two spaces.
Session: Virtual
Authors: Zhou, Yihan; Zhang, Chicheng
Abstract:
Multi-distribution learning extends agnostic Probably Approximately Correct (PAC) learning to the setting in which a family of $k$ distributions, $\cbr{D_i}_{i\in[k]}$, is considered and a classifier's performance is measured by its error under the worst distribution. This problem has attracted extensive study due to its broad applications in collaborative learning, fairness, and robustness. Despite a rather complete picture of sample complexity of passive multi-distribution learning, research on active multi-distribution learning remains scarce, with algorithms whose optimality remaining unknown. In this paper, we develop new algorithms for active multi-distribution learning and establish improved label complexity upper and lower bounds. Specifically, in the near-realizable setting we prove an upper bound of $\widetilde{O}\Bigl(\theta_{\max}(d+k)\ln\frac{1}{\varepsilon}\Bigr)$ and $\widetilde{O}\Bigl(\theta_{\max}(d+k)\Bigl(\log\frac{1}{\varepsilon}+\frac{\nu^2}{\varepsilon^2}\Bigr)+\frac{k\nu}{\varepsilon^2}\Bigr)$ in the realizable and agnostic settings respectively, where \(\theta_{\max}\) is the maximum disagreement coefficient among the \(k\) distributions, \(d\) is the VC dimension of the hypothesis class, \(\nu\) is the error of the best hypothesis, and \(\varepsilon\) is the target excess error. Moreover, we show that the bound in the realizable setting is information-theoretically optimal and that the \(k\nu/\varepsilon^2\) term in the agnostic setting is fundamental for all proper learners. We also derive novel distribution-free label complexity upper bounds.
Session: Virtual
Authors: Lu, Zhou; Sun, Y. Jennifer; Zhang, Zhiyu
Abstract:
Focusing on the expert problem in online learning, this paper studies the interpolation of several performance metrics via $\phi$-regret minimization, which measures the performance of an algorithm by its regret with respect to an arbitrary action modification rule $\phi$. With $d$ experts and $T\gg d$ rounds in total, we present a single algorithm achieving the instance-adaptive $\phi$-regret bound \begin{equation*} \tilde O(\min\{\sqrt{d-d^\unif_\phi+1},\sqrt{d-d^\self_\phi}\}\cdot\sqrt{T}), \end{equation*} where $d^\unif_\phi$ is the maximum amount of experts modified identically by $\phi$, and $d^\self_\phi$ is the amount of experts that $\phi$ trivially modifies to themselves. By recovering the optimal $O(\sqrt{T\log d})$ external regret bound when $d^\unif_\phi=d$, the standard $\tilde O(\sqrt{T})$ internal regret bound when $d^\self_\phi=d-1$ and the optimal $\tilde O(\sqrt{dT})$ swap regret bound in the worst case, we improve existing results in the intermediate regimes. In addition, the same algorithm achieves the optimal quantile regret bound, which corresponds to even easier settings of $\phi$ than the external regret. Building on the classical reduction from $\phi$-regret minimization to external regret minimization on stochastic matrices, our main idea is to further convert the latter to online linear regression using Haar-wavelet-inspired matrix features. Then, we apply a particular $L_1$-version of comparator-adaptive online learning algorithms to exploit the sparsity in this regression subroutine.