Opening remarks (Saturday, July 2, 09:00 AM GMT+1)

Location: Room A

Session chair: Peter Grünwald

Online learning and optimization I (Saturday, July 2, 09:10 AM GMT+1)

Location: Room A

Session chairs: Vitaly Feldman; Chi Jin

Pushing the Efficiency-Regret Pareto Frontier for Online Learning of Portfolios and Quantum States

Time: Saturday, July 2, 09:10 AM GMT+1

Authors: Julian Zimmert; Naman Agarwal; Satyen Kale

We revisit the classical online portfolio selection problem. It is widely assumed that a trade-off between computational complexity and regret is unavoidable, with Cover’s Universal Portfolios algorithm, SOFT-BAYES and ADA-BARRONS currently constituting its state-of-the-art Pareto frontier. In this paper, we present the first efficient algorithm, BISONS, that obtains polylogarithmic regret with memory and per-step running time requirements that are polynomial in the dimension,
displacing ADA-BARRONS from the Pareto frontier. Additionally, we resolve a COLT 2020 open problem by showing that a certain Follow-The-Regularized-Leader algorithm with log-barrier regularization suffers an exponentially larger dependence on the dimension than previously conjectured. Thus, we rule out this algorithm as a candidate for the Pareto frontier. We also extend our algorithm and analysis to a more general problem than online portfolio selection, viz. online learning of quantum states with log loss. This algorithm, called SCHRODINGER’S-BISONS, is the first efficient algorithm with polylogarithmic regret for this more general problem.

Beyond No Regret: Instance-Dependent PAC Reinforcement Learning

Time: Saturday, July 2, 09:22 AM GMT+1

Authors: Andrew Wagenmaker; Max Simchowitz; Kevin Jamieson

The theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying $\epsilon$-optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an $\epsilon$-optimal policy and achieve the worst-case optimal rate, it is unknown whether low-regret algorithms can obtain the instance-optimal rate for policy identification. We show this is not possible---there exists a fundamental tradeoff between achieving low regret and identifying an $\epsilon$-optimal policy at the instance-optimal rate.

Motivated by our negative finding, we propose a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. We then propose and analyze a novel, planning-based algorithm which attains this sample complexity---yielding a complexity which scales with the suboptimality gaps and the ``reachability'' of a state.
We show our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.

Chasing Convex Bodies and Functions with Black-Box Advice

Time: Saturday, July 2, 09:34 AM GMT+1

Authors: Nicolas Christianson; Tinashe Handina; Adam Wierman

We consider the problem of convex function chasing with black-box advice, where an online decision-maker aims to minimize the total cost of making and switching between decisions in a normed vector space, aided by black-box advice such as the decisions of a machine-learned algorithm. The decision-maker seeks cost comparable to the advice when it performs well, known as \emph{consistency}, while also ensuring worst-case \emph{robustness} even when the advice is adversarial. We first consider the common paradigm of algorithms that switch between the decisions of the advice and a competitive algorithm, showing that no algorithm in this class can improve upon 3-consistency while staying robust. We then propose two novel algorithms that bypass this limitation by exploiting the problem's convexity. The first, $\textsc{Interp}$, achieves $(\sqrt{2}+\epsilon)$-consistency and $\mathcal{O}(\frac{C}{\epsilon^2})$-robustness for any $\epsilon > 0$, where $C$ is the competitive ratio of an algorithm for convex function chasing or a subclass thereof. The second, $\textsc{BdInterp}$, achieves $(1+\epsilon)$-consistency and $\mathcal{O}(\frac{CD}{\epsilon})$-robustness when the problem has bounded diameter $D$. Further, we show that $\textsc{BdInterp}$ achieves near-optimal consistency-robustness trade-off for the special case where cost functions are $\alpha$-polyhedral.

Online Learning to Transport via the Minimal Selection Principle

Time: Saturday, July 2, 09:46 AM GMT+1

Authors: Wenxuan Guo; YoonHaeng Hur; Tengyuan Liang; Chris Ryan

Motivated by robust dynamic resource allocation in operations research, we study the Online Learning to Transport (OLT) problem where the decision variable is a probability measure, an infinite-dimensional object. We draw connections between online learning, optimal transport, and partial differential equations through an insight called the minimal selection principle, originally studied in the Wasserstein gradient flow setting by Ambrosio et al. (2005). This allows us to extend the standard online learning framework to the infinite-dimensional setting seamlessly. Based on our framework, we derive a novel method called the minimal selection or exploration (MSoE) algorithm to solve OLT problems using mean-field approximation and discretization techniques. In the displacement convex setting, the main theoretical message underpinning our approach is that minimizing transport cost over time (via the minimal selection principle) ensures optimal cumulative regret upper bounds. On the algorithmic side, our MSoE algorithm applies beyond the displacement convex setting, making the mathematical theory of optimal transport practically relevant to non-convex settings common in dynamic resource allocation.

Statistics I (Saturday, July 2, 09:10 AM GMT+1)

Location: Room B

Session chairs: Prateek Jain; Miki Racz

Label Noise (Stochastic) Gradient Descent Implicitly Solves the Lasso for Quadratic Parametrization

Time: Saturday, July 2, 09:10 AM GMT+1

Authors: Loucas Vivien; Julien Reygner; Nicolas Flammarion

Understanding the implicit bias of training algorithms is of crucial importance in order to explain the success of overparametrised neural networks. In this paper, we study the role of the label noise in the training dynamics of a quadratically parametrized model through its continuous time version. We explicitly characterise the solution chosen by the stochastic flow and prove that it implicitly solves a Lasso program. To fully complete our analysis, we provide non asymptotic convergence guarantees for the dynamics as well as conditions for support recovery. We also give experimental results which support our theoretical claims. Our findings highlight the fact that structured noise can induce better generalisation and help explain the greater performances of stochastic dynamics as observed in practice.

Single Trajectory Nonparametric Learning of Nonlinear Dynamics

Time: Saturday, July 2, 09:22 AM GMT+1

Authors: Ingvar Ziemann; Henrik Sandberg; Nikolai Matni

Given a single trajectory of a dynamical system, we analyze the performance of the nonparametric least squares estimator (LSE). More precisely, we give nonasymptotic expected $l^2$-distance bounds between the LSE and the true regression function, where expectation is evaluated on a fresh, counterfactual, trajectory. We leverage recently developed information-theoretic methods to establish the optimality of the LSE for nonparametric hypotheses classes in terms of supremum norm metric entropy and a subgaussian parameter. Next, we relate this subgaussian parameter to the stability of the underlying process using notions from dynamical systems theory. When combined, these developments lead to rate-optimal error bounds that scale as $T^{-1/(2+q)}$ for suitably stable processes and hypothesis classes with metric entropy growth of order $\delta^{-q}$. Here, $T$ is the length of the observed trajectory, $\delta \in \mathbb{R}_+$ is the packing granularity and $q\in (0,2)$ is a complexity term. Finally, we specialize our results to a number of scenarios of practical interest, such as Lipschitz dynamics, generalized linear models, and dynamics described by functions in certain classes of Reproducing Kernel Hilbert Spaces (RKHS).

Kernel Interpolation in Sobolev Spaces Is Not consistent in Low Dimensions

Time: Saturday, July 2, 09:34 AM GMT+1

Authors: Simon Buchholz

We consider kernel ridgeless ridge regression with kernels whose associated RKHS is a Sobolev space $H^s$. We show for $d/2extending earlier results for the Laplace kernel in odd dimensions
and underlining again that benign overfitting is rare in low dimensions.
The proof proceeds by deriving sharp bounds on the spectrum of random kernel matrices using results from the theory of radial basis functions which might be of independent interest.

Universality of Empirical Risk Minimization

Time: Saturday, July 2, 09:46 AM GMT+1

Authors: Basil Saeed; Andrea Montanari

Consider supervised learning from i.i.d. samples {(y_i, x_i )}_{i≤n} where x_i ∈ R_p are feature vectors and y_i ∈ R are labels. We study empirical risk minimization over a class of functions that are parameterized by k = O(1) vectors θ_1 , . . . , θ_k ∈ R_p, and prove universality results both for the training and test error. Namely, under the proportional asymptotics n, p → ∞ , with n/p = Θ(1), we prove that the training error depends on the random features distribution only through its covariance structure. Further, we prove that the minimum test error over near-empirical risk minimizers enjoys similar universality properties. In particular, the asymptotics of these quantities can be computed —to leading order— under a simpler model in which the feature vectors x_i are replaced by Gaussian vectors g_i with the same covariance.
Earlier universality results were limited to strongly convex learning procedures, or to feature vectors x_i with independent entries. Our results do not make any of these assumptions.
Our assumptions are general enough to include feature vectors x_i that are produced by randomized featurization maps. In particular we explicitly check the assumptions for certain random features models (computing the output of a one-layer neural network with random weights) and neural tangent models (first-order Taylor approximation of two-layer networks).

Best paper presentation I (Saturday, July 2, 10:05 AM GMT+1)

Location: Room A

Session chair: Po-Ling Loh

Efficient Convex Optimization Requires Superlinear Memory

Time: Saturday, July 2, 10:05 AM GMT+1

Authors: Annie Marsden; Vatsal Sharan; Aaron Sidford; Gregory Valiant

We show that any memory-constrained, first-order algorithm which minimizes $d$-dimensional, $1$-Lipschitz convex functions over the unit ball to $1/\poly(d)$ accuracy using at most $d^{1.25 - \delta}$ bits of memory must make at least $\Omega(d^{1 + \delta})$ first-order queries (for any constant $\delta \in [0, 1/4]$). Consequently, the performance of such memory-constrained algorithms are a polynomial factor worse than the optimal $\tilde{O}(d)$ query bound for this problem obtained by cutting plane methods that use $\tilde{O}(d^2)$ memory. This resolves a COLT 2019 open problem of Woodworth and Srebro.

Privacy I (Saturday, July 2, 10:45 AM GMT+1)

Location: Room A

Session chairs: Peter Grünwald; Matus Telgarsky

A Private and Computationally-Efficient Estimator for Unbounded Gaussians

Time: Saturday, July 2, 10:45 AM GMT+1

Authors: Gautam Kamath; Argyris Mouzakis; Vikrant Singhal; Thomas Steinke; Jonathan Ullman

We give the first polynomial-time, polynomial-sample, differentially private estimator for the mean and covariance of an arbitrary Gaussian distribution $N(\mu,\Sigma)$ in $\R^d$. All previous estimators are either nonconstructive, with unbounded running time, or require the user to specify a priori bounds on the parameters $\mu$ and $\Sigma$. The primary new technical tool in our algorithm is a new differentially private preconditioner that takes samples from an arbitrary Gaussian $N(0,\Sigma)$ and returns a matrix $A$ such that $A \Sigma A^T$ has constant condition number

A Bounded-Noise Mechanism for Differential Privacy

Time: Saturday, July 2, 10:57 AM GMT+1

Authors: Yuval Dagan; Gil Kur

We present an asymptotically optimal $(\epsilon,\delta)$ differentially private mechanism for answering multiple, adaptively asked, $\Delta$-sensitive queries, settling the conjecture of Steinke and Ullman [2020].
Our algorithm has a significant advantage that it adds independent bounded noise to each query, thus providing an absolute error bound.
Additionally, we apply our algorithm in adaptive data analysis, obtaining an improved guarantee for answering multiple queries regarding some underlying distribution using a finite sample. Numerical computations show that the bounded-noise mechanism outperforms the Gaussian mechanism in many standard settings.

Private and Polynomial Time Algorithms for Learning Gaussians and Beyond

Time: Saturday, July 2, 11:09 AM GMT+1

Authors: Hassan Ashtiani; Christopher Liaw

We present a fairly general framework for reducing $(\varepsilon, \delta)$-differentially private (DP) statistical estimation to its non-private counterpart. As the main application of this framework, we give a polynomial time and $(\varepsilon,\delta)$-DP algorithm for learning (unrestricted) Gaussian distributions in $\mathbb{R}^d$. The sample complexity of our approach for learning the Gaussian up to total variation distance $\alpha$ is $\tilde{O}(d^2/\alpha^2 + d^2\sqrt{\ln(1/\delta)}/\alpha \eps + d\ln(1/\delta) / \alpha \eps)$ matching (up to logarithmic factors) the best known information-theoretic (non-efficient) sample complexity upper bound due to Aden-Ali, Ashtiani, and Kamath (2021). In an independent work, Kamath, Mouzakis, Singhal, Steinke, and Ullman (2021) proved a similar result using a different approach and with $O(d^{5/2})$ sample complexity dependence on $d$.

As another application of our framework, we provide the first polynomial time $(\varepsilon, \delta)$-DP algorithm for robust learning of (unrestricted) Gaussians with sample complexity $\tilde{O}(d^{3.5})$. In another independent work, Kothari, Manurangsi, and Velingker (2021) also provided a polynomial time $(\epsilon, \delta)$-DP algorithm for robust learning of Gaussians with sample complexity $\tilde{O}(d^8)$.

(Nearly) Optimal Private Linear Regression for Sub-Gaussian Data via Adaptive Clipping

Time: Saturday, July 2, 11:21 AM GMT+1

Authors: Prateek Varshney; Abhradeep Thakurta; Prateek Jain

We study the problem of differentially private linear regression where each of the data point is sampled from a fixed sub-Gaussian style distribution. We propose and analyze a one-pass mini-batch stochastic gradient descent method (DP-AMBSSGD) where points in each iteration are sampled without replacement. Noise is added for DP but the noise standard deviation is estimated online. Compared to existing $(\epsilon, \delta)$-DP techniques which have sub-optimal error bounds, DP-AMBSSGD is able to provide nearly optimal error bounds in terms of key parameters like dimensionality d, number of points N, and the standard deviation \sigma of the noise in observations. For example, when the $d$-dimensional covariates are sampled i.i.d. from the normal distribution, then the excess error of DP-AMBSSGD due to privacy is $\sigma^2 d/N(1+d/(\epsilon^2 N)), i.e., the error is meaningful when number of samples N\geq d \log d which is the standard operative regime for linear regression. In contrast, error bounds for existing efficient methods in this setting are: d^3/(\epsilon^2 N^2), even for \sigma=0. That is, for constant \epsilon, the existing techniques require N=d^1.5 to provide a non-trivial result.

Private High-Dimensional Hypothesis Testing

Time: Saturday, July 2, 11:33 AM GMT+1

Authors: Shyam Narayanan

We provide improved differentially private algorithms for identity testing of high-dimensional distributions. Specifically, for $d$-dimensional Gaussian distributions with known covariance $\Sigma$, we can test whether the distribution comes from $\mathcal{N}(\mu^*, \Sigma)$ for some fixed $\mu^*$ or from some $\mathcal{N}(\mu, \Sigma)$ with total variation distance at least $\alpha$ from $\mathcal{N}(\mu^*, \Sigma)$ with $(\eps, 0)$-differential privacy, using only
\[\tilde{O}\left(\frac{d^{1/2}}{\alpha^2} + \frac{d^{1/3}}{\alpha^{4/3} \cdot \eps^{2/3}} + \frac{1}{\alpha \cdot \eps}\right)\]
samples if the algorithm is allowed to be computationally inefficient, and only
\[\tilde{O}\left(\frac{d^{1/2}}{\alpha^2} + \frac{d^{1/4}}{\alpha \cdot \eps}\right)\]
samples for a computationally efficient algorithm. We also provide a matching lower bound showing that our computationally inefficient algorithm has optimal sample complexity. We also extend our algorithms to various related problems, including mean testing of Gaussians with bounded but unknown covariance, uniformity testing of product distributions over $\{\pm 1\}^d$, and tolerant testing. Our results improve over the previous best work of Canonne et al. (2020) for both computationally efficient and inefficient algorithms, and even our computationally efficient algorithm matches the optimal \emph{non-private} sample complexity of $O\left(\frac{\sqrt{d}}{\alpha^2}\right)$ in many standard parameter settings. In addition, our results show that, surprisingly, private identity testing of $d$-dimensional Gaussians can be done with fewer samples than private identity testing of discrete distributions over a domain of size $d$ (Acharya et al., 2018), which refutes a conjectured lower bound of Canonne et al. (2020).

Virtual session I (Saturday, July 2, 10:45 AM GMT+1)

Location: Room B

Session chairs: Claire Vernade; Tomer Koren

Risk Bounds for Aggregated Shallow Neural Networks using Gaussian Priors

Time: Saturday, July 2, 10:45 AM GMT+1

Authors: Laura Tinsi; Arnak Dalalyan

Analysing statistical properties of neural networks
is a central topic in statistics and machine learning.
However, most results in the literature focus on the
properties of the neural network minimizing the training
error. The goal of this paper is to consider aggregated
neural networks using a Gaussian prior. The departure
point of our approach is an arbitrary aggregate satisfying
the PAC-Bayesian inequality. The main contribution
is a precise nonasymptotic assessment of the estimation
error appearing in the PAC-Bayes bound. Our analysis is
sharp enough to lead to minimax rates of estimation over
Sobolev smoothness classes.

Self-Consistency of the Fokker Planck Equation

Time: Saturday, July 2, 10:57 AM GMT+1

Authors: Zebang Shen; Zhenfu Wang; Satyen Kale; Alejandro Ribeiro; Amin Karbasi; Hamed Hassani

The Fokker-Planck equation (FPE) is the partial differential equation that governs the density evolution of the \ito\ process and is of great importance to the literature of statistical physics and machine learning.
The FPE can be regarded as a continuity equation where the change of the density is completely determined by a time varying velocity field.
Importantly, this velocity field also depends on the current density function. As a result, the ground-truth velocity field can be shown to be the solution of a fixed-point equation, a property that we call \textit{self-consistency}.
In this paper, we exploit this concept to design a potential function of the hypothesis velocity fields, and prove that, if such a function diminishes to zero during the training procedure, the trajectory of the densities generated by the hypothesis velocity fields converges to the solution of the FPE in the Wasserstein-2 sense.
The proposed potential function is amenable to neural-network based parameterization as the stochastic gradient with respect to the parameter can be efficiently computed.
Once a parameterized model, such as Neural Ordinary Differential Equation is trained, we can generate the entire trajectory to the FPE.

Understanding Riemannian Acceleration via a Proximal Extragradient Framework

Time: Saturday, July 2, 11:09 AM GMT+1

Authors: Jikai Jin; Suvrit Sra

We contribute to advancing the understanding of Riemannian accelerated gradient methods. In particular, we revisit ``\emph{Accelerated Hybrid Proximal Extragradient}'' (A-HPE), a powerful framework for obtaining Euclidean accelerated methods~\citep{monteiro2013accelerated}. Building on A-HPE, we then propose and analyze Riemannian A-HPE. The core of our analysis consists of two key components: (i) a set of new insights into Euclidean A-HPE itself; and (ii) a careful control of metric distortion caused by Riemannian geometry. We illustrate our framework by obtaining a few existing and new Riemannian accelerated gradient methods as special cases, while characterizing their acceleration as corollaries of our main results.

Corruption-Robust Contextual Search through Density Updates

Time: Saturday, July 2, 11:21 AM GMT+1

Authors: Renato Leme; Chara Podimata; Jon Schneider

We study the problem of contextual search in the adversarial noise model. Let $d$ be the dimension of the problem, $T$ be the time horizon and $C$ be the total amount of noise in the system. For the $\epsilon$-ball loss, we give a tight regret bound of $O(C + d \log(1/\epsilon))$ improving over the $O(d^3 \log(1/\epsilon)) \log^2(T) + C \log(T) \log(1/\epsilon))$ bound of Krishnamurthy et al (STOC'21). For the symmetric loss, we give an efficient algorithm with regret $O(C+d \log T)$.

In terms of techniques, our algorithms are a departure from previous contextual search models in the sense that they keep track of density functions over the candidate vectors instead of a knowledge set consisting of the candidate vectors consistent with the feedback obtained.

Polynomial-Time Reinforcement Learning without Horizon

Time: Saturday, July 2, 11:33 AM GMT+1

Authors: Zihan Zhang; Xiangyang Ji; Simon Du

This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys regret independent on the planning horizon. Specifically, we consider tabular MDP with $S$ states, $A$ actions, a planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$ episodes. We design an algorithm that achieves an $O\left(\mathrm{poly}(S,A,\log K)\sqrt{K}\right)$ regret in contrast to existing bounds which either has an additional $\mathrm{polylog}(H)$ or has an exponential dependency on $S$. Our key technical contributions are (1) a new explicit exploration algorithm and (2) a sequence of new results establishing the approximation power, stability, and concentration property of stationary policies, which may be of independent interest.

Multilevel Optimization for Inverse Problems

Time: Saturday, July 2, 11:45 AM GMT+1

Authors: Simon Weissmann; Ashia Wilson; Jakob Zech

Inverse problems occur in a variety of parameter identification tasks in engineering. Such problems are challenging in practice, as they require repeated evaluation of computationally expensive forward models. We introduce a unifying framework of multilevel optimization that can be applied to a wide range of optimization-based solvers. Our framework provably reduces the computational cost associated with evaluating the expensive forward maps. To demonstrate the versatility of our analysis, we discuss its implications for various methodologies including multilevel gradient descent, a multilevel ensemble Kalman inversion and a multilevel Langevin sampler. We also provide numerical experiments to verify our theoretical findings.

LeT-All career panel (Saturday, July 2, 02:00 PM GMT+1)

Location: Room A

Session chair: Thodoris Lykouris

Robustness (Saturday, July 2, 03:30 PM GMT+1)

Location: Room A

Session chairs: Oliver Hinder; Eric Price

Approximate Cluster Recovery from Noisy Labels

Time: Saturday, July 2, 03:30 PM GMT+1

Authors: Buddhima Gamlath; Silvio Lattanzi; Ashkan Norouzi-Fard; Ola Svensson

Designing algorithms for machine learning problems targeting beyond worst-case analysis and, in particular, analyzing the effect of side-information on the complexity of such problems is a very important line of research with many practical applications. In this paper we study the classic k-means clustering problem in the presence of noisy labels.
In this problem, in addition to a set of points and parameter \(k\), we receive cluster labels of each point generated by either an adversarial or a random perturbation of the optimal solution. Our main goal is to formally study the effect of this extra information on the complexity of the k-means problem. In particular, in the context of random perturbations, we give an efficient algorithm that finds a clustering of cost within a factor $1+o(1)$ of the optimum even when the label of each point is perturbed with a large probability (think 99\%). In contrast, we show that the side-information with adversarial perturbations is as hard as the original problem even if only a small $\epsilon$ fraction of the labels are perturbed. We complement this negative result by giving a simple algorithm in the case when the adversary is only allowed to perturb an $\epsilon$ fraction of the labels per \emph{each cluster}.

Learning GMMs with Nearly Optimal Robustness Guarantees

Time: Saturday, July 2, 03:42 PM GMT+1

Authors: Allen Liu; Ankur Moitra

In this work we solve the problem of robustly learning a high-dimensional Gaussian mixture model with $k$ components from $\epsilon$-corrupted samples up to accuracy $\widetilde{O}(\epsilon)$ in total variation distance for any constant $k$ and with mild assumptions on the mixture. This robustness guarantee is optimal up to polylogarithmic factors. The main challenge is that most earlier works rely on learning individual components in the mixture, but this is impossible in our setting, at least for the types of strong robustness guarantees we are aiming for. Instead we introduce a new framework which we call {\em strong observability} that gives us a route to circumvent this obstacle.

Learning a Single Neuron with Adversarial Label Noise via Gradient Descent

Time: Saturday, July 2, 03:54 PM GMT+1

Authors: Ilias Diakonikolas; Vasilis Kontonis; Christos Tzamos; Nikos Zarifis

We study the fundamental problem of learning a single neuron, i.e., a
function of the form $\x \mapsto \sigma(\vec w \cdot \x)$ for monotone activations
$\sigma:\R \mapsto \R$, with respect to the $L_2^2$-loss in the presence of adversarial label noise.
Specifically, we are given labeled examples from a distribution $D$ on $(\x, y) \in \R^d \times \R$
such that there exists $\vec w^\ast \in \R^d$ achieving $F(\vec w^\ast) = \opt$, where
$F(\vec w) = \E_{(\x,y) \sim D}[(\sigma(\vec w\cdot \x) - y)^2]$. The goal of the learner
is to output a hypothesis vector $\wt{\vec w}$ such that $F(\wt{\vec w}) = C \, \opt+\eps$ with
high probability, where $C$ is a universal constant. As our main contribution, we give
efficient constant-factor approximate learners
for a broad class of distributions (including log-concave distributions)
and activation functions (including ReLUs and sigmoids).
Concretely, for the class of isotropic log-concave distributions, we obtain
the following important corollaries:
\begin{itemize}[leftmargin=3pc, rightmargin = 1.5pc]
\item For the logistic activation, i.e., $\sigma(t) = 1/(1+e^{-t})$, we obtain the first
polynomial-time constant factor approximation, even under the Gaussian distribution.
Moreover, our algorithm has sample complexity $\wt{O}(d/\eps)$, which is tight within
polylogarithmic factors.

\item For the ReLU activation, i.e., $\sigma(t) = \max(0,t)$, we give an efficient algorithm with
sample complexity $\wt{O}(d \, \polylog(1/\eps))$. Prior to our work, the best known
constant-factor approximate learner had sample complexity $\Omega(d/\eps)$.
In both settings, our algorithms are simple, performing gradient-descent on the (regularized) $L_2^2$-loss.
The correctness of our algorithms relies on novel structural results that we establish,
showing that (essentially all) stationary points of the underlying non-convex loss
are approximately optimal.

Robustly-Reliable Learners under Poisoning Attacks

Time: Saturday, July 2, 04:06 PM GMT+1

Authors: Maria-Florina Balcan; Avrim Blum; Steve Hanneke; Dravyansh Sharma

Data poisoning attacks, in which an adversary corrupts a training set with the goal of inducing specific desired mistakes, have raised substantial concern: even just the possibility of such an attack can make a user no longer trust the results of a learning system. In this work, we show how to achieve strong robustness guarantees in the face of such attacks across multiple axes. Specifically, we provide robustly-reliable predictions, in which the predicted label is guaranteed to be correct so long as the adversary has not exceeded a given corruption budget, even in the presence of instance-targeted attacks, where the adversary knows the test example in advance and aims to cause a specific failure on that example. We also extend these results to active and agnostic learning. We provide nearly-tight upper and lower bounds on the guarantees achievable in these scenarios, as well as efficient algorithms given an ERM oracle. Moreover, for the case of linear separators over natural distributions, we provide efficient non-oracle algorithms for such robustly-reliable predictions.

Robust Sparse Mean Estimation via Sum of Squares

Time: Saturday, July 2, 04:18 PM GMT+1

Authors: Ilias Diakonikolas; Daniel Kane; Sushrut Karmalkar; Ankit Pensia; Thanasis Pittas

We study the problem of high-dimensional sparse mean estimation in the presence of an $\eps$-fraction of adversarial outliers. Prior work obtained sample and computationally efficient algorithms for this task for identity-covariance subgaussian distributions. In this work, we develop the first efficient algorithms for robust sparse mean estimation without a priori knowledge of the covariance. For distributions on $\R^d$ with `certifiably bounded' $t$-th moments and sufficiently light tails, our algorithm achieves error of $O(\eps^{1-1/t})$ with sample complexity $m = (k\log(d))^{O(t)}/\eps^{2-2/t}$. For the special case of the Gaussian distribution, our algorithm achieves near-optimal error of $\tilde O(\eps)$ with sample complexity $m = O(k^4 \polylog(d))/\eps^2$. Our algorithms follow the Sum-of-Squares based proofs to algorithms approach. We complement our upper bounds with Statistical Query and low-degree polynomial testing lower bounds, providing evidence that the sample-time-error tradeoffs achieved by our algorithms are qualitatively best possible.

On the Power of Adaptivity in Statistical Adversaries

Time: Saturday, July 2, 04:30 PM GMT+1

Authors: Guy Blanc; Jane Lange; Ali Malik; Li-Yang Tan

We initiate the study of a fundamental question concerning adversarial noise models in statistical problems where the algorithm receives i.i.d. draws from a distribution $\mathcal{D}$. The definitions of these adversaries specify the {\sl type} of allowable corruptions (noise model) as well as {\sl when} these corruptions can be made (adaptivity); the latter differentiates between oblivious adversaries that can only corrupt the distribution $\mathcal{D}$ and adaptive adversaries that can have their corruptions depend on the specific sample $S$ that is drawn from $\mathcal{D}$.

We investigate whether oblivious adversaries are effectively equivalent to adaptive adversaries, across all noise models studied in the literature, under a unifying framework that we introduce. Specifically, can the behavior of an algorithm~$\mathcal{A}$ in the presence of oblivious adversaries always be well-approximated by that of an algorithm $\mathcal{A}'$ in the presence of adaptive adversaries? Our first result shows that this is indeed the case for the broad class of {\sl statistical query} algorithms, under all reasonable noise models. We then show that in the specific case of {\sl additive noise}, this equivalence holds for {\sl all} algorithms. Finally, we map out an approach towards proving this statement in its fullest generality, for all algorithms and under all reasonable noise models.

Virtual session II (Saturday, July 2, 03:30 PM GMT+1)

Location: Room B

Session chairs: Satyen Kale; Prateek Jain

Policy Optimization for Stochastic Shortest Path

Time: Saturday, July 2, 03:30 PM GMT+1

Authors: Liyu Chen; Haipeng Luo; Aviv Rosenberg

Policy optimization is among the most popular and successful reinforcement learning algorithms, and there is increasing interest in understanding its theoretical guarantees. In this work, we initiate the study of policy optimization for the stochastic shortest path (SSP) problem, a goal-oriented reinforcement learning model that strictly generalizes the finite-horizon model and better captures many applications. We consider a wide range of settings, including stochastic and adversarial environments under full information or bandit feedback, and propose a policy optimization algorithm for each setting that makes use of novel correction terms and/or variants of dilated bonuses (Luo et al., 2021). For most settings, our algorithm is shown to achieve a near-optimal regret bound.
One key technical contribution of this work is a new approximation scheme to tackle SSP problems that we call stacked discounted approximation and use in all our proposed algorithms. Unlike the finite-horizon approximation that is heavily used in recent SSP algorithms, our new approximation enables us to learn a near-stationary policy with only logarithmic changes during an episode and could lead to an exponential improvement in space complexity.

Computational-Statistical Gap in Reinforcement Learning

Time: Saturday, July 2, 03:42 PM GMT+1

Authors: Daniel Kane; Sihan Liu; Shachar Lovett; Gaurav Mahajan

Reinforcement learning with function approximation has recently achieved tremendous results in applications with large state spaces. This empirical success has motivated a growing body of theoretical work proposing necessary and sufficient conditions under which efficient reinforcement learning is possible. From this line of work, a remarkably simple minimal sufficient condition has emerged for sample efficient reinforcement learning: MDPs with optimal value function V* and Q* linear in some known low-dimensional features. In this setting, recent works have designed sample efficient algorithms which require a number of samples polynomial in the feature dimension and independent of the size of state space. They however leave finding computationally efficient algorithms as future work and this is considered a major open problem in the community.

In this work, we make progress on this open problem by presenting the first computational lower bound for RL with linear function approximation: unless NP=RP, no randomized polynomial time algorithm exists for deterministic transition MDPs with a constant number of actions and linear optimal value functions. To prove this, we show a reduction from Unique-Sat, where we convert a CNF formula into an MDP with deterministic transitions, constant number of actions and low dimensional linear optimal value functions. This result also exhibits the first computational-statistical gap in reinforcement learning with linear function approximation, as the underlying statistical problem is information-theoretically solvable with a polynomial number of queries, but no computationally efficient algorithm exists unless NP=RP. Finally, we also prove a quasi-polynomial time lower bound under the Randomized Exponential Time Hypothesis.

Minimax Regret Optimization for Robust Machine Learning under Distribution Shift

Time: Saturday, July 2, 03:54 PM GMT+1

Authors: Alekh Agarwal; Tong Zhang

In this paper, we consider learning scenarios where the learned model is evaluated under an unknown test distribution which potentially differs from the training distribution (i.e. distribution shift). The learner has access to a family of weight functions such that the test distribution is a reweighting of the training distribution under one of these functions, a setting typically studied under the name of Distributionally Robust Optimization (DRO). We consider the problem of deriving regret bounds in the classical learning theory setting, and require that the resulting regret bounds hold uniformly for all potential test distributions. We show that the \dro formulation does not guarantee uniformly small regret under distribution shift. We instead propose an alternative method called Minimax Robust Optimization (MRO), and show that under suitable conditions this method achieves uniformly low regret across all test distributions. We also adapt our technique to have stronger guarantees when the test distributions are heterogeneous in their similarity to the training data. Given the widespead optimization of worst case risks in current approaches to robust machine learning, we believe that MRO can be a strong alternative to address distribution shift scenarios.

Offline Reinforcement Learning with Realizability and Single-Policy Concentrability

Time: Saturday, July 2, 04:06 PM GMT+1

Authors: Wenhao Zhan; Baihe Huang; Audrey Huang; Nan Jiang; Jason Lee

Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e.g., Bellman-completeness) and the data coverage (e.g., all-policy concentrability). Despite the recent efforts on relaxing these assumptions, existing works are only able to relax one of the two factors, leaving the strong assumption on the other factor intact. As an important open problem, can we achieve sample-efficient offline RL with weak assumptions on both factors?

In this paper we answer the question in the positive. We analyze a simple algorithm based on the primal-dual formulation of MDPs, where the dual variables (discounted occupancy) are modeled using a density-ratio function against offline data. With proper regularization, the algorithm enjoys polynomial sample complexity, under only realizability and single-policy concentrability. We also provide alternative analyses based on different assumptions to shed light on the nature of primal-dual algorithms for offline RL.

Non-Linear Reinforcement Learning in Large Action Spaces: Structural Conditions and Sample-Efficiency of Posterior Sampling

Time: Saturday, July 2, 04:18 PM GMT+1

Authors: Alekh Agarwal; Tong Zhang

Provably sample-efficient Reinforcement Learning (RL) with rich observations and function approximation has witnessed tremendous recent progress, particularly when the underlying function approximators are linear. In this linear regime, computationally and statistically efficient methods exist where the potentially infinite state and action spaces can be captured through a known feature embedding, with the sample complexity scaling with the (intrinsic) dimension of these features. When the action space is finite, significantly more sophisticated results allow non-linear function approximation under appropriate structural constraints on the underlying RL problem, permitting for instance, the learning of good features instead of assuming access to them. In this work, we present the first result for non-linear function approximation which holds for general action spaces under a \emph{linear embeddability} condition, which generalizes linear and finite action settings. We design a novel optimistic posterior sampling strategy, $TS^3$, for such problems, and show worst case sample complexity guarantees that scale with a rank parameter of the RL problem, the linear embedding dimension introduced in this work and standard measures of the function class complexity.

Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation

Time: Saturday, July 2, 04:30 PM GMT+1

Authors: Dylan Foster; Akshay Krishnamurthy; David Simchi-Levi; Yunzong Xu

We consider the offline reinforcement learning problem, where the aim is to learn a decision making policy from logged data. Offline RL---particularly when coupled with (value) function approximation to allow for generalization in large or continuous state spaces---is becoming increasingly relevant in practice, because it avoids costly and time-consuming online data collection and is well suited to safety-critical domains. Existing sample complexity guarantees for offline value function approximation methods typically require both (1) distributional assumptions (i.e., good coverage) and (2) representational assumptions (i.e., ability to represent some or all $Q$-value functions) stronger than what is required for supervised learning. However, the necessity of these conditions and the fundamental limits of offline RL are not well understood in spite of decades of research. This led Chen and Jiang (2019) to conjecture that concentrability (the most standard notion of coverage) and realizability (the weakest representation condition) alone are not sufficient for sample-efficient offline RL. We resolve this conjecture in the positive by proving that in general, even if both concentrability and realizability are satisfied, any algorithm requires sample complexity either polynomial in the size of the state space or exponential in other parameters to learn a non-trivial policy.

Our results show that sample-efficient offline reinforcement learning requires either restrictive coverage conditions or representation conditions that go beyond supervised learning, and highlight a phenomenon called over-coverage which serves as a fundamental barrier for offline value function approximation methods. A consequence of our results for reinforcement learning with linear function approximation is that the separation between online and offline RL can be arbitrarily large, even in constant dimension.

Computational complexity I (Saturday, July 2, 04:45 PM GMT+1)

Location: Room A

Session chairs: Tomer Koren; Pooria Joulani

Lattice-Based Methods Surpass Sum-of-Squares in Clustering

Time: Saturday, July 2, 04:45 PM GMT+1

Authors: Ilias Zadik; Min Jae Song; Alexander Wein; Joan Bruna

Clustering is a fundamental primitive in unsupervised learning which gives rise to a rich class of computationally-challenging inference tasks. In this work, we focus on the canonical task of clustering d-dimensional Gaussian mixtures with unknown (and possibly degenerate) covariance. Recent works (Ghosh et al. '20; Mao, Wein '21; Davis, Diaz, Wang '21) have established lower bounds against the class of low-degree polynomial methods and the sum-of-squares (SoS) hierarchy for recovering certain hidden structures planted in Gaussian clustering instances. Prior work on many similar inference tasks portends that such lower bounds strongly suggest the presence of an inherent statistical-to-computational gap for clustering, that is, a parameter regime where the clustering task is statistically possible but no polynomial-time algorithm succeeds.
One special case of the clustering task we consider is equivalent to the problem of finding a planted hypercube vector in an otherwise random subspace. We show that, perhaps surprisingly, this particular clustering model does not exhibit a statistical-to-computational gap, even though the aforementioned low-degree and SoS lower bounds continue to apply in this case. To achieve this, we give a polynomial-time algorithm based on the Lenstra--Lenstra--Lovasz lattice basis reduction method which achieves the statistically-optimal sample complexity of d+1 samples. This result extends the class of problems whose conjectured statistical-to-computational gaps can be "closed" by "brittle" polynomial-time algorithms, highlighting the crucial but subtle role of noise in the onset of statistical-to-computational gaps.

Streaming Algorithms for Ellipsoidal Approximation of Convex Polytopes

Time: Saturday, July 2, 04:57 PM GMT+1

Authors: Yury Makarychev; Naren Manoj; Max Ovsiankin

We give efficient deterministic one-pass streaming algorithms for finding an ellipsoidal approximation of a symmetric convex polytope. The algorithms are near-optimal in that their approximation factors differ from that of the optimal offline solution only by a factor sub-logarithmic in the aspect ratio of the polytope.

Hardness of Maximum Likelihood Learning of DPPs

Time: Saturday, July 2, 05:09 PM GMT+1

Authors: Elena Grigorescu; Brendan Juba; Karl Wimmer; Ning Xie

Determinantal Point Processes (DPPs) are a widely used probabilistic model for negatively correlated sets. DPPs are used in Machine Learning applications to select a diverse, yet representative subset of data. In these applications, the parameters of the DPP need to be fit to match the data; typically, we seek a set of parameters that maximize the likelihood of the data. The algorithms used for this task either optimize over a limited family of DPPs, or else use local improvement heuristics that do not provide theoretical guarantees of optimality.

It is natural to ask if there exist efficient algorithms for finding a maximum likelihood DPP model for a given data set. In seminal work on DPPs in Machine Learning, Kulesza conjectured in his PhD Thesis (2012) that the problem is NP-complete.

In this work we prove Kulesza's conjecture: we prove moreover, that even computing a $1-\frac{1}{\mathrm{poly} \log N}$-approximation to the maximum log-likelihood of a DPP on a set of $N$ items is NP-complete. At the same time, we also obtain the first polynomial-time algorithm obtaining a nontrivial worst-case approximation to the optimal likelihood: we present a polynomial-time $1/\log m$-approximation algorithm (for data sets of size $m$), which moreover obtains a $1-\frac{1}{\log N}$-approximation if all $N$ elements appear in a $O(1/N)$-fraction of the subsets.

In terms of techniques, the hardness result reduces to solving a gap instance of a ``vector coloring" problem on a hypergraph obtained from an adaptation of the constructions of Bogdanov, Obata and Trevisan (FOCS 2002), using the strong expanders of Alon and Capalbo (FOCS 2007).

Strong Memory Lower Bounds for Learning Natural Models

Time: Saturday, July 2, 05:21 PM GMT+1

Authors: Gavin Brown; Mark Bun; Adam Smith

We give lower bounds on the amount of memory required by a one-pass streaming algorithms for solving several natural learning problems. In a setting where examples lie in $\{0,1\}^d$ and the optimal classifier can be encoded using $\kappa$ bits, we show that algorithms which learn using a near-minimal number of examples, $\tilde O(\kappa)$, must use $\tilde \Omega( d\kappa)$ bits of space. Our space bounds match the dimension of the ambient space of the problem's natural parametrization, even when it is quadratic in the size of examples and the final classifier. For instance, in the setting of $d$-sparse linear classifiers over the degree-2 polynomial kernel, for which $\kappa=\Theta(d\log d)$, our space lower bound is $\tilde\Omega(d^2)$. Our bounds degrade gracefully with the stream length $N$, generally having the form $\tilde\Omega(d\kappa \cdot \frac{\kappa}{N})$.

Bounds of the form $\Omega(d\kappa)$ were known for learning parity and other problems defined over finite fields. Bounds that apply in a narrow range of sample sizes are also known for linear regression. Ours are the first such bounds for problems of the type commonly seen in recent learning applications that apply for for a large range of input sizes.

The Query Complexity of Local Search and Brouwer in Rounds

Time: Saturday, July 2, 05:33 PM GMT+1

Authors: Simina Branzei; Jiawei Li

We consider the query complexity of finding a local minimum of a function defined on a graph, where at most $k$ rounds of interaction (aka adaptivity) with the oracle are allowed. Adaptivity is a fundamental concept studied due to the need to parallelize computation and understand the speedups attainable. The query complexity of local search is tightly related to the complexity of computing stationary points of a function, thus bounds for local search can give insights into the performance of algorithms such as gradient descent.

We focus on the $d$-dimensional grid $\{1, 2, \ldots, n \}^d$, where the dimension $d \geq 2$ is a constant. We give algorithms and lower bounds that characterize the trade-off between the number of rounds of adaptivity and the query complexity of local search, when the number of rounds is constant and polynomial in $n$, respectively.

When the number of rounds $k$ is constant, the query complexity is $\Theta\bigl(n^{\frac{d^{k+1} - d^k}{d^k - 1}}\bigl)$, for both deterministic and randomized algorithms. %E.g., the query complexity on $[n]^2$ in two rounds is $\Theta(n^{4/3})$.
When the number of rounds is polynomial, i.e. $k = n^{\alpha}$ for $0 < \alpha < d/2$, the randomized query complexity is $\Theta\bigl(n^{(d-1) - \frac{d-2}{d}\alpha}\bigr)$ for all $d \geq 5$. For $d=3$ and $d=4$, we show the same upper bound holds and give almost matching lower bounds.

The local search analysis also enables us to characterize the query complexity of computing a Brouwer fixed point in rounds. Our proof technique for lower bounding the query complexity in rounds may be of independent interest as an alternative to the classical relational adversary method of Aaronson from the fully adaptive setting.

Virtual session III (Saturday, July 2, 04:45 PM GMT+1)

Location: Room B

Session chairs: Tor Lattimore; Tim van Erven

Tight Query Complexity Bounds for Learning Graph Partitions

Time: Saturday, July 2, 04:45 PM GMT+1

Authors: Xizhi Liu; Sayan Mukherjee

Given a partition of a graph into connected components, the membership oracle asserts whether any two vertices of the graph lie in the same component or not.
We prove that for $n\ge k\ge 2$, learning the components of an $n$-vertex hidden graph with $k$ components requires at least $(k-1)n-\binom k2$ membership queries.
Our result improves on the best known information-theoretic bound of $\Omega(n\log k)$ queries, and exactly matches the query complexity of the algorithm introduced by [Reyzin and Srivastava, 2007] for this problem.
Additionally, we introduce an oracle that can learn the number of components of $G$ in asymptotically fewer queries than learning the full partition, thus answering another question posed by the same authors.
Lastly, we introduce a more applicable version of this oracle, and prove asymptotically tight bounds of $\widetilde\Theta(m)$ queries for both learning and verifying an $m$-edge hidden graph $G$ using it.

ROOT-SGD: Sharp Nonasymptotics and Asymptotic Efficiency in a Single Algorithm

Time: Saturday, July 2, 04:57 PM GMT+1

Authors: Chris Li; Wenlong Mou; Martin Wainwright; Michael Jordan

We study the problem of solving strongly convex and smooth unconstrained optimization problems using stochastic first-order algorithms. We devise a novel algorithm, referred to as Recursive One-Over-T SGD (ROOT-SGD), based on a easily implementable and recursive averaging of past stochastic gradients. We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an asymptotic sense. On the nonasymptotic side, we prove risk bounds on the last iterate of ROOT-SGD with leading-order terms that match the optimal statistical risk with a unity pre-factor, along with a higher-order term that scales at the sharp rate of $O(n^{-3/2})$. On the asymptotic side, we show that when a mild, one-point Hessian continity condition is imposed, the rescaled last iterate of (multi-epoch) ROOT-SGD converges asymptotically to a Gaussian limit with the Cram\'{e}r-Rao optimal asymptotic covariance, for a broad range of stepsize choices.

Strong Gaussian Approximation for the Sum of Random Vectors

Time: Saturday, July 2, 05:09 PM GMT+1

Authors: Nazar Buzun; Nikolay Shvetsov; Dmitry V. Dylov

This paper derives a new strong Gaussian approximation bound for the sum of independent random vectors. The approach relies on the optimal transport theory and yields explicit dependence on the dimension size p and the sample size n. This dependence establishes a new fundamental limit for all practical applications of statistical learning theory. Particularly, based on this bound, we prove approximation in distribution for the maximum norm in a high-dimensional setting (p > n).

Gardner Formula for Ising Perceptron Models at Small Densities

Time: Saturday, July 2, 05:21 PM GMT+1

Authors: Erwin Bolthausen; Shuta Nakajima; Nike Sun; Changji Xu

We consider the Ising perceptron model with N spins and M = N*alpha patterns, with a general activation function U that is bounded above. For U bounded away from zero, or U a one-sided threshold function, it was shown by Talagrand (2000, 2011) that for small densities alpha, the free energy of the model converges in the large-N limit to the replica symmetric formula conjectured in the physics literature (Krauth--Mezard 1989, see also Gardner--Derrida 1988). We give a new proof of this result, which covers the more general class of all functions U that are bounded above and satisfy a certain variance bound. The proof uses the (first and second) moment method conditional on the approximate message passing iterates of the model. In order to deduce our main theorem, we also prove a new concentration result for the perceptron model in the case where U is not bounded away from zero.

Private Matrix Approximation and Geometry of Unitary Orbits

Time: Saturday, July 2, 05:33 PM GMT+1

Authors: Oren Mangoubi; Yikai Wu; Satyen Kale; Abhradeep Thakurta; Nisheeth Vishnoi

Consider the following optimization problem: Given $n \times n$ matrices $A$ and $\Lambda$, maximize $\langle A, U\Lambda U^*\rangle$ where $U$ varies over the unitary group $\mathrm{U}(n)$. This problem seeks to approximate $A$ by a matrix whose spectrum is the same as $\Lambda$ and, by setting $\Lambda$ to be appropriate diagonal matrices, one can recover matrix approximation problems such as PCA and rank-$k$ approximation. We study the problem of designing differentially private algorithms for this optimization problem in settings where the matrix $A$ is constructed using users' private data. We give efficient and private algorithms that come with upper and lower bounds on utility. Our results unify and improve upon several prior works on private matrix approximation problems. They rely on extensions of packing/covering number bounds for Grassmannians to unitary orbits which should be of independent interest.

The Structured Abstain Problem and the Lovász Hinge

Time: Saturday, July 2, 05:45 PM GMT+1

Authors: Enrique Nueve; Rafael Frongillo; Jessica Finocchiaro

The Lovász hinge is a convex surrogate recently proposed for structured binary classification, in which k binary predictions are made simultaneously and the error is judged by a submodular set function. Despite its wide usage in image segmentation and related problems, its consistency has remained open. We resolve this open question, showing that the Lovász hinge is inconsistent for its desired target unless the set function is modular. Leveraging a recent embedding framework, we instead derive the target loss for which the Lovász hinge is consistent. This target, which we call the structured abstain problem, allows one to abstain on any subset of the k predictions. We derive two link functions, each of which are consistent for all submodular set functions simultaneously.

Online learning and optimization II (Sunday, July 3, 09:00 AM GMT+1)

Location: Room A

Session chairs: Pooria Joulani; Wouter Koolen

Parameter-Free Mirror Descent

Time: Sunday, July 3, 09:00 AM GMT+1

Authors: Andrew Jacobsen; Ashok Cutkosky

We develop a modified online mirror descent framework that is suitable for building adaptive and parameter-free algorithms in unbounded domains. We leverage this technique to develop the first unconstrained online linear optimization algorithm achieving an optimal dynamic regret bound, and we further demonstrate that natural strategies based on Follow-the-Regularized-Leader are unable to achieve similar results. We also apply our mirror descent framework to build new parameter-free implicit updates, as well as a simplified and improved unconstrained scale-free algorithm.

Scale-Free Unconstrained Online Learning for Curved Losses

Time: Sunday, July 3, 09:12 AM GMT+1

Authors: Jack Mayo; Hedi Hadiji; Tim van Erven

A sequence of works in unconstrained online convex optimisation have investigated the possibility of adapting simultaneously to the norm U of the comparator and the maximum norm G of the gradients. In full generality, matching upper and lower bounds are known which show that this comes at the unavoidable cost of an additive GU^3, which is not needed when either G or U is known in advance. Surprisingly, recent results by Kempka et al. (2019) show that no such price for adaptivity is needed in the specific case of 1-Lipschitz losses like the hinge loss. We follow up on this observation by showing that there is in fact never a price to pay for adaptivity if we specialise to any of the other common supervised online learning losses: our results cover log loss, (linear and non-parametric) logistic regression, square loss prediction, and (linear and non-parametric) least-squares regression. We also fill in several gaps in the literature by providing matching lower bounds with an explicit dependence on U. In all cases we obtain scale-free algorithms, which are suitably invariant under rescaling of the data. Our general goal is to establish achievable rates without concern for computational efficiency, but for linear logistic regression we also provide an adaptive method that is as efficient as the recent non-adaptive algorithm by Agarwal et al. (2021).

Efficient Projection-Free Online Convex Optimization with Membership Oracle

Time: Sunday, July 3, 09:24 AM GMT+1

Authors: Zakaria Mhammedi

In constrained convex optimization, existing methods based on the ellipsoid or cutting plane method do not scale well with the dimension of the ambient space. Alternative approaches such as Projected Gradient Descent only provide a computational benefit for simple convex sets such as Euclidean balls, where Euclidean projections can be performed efficiently. For other sets, the cost of the projections can be too high. To circumvent these issues, alternative methods based on the famous Frank-Wolfe algorithm have been studied and used. Such methods use a Linear Optimization Oracle at each iteration instead of Euclidean projections; the former can often be performed efficiently. Such methods have also been extended to the online and stochastic optimization settings. However, the Frank-Wolfe algorithm and its variants do not achieve the optimal performance, in terms of regret or rate, for general convex sets. What is more, the Linear Optimization Oracle they use can still be computationally expensive in some cases. In this paper, we move away from Frank-Wolfe style algorithms and present a new reduction that turns any algorithm $\A$ defined on a Euclidean ball (where projections are cheap) to an algorithm on a constrained set $\K$ contained within the ball, without sacrificing the performance of the original algorithm $\A$ by much. Our reduction requires $O(T \ln T)$ calls to a Membership Oracle on $\K$ after $T$ rounds, and no linear optimization on $\K$ is needed. Using our reduction, we recover optimal regret bounds [resp.~rates], in terms of the number of iterations, in online [resp.~stochastic] convex optimization. Our guarantees are also useful in the offline convex optimization setting when the dimension of the ambient space is large.

Damped Online Newton Step for Portfolio Selection

Time: Sunday, July 3, 09:36 AM GMT+1

Authors: Zakaria Mhammedi; Alexander Rakhlin

We revisit the classic online portfolio selection problem, where at each round a learner selects a distribution over a set of portfolios to allocate its wealth. It is known that for this problem a logarithmic regret with respect to Cover's loss is achievable using the Universal Portfolio Selection algorithm, for example. However, all existing algorithms that achieve a logarithmic regret for this problem have per-round time and space complexities that scale polynomially with the total number of rounds, making them impractical. In this paper, we build on the recent work by Haipeng et al. 2018 and present the first practical online portfolio selection algorithm with a logarithmic regret and whose per-round time and space complexities depend only logarithmically on the horizon. Behind our approach are two key technical novelties. We first show that the Damped Online Newton steps can approximate mirror descent iterates well, even when dealing with time-varying regularizers. Second, we present a new meta-algorithm that achieves a strongly adaptive, logarithmic regret (i.e. a logarithmic regret on any sub-interval) for mixable losses.

Deep learning I (Sunday, July 3, 09:00 AM GMT+1)

Location: Room B

Session chairs: Matus Telgarsky; Tengyuan Liang

Depth and Feature Learning are Provably Beneficial for Neural Network Discriminators

Time: Sunday, July 3, 09:00 AM GMT+1

Authors: Carles Domingo-Enrich

We construct pairs of distributions $\mu_d, \nu_d$ on $\mathbb{R}^d$ such that the quantity $|\mathbb{E}_{x \sim \mu_d} [F(x)] - \mathbb{E}_{x \sim \nu_d} [F(x)]|$ decreases as $\Omega(1/d^2)$ for some three-layer ReLU network $F$ with polynomial width and weights, while declining exponentially in $d$ if $F$ is any two-layer network with polynomial weights. This shows that deep GAN discriminators are able to distinguish distributions that shallow discriminators cannot. Analogously, we build pairs of distributions $\mu_d, \nu_d$ on $\mathbb{R}^d$ such that $|\mathbb{E}_{x \sim \mu_d} [F(x)] - \mathbb{E}_{x \sim \nu_d} [F(x)]|$ decreases as $\Omega(1/(d\log d))$ for two-layer ReLU networks with polynomial weights, while declining exponentially for bounded-norm functions in the associated RKHS. This confirms that feature learning is beneficial for discriminators. Our bounds are based on Fourier transforms.

Width Is Less Important than Depth in ReLU Neural Networks

Time: Sunday, July 3, 09:12 AM GMT+1

Authors: Gal Vardi; Gilad Yehudai; Ohad Shamir

We solve an open question from Lu et al. (2017), by showing that any target network with inputs in $\mathbb{R}^d$ can be approximated by a width $O(d)$ network (independent of the target network's architecture), whose number of parameters is essentially larger only by a linear factor. In light of previous depth separation theorems, which imply that a similar result cannot hold when the roles of width and depth are interchanged, it follows that depth plays a more significant role than width in the expressive power of neural networks.
We extend our results to constructing networks with bounded weights, and to constructing networks with width at most $d+2$, which is close to the minimal possible width due to previous lower bounds. Both of these constructions cause an extra polynomial factor in the number of parameters over the target network. We also show an exact representation of wide and shallow networks using deep and narrow networks which, in certain cases, does not increase the number of parameters over the target network.

The Merged-Staircase Property: A Necessary and Nearly Sufficient Condition for SGD Learning of Sparse Functions on Two-Layer Neural Networks

Time: Sunday, July 3, 09:24 AM GMT+1

Authors: Emmanuel Abbe; Enric Boix Adsera; Theodor Misiakiewicz

A characterization of functions that neural networks can learn with SGD is currently known for two extremal parametrizations: neural networks in the linear regime, and neural networks with no structural constraints. However, for the main parametrization of interest ---non-linear but regular networks--- no tight characterization has yet been achieved, despite significant insights.

We take a step in this direction by considering depth-2 neural networks trained by SGD in the mean-field regime. We consider functions on binary inputs that have latent low-dimensional structure (i.e., a sparse Fourier representation). This regime is of interest since it remains poorly understood how neural networks routinely tackle high-dimensional datasets and adapt to latent low-dimensional structure without suffering from the curse of dimensionality.
Accordingly, we study SGD-learnability with $O(d)$ sample complexity in a large ambient dimension $d$.

Our main result characterizes a hierarchical property ---the merged-staircase property--- that is both necessary and nearly sufficient for learning in this setting.
A key tool is a new approximation result based on a ``dimension-free'' dynamics that applies to functions defined on a latent space of low-dimension. We further show that non-linear training is necessary: for such classes of functions, linear methods on any feature map (e.g., the NTK) are not capable of learning efficiently.

Neural Networks Can Learn Representations with Gradient Descent

Time: Sunday, July 3, 09:36 AM GMT+1

Authors: Alexandru Damian; Jason Lee; Mahdi Soltanolkotabi

Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime.

Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form f*(x) = g(Ux) where U maps from d to r dimensions with d ≫ r. When the degree of f* is p, it is known that n≍d^p samples are necessary to learn f* in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to f*. This results in an improved sample complexity of n≍d^p and enables transfer learning with sample complexity independent of d.

Best paper presentation II (Sunday, July 3, 09:55 AM GMT+1)

Location: None

Session chair: Maxim Raginsky

New Projection-Free Algorithms for Online Convex Optimization with Adaptive Regret Guarantees

Time: Sunday, July 3, 09:55 AM GMT+1

Authors: Ben Kretzu; Dan Garber

We present new efficient \textit{projection-free} algorithms for online convex optimization (OCO), where by projection-free we refer to algorithms that avoid computing orthogonal projections onto the feasible set, and instead relay on different and potentially much more efficient oracles. While most state-of-the-art projection-free algorithms are based on the \textit{follow-the-leader} framework, our algorithms are fundamentally different and are based on the \textit{online gradient descent} algorithm with a novel and efficient approach to computing so-called \textit{infeasible projections}. As a consequence, we obtain the first projection-free algorithms which naturally yield \textit{adaptive regret} guarantees, i.e., regret bounds that hold w.r.t. any sub-interval of the sequence.
Concretely, when assuming the availability of a linear optimization oracle (LOO) for the feasible set, on a sequence of length $T$, our algorithms guarantee $O(T^{3/4})$ adaptive regret and $O(T^{3/4})$ adaptive expected regret, for the full-information and bandit settings, respectively, using only $O(T)$ calls to the LOO. These bounds match the current state-of-the-art regret bounds for LOO-based projection-free OCO, which are \textit{not adaptive}.
We also consider a new natural setting in which the feasible set is accessible through a separation oracle.
We present algorithms which, using overall $O(T)$ calls to the separation oracle, guarantee $O(\sqrt{T})$ adaptive regret and $O(T^{3/4})$ adaptive expected regret for the full-information and bandit settings, respectively.

Generalization (Sunday, July 3, 10:45 AM GMT+1)

Location: Room A

Session chairs: Aryeh Kontorovich; Tengyuan Liang

On the Benefits of Large Learning Rates for Kernel Methods

Time: Sunday, July 3, 10:45 AM GMT+1

Authors: Gaspard Beugnot; Julien Mairal; Alessandro Rudi

This paper studies an intriguing phenomenon related to the good generalization performance of estimators obtained by using large learning rates within gradient descent algorithms.
First observed in the deep learning literature, we show that such a phenomenon can be precisely characterized in the context of kernel methods, even though the resulting optimization problem is convex. Specifically, we consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution on the Hessian's eigenvectors. This extends an intuition described by Nakkiran (2020) on a two-dimensional toy problem to realistic learning scenarios such as kernel ridge regression. While large learning rates may be proven beneficial as soon as there is a mismatch between the train and test objectives, we further explain why it already occurs in classification tasks without assuming any particular mismatch between train and test data distributions.

Monotone Learning

Time: Sunday, July 3, 10:57 AM GMT+1

Authors: Olivier Bousquet; Amit Daniely; Haim Kaplan; Yishay Mansour; Shay Moran; Uri Stemmer

The amount of training-data is one of the key factors which determines
the generalization capacity of learning algorithms.
Intuitively, one expects the error rate to decrease
as the amount of training-data increases.
Perhaps surprisingly, natural attempts to formalize this intuition give rise to
interesting and challenging mathematical questions.
For example, in their classical book on pattern recognition,
Devroye, Gyorfi and Lugosi (1996) ask whether there exists a {monotone} Bayes-consistent algorithm.This question remained open for over 25 years,
until recently Pestov (2021) resolved it for binary classification,
using an intricate construction of a monotone Bayes-consistent algorithm.

We derive a general result in multiclass classification, showing that
every learning algorithm $A$ can be transformed to a monotone one with similar performance. Further, the transformation is efficient and only uses a black-box oracle access to $A$. This demonstrates that one can provably avoid non-monotonic behaviour without compromising performance, thus answering
questions asked by Devroye, Gyorfi, and Lugosi (1996), Viering, Mey, and Loog (2019), Viering and Loog (2021), and by Mhammedi (2021).

Our general transformation readily implies monotone learners in a variety of contexts: for example, Pestov's result follows by applying it on \emph{any}
Bayes-consistent algorithm (e.g., $k$-Nearest-Neighbours).
In fact, our transformation extends Pestov's result
to classification tasks with an arbitrary number of labels.
This is contrast with Pestov's work which is tailored to binary classification.

In addition, we provide uniform bounds on the error of the monotone algorithm. This makes our transformation applicable in distribution-free settings.
For example, in PAC learning it implies that every
learnable class admits a monotone PAC learner.
This resolves questions asked by Viering, Mey, and Loog (2019); Viering and Loog (2021); Mhammedi (2021)

Generalization Bounds for Data-Driven Numerical Linear Algebra

Time: Sunday, July 3, 11:09 AM GMT+1

Authors: Peter Bartlett; Piotr Indyk; Tal Wagner

Data-driven algorithms can adapt their internal structure or parameters to inputs from unknown application-specific distributions, by learning from a training sample of inputs. Several recent works have applied this approach to problems in numerical linear algebra, obtaining significant empirical gains in performance. However, no theoretical explanation for their success was known.

In this work we prove generalization bounds for those algorithms, within the PAC-learning framework for data-driven algorithm selection proposed by Gupta and Roughgarden (SICOMP 2017). Our main result is an almost tight bound on the fat shattering dimension of the learning-based low rank approximation algorithm of Indyk et al.~(NeurIPS 2019). Our techniques are general, and provide generalization bounds for many other recently proposed data-driven algorithms in numerical linear algebra, covering both sketching-based and multigrid-based methods. This considerably broadens the class of data-driven algorithms for which a PAC-learning analysis is available.

Generalization Bounds via Convex Analysis

Time: Sunday, July 3, 11:21 AM GMT+1

Authors: Gabor Lugosi; Gergely Neu

Since the celebrated works of Russo and Zou (2016,2019) and Xu and Raginsky (2017), it has been well known that the generalization error of supervised learning algorithms can be bounded in terms of the mutual information between their input and the output, given that the loss of any fixed hypothesis has a subgaussian tail. In this work, we generalize this result beyond the standard choice of Shannon's mutual information to measure the dependence between the input and the output. Our main result shows that it is indeed possible to replace the mutual information by any strongly convex function of the joint input-output distribution, with the subgaussianity condition on the losses replaced by a bound on an appropriately chosen norm capturing the geometry of the dependence measure. This allows us to derive a range of generalization bounds that are either entirely new or strengthen previously known ones. Examples include bounds stated in terms of $p$-norm divergences and the Wasserstein-2 distance, which are respectively applicable for heavy-tailed loss distributions and highly smooth loss functions. Our analysis is entirely based on elementary tools from convex analysis by tracking the growth of a potential function associated with the dependence measure and the loss function.

Chained Generalisation Bounds

Time: Sunday, July 3, 11:33 AM GMT+1

Authors: Eugenio Clerico; Amitis Shidani; George Deligiannidis; Arnaud Doucet

This work discusses how to derive upper bounds for the expected generalisation error of supervised learning algorithms by means of the chaining technique. By developing a general theoretical framework, we establish a duality between generalisation bounds based on the regularity of the loss function, and their chained counterparts, which can be obtained by lifting the regularity assumption from the loss onto its gradient. This allows us to re-derive the chaining mutual information bound from the literature, and to obtain novel chained information-theoretic generalisation bounds, based on the Wasserstein distance and other probability metrics. We show on some toy examples that the chained generalisation bound can be significantly tighter than its standard counterpart, particularly when the distribution of the hypotheses selected by the algorithm is very concentrated.

Memorize to Generalize: On the Necessity of Interpolation in High Dimensional Linear Regression

Time: Sunday, July 3, 11:45 AM GMT+1

Authors: Chen Cheng; John Duchi; Rohith Kuditipudi

We examine the necessity of interpolation in overparameterized
models, that is, when achieving optimal predictive risk
in machine learning problems requires (nearly)
interpolating the training data. In particular, we consider simple
overparameterized linear regression $y = X \theta + w$
with random design $X \in
\real^{n \times d}$ under the proportional asymptotics $d/n \to \gamma
\in (1, \infty)$. We precisely characterize how prediction error
necessarily scales with training error in this setting. An
implication of this characterization is that as the label noise variance
$\sigma^2 \to 0$, any estimator that incurs at least $\mathsf{c}\sigma^4$
training error for some constant $\mathsf{c}$ is necessarily suboptimal
and will suffer growth in excess prediction error at least
linear in the training error. Thus, optimal performance requires fitting
training data to substantially higher accuracy than the inherent noise
floor of the problem.

Online learning and optimization III (Sunday, July 3, 10:45 AM GMT+1)

Location: Room B

Session chairs: Tim van Erven; Steve Hanneke

Universal Online Learning with Bounded Loss: Reduction to Binary Classification

Time: Sunday, July 3, 10:45 AM GMT+1

Authors: Moise Blanchard; Romain Cosson

We study universal consistency of non-i.i.d. processes in the context of online learning. A stochastic process is said to admit universal consistency if there exists a learner that achieves vanishing average loss for any measurable response function on this process. When the loss function is unbounded, [1] showed that the only processes admitting strong universal consistency are those taking a finite number of values almost surely. However, when the loss function is bounded, the class of processes admitting strong universal consistency is much richer and its characterization could be dependent on the response setting [2]. In this paper, we show that this class of processes is independent from the response setting thereby closing an open question of [3] (Open Problem 3). Specifically, we show that the class of processes that admit universal online learning is the same for binary classification as for multiclass classification with countable number of classes. Consequently, any output setting with bounded loss can be reduced to binary classification. Our reduction is constructive and practical. Indeed, we show that the nearest neighbor algorithm is transported by our construction. For binary classification on a process admitting strong universal learning, we prove that nearest neighbor successfully learns at least all finite unions of intervals.

Universal Online Learning: An Optimistically Universal Learning Rule

Time: Sunday, July 3, 10:57 AM GMT+1

Authors: Moise Blanchard

We study the subject of universal online learning with non-i.i.d. processes for bounded losses. The notion of universally consistent learning was defined by Hanneke in an effort to study learning theory under minimal assumptions, where the objective is to obtain low long-run average loss for any target function. We are interested in characterizing processes for which learning is possible and whether there exist learning rules guaranteed to be universally consistent given the only assumption that such learning is possible. The case of unbounded losses is very restrictive since the learnable processes almost surely have to visit a finite number of points and as a result, simple memorization is optimistically universal. We focus on the bounded setting and give a complete characterization of the processes admitting strong and weak universal learning. We further show that the k-nearest neighbor algorithm (kNN) is not optimistically universal and present a novel variant of 1NN which is optimistically universal for general input and value spaces in both strong and weak settings. This closes all the COLT 2021 open problems posed on universal online learning.

How Catastrophic Can Catastrophic Forgetting Be in Linear Regression?

Time: Sunday, July 3, 11:09 AM GMT+1

Authors: Itay Evron; Edward Moroshko; Rachel Ward; Nathan Srebro; Daniel Soudry

To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions.
We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds.
We establish connections between continual learning in the linear setting and two other research areas --
alternating projections and the Kaczmarz method.
In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas.
In particular, when $T$ tasks in $d$ dimensions are presented cyclically for $k$ iterations, we prove an upper bound of $T^2\min\{1/\sqrt{k},d/k\}$ on the forgetting.
This stands in contrast to the convergence to the offline solution, which can be arbitrarily slow according to existing alternating projection results.
We further show that the $T^2$ factor can be lifted when tasks are presented in a random ordering.

Strategizing Against Learners in Bayesian Games

Time: Sunday, July 3, 11:21 AM GMT+1

Authors: Yishay Mansour; Mehryar Mohri; Jon Schneider; Balasubramanian Sivan

We study repeated two-player games where one of the players, the learner, employs a no-regret learning strategy, while the other, the optimizer, is a rational utility maximizer. We consider general Bayesian games, where the payoffs of both the optimizer and the learner could depend on the type, which is drawn from a publicly known distribution, but revealed privately to the learner. We address the following questions: (a) what is the bare minimum that the optimizer can guarantee to obtain regardless of the no-regret learning algorithm employed by the learner? (b) are there learning algorithms that cap the optimizer payoff at this minimum? (c) can these algorithms be implemented efficiently? While building this theory of optimizer-learner interactions, we define a new combinatorial notion of regret called polytope swap regret, that could be of independent interest in other settings.

Plenary I (Sunday, July 3, 02:00 PM GMT+1)

Location: Room A

Session chair: Po-Ling Loh

Recent advances in streaming and private heavy hitters

Time: Sunday, July 3, 02:00 PM GMT+1

Speaker: Jelani Nelson

We survey recent results on finding heavy hitters in the streaming model, as well as in a distributed model while satisfying local differential privacy. Some tools and concepts commonly employed in learning theory will make appearances, such as spectral clustering and (a pseudorandom variant of) Rademacher complexity.

Jelani Nelson is a Professor in the Department of Electrical Engineering and Computer Sciences at UC Berkeley, and also a part-time Research Scientist at Google. His research interests include sketching and streaming algorithms, random projections and their applications to randomized linear algebra and compressed sensing, and differential privacy. He is a recipient of the Presidential Early Career Award for Scientists and Engineers, a Sloan Research Fellowship, and Best Paper Awards at PODS 2010 and 2022. He is also Founder and President of AddisCoder, Inc., which has provided free algorithms training to over 500 Ethiopian high school students since 2011, and which is co-launching a similar "JamCoders" program in Kingston, Jamaica this summer.

Deep learning II (Sunday, July 3, 03:30 PM GMT+1)

Location: Room A

Session chairs: Dylan Foster; Satyen Kale

Optimization-Based Separations for Neural Networks

Time: Sunday, July 3, 03:30 PM GMT+1

Authors: Itay Safran; Jason Lee

Depth separation results propose a possible theoretical explanation for the benefits of deep neural networks over shallower architectures, establishing that the former possess superior approximation capabilities. However, there are no known results in which the deeper architecture leverages this advantage into a provable optimization guarantee. We prove that when the data are generated by a distribution with radial symmetry which satisfies some mild assumptions, gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations, and where the hidden layer is held fixed throughout training. By building on and refining existing techniques for approximation lower bounds of neural networks with a single layer of non-linearities, we show that there are $d$-dimensional radial distributions on the data such that ball indicators cannot be learned efficiently by any algorithm to accuracy better than $\Omega(d^{-4})$, nor by a standard gradient descent implementation to accuracy better than a constant. These results establish what is to the best of our knowledge, the first optimization-based separations where the approximation benefits of the stronger architecture provably manifest in practice. Our proof technique introduces new tools and ideas that may be of independent interest in the theoretical study of both the approximation and optimization of neural networks.

Wasserstein GANs with Gradient Penalty Compute Congested Transport

Time: Sunday, July 3, 03:42 PM GMT+1

Authors: Tristan Milne; Adrian Nachman

Wasserstein GANs with Gradient Penalty (WGAN-GP) are a very popular method for training generative models to produce high quality synthetic data. While WGAN-GP were initially developed to calculate the Wasserstein 1 distance between generated and real data, recent works (e.g. [22]) have provided empirical evidence that this does not occur, and have argued that WGAN-GP perform well not in spite of this issue, but because of it. In this paper we show for the first time that WGAN-GP compute the minimum of a different optimal transport problem, the so-called congested transport [7]. Congested transport determines the cost of moving one distribution to another under a transport model that penalizes congestion. For WGAN-GP, we find that the congestion penalty has a spatially varying component determined by the sampling strategy used in [11] which acts like a local speed limit, making congestion cost less in some regions than others. This aspect of the congested transport problem is new in that the congestion penalty turns out to be unbounded and depends on the distributions to be transported, and so we provide the necessary mathematical proofs for this setting. We use our discovery to show that the gradients of solutions to the optimization problem in WGAN-GP determine the time averaged momentum of optimal mass flow. This is in contrast to the gradients of Kantorovich potentials for the Wasserstein 1 distance, which only determine the normalized direction of flow. This may explain, in support of [22], the success of WGAN-GP, since the training of the generator is based on these gradients.

The Implicit Bias of Benign Overfitting

Time: Sunday, July 3, 03:54 PM GMT+1

Authors: Ohad Shamir

The phenomenon of benign overfitting, where a predictor perfectly fits noisy training data while attaining low expected loss, has received much attention in recent years, but still remains not fully understood beyond simple linear regression setups. In this paper, we show that for regression, benign overfitting is ``biased'' towards certain types of problems, in the sense that its existence on one learning problem precludes its existence on other learning problems. On the negative side, we use this to argue that one should not expect benign overfitting to occur in general, for several natural extensions of the plain linear regression problems studied so far. We then turn to classification problems, and show that the situation there is much more favorable. Specifically, we consider a model where an arbitrary input distribution of some fixed dimension k is concatenated with a high-dimensional distribution, and prove that the max-margin predictor (to which gradient-based methods are known to converge in direction) is asymptotically biased towards minimizing the expected \emph{squared hinge loss} w.r.t. the k-dimensional distribution. This allows us to reduce the question of benign overfitting in classification to the simpler question of whether this loss is a good surrogate for the misclassification error, and use it to show benign overfitting in some new settings.

Rate of Convergence of Polynomial Networks to Gaussian Processes

Time: Sunday, July 3, 04:06 PM GMT+1

Authors: Adam Klukowski

We examine one-hidden-layer neural networks with random weights. It is well-known that in the limit of infinitely many neurons they simplify to Gaussian processes. For networks with a polynomial activation, we demonstrate that the rate of this convergence in 2-Wasserstein metric is O(1/sqrt(n)), where n is the number of hidden neurons. We suspect this rate is asymptotically sharp. We improve the known convergence rate for other activations, to power-law in n for ReLU and inverse-square-root up to logarithmic factors for erf. We explore the interplay between spherical harmonics, Stein kernels and optimal transport in the non-isotropic setting.

Inductive Bias of Multi-Channel Linear Convolutional Networks with Bounded Weight Norm

Time: Sunday, July 3, 04:18 PM GMT+1

Authors: Meena Jagadeesan; Ilya Razenshteyn; Suriya Gunasekar

We provide a function space characterization of the inductive bias resulting from minimizing the $\ell_2$ norm of the weights in multi-channel convolutional neural networks with linear activations and empirically test our resulting hypothesis on ReLU networks trained using gradient descent. We define an \textit{induced regularizer} in the function space as the minimum $\ell_2$ norm of weights of a network required to realize a function. For two layer linear convolutional networks with $C$ output channels and kernel size $K$, we show the following: (a) If the inputs to the network are single channeled, the induced regularizer for any $K$ is \textit{independent} of the number of output channels $C$. Furthermore, we derive the regularizer is a norm given by a semidefinite program (SDP). (b) In contrast, for multi-channel inputs, multiple output channels can be necessary to merely realize all matrix-valued linear functions and thus the inductive bias \emph{does} depend on $C$. However, for sufficiently large $C$, the induced regularizer is again given by an SDP that is independent of $C$. In particular, the induced regularizer for $K=1$ and $K=D$ (input dimension) are given in closed form as the nuclear norm and the $\ell_{2,1}$ group-sparse norm, respectively, of the Fourier coefficients of the linear predictor.
We investigate the broader applicability of our theoretical results to implicit regularization from gradient descent on linear and ReLU networks through experiments on MNIST and CIFAR-10 datasets.

Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data

Time: Sunday, July 3, 04:30 PM GMT+1

Authors: Spencer Frei; Niladri Chatterji; Peter Bartlett

Benign overfitting, the phenomenon where interpolating models generalize well in the presence of noisy data, was first observed in neural network models trained with gradient descent. To better understand this empirical observation, we consider the generalization error of two-layer neural networks trained to interpolation by gradient descent on the logistic loss following random initialization. We assume the data comes from well-separated class-conditional log-concave distributions and allow for a constant fraction of the training labels to be corrupted by an adversary. We show that in this setting, neural networks exhibit benign overfitting: they can be driven to zero training error, perfectly fitting any noisy training labels, and simultaneously achieve test error close to the Bayes-optimal error. In contrast to previous work on benign overfitting that require linear or kernel-based predictors, our analysis holds in a setting where both the model and learning dynamics are fundamentally nonlinear.

Virtual session IV (Sunday, July 3, 03:30 PM GMT+1)

Location: Room B

Session chairs: Benjamin Guedj; Hassan Ashtiani

Optimal Mean Estimation without a Variance

Time: Sunday, July 3, 03:30 PM GMT+1

Authors: Yeshwanth Cherapanamjeri; Nilesh Tripuraneni; Peter Bartlett; Michael Jordan

We study the problem of heavy-tailed mean estimation in settings where the variance of the data-generating distribution does not exist. Concretely, given a sample $\bm{X} = \{X_i\}_{i = 1}^n$ from a distribution $\mc{D}$ over $\mb{R}^d$ with mean $\mu$ which satisfies the following \emph{weak-moment} assumption for some ${\alpha \in [0, 1]}$:
\forall \norm{v} = 1: \mb{E}_{X \ts \mc{D}}[\abs{\inp{X - \mu}{v}}^{1 + \alpha}] \leq 1,
and given a target failure probability, $\delta$, our goal is to design an estimator which attains the smallest possible confidence interval as a function of $n,d,\delta$. For the specific case of $\alpha = 1$, foundational work of Lugosi and Mendelson exhibits an estimator achieving \emph{optimal} subgaussian confidence intervals, and subsequent work has led to computationally efficient versions of this estimator. Here, we study the case of general $\alpha$, and provide a precise characterization of the optimal achievable confidence interval by establishing the following information-theoretic lower bound:
\Omega \lprp{\sqrt{\frac{d}{n}} + \lprp{\frac{d}{n}}^{\frac{\alpha}{(1 + \alpha)}} + \lprp{\frac{\log 1 / \delta}{n}}^{\frac{\alpha}{(1 + \alpha)}}}.
and devising an estimator matching the aforementioned lower bound up to constants. Moreover, our estimator is computationally efficient.

Differential Privacy and Robust Statistics in High Dimensions

Time: Sunday, July 3, 03:42 PM GMT+1

Authors: Xiyang Liu; Weihao Kong; Sewoong Oh

We introduce a universal framework for characterizing the statistical efficiency of a statistical estimation problem with differential privacy guarantees. Our framework, which we call High-dimensional Propose-Test-Release (HPTR), builds upon three crucial components: the exponential mechanism, robust statistics, and the Propose-Test-Release mechanism. Connecting all these together is the concept of resilience, which is central to robust statistical estimation. Resilience guides the design of the algorithm, the sensitivity analysis, and the success probability analysis of the test step in Propose-Test-Release. The key insight is that if we design an exponential mechanism that accesses the data only via one-dimensional and robust statistics, then the resulting local sensitivity can be dramatically reduced. Using resilience, we can provide tight local sensitivity bounds. These tight bounds readily translate into near-optimal utility guarantees in several cases. We give a general recipe for applying HPTR to a given instance of a statistical estimation problem and demonstrate it on canonical problems of mean estimation, linear regression, covariance estimation, and principal component analysis. We introduce a general utility analysis technique that proves that HPTR achieves near-optimal sample complexity under several scenarios studied in the literature.

Derivatives and Residual Distribution of Regularized M-Estimators with Application to Adaptive Tuning

Time: Sunday, July 3, 03:54 PM GMT+1

Authors: Pierre Bellec; Yiwei Shen

This paper studies M-estimators with gradient-Lipschitz loss function regularized with convex penalty in linear models with Gaussian design matrix and arbitrary noise distribution. A practical example is the robust M-estimator constructed with the Huber loss and the Elastic-Net penalty and the noise distribution has heavy-tails. Our main contributions are three-fold. (i) We provide general formulae for the derivatives of regularized M-estimators $\hat\beta(y,X)$ where differentiation is taken with respect to both X and y; this reveals a simple differentiability structure shared by all convex regularized M-estimators. (ii) Using these derivatives, we characterize the distribution of the residuals in the intermediate high-dimensional regime where dimension and sample size are of the same order. (iii) Motivated by the distribution of the residuals, we propose a novel adaptive criterion to select tuning parameters of regularized M-estimators. The criterion approximates the out-of-sample error up to an additive constant independent of the estimator, so that minimizing the criterion provides a proxy for minimizing the out-of-sample error. The proposed adaptive criterion does not require the knowledge of the noise distribution or of the covariance of the design. Simulated data confirms the theoretical findings, regarding both the distribution of the residuals and the success of the criterion as a proxy of the out-of-sample error. Finally our results reveal new relationships between the derivatives of the $\hat\beta$ and the effective degrees of freedom of the M-estimators, which are of independent interest.

Mean-Field Nonparametric Estimation of Interacting Particle Systems

Time: Sunday, July 3, 04:06 PM GMT+1

Authors: Rentian Yao; Xiaohui Chen; Yun Yang

This paper concerns the nonparametric estimation problem of the distribution-state dependent drift vector field in an interacting $N$-particle system. Observing single-trajectory data for each particle, we derive the mean-field rate of convergence for the maximum likelihood estimator (MLE), which depends on both Gaussian complexity and Rademacher complexity of the function class. In particular, when the function class contains $\alpha$-smooth H{\"o}lder functions, our rate of convergence is minimax optimal on the order of $N^{-\frac{\alpha}{d+2\alpha}}$. Combining with a Fourier analytical deconvolution estimator, we derive the consistency of MLE for the external force and interaction kernel in the McKean-Vlasov equation.

Non-Gaussian Component Analysis via Lattice Basis Reduction

Time: Sunday, July 3, 04:18 PM GMT+1

Authors: Ilias Diakonikolas; Daniel Kane

Non-Gaussian Component Analysis (NGCA) is the following distribution learning problem:
Given i.i.d.\ samples from a distribution on $\R^d$ that is non-gaussian in a hidden direction
$v$ and an independent standard Gaussian in the orthogonal directions, the goal is to approximate
the hidden direction $v$. Prior work~\citep{DKS17-sq} provided formal evidence
for the existence of an information-computation tradeoff for NGCA
under appropriate moment-matching conditions on the univariate non-gaussian distribution $A$.
The latter result does not apply when the distribution $A$ is discrete.
A natural question is whether information-computation tradeoffs persist in this setting.
In this paper, we answer this question in the negative
by obtaining a sample and computationally efficient algorithm for NGCA
in the regime that $A$ is discrete or nearly discrete, in a well-defined technical sense.
The key tool leveraged in our algorithm is the LLL method~\citep{LLL82}
for lattice basis reduction.

Orthogonal Statistical Learning with Self-Concordant Loss

Time: Sunday, July 3, 04:30 PM GMT+1

Authors: Lang Liu; Carlos Cinelli; Zaid Harchaoui

Orthogonal statistical learning and double machine learning have emerged as general frameworks for two-stage statistical prediction in the presence of a nuisance component. We establish non-asymptotic bounds on the excess risk of orthogonal statistical learning methods with a loss function satisfying a self-concordance property. Our bounds improve upon existing bounds by a dimension factor while lifting the assumption of strong convexity. We illustrate the results with examples from multiple treatment effect estimation and generalized partially linear modeling.

Bandits I (Sunday, July 3, 05:00 PM GMT+1)

Location: Room A

Session chairs: Csaba Szepesvári; Gergely Neu

The Pareto Frontier of Instance-Dependent Guarantees in Multi-Player Multi-Armed Bandits with No Communication

Time: Sunday, July 3, 05:00 PM GMT+1

Authors: Mark Sellke; Allen Liu

We study the stochastic multi-player multi-armed bandit problem. In this problem, there are $m$ players and $K > m$ arms and the players cooperate to maximize their total reward. However the players cannot communicate and are penalized (e.g. receive no reward) if they pull the same arm at the same time. We ask whether it is possible to obtain optimal instance-dependent regret $\wt{O}(1/\Delta)$ where $\Delta$ is the gap between the $m$-th and $m+1$-st best arms. Such guarantees were recently achieved by \cite{pacchiano2021instance, huang2021towards} in a model in which the players are able to implicitly communicate through intentional collisions. We show that with no communication at all, such guarantees are, surprisingly, not achievable. In fact, obtaining the optimal $\wt{O}(1/\Delta)$ regret for some regimes of $\Delta$ necessarily implies strictly sub-optimal regret in other regimes. Our main result is a complete characterization of the Pareto optimal instance-dependent trade-offs that are possible with no communication. Our algorithm generalizes that of \cite{bubeck2021cooperative} and enjoys the same strong no-collision property, while our lower bound is completely new.

Return of the Bias: Minimax Optimal High Probability Bounds for Adversarial Linear Bandits

Time: Sunday, July 3, 05:12 PM GMT+1

Authors: Julian Zimmert; Tor Lattimore

We introduce a modification of follow the regularised leader and combine it with the log determinant potential and suitable loss estimators
to prove that the minimax regret for adaptive adversarial linear bandits is at most $O(d \sqrt{T \log(T)})$ where $d$ is the dimension and $T$ is the number of rounds.
By using exponential weights, we improve this bound to $O(\sqrt{dT\log(kT)})$ when the action set has size $k$. These
results confirms an old conjecture.
We also show that follow the regularized leader with the entropic barrier and suitable loss estimators has regret against an adaptive adversary of
at most $O(d^2 \sqrt{T} \log(T))$ and can be implement in polynomial time, which improves on the best known bound for an efficient algorithm of $O(d^{7/2} \sqrt{T} \poly(\log(T)))$
by Lee et al 2020.

Efficient Decentralized Multi-Agent Learning in Asymmetric Queuing Systems

Time: Sunday, July 3, 05:24 PM GMT+1

Authors: Daniel Freund; Thodoris Lykouris; Wentao Weng

We study decentralized multi-agent learning in bipartite queuing systems, a standard model for service systems. In particular, N agents request service from K servers in a fully decentralized way, i.e, by running the same algorithm without communication. Previous decentralized algorithms are restricted to symmetric systems, have performance that is degrading exponentially in the number of servers, require communication through shared randomness and unique agent identities, and are computationally demanding. In contrast, we provide a simple learning algorithm that, when run decentrally by each agent, leads the queueing system to have efficient performance in general asymmetric bipartite queuing systems while also having additional robustness properties. Along the way, we provide the first UCB-based algorithm for the centralized case of the problem, which resolves an open question by Krishnasamy et al.

Complete Policy Regret Bounds for Tallying Bandits

Time: Sunday, July 3, 05:36 PM GMT+1

Authors: Dhruv Malik; Yuanzhi Li; Aarti Singh

Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary. We study restrictions on the adversary that enable efficient minimization of the \emph{complete policy regret}, which is the strongest possible version of policy regret. We identify a gap in the current theoretical understanding of what sorts of restrictions permit tractability in this challenging setting. To resolve this gap, we consider a generalization of the stochastic multi armed bandit, which we call the \emph{tallying bandit}. This is an online learning setting with an $m$-memory bounded adversary, where the average loss for playing an action is an unknown function of the number (or tally) of times that the action was played in the last $m$ timesteps. For tallying bandit problems with $\numact$ actions and time horizon $T$, we provide an algorithm that w.h.p achieves a complete policy regret guarantee of $\bigo ( m \numact \sqrt{T} )$, where the $\bigo$ notation hides only logarithmic factors. We additionally prove an $\bigomega(\sqrt{ m \numact T})$ lower bound on the expected complete policy regret of any tallying bandit algorithm, demonstrating the near optimality of our method.

Virtual session V (Sunday, July 3, 05:00 PM GMT+1)

Location: Room B

Session chairs: Hassan Ashtiani; Ohad Shamir

Private Robust Estimation by Stabilizing Convex Relaxations

Time: Sunday, July 3, 05:00 PM GMT+1

Authors: Pravesh Kothari; Pasin Manurangsi; Ameya Velingker

We give the first polynomial time and sample (epsilon, delta)-differentially private (DP) algorithm to estimate the mean, covariance and higher moments in the presence of a constant fraction of adversarial outliers. Our algorithm succeeds for families of distributions that satisfy two well-studied properties in prior works on robust estimation: certifiable subgaussianity of directional moments and certifiable hypercontractivity of degree 2 polynomials. Our recovery guarantees hold in the “right affine-invariant norms”: Mahalanobis distance for mean, multiplicative spectral and relative Frobenius distance guarantees for covariance and injective norms for higher moments. Prior works obtained private robust algorithms for mean estimation of subgaussian distributions with bounded covariance. For covariance estimation, ours is the first efficient algorithm (even in the absence of outliers) that succeeds without any condition-number assumptions.

Our algorithms arise from a new framework that provides a general blueprint for modifying convex relaxations for robust estimation to satisfy strong worst-case stability guarantees in the appropriate parameter norms whenever the algorithms produce witnesses of correctness in their run. We verify such guarantees for a modification of standard sum-of-squares (SoS) semidefinite programming relaxations for robust estimation. Our privacy guarantees are obtained by combining stability guarantees with a new “estimate dependent” noise injection mechanism in which noise scales with the eigenvalues of the estimated covariance. We believe this framework
will be useful more generally in obtaining DP counterparts of robust estimators.

Independently of our work, Ashtiani and Liaw [AL21] also obtained a polynomial time and sample private robust estimation algorithm for Gaussian distributions.

The Role of Interactivity in Structured Estimation

Time: Sunday, July 3, 05:12 PM GMT+1

Authors: Jayadev Acharya; Clement Canonne; Ziteng Sun; Himanshu Tyagi

We study high-dimensional sparse estimation under three natural constraints: communication constraints, local privacy constraints, and linear measurements (compressive sensing). Without sparsity assumptions, it has been established that interactivity cannot improve the minimax rates of estimation under these information constraints. The question of whether interactivity helps with natural inference tasks has been a topic of active research. We settle this question in the affirmative for the prototypical problems of high-dimensional sparse mean estimation and compressive sensing, by demonstrating a gap between interactive and noninteractive protocols.
We further establish that the gap increases when we have more structured sparsity: for \emph{block sparsity} this gap can be as large as \emph{polynomial} in the dimensionality. Thus, the more structured the sparsity is, the greater is the advantage of interaction. Proving the lower bounds requires a careful breaking of a sum of correlated random variables into independent components using Baranyai's theorem on decomposition of hypergraphs, which might be of independent interest.

Near Optimal Efficient Decoding from Pooled Data

Time: Sunday, July 3, 05:24 PM GMT+1

Authors: Max Hahn-Klimroth; Noela Müller

Consider $n$ items, each of which is characterised by one of $d+1$ possible features in $\{0, \ldots, d\}$. We study the inference task of learning these types by queries on subsets, or pools, of the items that only reveal a form of coarsened information on the features - in our case, the sum of all the features in the pool. This is a realistic scenario in situations where one has memory or technical constraints in the data collection process, or where the data is subject to anonymisation. Related prominent problems are the quantitative group testing problem, of which it is a generalisation, as well as the compressed sensing problem, of which it is a special case. In the present article, we are interested in the minimum number of queries needed to efficiently infer the labels, if one of the features, say $0$, is dominant in the sense that the number $k$ of non-zero features among the items is much smaller than $n$. It is known that in this case, all features can be recovered in exponential time by using no more than $O(k)$ queries. However, so far, all \textit{efficient} inference algorithms required at least $\Omega(k\ln n)$ queries, and it was unknown whether this gap is artificial or of a fundamental nature. Here we show that indeed, the previous gap between the information-theoretic and computational bounds is not inherent to the problem by providing an efficient algorithm that succeeds with high probability and employs no more than $O(k)$ measurements. This also solves a long standing open question for the quantitative group testing problem.

Optimal SQ Lower Bounds for Robustly Learning Discrete Product Distributions and Ising Models

Time: Sunday, July 3, 05:36 PM GMT+1

Authors: Ilias Diakonikolas; Daniel Kane; Yuxin Sun

We establish optimal Statistical Query (SQ) lower bounds for robustly learning certain families of discrete high-dimensional distributions. In particular, we show that no efficient SQ algorithm with access to an $\eps$-corrupted binary product distribution can learn its mean within $\ell_2$-error $o(\eps \sqrt{\log(1/\eps)})$.
Similarly, we show that no efficient SQ algorithm with access to an $\eps$-corrupted ferromagnetic high-temperature Ising model can learn the model
to total variation distance $o(\eps \log(1/\eps))$. Our SQ lower bounds match the error guarantees of known algorithms for these problems, providing evidence that current upper bounds for these tasks are best possible. At the technical level, we develop a generic SQ lower bound for discrete high-dimensional distributions starting from low-dimensional moment matching constructions that we believe will find other applications. Additionally, we introduce new ideas to analyze these moment-matching constructions for discrete univariate distributions.

Sampling (Monday, July 4, 09:00 AM GMT+1)

Location: Room A

Session chairs: Kfir Levy; Akshay Krishnamurthy

Analysis of Langevin Monte Carlo from Poincare to Log-Sobolev

Time: Monday, July 4, 09:00 AM GMT+1

Authors: Sinho Chewi; Murat Erdogdu; Mufan Li; Ruoqi Shen; Shunshi Zhang

Classically, the continuous-time Langevin diffusion converges exponentially fast to its stationary distribution $\pi$ under the sole assumption that $\pi$ satisfies a Poincar\'e inequality. Using this fact to provide guarantees for the discrete-time Langevin Monte Carlo (LMC) algorithm, however, is considerably more challenging due to the need for working with chi-squared or R\'enyi divergences, and prior works have largely focused on strongly log-concave targets. In this work, we provide the first convergence guarantees for LMC assuming that $\pi$ satisfies either a Lata\l{}a--Oleszkiewicz or modified log-Sobolev inequality, which interpolates between the Poincar\'e and log-Sobolev settings. Unlike prior works, our results allow for weak smoothness and do not require convexity or dissipativity conditions.

The Query Complexity of Sampling from Strongly Log-Concave Distributions in One Dimension

Time: Monday, July 4, 09:12 AM GMT+1

Authors: Sinho Chewi; Patrik Gerber; Chen Lu; Thibaut Le Gouic; Philippe Rigollet

We establish the first tight lower bound of $\Omega(\log\log\kappa)$ on the query complexity of sampling from the class of strongly log-concave and log-smooth distributions with condition number $\kappa$ in one dimension. Whereas existing guarantees for MCMC-based algorithms scale polynomially in $\kappa$, we introduce a novel algorithm based on rejection sampling that closes this doubly exponential gap.

Towards a Theory of Non-Log-Concave Sampling: First-Order Stationarity Guarantees for Langevin Monte Carlo

Time: Monday, July 4, 09:24 AM GMT+1

Authors: Krishna Balasubramanian; Sinho Chewi; Murat Erdogdu; Adil Salim; Shunshi Zhang

For the task of sampling from a density $\pi \propto \exp(-V)$ on $\R^d$, where $V$ is possibly non-convex but $L$-gradient Lipschitz, we prove that averaged Langevin Monte Carlo outputs a sample with $\varepsilon$-relative Fisher information after $O( L^2 d^2/\varepsilon^2)$ iterations. This is the sampling analogue of complexity bounds for finding an $\varepsilon$-approximate first-order stationary points in non-convex optimization and therefore constitutes a first step towards the general theory of non-log-concave sampling. We discuss numerous extensions and applications of our result; in particular, it yields a new state-of-the-art guarantee for sampling from distributions which satisfy a Poincar\'e inequality.

Improved Analysis for a Proximal Algorithm for Sampling

Time: Monday, July 4, 09:36 AM GMT+1

Authors: Yongxin Chen; Sinho Chewi; Adil Salim; Andre Wibisono

We study the proximal sampler of Lee, Shen, and Tian (2021) and obtain new convergence guarantees under weaker assumptions than strong log-concavity: namely, our results hold for (1) weakly log-concave targets, and (2) targets satisfying isoperimetric assumptions which allow for non-log-concavity. We demonstrate our results by obtaining new state-of-the-art sampling guarantees for several classes of target distributions. We also strengthen the connection between the proximal sampler and the proximal method in optimization by interpreting the former as an entropically regularized Wasserstein gradient flow and the latter as the limit of one.

Sampling Approximately Low-Rank Ising Models: MCMC Meets Variational Methods

Time: Monday, July 4, 09:48 AM GMT+1

Authors: Frederic Koehler; Holden Lee; Andrej Risteski

We consider Ising models on the hypercube with a general interaction matrix $J$, and give a polynomial time sampling algorithm when all but $O(1)$ eigenvalues of $J$ lie in an interval of length one, a situation which occurs in many models of interest. This was previously known for the Glauber dynamics when \emph{all} eigenvalues fit in an interval of length one; however, a single outlier can force the Glauber dynamics to mix torpidly. Our general result implies the first polynomial time sampling algorithms for low-rank Ising models such as Hopfield networks with a fixed number of patterns and Bayesian clustering models with low-dimensional contexts, and greatly improves the polynomial time sampling regime for the antiferromagnetic/ferromagnetic Ising model with inconsistent field on expander graphs. It also improves on previous approximation algorithm results based on the naive mean-field approximation in variational methods and statistical physics.

Our approach is based on a new fusion of ideas from the MCMC and variational inference worlds. As part of our algorithm, we define a new nonconvex variational problem which allows us to sample from an exponential reweighting of a distribution by a negative definite quadratic form, and show how to make this procedure provably efficient using stochastic gradient descent. On top of this, we construct a new simulated tempering chain (on an extended state space arising from the Hubbard-Stratonovich transform) which overcomes the obstacle posed by large positive eigenvalues, and combine it with the SGD-based sampler to solve the full problem.

From Sampling to Optimization on Discrete Domains with Applications to Determinant Maximization

Time: Monday, July 4, 10:00 AM GMT+1

Authors: Nima Anari; Thuy-Duong Vuong

We show a connection between sampling and optimization on discrete domains. For a family of distributions $\mu$ defined on size $k$ subsets of a ground set of elements that is closed under external fields, we show that rapid mixing of natural local random walks implies the existence of simple approximation algorithms to find $\max \mu(\cdot)$. More precisely we show that if $t$-step down-up random walks have spectral gap at least inverse polynomially large, then $t$-step local search can find $\max \mu(\cdot)$ within a factor of $k^{O(k)}$. As the main application of our result, we show that $2$-step local search achieves a nearly-optimal $k^{O(k)}$-factor approximation for MAP inference on nonsymmetric $k$-DPPs. This is the first nontrivial multiplicative approximation algorithm for this problem.

We establish the connection between sampling and optimization by showing that an exchange inequality, a concept rooted in discrete convex analysis, can be derived from fast mixing of local random walks. We further advance the state-of-the-art on the mixing of random walks for nonsymmetric DPPs and more generally sector-stable distributions, by obtaining the tightest possible bound on the step size needed for polynomial-time mixing of random walks. Our improvement brings the step size down by a factor of $2$ compared to prior works, and is potentially of independent interest in sampling applications. The improvement on step size directly translates to quadratically faster local search steps for MAP inference.

Computational complexity II (Monday, July 4, 09:00 AM GMT+1)

Location: Room B

Session chairs: Wouter Koolen; Aryeh Kontorovich

Near-Optimal Statistical Query Lower Bounds for Agnostically Learning Intersections of Halfspaces with Gaussian Marginals

Time: Monday, July 4, 09:00 AM GMT+1

Authors: Daniel Hsu; Clayton Sanford; Rocco Servedio; Emmanouil Vlatakis-Gkaragkounis

We consider the well-studied problem of learning intersections of halfspaces under the Gaussian distribution in the challenging \emph{agnostic learning} model. Recent work of Diakonikolas et al. (2021) shows that any Statistical Query (SQ) algorithm for agnostically learning the class of intersections of $k$ halfspaces over $\mathbb{R}^n$ to constant excess error either must make queries of tolerance at most $n^{-\tilde{\Omega}(\sqrt{\log k})}$ or must make $2^{n^{\Omega(1)}}$ queries. We strengthen this result by improving the tolerance requirement to $n^{-\tilde{\Omega}(\log k)}$. This lower bound is essentially best possible since an SQ algorithm of Klivans et al. (2008) agnostically learns this class to any constant excess error using $n^{O(\log k)}$ queries of tolerance $n^{-O(\log k)}$. We prove two variants of our lower bound, each of which combines ingredients from Diakonikolas et al. (2021) with (an extension of) a different earlier approach for agnostic SQ lower bounds for the Boolean setting due to Dachman-Soled et al. (2014). Our approach also yields lower bounds for agnostically SQ learning the class of "convex subspace juntas" (studied by Vempala, 2010) and the class of sets with bounded Gaussian surface area; all of these lower bounds are nearly optimal since they essentially match known upper bounds from Klivans et al. (2008).

Optimal SQ Lower Bounds for Learning Halfspaces with Massart Noise

Time: Monday, July 4, 09:12 AM GMT+1

Authors: Stefan Tiegel; Rajai Nasser

We give tight statistical query (SQ) lower bounds for learnining halfspaces in the presence of Massart noise.
In particular, suppose that all labels are corrupted with probability at most $\eta$.
We show that for arbitrary $\eta \in [0,1/2]$ every SQ algorithm achieving misclassification error better than $\eta$ requires queries of superpolynomial accuracy or at least a superpolynomial number of queries.
Further, this continues to hold even if the information-theoretically optimal error $\OPT$ is as small as $\exp\Paren{-\log^c(d)}$, where $d$ is the dimension and $0 < c < 1$ is an arbitrary absolute constant, and an overwhelming fraction of examples are noiseless.
Our lower bound matches known polynomial time algorithms, which are also implementable in the SQ framework.
Previously, such lower bounds only ruled out algorithms achieving error $\OPT + \e$ or error better than $\Omega(\eta)$ or, if $\eta$ is close to $1/2$, error $\eta - o_\eta(1)$, where the term $o_\eta(1)$ is constant in $d$ but going to 0 for $\eta$ approaching $1/2$.

As a consequence, we also show that achieving misclassification error better than $1/2$ in the $(A,\alpha)$-Tsybakov model is SQ-hard for $A$ constant and $\alpha$ bounded away from 1.

Smoothed Online Learning Is as Easy as Statistical Learning

Time: Monday, July 4, 09:24 AM GMT+1

Authors: Adam Block; Yuval Dagan; Noah Golowich; Alexander Rakhlin

Much of modern learning theory has been split between two regimes: the classical offline setting, where data arrive independently, and the online setting, where data arrive adversarially. While the former model is often both computationally and statistically tractable, the latter requires no distributional assumptions. In an attempt to achieve the best of both worlds, previous work proposed the smooth online setting where each sample is drawn from an adversarially chosen distribution, which is smooth, i.e., it has a bounded density with respect to a fixed dominating measure. Existing results for the smooth setting were known only for binary-valued function classes and were computation- ally expensive in general; in this paper, we fill these lacunae. In particular, we provide tight bounds on the minimax regret of learning a nonparametric function class, with nearly optimal dependence on both the horizon and smoothness parameters. Furthermore, we provide the first oracle-efficient, no-regret algorithms in this setting. In particular, we propose an oracle-efficient improper algorithm whose regret achieves optimal dependence on the horizon and a proper algorithm requiring only a single oracle call per round whose regret has the optimal horizon dependence in the classification setting and is sublinear in general. Both algorithms have exponentially worse dependence on the smoothness parameter of the adversary than the minimax rate. We then prove a lower bound on the oracle complexity of any proper learning algorithm, which matches the oracle-efficient upper bounds up to a polynomial factor, thus demonstrating the existence of a statistical-computational gap in smooth online learning. Finally, we apply our results to the contextual bandit setting to show that if a function class is learnable in the classical setting, then there is an oracle-efficient, no-regret algorithm for contextual bandits in the case that contexts arrive in a smooth manner.

Toward Instance-Optimal State Certification with Incoherent Measurements

Time: Monday, July 4, 09:36 AM GMT+1

Authors: Sitan Chen; Jerry Li; Ryan O'Donnell

We revisit the basic problem of quantum state certification: given copies of unknown mixed state ρ∈ℂ^{d×d} and the description of a mixed state σ, decide whether σ=ρ or ‖σ−ρ‖_𝗍𝗋 ≥ ϵ. When σ is maximally mixed, this is mixedness testing, and it is known that Ω(d^{Θ(1)}/ϵ^2) copies are necessary, where the exact exponent depends on the type of measurements the learner can make [OW15, BCL20], and in many of these settings there is a matching upper bound [OW15, BOW19, BCL20].

Can one avoid this d^{Θ(1)} dependence for certain kinds of mixed states σ, e.g. ones which are approximately low rank? More ambitiously, does there exist a simple functional f : ℂ^{d×d} → ℝ_{≥0} for which one can show that Θ(f(σ)/ϵ^2) copies are necessary and sufficient for state certification with respect to any σ? Such instance-optimal bounds are known in the context of classical distribution testing, e.g. [VV17].

Here we give the first bounds of this nature for the quantum setting, showing (up to log factors) that the copy complexity for state certification using nonadaptive incoherent measurements is essentially given by the copy complexity for mixedness testing times the fidelity between σ and the maximally mixed state. Surprisingly, our bound differs substantially from instance optimal bounds for the classical problem, demonstrating a qualitative difference between the two settings.

Realizable Learning Is All You Need

Time: Monday, July 4, 09:48 AM GMT+1

Authors: Max Hopkins; Daniel Kane; Shachar Lovett; Gaurav Mahajan

The equivalence of realizable and agnostic learnability is a fundamental phenomenon in learning theory. With variants ranging from classical settings like PAC learning and regression to recent trends such as adversarially robust and private learning, it's surprising we still lack a unified theory; traditional proofs of the equivalence tend to be disparate, and rely on strong model-specific assumptions like uniform convergence and sample compression.

In this work, we give the first model-independent framework explaining the equivalence of realizable and agnostic learnability: a three-line blackbox reduction that simplifies, unifies, and extends our understanding across a wide variety of settings. This includes models with no known characterization of learnability such as learning with arbitrary distributional assumptions or general loss, as well as a host of other popular settings such as robust learning, partial learning, fair learning, and the statistical query model.

More generally, we argue that the equivalence of realizable and agnostic learning is actually a special case of a broader phenomenon we call property generalization: any desirable property of a learning algorithm (e.g. noise tolerance, privacy, stability) that can be satisfied over finite hypothesis classes extends (possibly in some variation) to any learnable hypothesis class.

On Characterizations of Learnability with Computable Learners

Time: Monday, July 4, 10:00 AM GMT+1

Authors: Tom Sterkenburg

We study computable PAC (CPAC) learning as introduced by Agarwal et al. (2020). First, we consider the main open question of finding characterizations of proper and improper CPAC learning. We give a characterization of a closely related notion of *strong* CPAC learning, and we provide a negative answer to the COLT open problem posed by Agarwal et al. (2021) whether all decidably representable PAC learnable classes are improperly CPAC learnable. Second, we consider undecidability of (computable) PAC learnability. We give a simple and general argument to exhibit such undecidability, and we initiate a study of the arithmetical complexity of learnability. We briefly discuss the relation to the undecidability result of Ben-David et al. (2019), that motivated the work of Agarwal et al.

Bandits II (Monday, July 4, 10:45 AM GMT+1)

Location: Room A

Session chairs: Yevgeny Seldin; Csaba Szepesvári

Multi-Agent Learning for Iterative Dominance Elimination: Formal Barriers and New Algorithms

Time: Monday, July 4, 10:45 AM GMT+1

Authors: Jibang Wu; Haifeng Xu; Fan Yao

Dominated actions are natural (and perhaps the simplest possible) multi-agent generalizations of sub-optimal actions as in standard single-agent decision making. Thus similar to standard bandit learning, a fundamental learning question in multi-agent systems is whether agents can efficiently eliminate all iteratively dominated actions in an unknown game if they can only observe noisy bandit feedback about the payoff of their played actions. Surprisingly, despite a seemingly simple task, we show a quite negative result; that is, standard no regret algorithms --- including the entire family of Dual Averaging algorithms --- provably take exponentially many rounds to eliminate all iteratively dominated actions. Moreover, algorithms with the stronger no swap regret also suffer similar exponential inefficiency. To overcome these barriers, we develop a new algorithm that adjusts Exp3 with Diminishing Historical rewards (termed Exp3-DH); Exp3-DH gradually ``forgets'' history at carefully tailored rates. We prove that when all agents run Exp3-DH (a.k.a., self-play in multi-agent learning), all iteratively dominated actions can be eliminated within polynomially many rounds. Our experimental results further demonstrate the efficiency of Exp3-DH, and that state-of-the-art bandit algorithms, even those explicitly developed for learning in games, fail to eliminate all iteratively dominated actions efficiently.

Adversarially Robust Multi-Armed Bandit Algorithm with Variance-Dependent Regret Bounds

Time: Monday, July 4, 10:57 AM GMT+1

Authors: Shinji Ito; Taira Tsuchiya; Junya Honda

This paper considers the multi-armed bandit (MAB) problem and provides a new best-of-both-worlds (BOBW) algorithm that works nearly optimally in both stochastic and adversarial settings. In stochastic settings, some existing BOBW algorithms achieve tight gap-dependent regret bounds of $O(\sum_{i: \Delta_i>0} \frac{\log T}{\Delta_i})$ for suboptimality gap $\Delta_i$ of arm $i$ and time horizon $T$. As Audibert et al. (2007) have shown, however, that the performance can be improved in stochastic environments with low-variance arms. In fact, they have provided a stochastic MAB algorithm with gap-variance-dependent regret bounds of $O(\sum_{i: \Delta_i>0} (\frac{\sigma_i^2}{\Delta_i} + 1) \log T )$ for loss variance $\sigma_i^2$ of arm $i$. In this paper, we propose the first BOBW algorithm with gap-variance-dependent bounds, showing that the variance information can be used even in the possibly adversarial environment. Further, the leading constant factor in our gap-variance dependent bound is only (almost) twice the value for the lower bound. Additionally, the proposed algorithm enjoys multiple data-dependent regret bounds in adversarial settings and works well in stochastic settings with adversarial corruptions. The proposed algorithm is based on the follow-the-regularized-leader method and employs adaptive learning rates that depend on the empirical prediction error of the loss, which leads to gap-variance-dependent regret bounds reflecting the variance of the arms.

A Sharp Memory-Regret Trade-off for Multi-Pass Streaming Bandits

Time: Monday, July 4, 11:09 AM GMT+1

Authors: Arpit Agarwal; Sanjeev Khanna; Prathamesh Patil

The stochastic $K$-armed bandit problem has been studied extensively due to its applications in various domains ranging from online advertising to clinical trials. In practice however, the number of arms can be very large resulting in large memory requirements for simultaneously processing them. In this paper we consider a streaming setting where the arms are presented in a stream and the algorithm uses limited memory to process these arms. Here, the goal is not only to minimize regret, but also to do so in minimal memory. Previous algorithms for this problem operate in one of the two settings: they either use $\Omega(\log \log T)$ passes over the stream \citep{rathod2021reducing, ChaudhuriKa20, Liau+18}, or just a single pass \citep{Maiti+21}.

In this paper we study the trade-off between memory and regret when $B$ passes over the stream are allowed, for any $B \geq 1$, and establish \emph{tight} regret upper and lower bounds for any $B$-pass algorithm. Our results uncover a surprising \emph{sharp transition phenomenon}: $O(1)$ memory is sufficient to achieve $\widetilde\Theta\paren{T^{\half + \frac{1}{2^{B+2}-2}}}$ regret in $B$ passes, and increasing the memory to any quantity that is $o(K)$ has almost no impact on further reducing this regret, unless we use $\Omega(K)$ memory. Our main technical contribution is our lower bound which requires the use of \emph{information-theoretic techniques} as well as ideas from \emph{round elimination} to show that the \emph{residual problem} remains challenging over subsequent passes.

Minimax Regret for Partial Monitoring: Infinite Outcomes and Rustichini’s Regret

Time: Monday, July 4, 11:21 AM GMT+1

Authors: Tor Lattimore

We show that a version of the generalised information ratio of Lattimore and Gyorgy (2020) determines the asymptotic minimax regret for all finite-action partial monitoring games provided that (a) the standard definition of regret is used but the latent space where the adversary plays is potentially infinite; or (b) the regret introduced by Rustichini (1999) is used and the latent space is finite. Our results are complemented by a number of examples. For any p ∈ [1/2, 1] there exists an infinite partial monitoring game for which the minimax regret over n rounds is n^p up to subpolynomial factors and there exist finite games for which the minimax Rustichini regret is n^(4/7) up to subpolynomial factors.

Towards Optimal Algorithms for Multi-Player Bandits without Collision Sensing Information

Time: Monday, July 4, 11:33 AM GMT+1

Authors: Wei Huang; Richard Combes; Cindy Trinh

We propose a novel algorithm for multi-player multi-armed bandits without collision sensing information. Our algorithm circumvents two problems shared by all state-of-the-art algorithms: it does not need as an input a lower bound on the minimal expected reward of an arm, and its performance does not scale inversely proportionally to the minimal expected reward. We prove a theoretical regret upper bound to justify these claims. We complement our theoretical results with numerical experiments, showing that the proposed algorithm outperforms state-of-the-art in practice.

Tracking Most Significant Arm Switches in Bandits

Time: Monday, July 4, 11:45 AM GMT+1

Authors: Joe Suk; Samory Kpotufe

In \emph{bandit with distribution shifts}, one aims to automatically adapt to unknown changes in reward distribution, and \emph{restart} exploration when necessary. While this problem has been studied for many years, a recent breakthrough of \cite{auer2018,auer2019} provides the first adaptive procedure to guarantee an optimal (dynamic) regret $\sqrt{LT}$, for $T$ rounds, and an unknown number $L$ of changes. However, while this rate is tight in the worst case, it remained open whether faster rates are possible, without prior knowledge, if few changes in distribution are actually \emph{severe}. %\citep{auer2019,foster2020}.%, e.g., involve best arm switches, or large changes in mean rewards.%This is partially addressed by works that consider \emph{total variation} settings, but which also can result in

To resolve this question, we propose a new notion of \emph{significant shift}, which only counts very severe changes that clearly necessitate a restart: roughly, these are changes involving not only best arm switches, but also involving large aggregate differences in reward overtime. Thus, our resulting procedure adaptively achieves rates always faster (sometimes significantly) than $O(\sqrt{ST})$, where $S\ll L$ only counts best arm switches, while at the same time, always faster than the optimal $O(V^{\frac{1}{3}}T^{\frac{2}{3}})$ when expressed in terms of \emph{total variation} $V$ (which aggregates differences overtime). Our results are expressed in enough generality to also capture non-stochastic adversarial settings.

Optimization I (Monday, July 4, 10:45 AM GMT+1)

Location: Room B

Session chairs: Praneeth Netrapalli; Daniel Hsu

Mirror Descent Strikes Again: Optimal Stochastic Convex Optimization under Infinite Noise Variance

Time: Monday, July 4, 10:45 AM GMT+1

Authors: Nuri Mert Vural; Lu Yu; Krishna Balasubramanian; Stanislav Volgushev; Murat Erdogdu

We study stochastic convex optimization under infinite noise variance. Specifically, when the stochastic gradient is unbiased and has uniformly bounded $(1+\kappa)$-th moment, for some $\kappa \in (0,1]$, we quantify the convergence rate of the Stochastic Mirror Descent algorithm with a particular class of uniformly convex mirror maps, in terms of the number of iterations, dimensionality and related geometric parameters of the optimization problem. Interestingly this algorithm does not require any explicit gradient clipping or normalization, which have been extensively used in several recent empirical and theoretical works. We complement our convergence results with information-theoretic lower bounds showing that no other algorithm using only stochastic first-order oracles can achieve improved rates. Our results have several interesting consequences for devising online/streaming stochastic approximation algorithms for problems arising in robust statistics and machine learning.

The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance

Time: Monday, July 4, 10:57 AM GMT+1

Authors: Matthew Faw; Isidoros Tziotis; Constantine Caramanis; Aryan Mokhtari; Sanjay Shakkottai; Rachel Ward

We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives. Despite their popularity, the analysis of adaptive SGD lags behind that of non adaptive methods in this setting. Specifically, all prior works rely on some subset of the following assumptions: (i) uniformly-bounded gradient norms, (ii) uniformly-bounded stochastic gradient variance (or even noise support), (iii) conditional independence between the step size and stochastic gradient. In this work, we show that AdaGrad-Norm exhibits an order optimal convergence rate of $\O(\nicefrac{\poly\log(T)}{\sqrt{T}})$ after $T$ iterations under the same assumptions as optimally-tuned non adaptive SGD (unbounded gradient norms and affine noise variance scaling), and crucially, without needing any tuning parameters. We thus establish that adaptive gradient methods exhibit order-optimal convergence in much broader regimes than previously understood.

Stochastic Variance Reduction for Variational Inequality Methods

Time: Monday, July 4, 11:09 AM GMT+1

Authors: Ahmet Alacaoglu; Yura Malitsky

We propose stochastic variance reduced algorithms for solving convex-concave saddle point problems, monotone variational inequalities, and monotone inclusions. Our framework applies to extragradient, forward-backward-forward, and forward-reflected-backward methods both in Euclidean and Bregman setups. All proposed methods converge in exactly the same setting as their deterministic counterparts and they either match or improve the best-known complexities for solving structured min-max problems. Our results reinforce the correspondence between variance reduction in variational inequalities and minimization. We also illustrate the improvements of our approach with numerical evaluations on matrix games.

Accelerated SGD for Non-Strongly-Convex Least Squares

Time: Monday, July 4, 11:21 AM GMT+1

Authors: Aditya Varre; Nicolas Flammarion

We consider stochastic approximation for the least squares regression problem in the non-strongly convex setting. We present the first practical algorithm that achieves the optimal prediction error rates in terms of dependence on the noise of the problem, as $O(d/t)$ while accelerating the forgetting of the initial conditions to $O(d/t^2)$. Our new algorithm is based on a simple modification of the accelerated gradient descent. We provide convergence results for both the averaged and the last iterate of the algorithm. In order to describe the tightness of these new bounds, we present a matching lower bound in the noiseless setting and thus show the optimality of our algorithm.

Making SGD Parameter-Free

Time: Monday, July 4, 11:33 AM GMT+1

Authors: Yair Carmon; Oliver Hinder

We develop an algorithm for parameter-free stochastic convex optimization (SCO) whose rate of convergence is only a double-logarithmic factor larger than the optimal rate for the corresponding known-parameter setting. In contrast, the best previously known rates for parameter-free SCO are based on online parameter-free regret bounds, which contain unavoidable excess logarithmic terms compared to their known-parameter counterparts. Our algorithm is conceptually simple, has high-probability guarantees, and is also partially adaptive to unknown gradient norms, smoothness, and strong convexity. At the heart of our results is a novel parameter-free certificate for SGD step size choice, and a time-uniform concentration result that assumes no a-priori bounds on SGD iterates.

Stochastic Linear Optimization Never Overfits with Quadratically-Bounded Losses on General Data

Time: Monday, July 4, 11:45 AM GMT+1

Authors: Matus Telgarsky

This work shows that a diverse collection of linear optimization methods, when run on general data, fail to overfit, despite lacking any explicit constraints or regularization: with high probability, their trajectories stay near the curve of optimal constrained solutions over the population distribution. This analysis is powered by an elementary but flexible proof scheme which can handle diverse settings, summarized as follows. Firstly, the data can be general: unlike other implicit bias works, it need not satisfy large margin or other structural conditions, and moreover can even arrive sequentially IID, sequentially following a Markov chain, or as a batch, and lastly it can even have heavy tails. Secondly, while the main analysis is for mirror descent, rates are also provided for the Temporal-Difference fixed-point method from reinforcement learning; all prior high probability analyses in these settings required bounded iterates, bounded updates, bounded noise, or some equivalent. Thirdly, the losses are general, and for instance the logistic and squared losses can be handled simultaneously, unlike other implicit bias works. In all of these settings, not only is low population error guaranteed with high probability, but moreover low sample complexity is guaranteed so long as there exists any low-complexity near-optimal solution, even if the global problem structure and in particular global optima have high complexity.

Plenary II (Monday, July 4, 02:00 PM GMT+1)

Location: Room A

Session chair: Maxim Raginsky

Policy Optimization for Learning Control Policies: Global Optimality and Convex Parameterization

Time: Monday, July 4, 02:00 PM GMT+1

Speaker: Maryam Fazel

Policy Optimization methods enjoy wide practical use in reinforcement learning (RL) for applications ranging from robotic manipulation to game-playing, partly because they are easy to implement and allow for richly parameterized policies. Yet their theoretical properties, from optimality to statistical complexity, are still not fully understood. To help develop a theoretical basis for these methods, and to bridge the gap between RL and control theoretic approaches, recent work has studied whether gradient-based policy optimization can succeed in designing feedback control policies.

In this talk, we start by showing the convergence and optimality of gradient-based policy optimization methods for controlling linear dynamical systems with quadratic costs (known as the LQR problem in control), where despite nonconvexity, convergence to the optimal policy occurs under mild assumptions. We then make a connection between convex parameterizations in control theory on one hand, and the Polyak-Lojasiewic (or gradient dominance) property of the nonconvex cost function, on the other. This link between the nonconvex and convex landscapes provides insight and helps extend the results to more complex control problems.

Maryam Fazel is the Moorthy Family Professor of Electrical and Computer Engineering at the University of Washington, with adjunct appointments in Computer Science and Engineering, Mathematics, and Statistics. Maryam received her MS and PhD from Stanford University, and her BS from Sharif University of Technology in Iran, and was a postdoctoral scholar at Caltech before joining UW. She is a recipient of the NSF Career Award, UWEE Outstanding Teaching Award, and a UAI conference Best Student Paper Award with her student. She is the director of the Institute for Foundations of Data Science (IFDS), a multi-university NSF TRIPODS Institute. Maryam serves on the Editorial board of the MOS-SIAM Book Series on Optimization, and is an Associate Editor of the SIAM Journal on Mathematics of Data Science.

Optimization II (Monday, July 4, 03:30 PM GMT+1)

Location: Room A

Session chairs: Akshay Krishnamurthy; Kfir Levy

Negative Curvature Obstructs Acceleration for Strongly Geodesically Convex Optimization, Even with Exact First-Order Oracles

Time: Monday, July 4, 03:30 PM GMT+1

Authors: Christopher Criscitiello; Nicolas Boumal

Hamilton and Moitra (2021) showed that, in certain regimes, it is not possible to accelerate Riemannian gradient descent in the hyperbolic plane if we restrict ourselves to algorithms which make queries in a (large) bounded domain and which receive gradients and function values corrupted by a (small) amount of noise. We show that acceleration remains unachievable for any deterministic algorithm which receives exact gradient and function-value information (unbounded queries, no noise). Our results hold for a large class of Hadamard manifolds including hyperbolic spaces and the symmetric space $\mathrm{SL}(n) / \mathrm{SO}(n)$ of positive definite $n \times n$ matrices of determinant one. This cements a surprising gap between the complexity of convex optimization and geodesically convex optimization: for hyperbolic spaces, Riemannian gradient descent is optimal on the class of smooth and strongly geodesically convex functions (in the regime where the condition number scales with the radius of the optimization domain). The key idea for proving the lower bound consists of perturbing squared distance functions with sums of bump functions chosen by a resisting oracle.

Big-Step-Little-Step: Efficient Gradient Methods for Objectives with Multiple Scales

Time: Monday, July 4, 03:42 PM GMT+1

Authors: Jonathan Kelner; Annie Marsden; Vatsal Sharan; Aaron Sidford; Gregory Valiant; Honglin Yuan

We provide new gradient-based methods for efficiently solving a broad class of ill-conditioned optimization problems. We consider the problem of minimizing a function $f : \mathbb{R}^d \rightarrow \mathbb{R}$ which is implicitly decomposable as the sum of $m$ unknown non-interacting smooth, strongly convex functions and provide a method which solves this problem with a number of gradient evaluations that scales (up to logarithmic factors) as the product of the square-root of the condition numbers of the components. This complexity bound (which we prove is nearly optimal) can improve almost exponentially on that of accelerated gradient methods, which grow as the square root of the condition number of $f$. Additionally, we provide efficient methods for solving stochastic, quadratic variants of this multiscale optimization problem. Rather than learn the decomposition of $f$ (which would be prohibitively expensive), our methods apply a clean recursive ``Big-Step-Little-Step'' interleaving of standard methods. The resulting algorithms use $\tilde{\mathcal{O}}(d m)$ space, are numerically stable, and open the door to a more fine-grained understanding of the complexity of convex optimization beyond condition number.

Uniform Stability for First-Order Empirical Risk Minimization

Time: Monday, July 4, 03:54 PM GMT+1

Authors: Amit Attia; Tomer Koren

We consider the problem of designing uniformly stable first-order optimization algorithms for empirical risk minimization. Uniform stability is often used to obtain generalization error bounds for optimization algorithms, and we are interested in a general approach to achieve it. For Euclidean geometry, we suggest a black-box conversion which given a smooth optimization algorithm, produces a uniformly stable version of the algorithm while maintaining its convergence rate up to logarithmic factors. Using this reduction we obtain a (nearly) optimal algorithm for smooth optimization with convergence rate $\tilde{O}(1/T^2)$ and uniform stability $O(T^2/n)$, resolving an open problem of Chen et al. (2018); Attia and Koren (2021). For more general geometries, we develop a variant of Mirror Descent for smooth optimization with convergence rate $\tilde{O}(1/T)$ and uniform stability $O(T/n)$, leaving open the question of devising a general conversion method as in the Euclidean case.

Stability vs. Implicit Bias of Gradient Methods on Separable Data and Beyond

Time: Monday, July 4, 04:06 PM GMT+1

Authors: Matan Schliserman; Tomer Koren

An influential line of recent work has focused on the generalization properties of unregularized gradient-based learning procedures applied to separable linear classification with exponentially-tailed loss functions. The ability of such methods to generalize well has been attributed to the their implicit bias towards large margin predictors, both asymptotically as well as in finite time. We give an additional explanation for this generalization and relate it to two simple properties of the optimization objective, that we refer to as realizability and self-boundedness. We introduce a general setting of unconstrained stochastic convex optimization with these properties, and analyze generalization of gradient methods through the lens of algorithmic stability. In this broader setting, we obtain sharp stability bounds for gradient descent and stochastic gradient descent which apply even for a very large number of gradient steps, and use them to derive general generalization bounds for these algorithms. Finally, as direct applications of the general bounds, we return to the setting of linear classification with separable data and establish several novel test loss and test accuracy bounds for gradient descent and stochastic gradient descent for a variety of loss functions with different tail decay rates.

Sharper Rates for Separable Minimax and Finite Sum Optimization via Primal-Dual Extragradient Methods

Time: Monday, July 4, 04:18 PM GMT+1

Authors: Yujia Jin; Aaron Sidford; Kevin Tian

We design accelerated algorithms with improved rates for several fundamental classes of optimization problems. Our algorithms all build upon techniques related to the analysis of primal-dual extragradient methods via relative Lipschitzness proposed recently by Cohen, Sidford, and Tian '21.

(1) We study separable minimax optimization problems of the form $\min_x \max_y f(x) - g(y) + h(x, y)$, where $f$ and $g$ have smoothness and strong convexity parameters $(L^x, \mu^x)$, $(L^y, \mu^y)$, and h is convex-concave with a $(\Lambda^{xx}, \Lambda^{xy}, \Lambda^{yy})$-blockwise operator norm bounded Hessian. We provide an algorithm using $\tilde{O}(\sqrt{\frac{L^x}{\mu^x}} + \sqrt{\frac{L^y}{\mu^y}} + \frac{\Lambda^{xx}}{\mu^x} + \frac{\Lambda^{xy}}{\sqrt{\mu^x\mu^y}} + \frac{\Lambda^{yy}}{\mu^y})$ gradient queries. Notably, for convex-concave minimax problems with bilinear coupling (e.g. quadratics), where $\Lambda^{xx} = \Lambda^{yy} = 0$, our rate matches a lower bound of Zhang, Hong, and Zhang '19.

(2) We study finite sum optimization problems of the form $\min_x \frac 1 n \sum_{i \in [n]} f_i(x)$, where each $f_i$ is $L_i$-smooth and the overall problem is $\mu$-strongly convex. We provide an algorithm using $\tilde{O}(n + \sum_{i \in [n]} \sqrt{\frac{L_i}{n\mu}} )$ gradient queries. Notably, when the smoothness bounds $\{L_i\}_{i\in[n]}$ are non-uniform, our rate improves upon accelerated SVRG (Lin et al., Frostig et al. '15) and Katyusha (Allen-Zhu '17) by up to a $\sqrt{n}$ factor.

(3) We generalize our algorithms for minimax and finite sum optimization to solve a natural family of minimax finite sum optimization problems at an accelerated rate, encapsulating both above results up to a logarithmic factor.

Virtual session VI (Monday, July 4, 03:30 PM GMT+1)

Location: Room B

Session chairs: Daniel Hsu; Yevgeny Seldin

Adaptive Bandit Convex Optimization with Heterogeneous Curvature

Time: Monday, July 4, 03:30 PM GMT+1

Authors: Haipeng Luo; Mengxiao Zhang; Peng Zhao

We consider the problem of adversarial bandit convex optimization, that is, online learning over a sequence of arbitrary convex loss functions with only one function evaluation for each of them. While all previous works assume known and homogeneous curvature on these loss functions, we study a heterogeneous setting where each function has its own curvature that is only revealed after the learner makes a decision. We develop an efficient algorithm that is able to adapt to the curvature on the fly. Specifically, our algorithm not only recovers or \emph{even improves} existing results for several homogeneous settings, but also leads to surprising results for some heterogeneous settings --- for example, while Hazan and Levy (2014) showed that $\tilde{O}(d^{\frac{3}{2}}\sqrt{T})$ regret is achievable for a sequence of $T$ smooth and strongly convex $d$-dimensional functions, our algorithm reveals that the same is achievable even if $T^{\frac{3}{4}}$ of them are not strongly convex, and sometimes even if a constant fraction of them are not strongly convex. Our approach is inspired by the framework of Bartlett et al. (2007) who studied a similar heterogeneous setting but with stronger gradient feedback. Extending their framework to the bandit feedback setting requires novel ideas such as lifting the feasible domain and using a logarithmically homogeneous self-concordant barrier regularizer.

Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits

Time: Monday, July 4, 03:42 PM GMT+1

Authors: Haipeng Luo; Mengxiao Zhang; Peng Zhao; Zhi-Hua Zhou

We consider the problem of combining and learning over a set of adversarial bandit algorithms with the goal of adaptively tracking the best one on the fly. The Corral algorithm of Agarwal et al. (2017) and its variants (Foster et al., 2020a) achieve this goal with a regret overhead of order $\Ot(\sqrt{MT})$ where $M$ is the number of base algorithms and $T$ is the time horizon. The polynomial dependence on $M$, however, prevents one from applying these algorithms to many applications where $M$ is $\poly(T)$ or even larger. Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only \emph{logarithmic} dependence on $M$ as long as some conditions are satisfied. As the main example, we apply our recipe to the problem of adversarial linear bandits over a $d$-dimensional $\ell_p$ unit-ball for $p \in (1,2]$. By corralling a large set of $T$ base algorithms, each starting at a different time step, our final algorithm achieves the first optimal switching regret $\tilde{\O}(\sqrt{d S T})$ when competing against a sequence of comparators with $S$ switches (for some known $S$). We further extend our results to linear bandits over a smooth and strongly convex domain as well as unconstrained linear bandits.

Assemblies of Neurons Learn to Classify Well-Separated Distributions

Time: Monday, July 4, 03:54 PM GMT+1

Authors: Max Dabagia; Santosh Vempala; Christos Papadimitriou

An assembly is a large population of neurons whose synchronous firing represents a memory, concept, word, and other cognitive category. Assemblies are believed to provide a bridge between high-level cognitive phenomena and low-level neural activity. Recently, a computational system called the \emph{Assembly Calculus} (AC), with a repertoire of biologically plausible operations on assemblies, has been shown capable of simulating arbitrary space-bounded computation, but also of simulating complex cognitive phenomena such as language, reasoning, and planning. However, the mechanism whereby assemblies can mediate {\em learning} has not been known. Here we present such a mechanism, and prove rigorously that, for simple classification problems defined on distributions of labeled assemblies, a new assembly representing each class can be reliably formed in response to a few stimuli from the class; this assembly is henceforth reliably recalled in response to new stimuli from the same class. Furthermore, such class assemblies will be distinguishable as long as the respective classes are reasonably separated --- for example, when they are clusters of similar assemblies, or more generally separable with margin by a linear threshold function. To prove these results, we draw on random graph theory with dynamic edge weights to estimate sequences of activated vertices, yielding strong generalizations of previous calculations and theorems in this field over the past five years. These theorems are backed up by experiments demonstrating the successful formation of assemblies which represent concept classes on synthetic data drawn from such distributions, and also on MNIST, which lends itself to classification through one assembly per digit. Seen as a learning algorithm, this mechanism is entirely online, generalizes from very few samples, and requires only mild supervision --- all key attributes of learning in a model of the brain. We argue that this learning mechanism, supported by separate sensory pre-processing mechanisms for extracting attributes, such as edges or phonemes, from real world data, can be the basis of biological learning in cortex.

Faster Online Calibration without Randomization: Interval Forecasts and the Power of Two Choices

Time: Monday, July 4, 04:06 PM GMT+1

Authors: Chirag Gupta; Aaditya Ramdas

We study the problem of making calibrated probabilistic forecasts for a binary sequence generated by an adversarial nature. Following the seminal paper of Foster and Vohra (1998), nature is often modeled as an adaptive (online) adversary---one who sees all activity of the forecaster except the randomization that the forecaster may deploy. A number of papers have proposed randomized forecasting strategies that achieve a calibration error rate of $O(1/\sqrt{T})$, which we prove is tight in general. On the other hand, it is well known that it is not possible to be calibrated without randomization, or if nature also sees the forecaster's randomization; in both cases the calibration error could be $\Omega(1)$. Inspired by the equally seminal works on the power of two choices (Azar et al., 1994) and imprecise probability theory (Walley and Fine, 1982), we study a small variant of the standard online calibration problem. The adversary gives the forecaster the option of making two (nearby) probabilistic predictions, or equivalently an interval forecast of small width, and the endpoint closest to the revealed outcome is used to judge calibration. This power of two choices (or imprecise forecast) accords the forecaster with significant power---we show that a faster calibration rate of $O(1/T)$ can be achieved even without deploying any randomization.

Eigenspace Restructuring: A Principle of Space and Frequency in Neural Networks

Time: Monday, July 4, 04:18 PM GMT+1

Authors: Lechao Xiao

Understanding the fundamental principles behind the massive success of neural networks is one of the most important open questions in deep learning. However, due to the highly complex nature of the problem, progress has been relatively slow. In this note, through the lens of infinite-width networks, a.k.a. neural kernels, we present one such principle resulting from hierarchical localities. It is well-known that the eigenstructure of infinite-width multilayer perceptrons (MLPs) depends solely on the concept {\it frequency}, which measures the order of interactions. We show that the topologies from deep convolutional networks (CNNs) restructure the associated eigenspaces into finer subspaces. In addition to frequency, the new structure also depends on the concept {\it space}, which measures the spatial distance among nonlinear interaction terms. The resulting fine-grained eigenstructure dramatically improves the network's learnability, empowering them to simultaneously model a much richer class of interactions.
including Long-Range-Low-Frequency interactions, Short-Range-High-Frequency interactions, and various interpolations and extrapolations in-between. Additionally, model scaling can improve the resolutions of interpolations and extrapolations and, therefore, the network's learnability.
Finally, we prove a sharp characterization of the generalization error for infinite-width CNNs (aka C-NTK and CNN-GP) of any depth in the high-dimensional setting. Two corollaries follow: (1) infinite-width deep CNNs can overcome the curse of dimensionality without losing their expressivity, and (2) scaling improves performance in both the finite and infinite data regimes.

High-Dimensional Projection Pursuit: Outer Bounds and Applications to Interpolation in Neural Networks

Time: Monday, July 4, 04:30 PM GMT+1

Authors: Kangjie Zhou; Andrea Montanari

Given a cloud of $n$ data points in $\R^d$, consider
all projections onto $m$-dimensional subspaces of $\R^d$ and,
for each such projection, the empirical distribution of the projected points.
What does this collection of probability distributions look like when $n,d$ grow large?

We consider this question under the null model in which the points are i.i.d.
standard Gaussian vectors, focusing on the asymptotic regime in which $n,d\to\infty$,
with $n/d\to\alpha\in (0,\infty)$, while $m$ is fixed. Denoting by $\cuF_{m, \alpha}$
the set of probability distributions in $\R^m$ that arise as low-dimensional projections
in this limit, we establish new outer bounds on $\cuF_{m, \alpha}$. In
particular, we characterize the radius of $\cuF_{m,\alpha}$ in terms of Wasserstein distance
and prove sharp bounds in terms of Kullback-Leibler divergence and R\'{e}nyi information dimension.

The previous question has application to unsupervised learning methods, such as projection pursuit
and independent component analysis. We introduce a version of the same problem that is relevant for
supervised learning, and prove a sharp Wasserstein radius bound. As an application, we
establish an upper bound on the interpolation threshold of two-layers neural networks with $m$
hidden neurons.

Open problem session I (Monday, July 4, 04:45 PM GMT+1)

Location: Room A

Session chair: Ciara Pike-Burke

Open Problem: Properly learning decision trees in polynomial time?
Guy Blanc; Jane Lange; Mingda Qiao; Li-Yang Tan
Open Problem: Regret Bounds for Noise-Free Kernel-Based Bandits
Sattar Vakili
Open Problem: Running time complexity of accelerated $\ell_1$-regularized PageRank
Kimon Fountoulakis; Shenghao Yang

Statistics II (Tuesday, July 5, 09:00 AM GMT+1)

Location: Room A

Session chairs: Chi Jin; Thodoris Lykouris

Learning with Metric Losses

Time: Tuesday, July 5, 09:00 AM GMT+1

Authors: Dan Tsir Cohen; Aryeh Kontorovich

We propose an efficient algorithm for learning mappings between two metric spaces, $\X$ and $\Y$. Our procedure is strongly Bayes-consistent whenever $\X$ and $\Y$ are topologically separable and $\Y$ is ``bounded in expectation'' (our term; the separability assumption can be somewhat weakened). At this level of generality, ours is the first such learnability result for unbounded loss in the agnostic setting. Our technique is based on metric medoids (a variant of Fréchet means) and presents a significant departure from existing methods, which, as we demonstrate, fail to achieve Bayes-consistency on general instance- and label-space metrics. Our proofs introduce the technique of {\em semi-stable compression}, which may be of independent interest.

Trace Norm Regularization for Multi-Task Learning with Scarce Data

Time: Tuesday, July 5, 09:12 AM GMT+1

Authors: Etienne Boursier; Mikhail Konobeev; Nicolas Flammarion

Multi-task learning leverages structural similarities between multiple tasks to learn despite very few samples. Motivated by the recent success of neural networks applied to data-scarce tasks, we consider a linear low-dimensional shared representation model. Despite an extensive literature, existing theoretical results either guarantee weak estimation rates or require a large number of samples per task. This work provides the first estimation error bound for the trace norm regularized estimator when the number of samples per task is small. The advantages of trace norm regularization for learning data-scarce tasks extend to meta-learning and are confirmed empirically on synthetic datasets.

An Efficient Minimax Optimal Estimator for Multivariate Convex Regression

Time: Tuesday, July 5, 09:24 AM GMT+1

Authors: Gil Kur; Eli Putterman

We study the computational aspects of the task of multivariate convex regression in dimension $d \geq 5$. We present the first computationally efficient minimax optimal (up to logarithmic factors) estimators for the tasks of (i) $L$-Lipschitz convex regression (ii) $\Gamma$-bounded convex regression under polytopal support. The proof of the correctness of these estimators uses a variety of tools from different disciplines, among them empirical process theory, stochastic geometry, and potential theory.

On the Well-Spread Property and Its Relation to Linear Regression

Time: Tuesday, July 5, 09:36 AM GMT+1

Authors: Hongjie Chen; Tommaso d'Orsi

We consider the robust linear regression model $\bm y = X \beta^* +\bm \eta$, where an adversary oblivious to the design $X\in \R^{n\times d}$ may choose $\bm \eta$ to corrupt all but a (possibly vanishing) fraction of the observations $\bm y$ in an arbitrary way. Recent work \cite{d2021consistent, d2021consistentICML} has introduced efficient algorithms for consistent recovery of the parameter vector. These algorithms crucially rely on the design matrix being well-spread (a matrix is well-spread if its column span is far from any sparse vector).

In this paper, we argue that the well-spread property is information theoretically necessary for recovery. This shows a statistical price to pay to solve linear regression with oblivious outliers.

We further investigate the average-case time complexity of certifying well-spreadness of random matrices. We show that it is possible to efficiently certify whether a given $n$-by-$d$ random matrix is well-spread if the number of observations is quadratic in the ambient dimension $n\geq \Omega(d^2)$. We complement this result by showing rigorous evidence ---in the form of a lower bound against low-degree polynomials--- of the computational hardness of the certification problem when the number of observations is small, highlighting a family of design matrices for which regression is possible but we cannot efficiently certify that the required spreadness condition is satisfied.

Graphs (Tuesday, July 5, 09:00 AM GMT+1)

Location: Room B

Session chairs: Eric Price; Ciara Pike-Burke

Robust Estimation for Random Graphs

Time: Tuesday, July 5, 09:00 AM GMT+1

Authors: Jayadev Acharya; Ayush Jain; Gautam Kamath; Ananda Theertha Suresh; Huanyu Zhang

We study the problem of robustly estimating the parameter $p$ of an Erd\H{o}s-R\'enyi random graph on $n$ nodes, where a $\gamma$ fraction of nodes may be adversarially corrupted.
After showing the deficiencies of canonical estimators, we design a computationally-efficient spectral algorithm which estimates $p$ up to accuracy $\tilde O(\sqrt{p(1-p)}/n + \gamma\sqrt{p(1-p)} /\sqrt{n}+ \gamma/n)$ for $\gamma < 1/60$.
Furthermore, we give an inefficient algorithm with similar accuracy for all $\gamma<1/2$, the information-theoretic limit.
Finally, we prove a nearly-matching statistical lower bound, showing that the error of our algorithms is optimal up to logarithmic factors.

Learning Low Degree Hypergraphs

Time: Tuesday, July 5, 09:12 AM GMT+1

Authors: Eric Balkanski; Oussama Hanguir; Shatian Wang

We study the problem of learning a hypergraph via edge detecting queries. In this problem, a learner queries subsets of vertices of a hidden hypergraph and observes whether these subsets contain an edge or not. In general, learning a hypergraph with m edges of maximum size d requires Omega((2m/d)^{d/2}) queries. In this paper, we aim to identify families of hypergraphs that can be learned without suffering from a query complexity that grows exponentially in the size of the edges.

We show that hypermatchings and low-degree near-uniform hypergraphs with n vertices are learnable with poly(n) queries. For learning hypermatchings (hypergraphs of maximum degree Delta = 1), we give an O(log^3 n)-round algorithm with O(n log^5 n) queries. We complement this upper bound by showing that there are no algorithms with poly(n) queries that learn hypermatchings in o(log log n) adaptive rounds. For hypergraphs with maximum degree Delta and edge size ratio rho, we give a non-adaptive algorithm with O((2n)^{rho Delta+1} log^2 n) queries. To the best of our knowledge, these are the first algorithms with poly(n, m) query complexity for learning non-trivial families of hypergraphs that have a super-constant number of edges of arbitrarily large size.

Community Recovery in the Degree-Heterogeneous Stochastic Block Model

Time: Tuesday, July 5, 09:24 AM GMT+1

Authors: Vincent Cohen-Addad; Frederik Mallmann-Trenn; David Saulpic

We consider the problem of recovering communities in a random directed graph with planted communities. To model real-world directed graphs such as the Twitter or Instagram graphs that exhibit very heterogeneous degree sequences, we introduce the Degree-Heterogeneous Stochastic Block Model (DHSBM), a generalization of the classic Stochastic Block Model (SBM), where the vertex set is partitioned into communities and each vertex $u$ has two (unknown) associated probabilities, $p_u$ and $q_u$, $p_u > q_u$.
An arc from $u$ to $v$ is generated with probability $p_u$ if $u$ and $v$ are in the same community and with probability $q_u$ otherwise.
Given a graph generated from this model, the goal is to retrieve the communities.

The DHSBM allows to generate graphs with planted communities while allowing heterogeneous degree distributions, a quite important feature of real-world networks.

In the case where there are two communities, we present an iterative greedy linear-time algorithm that recovers them whenever $\min_u \frac{p_u - q_u}{\sqrt{p_u}} = \Omega(\sqrt{\log (n)/n})$. We also show that, up to a constant, this condition is necessary.
Our results also extend to the standard (undirected) SBM, where $p_u = p$ and $q_u= q$ for all nodes $u$. Our algorithm presents the first linear-time algorithm that recovers exactly the communities at the asymptotic information-theoretic threshold, improving over previous near-linear time spectral approaches.

Exact Community Recovery in Correlated Stochastic Block Models

Time: Tuesday, July 5, 09:36 AM GMT+1

Authors: Julia Gaudio; Miklos Racz; Anirudh Sridhar

We consider the problem of learning latent community structure from multiple correlated networks. We study edge-correlated stochastic block models with two balanced communities, focusing on the regime where the average degree is logarithmic in the number of vertices. Our main result derives the precise information-theoretic threshold for exact community recovery using multiple correlated graphs. This threshold captures the interplay between the community recovery and graph matching tasks. In particular, we uncover and characterize a region of the parameter space where exact community recovery is possible using multiple correlated graphs, even though (1) this is information-theoretically impossible using a single graph and (2) exact graph matching is also information-theoretically impossible. In this regime, we develop a novel algorithm that carefully synthesizes algorithms from the community recovery and graph matching literatures.

Open problem session II (Tuesday, July 5, 09:55 AM GMT+1)

Location: Room A

Session chair: Benjamin Guedj

Open Problem: Do you pay for Privacy in Online learning?
Amartya Sanyal; Giorgia Ramponi
Open Problem: Better Differentially Private Learning Algorithms with Margin Guarantees
Raef Bassily; Mehryar Mohri; Ananda Theertha Suresh
Open Problem: Finite-Time Instance Dependent Optimality for Stochastic Online Learning with Feedback Graphs
Teodor Vanislavov Marinov; Mehryar Mohri; Julian Zimmert
Open Problem: Optimal Best Arm Identification with Fixed-Budget
Chao Qin

Information theory (Tuesday, July 5, 10:45 AM GMT+1)

Location: Room A

Session chairs: Daniel Soudry; Oliver Hinder

Minimax Regret on Patterns Using Kullback-Leibler Divergence Covering

Time: Tuesday, July 5, 10:45 AM GMT+1

Authors: Jennifer Tang

This paper considers the problem of finding a tighter upper bound on the minimax regret of patterns, a class used to study large-alphabet distributions which avoids infinite asymptotic regret and redundancy. Our method for finding upper bounds for minimax regret uses cover numbers with Kullback-Leibler (KL) divergence as the distance. Compared to existing results by Acharya et al. (2013), we are able to improve the power of the exponent on the logarithmic term, giving a minimax regret bound which matches the best known minimax redundancy bound on patterns.

Sharp Constants in Uniformity Testing via the Huber Statistic

Time: Tuesday, July 5, 10:57 AM GMT+1

Authors: Shivam Gupta; Eric Price

Uniformity testing is one of the most well-studied problems in
property testing, with many known test statistics, including ones
based on counting collisions, singletons, and the empirical TV
distance. It is known that the optimal sample complexity to
distinguish the uniform distribution on $m$ elements from any
$\eps$-far distribution with $1-\delta$ probability is
$n = \Theta(\frac{\sqrt{m \log (1/\delta)}}{\eps^2} + \frac{\log
(1/\delta)}{\eps^2})$, which is achieved by the empirical TV
tester. Yet in simulation, these theoretical analyses are
misleading: in many cases, they do not correctly rank order the
performance of existing testers, even in an asymptotic regime of all
parameters tending to $0$ or $\infty$.

We explain this discrepancy by studying the \emph{constant factors}
required by the algorithms. We show that the collisions tester
achieves a sharp maximal constant in the number of standard deviations
of separation between uniform and non-uniform inputs. We then
introduce a new tester based on the Huber loss, and show that it not
only matches this separation, but also has tails corresponding to a
Gaussian with this separation. This leads to a sample complexity of
$(1 + o(1))\frac{\sqrt{m \log (1/\delta)}}{\eps^2}$ in the regime
where this term is dominant, unlike all other existing testers.

On the Memory Complexity of Uniformity Testing

Time: Tuesday, July 5, 11:09 AM GMT+1

Authors: Tomer Berg; Or Ordentlich; Ofer Shayevitz

In this paper we consider the problem of uniformity testing with limited memory. We observe a sequence of independent identically distributed random variables drawn from a distribution $p$ over $[n]$, which is either uniform or is $\eps$-far from uniform under the total variation distance, and our goal is to determine the correct hypothesis. At each time point we are allowed to update the state of a finite-memory machine with $S$ states, where each state of the machine is assigned one of the hypotheses, and we are interested in obtaining an asymptotic probability of error at most $0<\delta<1/2$ uniformly under both hypotheses.

The main contribution of this paper is deriving upper and lower bounds on the number of states $S$ needed in order to achieve a constant error probability $\delta$, as a function of $n$ and $\eps$, where our upper bound is $O(\frac{n\log n}{\eps})$ and our lower bound is $\Omega (n+\frac{1}{\eps})$. Prior works in the field have almost exclusively used collision counting for upper bounds, and the Paninski mixture for lower bounds. Somewhat surprisingly, in the limited memory with unlimited samples setup, the optimal solution does not involve counting collisions, and the Paninski prior is not hard, thus different proof techniques are needed in order to attain our bounds.

On the Role of Channel Capacity in Learning Gaussian Mixture Models

Time: Tuesday, July 5, 11:21 AM GMT+1

Authors: Elad Romanov; Or Ordentlich; Tamir Bendory

This paper studies the sample complexity of learning the $k$ unknown centers of a balanced Gaussian mixture model (GMM) in $\mathbb{R}^d$ with spherical covariance matrix $\sigma^2\bm{I}$. In particular, we are interested in the following question: what is the maximal noise level $\sigma^2$, for which the sample complexity is essentially the same as when estimating the centers from labeled measurements? To that end, we restrict attention to a Bayesian formulation of the problem, where the centers are uniformly distributed on the sphere $\sqrt{d}\mathcal{S}^{d-1}$. Our main results characterize the \emph{exact noise threshold} $\sigma^2$ below which the GMM learning problem, in the large system limit $d,k\to\infty$, is as easy as learning from labeled observations, and above which it is substantially harder. The threshold occurs at $\frac{\log k}{d} = \frac12\log\left( 1+\frac{1}{\sigma^2} \right)$, which is the capacity of the additive white Gaussian noise (AWGN) channel.
Thinking of the set of $k$ centers as a code, this noise threshold can be interpreted as the largest noise level for which the error probability of the code over the AWGN channel is small. Previous works on the GMM learning problem have identified the \emph{minimum distance} between the centers as a key parameter in determining the statistical difficulty of learning the corresponding GMM.
While our results are only proved for GMMs whose centers are uniformly distributed over the sphere, they hint that perhaps it is the decoding error probability associated with the center constellation as a channel code that determines the statistical difficulty of learning the corresponding GMM, rather than just the minimum distance.

Rate-Distortion Theoretic Generalization Bounds for Stochastic Learning Algorithms

Time: Tuesday, July 5, 11:33 AM GMT+1

Authors: Milad Sefidgaran; Amin Gohari; Gaël Richard; Umut Simsekli

Understanding generalization in modern machine learning settings has been one of the major challenges in statistical learning theory. In this context, recent years have witnessed the development of various generalization bounds suggesting different complexity notions such as the mutual information between the data sample and the algorithm output, compressibility of the hypothesis space, and the fractal dimension of the hypothesis space. While these bounds have illuminated the problem at hand from different angles, their suggested complexity notions might appear seemingly unrelated, thereby restricting their high-level impact. In this study, we prove novel generalization bounds through the lens of rate-distortion theory, and explicitly relate the concepts of mutual information, compressibility, and fractal dimensions in a single mathematical framework. Our approach consists of (i) defining a generalized notion of compressibility by using source coding concepts, and (ii) showing that the 'compression error rate' can be linked to the generalization error both in expectation and with high probability. We show that in the 'lossless compression' setting, we recover and improve existing mutual information-based bounds, whereas a 'lossy compression' scheme allows us to link generalization to the rate-distortion dimension - a particular notion of fractal dimension. Our results bring a more unified perspective on generalization and open up several future research directions.

Statistical and Computational Phase Transitions in Group Testing

Time: Tuesday, July 5, 11:45 AM GMT+1

Authors: Amin Coja-Oghlan; Oliver Gebhard; Max Hahn-Klimroth; Alexander Wein; Ilias Zadik

We study the group testing problem where the goal is to identify a set of k infected individuals carrying a rare disease within a population of size n, based on the outcomes of pooled tests which return positive whenever there is at least one infected individual in the tested group. We consider two different simple random procedures for assigning individuals to tests: the constant-column design and Bernoulli design.

Our first set of results concerns the fundamental statistical limits. For the constant-column design, we give a new information-theoretic lower bound which implies that the proportion of correctly identifiable infected individuals undergoes a sharp ``all-or-nothing'' phase transition when the number of tests crosses a particular threshold. For the Bernoulli design, we determine the precise number of tests required to solve the associated detection problem (where the goal is to distinguish between a group testing instance and pure noise), improving both the upper and lower bounds of Truong, Aldridge, and Scarlett (2020).

For both group testing models, we also study the power of computationally efficient (polynomial-time) inference procedures. We determine the precise number of tests required for the class of low-degree polynomial algorithms to solve the detection problem. This provides evidence for an inherent computational-statistical gap in both the detection and recovery problems at small sparsity levels. Notably, our evidence is contrary to that of Iliopoulos and Zadik (2021), who predicted the absence of a computational-statistical gap in the Bernoulli design.

Optimization III (Tuesday, July 5, 10:45 AM GMT+1)

Location: Room B

Session chairs: Ohad Shamir; Praneeth Netrapalli

EM's Convergence in Gaussian Latent Tree Models

Time: Tuesday, July 5, 10:45 AM GMT+1

Authors: Yuval Dagan; Vardis Kandiros; Constantinos Daskalakis

We study the optimization landscape of the log-likelihood function and the convergence of the Expectation-Maximization (EM) algorithm in latent Gaussian tree models, i.e.~tree-structured Gaussian graphical models whose leaf nodes are observable and non-leaf nodes are unobservable. We show that the unique non-trivial stationary point of the population log-likelihood is its global maximum, and establish that the expectation-maximization algorithm is guaranteed to converge to it in the single latent variable case. Our results for the landscape of the log-likelihood function in general latent tree models provide support for the extensive practical use of maximum likelihood based-methods in this setting. Our results for the expectation-maximization algorithm extend an emerging line of work on obtaining global convergence guarantees for this celebrated algorithm. We show our results for the non-trivial stationary points of the log-likelihood by arguing that a certain system of polynomial equations obtained from the EM updates has a unique non-trivial solution. The global convergence of the EM algorithm follows by arguing that all trivial fixed points are higher-order saddle points.

The Dynamics of Riemannian Robbins-Monro Algorithms

Time: Tuesday, July 5, 10:57 AM GMT+1

Authors: Mohammad Karimi; Ya-Ping Hsieh; Panayotis Mertikopoulos; Andreas Krause

Many important learning algorithms, such as stochastic gradient methods, are often deployed to solve nonlinear problems on Riemannian manifolds. Motivated by these applications, we propose a family of Riemannian algorithms generalizing and extending the seminal stochastic approximation framework of Robbins and Monro (1951). Compared to their Euclidean counterparts, Riemannian iterative algorithms are much less understood due to the lack of a global linear structure on the manifold. We overcome this difficulty by introducing an extended Fermi coordinate frame which allows us to map the asymptotic behavior of the proposed Riemannian Robbins–Monro (RRM) class of algorithms to that of an associated deterministic dynamical system under very mild assumptions on the underlying manifold. In so doing, we provide a general template of almost sure convergence results that mirrors and extends the existing theory for Euclidean Robbins-Monro schemes, albeit with a significantly more involved analysis that requires a number of new geometric ingredients. We showcase the flexibility of the proposed RRM framework by using it to establish the convergence of a retraction-based analogue of the popular optimistic / extra-gradient methods for solving minimization problems and games, and we provide a unified treatment for their convergence.

Two-Sided Weak Submodularity for Matroid Constrained Optimization and Regression

Time: Tuesday, July 5, 11:09 AM GMT+1

Authors: Justin Ward; Theophile Thiery

We study the following problem: Given a variable of interest, we would like to find a best linear predictor for it by choosing a subset of k relevant variables obeying a matroid constraint. This problem is a natural generalization of subset selection problems where it is necessary to spread observations amongst multiple different classes. We derive new, strengthened guarantees for this problem by improving the analysis of the residual random greedy algorithm and by developing a novel distorted local-search algorithm. To quantify our approximation guarantees, we refine the definition of weak submodularity by Das and Kempe (2011) and introduce the notion of an upper submodularity ratio, which we connect to the minimum k-sparse eigenvalue of the covariance matrix. More generally, we look at the problem of maximizing a set function f with lower and upper submodularity ratio $\gamma$ and $\beta$ under a matroid constraint. For this problem, our algorithms have asymptotic approximation guarantee 1/2 and (1 - 1/e) as the function is closer to being submodular. As a second application, we show that the Bayesian A-optimal design objective falls into our framework, leading to new guarantees for this problem as well.

Fast Algorithm for Overcomplete Order-3 Tensor Decomposition

Time: Tuesday, July 5, 11:21 AM GMT+1

Authors: Jingqiu Ding; Tommaso d'Orsi; Chih-Hung Liu; David Steurer; Stefan Tiegel

We develop the first fast spectral algorithm to decompose a random third-order tensor over of rank up to O(d^{3/2}/polylog(d)). Our algorithm only involves simple linear algebra operations and can recover all components in time O(d^{6.05}) under the current matrix multiplication time.

Non-Convex Optimization with Certificates and Fast Rates Through Kernel Sums of Squares

Time: Tuesday, July 5, 11:33 AM GMT+1

Authors: Blake Woodworth; Francis Bach; Alessandro Rudi

We consider potentially non-convex optimization problems, for which optimal rates of approximation depend on the dimension of the parameter space and the smoothness of the function to be optimized. In this paper, we propose an algorithm that achieves close to optimal a priori computational guarantees, while also providing a posteriori certificates of optimality. Our general formulation builds on infinite-dimensional sums-of-squares and Fourier analysis, and is instantiated on the minimization of periodic functions.

Plenary III (Tuesday, July 5, 02:00 PM GMT+1)

Location: Room A

Session chair: Maxim Raginsky

Robust learning from untrusted sources: The best things in life are (almost) free

Time: Tuesday, July 5, 02:00 PM GMT+1

Speaker: Alon Orlitsky

In many applications, including natural language processing, sensor networks, collaborative filtering, and federated learning, data are collected from different sources, some potentially corrupt, biased, or even adversarial. Learning algorithms for this setting have therefore garnered considerable recent attention. We develop a general framework for robust learning from untrusted sources, and determine the least number of samples required for robust density estimation and classification over both discrete and continuous domains. Perhaps surprisingly, we show that robust learning can be achieved with essentially the same number of samples as required for genuine data. For the important problems of learning discrete and piecewise-polynomial densities, and of interval-based classification, we achieve these limits with polynomial-time algorithms. Based on joint work with Ayush Jain.

Alon Orlitsky received B.Sc. degrees in Mathematics and Electrical Engineering from Ben Gurion University, and M.Sc. and Ph.D. degrees in Electrical Engineering from Stanford University. After a decade with the Communications Analysis Research Department at Bell Laboratories and a year at D.E. Shaw and Company, he joined the University of California San Diego, where he is currently a professor of Electrical and Computer Engineering and of Computer Science and Engineering and holds the Qualcomm Chair for Information Theory and its Applications. His research concerns information theory, statistical modeling, and machine learning, focusing on fundamental limits and practical algorithms for extracting knowledge from data. Among other distinctions, Alon is a recipient of the 2021 Information Theory Society Claude E. Shannon Award and a co-recipient of the 2017 ICML Best Paper Honorable Mention Award, the 2015 NeurIPS Best Paper Award, the 2006 Information Theory Society Paper Award, and the 1992 IEEE W.R.G. Baker Award.

Reinforcement learning and control (Tuesday, July 5, 03:30 PM GMT+1)

Location: Room A

Session chairs: Gergely Neu; Tor Lattimore

Thompson Sampling Achieves $\tilde{\mathcal{O}}(\sqrt{T})$ Regret in Linear Quadratic Control

Time: Tuesday, July 5, 03:30 PM GMT+1

Authors: Taylan Kargin; Sahin Lale; Kamyar Azizzadenesheli; Animashree Anandkumar; Babak Hassibi

Thompson Sampling (TS) is an efficient method for decision-making under uncertainty, where an action is sampled from a carefully prescribed distribution which is updated based on the observed data. In this work, we study the problem of adaptive control of stabilizable linear-quadratic regulators (LQRs) using TS, where the system dynamics are unknown. Previous works have established that $\tilde{\mathcal{O}}(\sqrt{T})$ frequentist regret is optimal for the adaptive control of LQRs. However, the existing methods either work only in restrictive settings, require a priori known stabilizing controllers or utilize computationally intractable approaches. We propose an efficient TS algorithm for the adaptive control of LQRs, TS-based Adaptive Control, TSAC, that attains $\tilde{\mathcal{O}}(\sqrt{T})$ regret, even for multidimensional systems, thereby solving the open problem posed in \citet{abeille2018improved}. TSAC does not require a priori known stabilizing controller and achieves fast stabilization of the underlying system by effectively exploring the environment in the early stages. Our result hinges on developing a novel lower bound on the probability that the TS provides an optimistic sample. By carefully prescribing an early exploration strategy and a policy update rule, we show that TS achieves order-optimal regret in adaptive control of multidimensional stabilizable LQRs. We empirically demonstrate the performance and the efficiency of the proposed algorithm in several adaptive control tasks.

Efficient Online Linear Control with Stochastic Convex Costs and Unknown Dynamics

Time: Tuesday, July 5, 03:42 PM GMT+1

Authors: Asaf Cassel; Alon Cohen; Tomer Koren

We consider the problem of controlling an unknown linear dynamical system under a stochastic convex cost and full feedback of both the state and cost function. We present a computationally efficient algorithm that attains an optimal $\sqrt{T}$ regret-rate against the best stabilizing linear controller. In contrast to previous work, our algorithm is based on the Optimism in the Face of Uncertainty paradigm. This results in a substantially improved computational complexity and a simpler analysis.

Learning to Control Linear Systems Can Be Hard

Time: Tuesday, July 5, 03:54 PM GMT+1

Authors: Anastasios Tsiamis; Ingvar Ziemann; Manfred Morari; Nikolai Matni; George Pappas

In this paper, we study the statistical difficulty of learning to control linear systems. We focus on two standard benchmarks, the sample complexity of stabilization, and the regret of the online learning of the Linear Quadratic Regulator (LQR). Prior results state that the statistical difficulty for both benchmarks scales polynomially with the system state dimension up to system-theoretic quantities. However, this does not reveal the whole picture. By utilizing minimax lower bounds for both benchmarks, we prove that there exist non-trivial classes of systems for which learning complexity scales dramatically, i.e. exponentially, with the system dimension. This situation arises in the case of underactuated systems, i.e. systems with fewer inputs than states. Such systems are structurally difficult to control and their
system theoretic quantities can scale exponentially with the system dimension dominating learning complexity. Under some additional structural assumptions (bounding systems away from uncontrollability), we provide qualitatively matching upper bounds. We prove that learning complexity can be at most exponential with the controllability index of the system, that is the degree of underactuation.

Can Q-learning Be Improved with Advice?

Time: Tuesday, July 5, 04:06 PM GMT+1

Authors: Noah Golowich; Ankur Moitra

Despite rapid progress in theoretical reinforcement learning (RL) over the last few years, most of the known guarantees are worst-case in nature, failing to take advantage of structure that may be known a priori about a given RL problem at hand. In this paper we address the question of whether worst-case lower bounds for regret in online learning of Markov decision processes (MDPs) can be circumvented when information about the MDP, in the form of predictions about its optimal Q-value function, is given to the algorithm. We show that when the predictions about the optimal Q-value function satisfy a reasonably weak condition we call distillation, then we can improve regret bounds by replacing the set of state-action pairs with the set of state-action pairs on which the predictions are grossly inaccurate. This improvement holds for both uniform regret bounds and gap-based ones. Further, we are able to achieve this property with an algorithm that achieves sublinear regret when given arbitrary predictions (i.e., even those which are not a distillation). Our work extends a recent line of work on algorithms with predictions, which has typically focused on simple online problems such as caching and scheduling, to the more complex and general problem of reinforcement learning.

When Is Partially Observable Reinforcement Learning Not Scary?

Time: Tuesday, July 5, 04:18 PM GMT+1

Authors: Qinghua Liu; Alan Chung; Csaba Szepesvari; Chi Jin

Partially observability is ubiquitous in applications of Reinforcement Learning (RL), in which agents learn to make a sequence of decisions despite lacking complete information about the latent states of the controlled system. Partially observable RL is notoriously difficult in theory---well-known complexity-theoretic results show that learning partially observable Markov decision processes (POMDPs) requires an exponential number of samples in the worst case. Yet, this does not rule out the possible existence of interesting subclasses of POMDPs, which include a large set of partial observable applications in practice while being tractable.

In this paper we identify a rich family of tractable POMDPs, which we call weakly revealing POMDPs. This family rules out the pathological instances of POMDPs with non-informative observations. We prove that for weakly revealing POMDPs, a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to guarantee a polynomial sample complexity. To the best of our knowledge, this is the first provably sample-efficient result for learning in overcomplete POMDPs---where the number of latent states can be larger than the number of observations---in settings where exploration is necessary.

Virtual session VII (Tuesday, July 5, 03:30 PM GMT+1)

Location: Room B

Session chairs: Steve Hanneke; Daniel Soudry

Dimension-Free Convergence Rates for Gradient Langevin Dynamics in RKHS

Time: Tuesday, July 5, 03:30 PM GMT+1

Authors: Boris Muzellec; Kanji Sato; Mathurin Massias; Taiji Suzuki

Gradient Langevin dynamics (GLD) and stochastic GLD (SGLD) have attracted considerable attention lately, as a way to provide convergence guarantees in a non-convex setting. However, the known rates grow exponentially with the dimension of the space under the dissipative condition. In this work, we provide a convergence analysis of GLD and SGLD when the optimization space is an infinite-dimensional Hilbert space.
More precisely, we derive non-asymptotic, dimension-free convergence rates for GLD/SGLD when performing regularized non-convex optimization in a reproducing kernel Hilbert space.
Amongst others, the convergence analysis relies on the properties of a stochastic differential equation, its discrete time Galerkin approximation and the geometric ergodicity of the associated Markov chains.

Statistical Estimation and Online Inference via Local SGD

Time: Tuesday, July 5, 03:42 PM GMT+1

Authors: Xiang Li; Jiadong Liang; Xiangyu Chang; Zhihua Zhang

We analyze the novel Local SGD in federated Learning, a multi-round estimation procedure that uses intermittent communication to improve communication efficiency. Under a $2{+}\delta$ moment condition on stochastic gradients, we first establish a {\it functional central limit theorem} that shows the averaged iterates of Local SGD converge weakly to a rescaled Brownian motion. We next provide two iterative inference methods: the {\it plug-in} and the {\it random scaling}. Random scaling constructs an asymptotically pivotal statistic for inference by using the information along the whole Local SGD path. Both the methods are communication efficient and applicable to online data. Our results show that Local SGD simultaneously achieves both statistical efficiency and communication efficiency.

Optimal and Instance-Dependent Guarantees for Markovian Linear Stochastic Approximation

Time: Tuesday, July 5, 03:54 PM GMT+1

Authors: Wenlong Mou; Ashwin Pananjady; Martin Wainwright; Peter Bartlett

We study stochastic approximation procedures for approximately solving a $d$-dimensional linear fixed point equation based on observing a trajectory of length $n$ from an ergodic Markov chain. We first exhibit a non-asymptotic bound of the order $t_{\mathrm{mix}} \tfrac{d}{n}$ on the squared error of the last iterate of a standard scheme, where $t_{\mathrm{mix}}$ is a mixing time. We then prove a non-asymptotic instance-dependent bound on a suitably averaged sequence of iterates, with a leading term that matches the local asymptotic minimax limit, including sharp dependence on the parameters $(d, t_{\mathrm{mix}})$ in the higher order terms. We complement these upper bounds with a non-asymptotic minimax lower bound that establishes the instance-optimality of the averaged SA estimator. We derive corollaries of these results for policy evaluation with Markov noise---covering the TD($\lambda$) family of algorithms for all $\lambda \in [0, 1)$---and linear autoregressive models. Our instance-dependent characterizations open the door to the design of fine-grained model selection procedures for hyperparameter tuning (e.g., choosing the value of $\lambda$ when running the TD($\lambda$) algorithm).

On Almost Sure Convergence Rates of Stochastic Gradient Methods

Time: Tuesday, July 5, 04:06 PM GMT+1

Authors: Jun Liu; Ye Yuan

The vast majority of convergence rates analysis for stochastic gradient methods in the literature focus on convergence in expectation, whereas trajectory-wise almost sure convergence is clearly important to ensure that any instantiation of the stochastic algorithms would converge with probability one. Here we provide a unified almost sure convergence rates analysis for stochastic gradient descent (SGD), stochastic heavy-ball (SHB), and stochastic Nesterov's accelerated gradient (SNAG) methods. We show, for the first time, that the almost sure convergence rates obtained for these stochastic gradient methods on strongly convex functions, are arbitrarily close to their optimal convergence rates possible. For non-convex objective functions, we not only show that a weighted average of the squared gradient norms converges to zero almost surely, but also the last iterates of the algorithms. We further provide last-iterate almost sure convergence rates analysis for stochastic gradient methods on weakly convex smooth functions, in contrast with most existing results in the literature that only provide convergence in expectation for a weighted average of the iterates.

Improved Parallel Algorithm for Minimum Cost Submodular Cover Problem

Time: Tuesday, July 5, 04:18 PM GMT+1

Authors: Yingli Ran; Zhao Zhang; Shaojie Tang

In the minimum cost submodular cover problem (MinSMC) problem, given a monotone nondecreasing submodular function $f\colon 2^V \rightarrow \mathbb{Z}^+$, a cost function $c: V\rightarrow \mathbb R^{+}$, and an integer $k\leq f(V)$, the goal is to find a subset $A\subseteq V$ with the minimum cost such that $f(A)\geq k$. The MinSMC can be found at the heart of many machine learning and data mining applications. In this paper, we design a parallel algorithm for MinSMC that obtains a solution with an approximation ratio of at most $\frac{H(\min\{\Delta,k\})}{1-5\varepsilon}$ with a probability of $1-3\varepsilon$ in $O(\frac{\log m\log n\log^2 mn}{\varepsilon^4})$ rounds, where $\Delta=\max_{v\in V}f(v)$, $H(\cdot)$ is the Harmonic number, $n=f(V)$, $m=|V|$, and $\varepsilon$ is a constant in $(0,\frac{1}{5})$. This paper is the first to obtain a parallel algorithm for the weighted version of the MinSMC problem with an approximation ratio arbitrarily close to $H(\min\{\Delta,k\})$.

Sample-Efficient Reinforcement Learning in the Presence of Exogenous Information

Time: Tuesday, July 5, 04:30 PM GMT+1

Authors: Yonathan Efroni; Dylan Foster; Dipendra Misra; Akshay Krishnamurthy; John Langford

In real-world reinforcement learning applications the learner's observation space is ubiquitously high-dimensional with both relevant and irrelevant information about the task at hand. Learning from high-dimensional observations has been the subject of extensive investigation in supervised learning and statistics (e.g., via sparsity), but analogous issues in reinforcement learning are not well understood, even in finite state/action (tabular) domains. We introduce a new problem setting for reinforcement learning, the Exogenous Markov Decision Process (ExMDP), in which the state space admits an (unknown) factorization into a small controllable (or, endogenous) component and a large irrelevant (or, exogenous) component; the exogenous component is independent of the learner's actions, but evolves in an arbitrary, temporally correlated fashion. We provide a new algorithm, OSSR, which learns a near-optimal policy with sample complexity polynomial in the size of the endogenous component and nearly independent of the size of the exogenous component, thereby offering a doubly-exponential improvement over off-the-shelf algorithms. Our results highlight for the first time that sample-efficient reinforcement learning is possible in the presence of exogenous information, and provide a simple, user-friendly benchmark for investigation going forward.

Privacy II (Tuesday, July 5, 05:00 PM GMT+1)

Location: Room A

Session chairs: Miki Racz; Vitaly Feldman

The Price of Tolerance in Distribution Testing

Time: Tuesday, July 5, 05:00 PM GMT+1

Authors: Clement Canonne; Ayush Jain; Gautam Kamath; Jerry Li

We revisit the problem of tolerant distribution testing. That is, given samples from an unknown distribution $p$ over $\{1, \dots, n\}$, is it $\varepsilon_1$-close to or $\varepsilon_2$-far from a reference distribution $q$ (in total variation distance)?
Despite significant interest over the past decade, this problem is well understood only in the extreme cases.
In the noiseless setting (i.e., $\varepsilon_1 = 0$) the sample complexity is $\Theta(\sqrt{n})$, strongly sublinear in the domain size.
At the other end of the spectrum, when $\varepsilon_1 = \varepsilon_2/2$, the sample complexity jumps to the barely sublinear $\Theta(n/\log n)$.
However, very little is known about the intermediate regime.
We fully characterize the price of tolerance in distribution testing as a function of $n$, $\varepsilon_1$, $\varepsilon_2$, up to a single $\log n$ factor.
Specifically, we show the sample complexity to be
\[\tilde \Theta\mleft(\frac{\sqrt{n}}{\ve_2^{2}} + \frac{n}{\log n} \cdot \max \mleft\{\frac{\ve_1}{\ve_2^2},\mleft(\frac{\ve_1}{\ve_2^2}\mright)^{\!\!2}\mright\}\mright),\]
providing a smooth tradeoff between the two previously known cases.
We also provide a similar characterization for the problem of tolerant equivalence testing, where both $p$ and $q$ are unknown.
Surprisingly, in both cases, the main quantity dictating the sample complexity is the ratio $\varepsilon_1/\varepsilon_2^2$, and not the more intuitive $\varepsilon_1/\varepsilon_2$.
Of particular technical interest is our lower bound framework, which involves novel approximation-theoretic tools required to handle the asymmetry between $\varepsilon_1$ and $\varepsilon_2$, a challenge absent from previous works.

Private Convex Optimization via Exponential Mechanism

Time: Tuesday, July 5, 05:12 PM GMT+1

Authors: Sivakanth Gopi; Yin Tat Lee; Daogao Liu

In this paper, we study the private optimization problems for non-smooth convex functions $F(x)=\mathbb{E}_i f_i(x)$ on $\mathbb{R}^d$.
We show that modifying the exponential mechanism by adding an $\ell_2^2$ regularizer to $F(x)$ and sampling from $\pi(x)\propto \exp(-k(F(x)+\mu\|x\|_2^2/2))$ recovers both the known optimal empirical risk and population loss under $(\eps,\delta)$-DP. Furthermore, we show how to implement this mechanism using $\widetilde{O}(n \min(d, n))$ queries to $f_i(x)$ where $n$ is the number of samples/users in the DP-SCO.
We also give a (nearly) matching lower bound $\widetilde{\Omega}(n \min(d, n))$ on the number of evaluation queries.

Our results utilize the following tools that are of independent interests:
\item We prove Gaussian Differential Privacy (GDP) of the exponential mechanism if the loss function is strongly convex and the perturbation is Lipschitz. Our privacy bound is \emph{optimal} as it includes the privacy of Gaussian mechanism as a special case.
\item We show how to sample from $\exp(-F(x)-\mu \|x\|^2_2/2)$ for $G$-Lipschitz $F$ with $\eta$ error in TV distance using $\widetilde{O}((G^2/\mu) \log^2(d/\eta))$ unbiased queries to $F(x)$. This is the first sampler whose query complexity has \emph{polylogarithmic dependence} on both dimension $d$ and accuracy $\eta$.

Low-Degree Multicalibration

Time: Tuesday, July 5, 05:24 PM GMT+1

Authors: Parikshit Gopalan; Michael Kim; Mihir Singhal; Shengjia Zhao

Multicalibration, introduced as a notion of algorithmic fairness, has proved to be a powerful and versatile concept, with implications far beyond its original intent.
This stringent notion---that predictions be well-calibrated across a rich class of intersecting subpopulations---provides its strong guarantees at a cost: the computational and sample complexity of learning multicalibrated predictors are high, and grow exponentially with the number of class labels.
In contrast, the relaxed notion of multiaccuracy can be achieved more efficiently, yet many of the most desirable properties of multicalibration cannot be guaranteed assuming multiaccuracy alone.
This tension raises a key question: \emph{Can we learn predictors with multicalibration-style guarantees at a cost commensurate with multiaccuracy?}

In this work, we define and initiate the study of \emph{Low-Degree Multicalibration}.
Low-Degree Multicalibration defines a hierarchy of increasingly-powerful multi-group fairness notions that spans multiaccuracy and the original formulation of multicalibration at the extremes.
Our main technical contribution demonstrates that key properties of multicalibration, related to fairness and accuracy, actually manifest as low-degree properties.
Importantly, we show that low-degree multicalibration can be significantly more efficient than full multicalibration.
In the multi-class setting, the sample complexity to achieve low-degree multicalibration improves exponentially (in the number of classes) over full multicalibration.
Our work presents compelling evidence that low-degree multicalibration represents a sweet spot, pairing computational and sample efficiency with strong fairness and accuracy guarantees.

Better Private Algorithms for Correlation Clustering

Time: Tuesday, July 5, 05:36 PM GMT+1

Authors: Daogao Liu

In machine learning, correlation clustering is an important problem whose goal is to partition the individuals into groups that correlate with their pairwise similarities as much as possible.
In this work, we revisit the correlation clustering under the differential privacy constraints.
Particularly, we improve previous results and achieve an $\Tilde{O}(n^{1.5})$ additive error compared to the optimal cost in expectation on general graphs.
As for unweighted complete graphs, we improve the results further and propose a more involved algorithm which achieves $\Tilde{O}(n \sqrt{\Delta^*})$ additive error, where $\Delta^*$ is the maximum degrees of positive edges among all nodes.

Virtual session VIII (Tuesday, July 5, 05:00 PM GMT+1)

Location: Room B

Session chairs: Thodoris Lykouris; Dylan Foster

Random Graph Matching in Geometric Models: The Case of Complete Graphs

Time: Tuesday, July 5, 05:00 PM GMT+1

Authors: Haoyu Wang; Yihong Wu; Jiaming Xu; Israel Yolou

This paper studies the problem of matching two complete graphs with edge weights correlated through latent geometries, extending a recent line of research on random graph matching with independent edge weights to geometric models.
Specifically, given a random permutation $\pi^*$ on $[n]$ and $n$ iid pairs of correlated Gaussian vectors $\{X_{\pi^*(i)}, Y_i\}$ in $\reals^d$ with noise parameter $\sigma$, the edge weights are given by $A_{ij}=\kappa(X_i,X_j)$ and $B_{ij}=\kappa(Y_i,Y_j)$ for some link function $\kappa$. The goal is to recover the hidden vertex correspondence $\pi^*$ based on the observation of $A$ and $B$. We focus on the dot-product model with $\kappa(x,y)=\langle x, y \rangle$ and Euclidean distance model with $\kappa(x,y)=\|x-y\|^2$, in the low-dimensional regime of $d=o(\log n)$ wherein the underlying geometric structures are most evident. We derive an approximate maximum likelihood estimator, which provably achieves, with high probability, perfect recovery of $\pi^*$ when $\sigma=o(n^{-2/d})$ and almost perfect recovery with a vanishing fraction of errors when $\sigma=o(n^{-1/d})$. Furthermore, these conditions are shown to be information-theoretically optimal even when the latent coordinates $\{X_i\}$ and $\{Y_i\}$ are observed, complementing the recent results of Dai et al. (2019) and Kunisky and Niles-Weed (2022) in geometric models of the planted bipartite matching problem. As a side discovery, we show that the celebrated spectral algorithm of Umeyama (1988) emerges as a further approximation to the maximum likelihood in the geometric model.

Near-Optimal Statistical Query Hardness of Learning Halfspaces with Massart Noise

Time: Tuesday, July 5, 05:12 PM GMT+1

Authors: Ilias Diakonikolas; Daniel Kane

We study the problem of PAC learning halfspaces with Massart noise.
Given labeled samples $(x, y)$
from a distribution $D$ on $\R^{d} \times \{ \pm 1\}$
such that the marginal $D_x$ on the examples is arbitrary
and the label $y$ of example $x$ is generated from the target halfspace
corrupted by a Massart adversary with flipping probability $\eta(x) \leq \eta \leq 1/2$,
the goal
is to compute a hypothesis with small misclassification error.
The best known $\poly(d, 1/\eps)$-time algorithms for this problem
achieve error of $\eta+\eps$, which can be far from the optimal bound of $\opt+\eps$,
where $\opt = \E_{x \sim D_x} [\eta(x)]$.
While it is known that achieving $\opt+o(1)$ error requires super-polynomial time
in the Statistical Query model, a large gap remains between
known upper and lower bounds.

In this work, we essentially characterize
the efficient learnability of Massart halfspaces in the Statistical Query (SQ) model.
Specifically, we show that no efficient SQ algorithm for learning Massart halfspaces on $\R^d$
can achieve error better than $\Omega(\eta)$, even if $\opt = 2^{-\log^{c} (d)}$,
for any universal constant $c \in (0, 1)$.
Furthermore, when the noise upper bound $\eta$ is close to $1/2$,
our error lower bound becomes $\eta - o_{\eta}(1)$, where the $o_{\eta}(1)$ term goes to $0$
when $\eta$ approaches $1/2$.
Our results provide strong evidence that known
learning algorithms for Massart halfspaces are nearly best possible,
thereby resolving a longstanding open problem in learning theory.

Hierarchical Clustering in Graph Streams: Single-Pass Algorithms and Space Lower Bounds

Time: Tuesday, July 5, 05:24 PM GMT+1

Authors: Sepehr Assadi; Vaggos Chatziafratis; Jakub Łącki; Vahab Mirrokni; Chen Wang

The Hierarchical Clustering (HC) problem consists of building a hierarchy of clusters to represent a given dataset. Motivated by the modern large-scale applications, we study the problem in the \emph{streaming model}, in which the memory is heavily limited and only a single or very few passes over the input are allowed.

Specifically, we investigate whether a good hierarchical clustering can be obtained, or at least whether we can approximately estimate the value of the optimal hierarchy. To measure the quality of a hierarchy, we use the HC minimization objective introduced by Dasgupta [STOC'16]. Assuming that the input is an n-vertex weighted graph whose edges arrive in a stream, we derive the following results on space-vs-accuracy tradeoffs:

-- With O(n polylog n) space, we develop a single-pass algorithm, whose approximation ratio\\ matches the currently best \textit{offline} algorithm by Charikar and Chatziafratis [SODA'17].
-- When the space is more limited, namely, n^{1-o(1)}, we prove that no algorithm can even estimate the value of optimum hierarchical tree to within an o(log(n)/loglog(n)) factor, even when allowed polylog{{n}} passes over the input and exponential time.
-- In the most stringent setting of polylog{n} space, studied extensively in the literature, we rule out algorithms that can even distinguish between ``highly''-vs-``poorly'' clusterable graphs, namely, graphs that have an n^{1/2-o(1)} factor gap between their HC objective value.
-- Finally, we prove that any single-pass streaming algorithm that computes an optimal HC clustering requires to store almost the entire input even if allowed exponential time.

Our algorithmic results establish a general structural result that proves that cut sparsifiers of input graph can preserve cost of ``balanced'' hierarchical trees to within some constant factor, and thus can be used in place of the original (dense) graphs when solving HC. Our lower bound results involve establishing a new streaming lower bound for a novel problem ``One-vs-Many-Expanders'', which can be of independent interest.

Clustering with Queries under Semi-Random Noise

Time: Tuesday, July 5, 05:36 PM GMT+1

Authors: Alberto Del Pia; Mingchen Ma; Christos Tzamos

The seminal paper by Mazumdar and Saha (2017a) introduced an extensive line of work on clustering with noisy queries. Yet, despite significant progress on the problem, the proposed methods depend crucially on knowing the exact probabilities of errors of the underlying fully-random oracle.
In this work, we develop robust learning methods that tolerate general semi-random noise obtaining qualitatively the same guarantees as the best possible methods in the fully-random model.

More specifically, given a set of n points with an unknown underlying partition, we are allowed to query pairs of points u,v to check if they are in the same cluster, but with probability p, the answer may be adversarially chosen. We show that information theoretically O(nk log n /(1-2p)^2) queries suffice to learn any cluster of sufficiently large size. Our main result is a computationally efficient algorithm that can identify large clusters with O(nk log n/ (1-2p)^2) + poly(log n, k, 1/(1-2p)) queries, matching the guarantees of the best known algorithms in the fully-random model. As a corollary of our approach, we develop the first parameter-free algorithm for the fully-random model, answering an open question in Mazumdar and Saha (2017a).

Closing remarks (Tuesday, July 5, 05:55 PM GMT+1)

Location: Room A

Session chair: Maxim Raginsky