Open Problems

Time (IST) Title, Authors
11.30-11.42 \(\log(n)\) factor in “Local Glivenko-Cantelli” (Doron Cohen, Aryeh Kontorovich)
11.42-11.54 The Sample Complexity of Multi-Distribution Learning for VC Classes (Pranjal Awasthi, Nika Haghtalab, Eric Zhao)
11.54-12.06 Polynomial linearly-convergent method for g-convex optimization? (Christopher Criscitiello, David Martínez-Rubio, Nicolas Boumal)
12.06-12.18 Is There a First-Order Method that Only Converges to Local Minimax Optima? (Jiseok Chae, Kyuwon Kim, Donghwan Kim)
12.18-12.30 Learning sparse linear concepts by priming the features (Manfred K. Warmuth, Ehsan Amid)

\(\log(n)\) factor in “Local Glivenko-Cantelli”

Doron Cohen, Aryeh Kontorovich

Can the \(\log(n)\) factor in the upper bound of Cohen and Kontorovich (COLT, 2023) be removed?

The Sample Complexity of Multi-Distribution Learning for VC Classes

Pranjal Awasthi, Nika Haghtalab, Eric Zhao

Multi-distribution learning is a natural generalization of PAC learning to settings with multiple data distributions. There remains a significant gap between the known upper and lower bounds for PAC-learnable classes. In particular, though we understand the sample complexity of learning a VC dimension \(d\) class on \(k\) distributions to be \(O(\epsilon^{-2} \ln(k) (d + k) + \min \{\epsilon^{-1} d k, \epsilon^{-4} \ln(k) d\})\), the best lower bound is \(\Omega(\epsilon^{-2}(d + k \ln(k)))\). We discuss recent progress on this problem and some hurdles that are fundamental to the use of game dynamics in statistical learning.

Polynomial linearly-convergent method for g-convex optimization?

Christopher Criscitiello, David Martínez-Rubio, Nicolas Boumal

Let \(f \colon \mathcal{M} \to \mathbb{R}\) be a Lipschitz and geodesically convex function defined on a \(d\)-dimensional Riemannian manifold \(\mathcal{M}\). Does there exist a first-order deterministic algorithm which (a) uses at most \(O(\mathrm{poly}(d) \log(\epsilon^{-1}))\) subgradient queries to find a point with target accuracy \(\epsilon\), and (b) requires only \(O(\mathrm{poly}(d))\) arithmetic operations per query? In convex optimization, the classical ellipsoid method achieves this. After detailing related work, we provide an ellipsoid-like algorithm with query complexity \(O(d^2 \log^2(\epsilon^{-1}))\) and per-query complexity \(O(d^2)\) for the limited case where \(\mathcal{M}\) has constant curvature (hemisphere or hyperbolic space). We then detail possible approaches and corresponding obstacles for designing an ellipsoid-like method for general Riemannian manifolds.

Is There a First-Order Method that Only Converges to Local Minimax Optima?

Jiseok Chae, Kyuwon Kim, Donghwan Kim

Can we effectively train a generative adversarial network (GAN), i.e., optimize a minimax problem, similar to classification neural networks, i.e., minimize a function, by gradient methods? Currently, the answer to this question is “No”. Despite the extensive studies, training GANs still remains challenging, and diffusion-based generative models are largely replacing GANs. When training GANs, we not only struggle with finding stationary points, but also suffer from the so-called mode-collapse phenomenon, generating samples that lack diversity compared to the training data. Due to the nature of GANs, mode-collapse is likely to occur when we find an optimal point for the maximin problem, rather than the original minimax problem.

This suggests that answering to an open question of whether there exists a first-order method that only converges to (local) optimum of minimax problems can resolve these issues. None of the existing methods possess such a property, neither theoretically nor practically. This is in contrast to standard gradient descent successfully finding local minima. In nonconvex-nonconcave minimax problems, Jin et al. are the first to suggest an appropriate notion of local optimality, taking account of the order of minimization and maximization. They also presented a partial answer to the question above, showing that two-timescale gradient descent ascent converges to strict local minimax optima. However, the convergence to general local minimax optimum was left mostly unexplored, even though non-strict local minimax optima are prevalent. Our recent findings illustrate that it is possible to find some non-strict local minimax optima by a two-timescale extragradient method.

This positive result brings new attention to the open question. Furthermore, we wish to revive discussion on the appropriate notion of local minimax optimum. This was initially discussed by Jin et al., but not much thereafter, which we believe is crucial in answering the open question.

Learning sparse linear concepts by priming the features

Manfred K. Warmuth, Ehsan Amid

Sparse linear problems can be learned well with online multiplicative updates. The question is weather there are closed form updates based on the past examples that can sample efficiently learn such sparse linear problems as well?

We show experimentally that this can be achieved by applying linear least squares, then “priming” the ith features of the past instances by multiplying them by the ith linear least squares weight, and finally applying linear least squares a second time. However it is an open problem whether such priming methods have provably good regret bounds when applied online?