Program
Talks in 231ABCPoster sessions in 301 Foyer
ABSTRACTS
On Mean Estimation for General Norms with Statistical Queries
Jerry Li, Aleksandar Nikolov, Ilya Razenshteyn, Erik WaingartenTalk: Tuesday 6/25/19 at 9:00 AM
Poster session: Tuesday 6/25/19 at 6:00 PM
We study the problem of mean estimation for high-dimensional distributions given access to a statistical query oracle. For a normed space \(X = (\mathbb{R}^d, \|\cdot\|_X)\) and a distribution supported on vectors \(x \in \mathbb{R}^d\) with \(\|x\|_{X} \leq 1\), the task is to output an estimate \(\hat{\mu} \in \R^d\) which is \(\varepsilon\)-close in distance \(\|\cdot\|_X\) to the true mean of the distribution. We obtain sharp upper and lower bounds for the statistical query complexity of this problem when the the underlying norm is symmetric as well as for Schatten-\(p\) norms, answering two questions raised by Feldman, Guzman, and Vempala (SODA 2017).
How Hard is Robust Mean Estimation?
Samuel B. Hopkins, Jerry LiTalk: Tuesday 6/25/19 at 9:10 AM
Poster session: Tuesday 6/25/19 at 6:00 PM
Robust mean estimation is the problem of estimating the mean \(\mu \in R^d\) of a \(d\)-dimensional distribution \(D\) from a list of independent samples, an \(\varepsilon\)-fraction of which have been arbitrarily corrupted by a malicious adversary. Recent algorithmic progress has resulted in the first polynomial-time algorithms which achieve \emph{dimension-independent} rates of error: for instance, if \(D\) has covariance \(I\), in polynomial-time one may find \(\hat{\mu}\) with \(\|\mu - \hat{\mu}\| \leq O(\sqrt{\varepsilon})\). However, error rates achieved by current polynomial-time algorithms, while dimension-independent, are sub-optimal in many natural settings, such as when \(D\) is sub-Gaussian, or has bounded \(4\)-th moments. In this work we worst-case complexity-theoretic evidence that improving on the error rates of current polynomial-time algorithms for robust mean estimation may be computationally intractable in natural settings. We show that several natural approaches to improving error rates of current polynomial-time robust mean estimation algorithms would imply efficient algorithms for the small-set expansion problem, refuting Raghavendra and Steurer's small-set expansion problem (so long as \(P \neq NP\)). We also give the first direct reduction to the robust mean estimation problem, starting from a plausible but nonstandard variant of the small-set expansion problem.
Fast Mean Estimation with Sub-Gaussian Rates
Yeshwanth Cherapanamjeri, Nicolas Flammarion, Peter BartlettTalk: Tuesday 6/25/19 at 9:20 AM
Poster session: Tuesday 6/25/19 at 6:00 PM
We propose an estimator for the mean of a random vector in R^d that can be computed in time O(n^4+n^2d) for n i.i.d.~samples and that has error bounds matching the sub-Gaussian case. The only assumptions we make about the data distribution are that it has finite mean and covariance; in particular, we make no assumptions about higher-order moments. Like the polynomial time estimator introduced by Hopkins, 2018, which is based on the sum-of-squares hierarchy, our estimator achieves optimal statistical efficiency in this challenging setting, but it has a significantly faster runtime and a simpler analysis.
Learning to Prune: Speeding up Repeated Computations
Daniel Alabi, Adam Tauman Kalai, Katrina Ligett, Cameron Musco, Christos Tzamos, Ellen VitercikTalk: Tuesday 6/25/19 at 9:30 AM
Poster session: Tuesday 6/25/19 at 6:00 PM
It is common to face settings where one must solve a sequence of similar computational problems. Running a standard algorithm with worst-case runtime guarantees on each instance will fail to take advantage of valuable structure shared across the problem instances. For example, when a commuter drives from work to home, there are typically only a handful of routes that will ever be the shortest path. A naive algorithm that does not exploit this common structure may spend most of its time checking roads that will never be in the shortest path. More generally, we can often ignore large swaths of the search space that will likely never contain an optimal solution. We present an algorithm that learns to maximally prune the search space on repeated computations, thereby reducing runtime while provably outputting the correct solution each period with high probability. Our algorithm employs a simple explore-exploit technique resembling those used in online algorithms, though our setting is quite different. We prove that, with respect to our model of pruning search spaces, our approach is optimal up to constant factors. Finally, we illustrate the applicability of our model and algorithm to three classic problems: shortest-path routing, string search, and linear programming. We present experiments confirming that our simple algorithm is effective at significantly reducing the runtime of solving repeated computations.
Fast determinantal point processes via distortion-free intermediate sampling
Michal DerezinskiTalk: Tuesday 6/25/19 at 9:40 AM
Poster session: Tuesday 6/25/19 at 6:00 PM
Given a fixed n x d matrix X, where n >> d, we study the complexity of sampling from a distribution over all subsets of rows where the probability of a subset is proportional to the squared volume of the parallelopiped spanned by the rows (a.k.a. a determinantal point process). In this task, it is important to minimize the preprocessing cost of the procedure (performed once) as well as the sampling cost (performed repeatedly). To that end, we propose a new determinantal point process algorithm which has the following two properties, both of which are novel: (1) a preprocessing step which runs in time O(number-of-non-zeros(X) log n) + poly(d), and (2) a sampling step which runs in poly(d) time, independent of the number of rows n. We achieve this by introducing a new regularized determinantal point process (R-DPP), which serves as an intermediate distribution in the sampling procedure by reducing the number of rows from n to poly(d). Crucially, this intermediate distribution does not distort the probabilities of the target sample. Our key novelty in defining the R-DPP is the use of a Poisson random variable for controlling the probabilities of different subset sizes, leading to new determinantal formulas such as the normalization constant for this distribution. Our algorithm has applications in many diverse areas where determinantal point processes have been used, such as machine learning, stochastic optimization, data summarization and low-rank matrix reconstruction.
A Rank-1 Sketch for Matrix Multiplicative Weights
Yair Carmon, John C. Duchi, Aaron Sidford, Kevin TianTalk: Tuesday 6/25/19 at 9:50 AM
Poster session: Tuesday 6/25/19 at 6:00 PM
We show that a simple rank-1 randomized sketch of the matrix multiplicative weight (MMW) update enjoys the same regret bounds as MMW, up to a factor logarithmic in the rank of the feedbacks. Unlike MMW, where every step requires full matrix exponentiation, our steps require only a single product of the form \(e^A b\), which the Lanczos method approximates efficiently. Our key technique is to view the sketch as a randomized mirror projection, and perform mirror descent analysis on the expected projection. Our sketch solves the online eigenvector problem, improving the best known complexity bounds. We also apply it to a simple no-regret scheme for semidefinite programming in saddle-point form, where it matches best known-guarantees. Similar results (with worse polylogarithmic factors) are possible via the rank-3 sketch Allen-Zhu and Li (2017) propose. However, we believe that our approach is more practical and its analysis more transparent.
Discrepancy, Coresets, and Sketches in Machine Learning
Zohar Karnin, Edo LibertyTalk: Tuesday 6/25/19 at 10:00 AM
Poster session: Tuesday 6/25/19 at 6:00 PM
This paper defines the notion of class discrepancy for families of functions. It shows that low discrepancy classes admit small offline and streaming coresets. We provide general techniques for bounding the class discrepancy of machine learning problems and specifically do so for matrix covariance approximation, logistic regression, kernel density and any analytic function of the dot product or the squared distance. Our result resolves a long standing open problem regarding the coreset complexity of Gaussian kernel density estimation. We provide two more related but independent results. First, an exponential improvement of the widely used merge-and-reduce trick which gives improved streaming sketches for any low discrepancy problem. Second, an extremely simple algorithm for finding low discrepancy sequences (and therefore coresets) for any positive semi-definite kernel. This paper establishes some explicit connections between low class discrepancy, coreset complexity, learnability, and streaming algorithms.
Optimal Average-Case Reductions to Sparse PCA: From Weak Assumptions to Strong Hardness
Matthew Brennan, Guy BreslerTalk: Tuesday 6/25/19 at 10:10 AM
Poster session: Tuesday 6/25/19 at 6:00 PM
In the past decade, sparse principal component analysis has emerged as an archetypal problem for illustrating statistical-computational tradeoffs, and a line of research has aimed to characterize the average-case complexity of sparse PCA via reductions from the planted clique problem \cite{berthet2013optimal, berthet2013complexity, wang2016statistical, gao2017sparse, brennan2018reducibility}. These reductions show computational hardness for sparse PCA assuming the conjectured hardness of detecting a planted clique in a \(G(N,\frac12)\) random graph for any clique size \(K = o(N^{1/2})\). Still a full characterization of the computational feasibility of sparse PCA in the canonical generative model, the spiked covariance model, has remained open. Our first result is an alternative reduction for sparse PCA from planted clique. The resulting lower bound completes the computational feasibility picture for the spiked covariance model of sparse PCA and is tight (matching algorithmic achievability) at all sparsities \(k\), partially resolving a question raised in Brennan et al. (2018). The main result of this paper is a second reduction to the spiked covariance model from planted clique with several implications: (1) even a mild improvement in the signal strength needed by efficient sparse PCA algorithms would imply that the hardness threshold for the planted clique problem is \emph{subpolynomial}, rather than on the order \(\sqrt{N}\) as is widely conjectured; (2) starting with the weak hardness assumption that planted clique is hard for cliques of size \(K=o(N^\alpha)\) for any \(\alpha \in (0, 1/2]\) yields tight computational lower bounds for sparse PCA at sparsities \(k = o(n^{\alpha/3})\), where \(n\) is the number of samples; and (3) whether or not the hardness threshold for planted clique is in fact at \(\sqrt{N}\) is irrelevant to the statistical-computational gap for sparse PCA in the practically relevant highly sparse regime. This result is the first instance of a suboptimal hardness assumption implying nontrivial optimal lower bounds for another problem in unsupervised learning. As a key intermediate step, our reduction maps an instance of planted clique to the empirical covariance matrix of sparse PCA samples, which proves to be a delicate task because of dependence between the entries of this matrix. Our reduction is more algorithmically involved than prior reductions to sparse PCA, developing new techniques for average-case reductions and leveraging several decomposition and comparison properties of random matrices. Our lower bounds remain unchanged assuming hardness of planted dense subgraph instead of planted clique, which also has implications for the existence of algorithms that are slower than polynomial time.
Universality of Computational Lower Bounds for Submatrix Detection
Matthew Brennan, Guy Bresler, Wasim HuleihelTalk: Tuesday 6/25/19 at 10:20 AM
Poster session: Tuesday 6/25/19 at 6:00 PM
In the general submatrix detection problem, the task is to detect the presence of a small \(k \times k\) submatrix with entries sampled from a distribution \(\mP\) in an \(n \times n\) matrix of samples from \(\mQ\). This formulation includes a number of well-studied problems, such as biclustering when \(\mP\) and \(\mQ\) are Gaussians and the planted dense subgraph formulation of community detection when the submatrix is a principal minor and \(\mP\) and \(\mQ\) are Bernoulli random variables. These problems all seem to exhibit a universal phenomenon: there is a statistical-computational gap depending on \(\mP\) and \(\mQ\) between the minimum \(k\) at which this task can be solved and the minimum \(k\) at which it can be solved in polynomial time. Our main result is to tightly characterize this computational barrier as a tradeoff between \(k\) and the KL divergences between \(\mP\) and \(\mQ\) through average-case reductions from the planted clique conjecture. These computational lower bounds hold given mild assumptions on \(\mP\) and \(\mQ\) arising naturally from classical binary hypothesis testing. In particular, our results recover and generalize the planted clique lower bounds for Gaussian biclustering in \cite{ma2015computational, brennan2018reducibility} and for the sparse and general regimes of planted dense subgraph in \cite{hajek2015computational, brennan2018reducibility}. This yields the first universality principle for computational lower bounds obtained through average-case reductions. To reduce from planted clique to the submatrix detection for a specific pair \(\mP\) and \(\mQ\), we introduce two techniques for average-case reductions: (1) multivariate rejection kernels which perform an algorithmic change of measure and lift to a larger submatrix while obtaining an optimal tradeoff in KL divergence, and (2) a technique for embedding adjacency matrices of graphs as principal minors in larger matrices that handles distributional issues arising from their diagonal entries and the matching row and column supports of the \(k \times k\) submatrix. We suspect that these techniques have applications in average-case reductions to other problems and are likely of independent interest. We also characterize the statistical barrier in our general formulation of submatrix detection.
Reasoning in Bayesian Opinion Exchange Networks Is PSPACE-Hard
Jan Hązła, Ali Jadbabaie, Elchanan Mossel, M. Amin RahimianTalk: Tuesday 6/25/19 at 10:30 AM
Poster session: Tuesday 6/25/19 at 6:00 PM
We study the Bayesian model of opinion exchange of fully rational agents arranged on a network. In this model, the agents receive private signals that are indicative of an unknown state of the world. Then, they repeatedly announce the state of the world they consider most likely to their neighbors, at the same time updating their beliefs based on their neighbors' announcements. This model is extensively studied in economics since the work of Aumann (1976) and Geanakoplos and Polemarchakis (1982). It is known that the agents eventually agree with high probability on any network. It is often argued that the computations needed by agents in this model are difficult, but prior to our results, there was no rigorous work showing this hardness. We show that it is PSPACE-hard for the agents to compute their actions in this model. Furthermore, we show that it is equally difficult even to approximate an agent's posterior: It is PSPACE-hard to distinguish between the posterior being almost entirely concentrated on one state of the world or another.
Sum-of-squares meets square loss: Fast rates for agnostic tensor completion
Dylan Foster, Andrej RisteskiTalk: Tuesday 6/25/19 at 10:40 AM
Poster session: Tuesday 6/25/19 at 6:00 PM
We study tensor completion in the agnostic setting. In the basic tensor completion problem, we receive \(n\) entries of an unknown rank-\(r\) tensor and wish to complete the remaining entries using as few measurements as possible. In agnostic tensor completion, we make no assumptions on the rank of the unknown tensor, but attempt to predict as well as the best rank-\(r\) tensor. For agnostic learning of third-order tensors with the square loss, we give the first polynomial time algorithm that obtains "fast" (i.e., \(O(1/n)\)-type) rate improving over the rate obtained by reduction to matrix completion. Our prediction error rate to compete with the best \(d\times{}d\times{}d\) tensor of rank-\(r\) is \(\tilde{O}(r^{2}d^{3/2}/n)\). More broadly, we obtain so-called "exact oracle inequalities" that trade off estimation and approximation error. Our algorithm is based on the sum-of-squares hierarchy, namely the degree-six sum-of-squares relaxation for the (NP-hard) tensor nuclear norm. The key technique is showing that a certain characterization for the subgradient of the tensor nuclear norm can be encoded in the sum-of-squares proof system. This unlocks a plethora of techniques based on localization of empirical processes under the square loss, and in particular allows us to establish restricted eigenvalue guarantees for various tensor classes. The new analysis of the relaxation complements Barak and Moitra (2016), who gave slow rates for agnostic tensor completion, and Potechin and Steurer (2017), who gave exact recovery guarantees for the noiseless setting. The main inequalities we establish are fairly user-friendly, and we anticipate they will find use elsewhere. As an example, we give new guarantees for agnostic tensor sensing with Gaussian measurements, where we also obtain \(\tilde{O}(r^{2}d^{3/2}/n)\)-type fast rates. We also give a new computational lower bound that suggests that it is unlikely that there exists any polynomial time algorithm whose excess risk improves upon the \(d^{3/2}\) dependence on dimension in our results.
A Robust Spectral Algorithm for Overcomplete Tensor Decomposition
Samuel B. Hopkins, Tselil Schramm, Jonathan ShiTalk: Tuesday 6/25/19 at 10:50 AM
Poster session: Tuesday 6/25/19 at 6:00 PM
We give a spectral algorithm for decomposing overcomplete order-4 tensors, so long as their components satisfy an algebraic non-degeneracy condition that holds for nearly all (all but an algebraic set of measure \(0\)) tensors over \((\R^d)^{\otimes 4}\) with rank \(n \le d^2\). Our algorithm is robust to adversarial perturbations of bounded spectral norm. Our algorithm is inspired by one which uses the sum-of-squares semidefinite programming hierarchy (Ma, Shi, and Steurer STOC'16), and we achieve comparable robustness and overcompleteness guarantees under similar algebraic assumptions. However, our algorithm avoids semidefinite programming and may be implemented as a series of basic linear-algebraic operations. We consequently obtain a much faster running time than semidefinite programming methods: our algorithm runs in time \(\tilde O(n^2d^3) \le \tilde O(d^7)\), which is subquadratic in the input size \(d^4\) (where we have suppressed factors related to the condition number of the input tensor).
On the Regret Minimization of Nonconvex Online Gradient Ascent for Online PCA
Dan GarberTalk: Tuesday 6/25/19 at 2:00 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
In this paper we focus on the problem of Online Principal Component Analysis in the regret minimization framework. For this problem, all existing regret minimization algorithms for the fully-adversarial setting are based on a positive semidefinite convex relaxation, and hence require quadratic memory and SVD computation (either thin of full) on each iteration, which amounts to at least quadratic runtime per iteration. This is in stark contrast to a corresponding stochastic i.i.d. variant of the problem, which was studied extensively lately, and admits very efficient gradient ascent algorithms that work directly on the natural non-convex formulation of the problem, and hence require only linear memory and linear runtime per iteration. This raises the question: \textit{can non-convex online gradient ascent algorithms be shown to minimize regret in online adversarial settings?} In this paper we take a step forward towards answering this question. We introduce an \textit{adversarially-perturbed spiked-covariance model} in which, each data point is assumed to follow a fixed stochastic distribution with a non-zero spectral gap in the covariance matrix, but is then perturbed with some adversarial vector. This model is a natural extension of a well studied standard \textit{stochastic} setting that allows for non-stationary (adversarial) patterns to arise in the data and hence, might serve as a significantly better approximation for real-world data-streams. We show that in an interesting regime of parameters, when the non-convex online gradient ascent algorithm is initialized with a ``warm-start" vector, it provably minimizes the regret with high probability. We further discuss the possibility of computing such a ``warm-start" vector, and also the use of regularization to obtain fast regret rates. Our theoretical findings are supported by empirical experiments on both synthetic and real-world data.
Learning in Non-convex Games with an Optimization Oracle
Naman Agarwal, Alon Gonen, Elad HazanTalk: Tuesday 6/25/19 at 2:10 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
We consider online learning in an adversarial, non-convex setting under the assumption that the learner has an access to an offline optimization oracle. In the general setting of prediction with expert advice, \cite{Hazan} established that in the optimization-oracle model, online learning requires exponentially more computation than statistical learning. In this paper we show that by slightly strengthening the oracle model, the online and the statistical learning models become computationally equivalent. Our result holds for any Lipschitz and bounded (but not necessarily convex) function. As an application we demonstrate how the offline oracle enables efficient computation of an equilibrium in non-convex games, that include GAN (generative adversarial networks) as a special case.
Combining Online Learning Guarantees
Ashok CutkoskyTalk: Tuesday 6/25/19 at 2:20 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
We show how to take any two parameter-free online learning algorithms with different regret guarantees and obtain a single algorithm whose regret is the minimum of the two base algorithms. Our method is embarrassingly simple: just add the iterates. This trick can generate efficient algorithms that adapt to many norms simultaneously, as well as providing diagonal-style algorithms that still maintain dimension-free guarantees. We then proceed to show how a variant on this idea yields a black-box procedure for generating optimistic online learning algorithms. This yields the first optimistic regret guarantees in the unconstrained setting and generically increases adaptivity. Further, our optimistic algorithms are guaranteed to do no worse than their non-optimistic counterparts regardless of the quality of the optimistic estimates provided to the algorithm.
Artificial Constraints and Hints for Unbounded Online Learning
Ashok CutkoskyTalk: Tuesday 6/25/19 at 2:30 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
We provide algorithms that guarantee regret \(R_T(\w)\le \tilde O(G\|\w\|^3 + G(\|\w\|+1)\sqrt{T})\) or \(R_T(\w)\le \tilde O(G\|\w\|^3T^{1/3} + GT^{1/3}+ G\|\w\|\sqrt{T})\) for online convex optimization with \(G\)-Lipschitz losses for any comparison point \(\w\) without prior knowledge of either \(G\) or \(\|\w\|\). Previous algorithms dispense with the \(O(\|\w\|^3)\) term at the expense of knowledge of one or both of these parameters, while a lower bound shows that some additional penalty term over \(G\|\w\|\sqrt{T}\) is necessary. Previous penalties were \emph{exponential} while our bounds are polynomial in all quantities. Further, given a known bound \(\|\w\|\le D\), our same techniques allow us to design algorithms that adapt optimally to the unknown value of \(\|\w\|\) without requiring knowledge of \(G\).
Lipschitz Adaptivity with Multiple Learning Rates in Online Learning
Zakaria Mhammedi, Wouter M. Koolen, Tim van ErvenTalk: Tuesday 6/25/19 at 2:40 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
We aim to design adaptive online learning algorithms that take advantage of any special structure that might be present in the learning task at hand, with as little manual tuning by the user as possible. A fundamental obstacle that comes up in the design of such adaptive algorithms is to calibrate a so-called step-size or learning rate hyperparameter depending on variance, gradient norms, etc. A recent technique promises to overcome this difficulty by maintaining multiple learning rates in parallel. This technique has been applied in the MetaGrad algorithm for online convex optimization and the Squint algorithm for prediction with expert advice. However, in both cases the user still has to provide in advance a Lipschitz hyperparameter that bounds the norm of the gradients. Although this hyperparameter is typically not available in advance, tuning it correctly is crucial: if it is set too small, the methods may fail completely; but if it is taken too large, performance deteriorates significantly. In the present work we remove this Lipschitz hyperparameter by designing new versions of MetaGrad and Squint that adapt to its optimal value automatically. We achieve this by dynamically updating the set of active learning rates. For MetaGrad, we further improve the computational efficiency of handling constraints on the domain of prediction, and we remove the need to specify the number of rounds in advance.
Pure Entropic Regularization for MTS
Christian Coester, James R. LeeTalk: Tuesday 6/25/19 at 2:50 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
We show that on every \(n\)-point HST metric, there is a randomized online algorithm for metrical task systems (MTS) that is \(1\)-competitive for service costs and \(O(\log n)\)-competitive for movement costs. In general, these refined guarantees are optimal up to the implicit constant. While an \(O(\log n)\)-competitive algorithm for MTS on HST metrics was developed in Bubeck et al. (2018), that approach could only establish an \(O((\log n)^2)\)-competitive ratio when the service costs are required to be \(O(1)\)-competitive. Our algorithm is an instantiation of online mirror descent with the regularizer derived from a multiscale conditional entropy. In fact, our algorithm satisfies a set of even more refined guarantees; we are able to exploit this property to combine it with known random embedding theorems and obtain, for {\em any} \(n\)-point metric space, a randomized algorithm that is \(1\)-competitive for service costs and \(O((\log n)^2)\)-competitive for movement costs.
Vortices Instead of Equilibria in MinMax Optimization: Chaos and Butterfly Effects of Online Learning in Zero-Sum Games
Yun Kuen Cheung, Georgios PiliourasTalk: Tuesday 6/25/19 at 3:00 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
We prove that algorithmic experiments in zero-sum games "fail miserably" to confirm the unique, sharp prediction of maxmin equilibration. Contradicting nearly a century of economic thought that treats zero-sum games nearly axiomatically as the exemplar symbol of economic stability, we prove that no meaningful prediction can be made about the day-to-day behaviour of learning dynamics in zero-sum games. Multiplicative Weights Updates (MWU) with constant step-size is Lyapunov chaotic in the dual (payoff) space. Simply put, let's assume that an observer asks the agents playing Matching-Pennies whether they prefer Heads or Tails (and by how much in terms of aggregate payoff so far). The range of possible answers consistent with any arbitrary small set of initial conditions blows up exponentially with time everywhere in the payoff space. This result is robust both algorithmically as well as game theoretically: Algorithmic robustness: Chaos is robust to agents using any Follow-the-Regularized-Leader (FTRL) algorithms (e.g., gradient descent), smooth fictitious play, even when agents mix-and-match dynamics, use different or vanishing step-sizes. Game theoretic robustness: Chaos is robust to all affine variants of zero-sum games (strictly competitive), network variants with arbitrary large number of agents and even to competitive settings beyond these.
A Theory of Selective Prediction
Mingda Qiao, Gregory ValiantTalk: Tuesday 6/25/19 at 3:10 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
We consider a model of \emph{selective prediction}, where the prediction algorithm is given a data sequence in an online fashion and asked to predict a pre-specified statistic of the upcoming data points. The algorithm is allowed to choose when to make the prediction as well as the length of the prediction window, possibly depending on the observations so far. We prove that, even without \emph{any} distributional assumption on the input data stream, a large family of statistics can be estimated to non-trivial accuracy. To give one concrete example, suppose that we are given access to an arbitrary binary sequence \(x_1, \ldots, x_n\) of length \(n\). Our goal is to accurately predict the average observation, and we are allowed to choose the window over which the prediction is made: for some \(t < n\) and \(m \le n - t\), after seeing \(t\) observations we predict the average of \(x_{t+1}, \ldots, x_{t+m}\). We show that the expected squared error of our prediction can be bounded by \(O\left(\frac{1}{\log n}\right)\), and prove a matching lower bound. This result holds for any sequence (that is not adaptive to when the prediction is made, or the predicted value), and the expectation of the error is with respect to the randomness of the prediction algorithm. Our results apply to more general statistics of a sequence of observations, and we highlight a number of open directions for future work.
On the Computational Power of Online Gradient Descent
Vaggos Chatziafratis, Tim Roughgarden, Joshua R. WangTalk: Tuesday 6/25/19 at 3:20 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
We prove that the evolution of weight vectors in online gradient descent can encode arbitrary polynomial-space computations, even in very simple learning settings. Our results imply that, under weak complexity-theoretic assumptions, it is impossible to reason efficiently about the fine-grained behavior of online gradient descent.
Maximum Entropy Distributions: Bit Complexity and Stability
Damian Straszak, Nisheeth K. VishnoiTalk: Tuesday 6/25/19 at 4:00 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
Maximum entropy distributions with discrete support in m dimensions arise in machine learning, statistics, information theory, and theoretical computer science. While structural and computational properties of max-entropy distributions have been extensively studied, basic questions such as: Do max-entropy distributions over a large support (e.g., 2^m) with a specified marginal vector have succinct descriptions (polynomial-size in the input description)? and: Are max-entropy distributions ``stable'' under the perturbation of the marginal vector? have resisted a rigorous resolution. The first question is important to compute max-entropy distributions efficiently and the stability of max-entropy distributions is important in applications where the marginal vector is obtained by aggregating samples from an unknown distribution, and hence, the stability determines the sample complexity. Here we show that these questions are related and resolve both of them. Our main result shows a poly(m, log 1/eps) bound on the bit complexity of eps-optimal dual solutions to the maximum entropy convex program -- for very general support sets and with no restriction on the marginal vector. The proof of this result allows us to show that changing the marginal vector by delta changes the max-entropy distribution in the total variation distance roughly by a factor of poly(m, log 1/delta)*sqrt{delta} -- even when the size of the support set is exponential. Together, our results put max-entropy distributions on a mathematically sound footing -- these distributions are robust and computationally feasible models for data.
Estimation of smooth densities in Wasserstein distance
Jonathan Weed, Quentin BerthetTalk: Tuesday 6/25/19 at 4:10 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
The Wasserstein distances are a set of metrics on probability distributions supported on R^d with ap- plications throughout statistics and machine learning. Often, such distances are used in the context of variational problems, in which the statistician employs in place of an unknown measure a proxy constructed on the basis of independent samples. This raises the basic question of how well measures can be approximated in Wasserstein distance. While it is known that an empirical measure comprising i.i.d. samples is rate-optimal for general measures, no improved results were known for measures possessing smooth densities. We prove the first minimax rates for estimation of smooth densities for general Wasserstein distances, thereby showing how the curse of dimensionality can be alleviated for sufficiently regular measures. We also show how to construct discretely supported measures, suitable for computational purposes, which enjoy improved rates. Our approach is based on novel bounds between the Wasserstein distances and suitable Besov norms, which may be of independent interest.
The Optimal Approximation Factor in Density Estimation
Olivier Bousquet, Daniel Kane, Shay MoranTalk: Tuesday 6/25/19 at 4:20 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
Consider the following problem: given two arbitrary densities \(q_1,q_2\) and a sample-access to an unknown target density \(p\), find which of the \(q_i\)'s is closer to \(p\) in total variation. A remarkable result due to Yatracos shows that this problem is tractable in the following sense: there exists an algorithm that uses \(O(\epsilon^{-2})\) samples from \(p\) and outputs~\(q_i\) such that with high probability, \(TV(q_i,p) \leq 3\cdot\opt + \epsilon\), where \(\opt= \min\{TV(q_1,p),TV(q_2,p)\}\). Moreover, this result extends to any finite class of densities \(Q\): there exists an algorithm that outputs the best density in \(Q\) up to a multiplicative approximation factor of 3. We complement and extend this result by showing that: (i) the factor 3 can not be improved if one restricts the algorithm to output a density from \(Q\), and (ii) if one allows the algorithm to output arbitrary densities (e.g.\ a mixture of densities from \(Q\)), then the approximation factor can be reduced to 2, which is optimal. In particular this demonstrates an advantage of improper learning over proper in this setup. We develop two approaches to achieve the optimal approximation factor of \(2\): an adaptive one and a static one. Both approaches are based on a geometric point of view of the problem and rely on estimating surrogate metrics to the total variation. Our sample complexity bounds exploit techniques from {\it Adaptive Data Analysis}.
Communication and Memory Efficient Testing of Discrete Distributions
Ilias Diakonikolas, Themis Gouleakis, Daniel M. Kane, Sankeerth RaoTalk: Tuesday 6/25/19 at 4:30 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
We study distribution testing with communication and memory constraints in the following computational models: (1) The one-pass streaming model where the goal is to minimize the sample complexity of the protocol subject to a memory constraint, and (2) A distributed model where the data samples reside at multiple machines and the goal is to minimize the communication cost of the protocol. In both these models, we provide efficient algorithms for uniformity/identity testing (goodness of fit) and closeness testing (two sample testing). Moreover, we show nearly-tight lower bounds on (1) the sample complexity of any one- pass streaming tester for uniformity, subject to the memory constraint, and (2) the communication cost of any uniformity testing protocol, in a restricted “one-pass” model of communication.
Inference under Local Constraints: Lower Bounds from Chi-Square Contractions
Jayadev Acharya, Clément L. Canonne, Himanshu TyagiTalk: Tuesday 6/25/19 at 4:40 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
Multiple users getting one sample each from an unknown distribution seek to enable a central server to conduct statistical inference. However, each player can only provide limited amount of information about its sample to the server. We propose a unified framework to study such distributed inference problems under local information constraints. We model the local information constraints by a set of channels W: each player chooses a channel from W, and then passes their data through this channel before transmitting the output to the server. The goal in this distributed setting is to understand the blow-up in data requirement imposed by the information constraints, compared to the centralized setting where all data samples are available to the server. We introduce two notions of chi-square fluctuations which provide bounds for the average distance and the distance to the average of a local perturbation. When information constraints are imposed, by the standard data-processing inequality, pairwise distances contract and so do our chi-square fluctuations. We provide a precise characterization of this contraction for discrete k-ary distributions and use it to obtain to general lower bounds for distribution learning and testing under information constraints. Our results involve notions of minmax and maxmin chi-square fluctuations, where the maximum is over the choice of channels and the minimum is over perturbations. The former emerges when considering public-coin protocols for testing and is bounded in terms of Frobenius norm of a positive semidefinite matrix H, a function of the channel family W. The latter appears for private-coin protocols and is bounded by the nuclear norm of H which is smaller than its Frobenius norm, establishing a separation between the sample complexity of testing using public and private coins.
Towards Testing Monotonicity of Distributions Over General Posets
Maryam Aliakbarpour, Themis Gouleakis, John Peebles, Ronitt Rubinfeld, Anak YodpinyaneeTalk: Tuesday 6/25/19 at 4:50 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
In this work, we consider the sample complexity required for testing the monotonicity of distributions over partial orders. A distribution \(p\) over a poset is {\em monotone} if, for any pair of domain elements \(x\) and \(y\) such that \(x \preceq y\), \(p(x) \leq p(y)\). To understand the sample complexity of this problem, we introduce a new property called \emph{bigness} over a finite domain, where the distribution is \(T\)-big if the minimum probability for any domain element is at least \(T\). Relying on the framework of \cite{wu2015chebyshev}, we establish a lower bound of \(\Omega(n/\log n)\) for testing bigness of distributions on domains of size \(n\). We then build on these lower bounds to give \(\Omega(n/\log{n})\) lower bounds for testing monotonicity over a matching poset of size \(n\) and significantly improved lower bounds over the hypercube poset. We give sublinear sample complexity bounds for testing bigness and for testing monotonicity over the matching poset. We then give a number of tools for analyzing upper bounds on the sample complexity of the monotonicity testing problem.
Testing Mixtures of Discrete Distributions
Maryam Aliakbarpour, Ravi Kumar, Ronitt RubinfeldTalk: Tuesday 6/25/19 at 5:00 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
There has been significant study on the sample complexity of testing properties of distributions over large domains. For many properties, it is known that the sample complexity can be substantially smaller than the domain size. For example, over a domain of size \(n\), distinguishing the uniform distribution from distributions that are far from uniform in \(\ell_1\)-distance uses only \(O(\sqrt{n})\) samples. However, the picture is very different in the presence of arbitrary noise, even when the amount of noise is quite small. In this case, one must distinguish if samples are coming from a distribution that is \(\epsilon\)-close to uniform from the case where the distribution is \((1-\epsilon)\)-far from uniform. The latter task requires nearly linear in \(n\) samples~\citep{Valiant08,ValiantValiant11}. In this work, we present a noise model that on one hand is more tractable for the testing problem, and on the other hand represents a rich class of noise families. In our model, the noisy distribution is a mixture of the original distribution and noise, where the latter is known to the tester either explicitly or via sample access; the form of the noise is also known \emph{a priori}. Focusing on the identity and closeness testing problems leads to the following mixture testing question: Given samples of distributions \(p, q_1,q_2\), can we test if \(p\) is a mixture of \(q_1\) and \(q_2\)? We consider this general question in various scenarios that differ in terms of how the tester can access the distributions, and show that indeed this problem is more tractable. Our results show that the sample complexity of our testers are exactly the same as for the classical non-mixture case.
Learning Ising Models with Independent Failures
Surbhi Goel, Daniel M. Kane, Adam R. KlivansTalk: Tuesday 6/25/19 at 5:10 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
We give the first efficient algorithm for learning the structure of an Ising model that tolerates independent failures; that is, each entry of the observed sample is missing with some unknown probability \(p\). Our algorithm matches the essentially optimal runtime and sample complexity bounds of recent work for learning Ising models due to Klivans and Meka (2017). We devise a novel unbiased estimator for the gradient of the Interaction Screening Objective (ISO) due to Vuffray et al. (2016) and apply a stochastic multiplicative gradient descent algorithm to minimize this objective. Solutions to this minimization recover the neighborhood information of the underlying Ising model on a node by node basis.
Sharp Theoretical Analysis for Nonparametric Testing under Random Projection
Meimei Liu, Zuofeng Shang, Guang ChengTalk: Tuesday 6/25/19 at 5:20 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
A common challenge in nonparametric inference is its high computational complexity when data volume is large. In this paper, we develop computationally efficient nonparametric testing by employing a random projection strategy. In the specific kernel ridge regression setup, a simple distance-based test statistic is proposed. Notably, we derive the minimum number of random projections that is sufficient for achieving testing optimality in terms of the minimax rate. As a by-product, the lower bound of projection dimension for minimax optimal estimation derived in \cite{yang2015randomized} is proven to be sharp. One technical contribution is to establish upper bounds for a range of tail sums of empirical kernel eigenvalues.
Lower bounds for testing graphical models: colorings and antiferromagnetic Ising models
Ivona Bezakova, Antonio Blanca, Zongchen Chen, Daniel Stefankovic, Eric VigodaTalk: Tuesday 6/25/19 at 5:30 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
We study the identity testing problem in the context of spin systems or undirected graphical models, where it takes the following form: given the parameter specification of the model M and a sampling oracle for the distribution \(\mu_{M^* }\) of an unknown model M^*, can we efficiently determine if the two models M and M^* are the same? We consider identity testing for both soft-constraint and hard-constraint systems. In particular, we prove hardness results in two prototypical cases, the Ising model and proper colorings, and explore whether identity testing is any easier than structure learning. For the ferromagnetic (attractive) Ising model, Daskalasis et al. (2018) presented a polynomial time algorithm for identity testing. We prove hardness results in the antiferromagnetic (repulsive) setting in the same regime of parameters where structure learning is known to require a super-polynomial number of samples. In particular, for n-vertex graphs of maximum degree d, we prove that if |\beta| d = \omega(\log n) (where \beta is the inverse temperature parameter), then there is no identity testing algorithm for the antiferromagnetic Ising model that runs in polynomial time unless RP = NP. We also establish computational lower bounds for a broader set of parameters under the (randomized) exponential time hypothesis. In our proofs, we use random graphs as gadgets; this is inspired by similar constructions in seminal works on the hardness of approximate counting. In the hard-constraint setting, we present hardness results for identity testing for proper colorings. Our results are based on the presumed hardness of #BIS, the problem of (approximately) counting independent sets in bipartite graphs. In particular, we prove that identity testing for colorings is hard in the same range of parameters where structure learning is known to be hard, which in turn matches the parameter regime for NP-hardness of the corresponding decision problem.
Testing Markov Chains Without Hitting
Yeshwanth Cherapanamjeri, Peter BartlettTalk: Tuesday 6/25/19 at 5:40 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
We study the problem of identity testing of markov chains. In this setting, we are given access to a single trajectory from a markov chain with unknown transition matrix Q and the goal is to determine whether Q = P for some known matrix P or Distt(P, Q) > \epsilon where Dist is suitably defined. In recent work by Daskalakis et al, it was shown that it is possible to distinguish between the two cases provided the length of the observed trajectory is at least super-linear in the hitting time of P which may be arbitrarily large. In this paper, we propose an algorithm that avoids this dependence on hitting time thus enabling efficient testing of markov chains even in cases where it is infeasible to observe every state in the chain. Our algorithm is based on combining classical ideas from approximation algorithms with techniques for the spectral analysis of markov chains.
Is your function low dimensional?
Anindya De, Elchanan Mossel, Joe NeemanTalk: Tuesday 6/25/19 at 5:50 PM
Poster session: Tuesday 6/25/19 at 6:00 PM
We study the problem of testing if a function depends on a small number of linear directions of its input data. We call a function \(f\) a \emph{linear \(k\)-junta} if it is completely determined by some \(k\)-dimensional subspace of the input space. In this paper, we study the problem of testing whether a given \(n\) variable function \(f : \mathbb{R}^n \to \{0,1\}\), is a linear \(k\)-junta or \(\epsilon\)-far from all linear \(k\)-juntas, where the closeness is measured with respect to the Gaussian measure on \(\mathbb{R}^n\). Linear \(k\)-juntas are a common generalization of two fundamental classes from Boolean function analysis (both of which have been studied in property testing) \textbf{1.} \(k\)- juntas which are functions on the Boolean cube which depend on at most k of the variables and \textbf{2.} intersection of \(k\) halfspaces, a fundamental geometric concept class. We show that the class of linear \(k\)-juntas is not testable, but adding a surface area constraint makes it testable: we give a \(\mathsf{poly}(k \cdot s/\epsilon)\)-query non-adaptive tester for linear \(k\)-juntas with surface area at most \(s\). We show that the polynomial dependence on \(s\) is necessary. Moreover, we show that if the function is a linear \(k\)-junta with surface area at most \(s\), we give a \((s \cdot k)^{O(k)}\)-query non-adaptive algorithm to learn the function \emph{up to a rotation of the basis}. In particular, this implies that we can test the class of intersections of \(k\) halfspaces in \(\mathbb{R}^n\) with query complexity independent of \(n\).
Planting trees in graphs, and finding them back
Laurent Massoulié, Ludovic Stephan, Don TowsleyTalk: Wednesday 6/26/19 at 9:00 AM
Poster session: Wednesday 6/26/19 at 6:00 PM
In this paper we study the two inference problems of detection and reconstruction in the context of planted structures in sparse Erd\H{o}s-Rényi random graphs \(\mathcal{G}(n,\lambda/n)\) with fixed average degree \(\lambda>0\). Motivated by a problem of communication security, we focus on the case where the planted structure consists in the addition of a tree graph. In the case of planted line graphs, we establish the following phase diagram for detection and reconstruction. In a low density region where the average degree \(\lambda\) of the original graph is below some critical value \(\lambda_c=1\), both detection and reconstruction go from impossible to easy as the line length \(K\) crosses some critical value \(K^*=f(\lambda)\ln(n)\), where \(n\) is the number of nodes in the graph. In a high density region where \(\lambda > \lambda_c\), detection goes from impossible to easy as \(K\) goes from \(o(\sqrt{n})\) to \(\omega(\sqrt{n})\). In contrast, reconstruction remains impossible so long as \(K=o(n)\). We then consider planted \(D\)-ary trees of varying depth \(h\) and \(2\le D\le O(1)\). For these we identify a low-density region \(\lambda<\lambda_D\), where \(\lambda_D\) is the threshold for emergence of the \(D\)-core in Erd\H{o}s-Rényi random graphs \(\mathcal{G}(n,\lambda/n)\) for which the following holds. There is a threshold \(h*=g(D)\ln(\ln(n))\) with the following properties. Detection goes from feasible to impossible as \(h\) crosses \(h*\). Interestingly, we show that only partial reconstruction is feasible at best for \(h\ge h*\). We conjecture a similar picture to hold for \(D\)-ary trees as for lines in the high-density region \(\lambda>\lambda_D\), but confirm only the following part of this picture: Detection is easy for \(D\)-ary trees of size \(\omega(\sqrt{n})\), while at best only partial reconstruction is feasible for \(D\)-ary trees of any size \(o(n)\). These results provide a clear contrast with the corresponding picture for detection and reconstruction of {\em low rank} planted structures, such as dense subgraphs and block communities. In the examples we study, there is i) an absence of hard phases for both detection and reconstruction, and ii) a discrepancy between detection and reconstruction, the latter being impossible for a wide range of parameters where detection is easy. The latter property does not hold for previously studied low rank planted structures.
Reconstructing Trees from Traces
Sami Davies, Miklos Racz, Cyrus RashtchianTalk: Wednesday 6/26/19 at 9:10 AM
Poster session: Wednesday 6/26/19 at 6:00 PM
We study the problem of learning a node-labeled tree given independent traces from an appropriately defined deletion channel. This problem, tree trace reconstruction, generalizes string trace reconstruction, which corresponds to the tree being a path. For many classes of trees, including complete trees and spiders, we provide algorithms that reconstruct the labels using only a polynomial number of traces. This exhibits a stark contrast to known results on string trace reconstruction, which require exponentially many traces, and where a central open problem is to determine whether a polynomial number of traces suffice. Our techniques combine novel combinatorial and complex analytic methods.
Robustness of spectral methods for community detection
Ludovic Stephan, Laurent MassouliéTalk: Wednesday 6/26/19 at 9:20 AM
Poster session: Wednesday 6/26/19 at 6:00 PM
The present work is concerned with community detection. Specifically, we consider a random graph drawn according to the stochastic block model: its vertex set is partitioned into blocks, or communities, and edges are placed randomly and independently of each other with probability depending only on the communities of their two endpoints. In this context, our aim is to recover the community labels better than by random guess, based only on the observation of the graph. In the sparse case, where edge probabilities are in \(O(1/n)\), we introduce a new spectral method based on the distance matrix \(D^{(l)}\), where \(D^{(l)}_{ij} = 1\) iff the graph distance between \(i\) and \(j\), noted \(d(i, j)\) is equal to \(\ell\). We show that when \(\ell \sim c\log(n)\) for carefully chosen \(c\), the eigenvectors associated to the largest eigenvalues of \(D^{(l)}\) provide enough information to perform non-trivial community recovery with high probability, provided we are above the so-called Kesten-Stigum threshold. This yields an efficient algorithm for community detection, since computation of the matrix \(D^{(l)}\) can be done in \(O(n^{1+\kappa})\) operations for a small constant \(\kappa\). We then study the sensitivity of the eigendecomposition of \(D^{(l)}\) when we allow an adversarial perturbation of the edges of \(G\). We show that when the considered perturbation does not affect more than \(O(n^\varepsilon)\) vertices for some small \(\varepsilon > 0\), the highest eigenvalues and their corresponding eigenvectors incur negligible perturbations, which allows us to still perform efficient recovery.
Achieving the Bayes Error Rate in Stochastic Block Model by SDP, Robustly
Yingjie Fei, Yudong ChenTalk: Wednesday 6/26/19 at 9:30 AM
Poster session: Wednesday 6/26/19 at 6:00 PM
We study the statistical performance of the semidefinite programming (SDP) relaxation approach for clustering under the binary symmetric Stochastic Block Model (SBM). We show that the SDP achieves an error rate of the form \[ \exp \left[ -(1-o(1)) \frac{nI}{2} \right], \] where \(I\) is an appropriate information-theoretic measure of the signal-to-noise ratio. This bound matches the minimax lower bound on the optimal Bayes error rate for this problem, and improves upon existing results that are sub-optimal by a multiplicative constant in the exponent. As a corollary, our result implies that SDP achieves the optimal exact recovery threshold with the correct leading constant. We further show that this error rate of SDP is robust; that is, it remains unchanged under the so-called semirandom model where the graph is modified by a monotone adversary, as well as under the setting with heterogeneous edge probabilities. Our proof is based on a novel primal-dual analysis of the SDP.
Optimal Learning for Mallows Block Model
Robert Busa-Fekete, Dimitris Fotakis, Balazs Szorenyi, Manolis ZampetakisTalk: Wednesday 6/26/19 at 9:40 AM
Poster session: Wednesday 6/26/19 at 6:00 PM
The Mallows model, introduced in the seminal paper of [Mallow 1957], is one of the most fundamental ranking distribution over the symmetric group S_m. To analyze more complex ranking data, several studies considered the Generalized Mallows model [Fligner&Verducci 1986], [Marden 1995]. Despite the significant research interest of ranking distributions, the exact sample complexity of estimating the parameters of a Mallows and a Generalized Mallows Model is not well-understood. The main result of the paper is a tight sample complexity bound for learning Mallows and Generalized Mallows Model. We approach the learning problem by analyzing a more general model which interpolates between the single parameter Mallows Model and the \(m\) parameter Mallows model. We call our model Mallows Block Model -- referring to the the Block Models that are a popular model in theoretical statistics. Our sample complexity analysis gives tight bound for learning the Mallows Block Model for any number of blocks. We provide essentially matching lower bounds for our sample complexity results. As a corollary of our analysis, it turns out that, if the central ranking is known, one single sample from the Mallows Block Model is sufficient to estimate the spread parameters with error that goes to zero as the size of the permutations goes to infinity. In addition, we calculate the exact rate of the parameter estimation error.
Accuracy-Memory Tradeoffs and Phase Transitions in Belief Propagation
Vishesh Jain, Frederic Koehler, Jingbo Liu, Elchanan MosselTalk: Wednesday 6/26/19 at 9:50 AM
Poster session: Wednesday 6/26/19 at 6:00 PM
The analysis of Belief Propagation and other algorithms for the {\em reconstruction problem} plays a key role in the analysis of community detection in inference on graphs, phylogenetic reconstruction in bioinformatics, and the cavity method in statistical physics. We prove a conjecture of Evans, Kenyon, Peres, and Schulman (2000) which states that any bounded memory message passing algorithm is statistically much weaker than Belief Propagation for the reconstruction problem. More formally, any recursive algorithm with bounded memory for the reconstruction problem on the trees with the binary symmetric channel has a phase transition strictly below the Belief Propagation threshold, also known as the Kesten-Stigum bound. The proof combines in novel fashion tools from recursive reconstruction, information theory, and optimal transport, and also establishes an asymptotic normality result for BP and other message-passing algorithms near the critical threshold.
Testing Identity of Multidimensional Histograms
Ilias Diakonikolas, Daniel M. Kane, John PeeblesTalk: Wednesday 6/26/19 at 10:00 AM
Poster session: Wednesday 6/26/19 at 6:00 PM
We investigate the problem of identity testing for multidimensional histogram distributions. A distribution \(p: D \to \R_+\), where \(D \subseteq \R^d\), is called a \(k\)-histogram if there exists a partition of the domain into \(k\) axis-aligned rectangles such that \(p\) is constant within each such rectangle. Histograms are one of the most fundamental non-parametric families of distributions and have been extensively studied in computer science and statistics. We give the first identity tester for this problem with {\em sub-learning} sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound. In more detail, let \(q\) be an unknown \(d\)-dimensional \(k\)-histogram distribution in fixed dimension \(d\), and \(p\) be an explicitly given \(d\)-dimensional \(k\)-histogram. We want to correctly distinguish, with probability at least \(2/3\), between the case that \(p = q\) versus \(\|p-q\|_1 \geq \eps\). We design an algorithm for this hypothesis testing problem with sample complexity \(O((\sqrt{k}/\eps^2) 2^{d/2} \log^{2.5 d}(k/\eps))\) that runs in sample-polynomial time. Our algorithm is robust to model misspecification, i.e., succeeds even if \(q\) is only promised to be {\em close} to a \(k\)-histogram. Moreover, for \(k = 2^{\Omega(d)}\), we show a sample complexity lower bound of \((\sqrt{k}/\eps^2) \cdot \Omega(\log(k)/d)^{d-1}\) when \(d\geq 2\). That is, for any fixed dimension \(d\), our upper and lower bounds are nearly matching. Prior to our work, the sample complexity of the \(d=1\) case was well-understood, but no algorithm with sub-learning sample complexity was known, even for \(d=2\). Our new upper and lower bounds have interesting conceptual implications regarding the relation between learning and testing in this setting.
Learning rates for Gaussian mixtures under group invariance
Victor-Emmanuel BrunelTalk: Wednesday 6/26/19 at 10:10 AM
Poster session: Wednesday 6/26/19 at 6:00 PM
We study the pointwise maximum likelihood estimation rates for a class of Gaussian mixtures that are invariant under the action of some isometry group. This model is also known as multi-reference alignment, where random rotations of a given vector are observed, up to Gaussian noise. We completely characterize the speed of the maximum likelihood estimator, by giving a comprehensive description of the likelihood geometry of the model. We show that the unknown parameter can always be decomposed into two components, one of which can be estimated at the fast rate \(n^{-1/2}\), the other one being estimated at the slower rate \(n^{-1/4}\). We provide an algebraic description and a geometric interpretation of these facts.
Global Convergence of the EM Algorithm for Mixtures of Two Component Linear Regression
Jeongyeol Kwon, Wei Qian, Constantine Caramanis, Yudong Chen, Damek DavisTalk: Wednesday 6/26/19 at 10:20 AM
Poster session: Wednesday 6/26/19 at 6:00 PM
The Expectation-Maximization algorithm is perhaps the most broadly used algorithm for inference of latent variable problems. A theoretical understanding of its performance, however, largely remains lacking. Recent results established that EM enjoys global convergence for Gaussian Mixture Models. For Mixed Linear Regression, however, only local convergence results have been established, and those only for the high SNR regime. We show here that EM converges for mixed linear regression with two components (it is known that it may fail to converge for three or more), and moreover that this convergence holds for random initialization. Our analysis reveals that EM exhibits very different behavior in Mixed Linear Regression from its behavior in Gaussian Mixture Models, and hence our proofs require the development of several new ideas.
Adaptive Hard Thresholding for Near-optimal Consistent Robust Regression
Arun Sai Suggala, Kush Bhatia, Pradeep Ravikumar, Prateek JainTalk: Wednesday 6/26/19 at 10:30 AM
Poster session: Wednesday 6/26/19 at 6:00 PM
We study the problem of robust linear regression with response variable corruptions. We consider the oblivious adversary model, where the adversary corrupts a fraction of the responses in complete ignorance of the data. We provide an estimator which consistently estimates the true regression vector, even with 1-o(1) fraction of corruptions. Existing results in this setting either don't guarantee consistent estimates or can only handle a small fraction of corruptions. We also extend our estimator to robust sparse linear regression and show that similar guarantees hold in this setting. Finally, we extend our estimator to the problem of linear regression with heavy-tailed noise and show that our estimator consistently estimates the regression vector even when the noise has unbounded variance (e.g., Cauchy distribution), for which most existing results don't even apply. Our estimator is based on a novel variant of outlier removal via hard thresholding in which the threshold is chosen adaptively and crucially relies on adding randomness to the threshold to escape bad fixed points of the non-convex hard thresholding operation.
Computationally and Statistically Efficient Truncated Regression
Constantinos Daskalakis, Themis Gouleakis, Christos Tzamos, Emmanouil ZampetakisTalk: Wednesday 6/26/19 at 10:40 AM
Poster session: Wednesday 6/26/19 at 6:00 PM
We provide a computationally and statistically efficient estimator for the classical problem of truncated linear regression, where the dependent variable y = w^T x + eps and its corresponding vector of covariates x \in R^k are only revealed if the dependent variable falls in some subset S \subseteq R; otherwise the existence of the pair (x, y) is hidden. This problem has remained a challenge since the early works of [Tobin 1958], [Amemiya 1973], [Hausman et al 1977], its applications are abundant, and its history dates back even further to the work of Galton, Pearson, Lee, and Fisher. While consistent estimators of the regression coefficients have been identified, the error rates are not well-understood, especially in high-dimensional settings. Under a `thickness assumption' about the covariance matrix of the covariates in the revealed sample, we provide a computationally efficient estimator for the coefficient vector w from n revealed samples that attains l_2 error O(\sqrt{k / n}), recovering the guarantees of least squares in the standard (untruncated) linear regression setting. Our estimator uses Projected Stochastic Gradient Descent (PSGD) on the negative log-likelihood of the truncated sample, and only needs oracle access to the set S, which may otherwise be arbitrary, and in particular may be non-convex. PSGD must be restricted to an appropriately defined convex cone to guarantee that the negative log-likelihood is strongly convex, which in turn is established using concentration of matrices on variables with sub-exponential tails. We perform experiments on simulated data to illustrate the accuracy of our estimator. As a corollary of our work, we show that SGD provably learns the parameters of single-layer neural networks with noisy Relu activation functions [Nair et al. 2010], [Bengio et al. 2013], [Gulcehre et al. 2016], given linearly many, in the number of network parameters, input-output pairs in the realizable setting.
The All-or-Nothing Phenomenon in Sparse Linear Regression
Galen Reeves, Jiaming Xu, Ilias ZadikTalk: Wednesday 6/26/19 at 10:50 AM
Poster session: Wednesday 6/26/19 at 6:00 PM
We study the problem of recovering a hidden binary \(k\)-sparse \(p\)-dimensional vector \(\beta\) from \(n\) noisy linear observations \(Y=X\beta+W\) with \(X_{ij}\) are i.i.d.\ \(\mathcal{N}(0,1)\) and \(W_i\) are i.i.d.\ \(\mathcal{N}(0,\sigma^2)\). A closely related hypothesis testing problem is to distinguish the pair \((X,Y)\) generated from this structured model from a corresponding null model where \((X,Y)\) consist of purely independent Gaussian entries. In the low sparsity \(k=o(p)\) and high signal to noise ratio \(k/\sigma^2=\Omega\left(1\right)\) regime, we establish an ``All-to-Nothing'' information-theoretic phase transition at a critical sample size \(n^*=2 k\log (p/k) /(1+k/\sigma^2)\), resolving a conjecture of \cite{gamarnikzadik}. Specifically, we show that if \(\liminf_{p\to \infty} n/n^*>1\), then the maximum likelihood estimator almost perfectly recovers the hidden vector with high probability and moreover the true hypothesis can be detected with a vanishing error probability. Conversely, if \(\liminf_{p\to \infty} n/n^*<1\), then it becomes information-theoretical impossible to even recover an arbitrarily small but fixed fraction of the hidden vector support, or to test hypotheses strictly better than random guess. Our converse proof builds upon two key techniques. First, we use a conditional second moment method to upper bound the KL divergence between the structured and null model. Second, inspired by the celebrated area theorem, we establish a lower bound to the mean squared estimation error in terms of the KL divergence.
Computational Limitations in Robust Classification and Win-Win Results
Akshay Degwekar, Preetum Nakkiran, Vinod VaikuntanathanTalk: Thursday 6/27/19 at 2:10 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We continue the study of computational limitations in learning robust classifiers, following the recent work of Bubeck, Lee, Price and Razenshteyn. First, we demonstrate classification tasks where computationally efficient robust classifiers do not exist, even when computationally unbounded robust classifiers do. We rely on the hardness of decoding problems with preprocessing on codes and lattices. Second, we show classification tasks where efficient robust classifiers exist, but they are computationally hard to learn. Bubeck et al. showed examples of such tasks {\em in the small-perturbation regime} where the robust classifier can recover from a constant number of perturbed bits. Indeed, as we observe, the question of whether a large-perturbation robust classifier for their task exists is related to important open questions in computational number theory. We show two such classification tasks in the large-perturbation regime: the first relies on the existence of one-way functions, a minimal assumption in cryptography; and the second on the hardness of the learning parity with noise problem. For the second task, not only does a non-robust classifier exist, but also an efficient algorithm that generates fresh new labeled samples given access to polynomially many training examples (termed as generation by Kearns et.\ al.\ (1994)). Third, we show that any such task implies the existence of cryptographic primitives such as one-way functions or even forms of public-key encryption. This leads us to a win-win scenario: either we can quickly learn an efficient robust classifier (assuming one exists), or we can construct new instances of popular and useful cryptographic primitives.
VC Classes are Adversarially Robustly Learnable, but Only Improperly
Omar Montasser, Steve Hanneke, Nathan SrebroTalk: Thursday 6/27/19 at 2:20 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We study the question of learning an adversarially robust predictor. We show that any hypothesis class H with finite VC dimension is robustly PAC learnable with an improper learning rule. The requirement of being improper is necessary as we exhibit examples of hypothesis classes H with finite VC dimension that are not robustly PAC learnable with any proper learning rule.
Faster Algorithms for High-Dimensional Robust Covariance Estimation
Yu Cheng, Ilias Diakonikolas, Rong Ge, David P. WoodruffTalk: Thursday 6/27/19 at 2:30 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We study the problem of estimating the covariance matrix of a high-dimensional distribution when a small constant fraction of the samples can be arbitrarily corrupted. Recent work gave the first polynomial time algorithms for this problem with near-optimal error guarantees for several natural structured distributions. Our main contribution is to develop faster algorithms for this problem whose running time nearly matches that of computing the empirical covariance. Given \(N = \tilde{\Omega}(d^2/\epsilon^2)\) samples from a \(d\)-dimensional Gaussian distribution, an \(\epsilon\)-fraction of which may be arbitrarily corrupted, our algorithm runs in time \(\tilde{O}(d^{3.26})/\poly(\epsilon)\) and approximates the unknown covariance matrix to optimal error up to a logarithmic factor. Previous robust algorithms with comparable error guarantees all have runtimes \(\tilde{\Omega}(d^{2 \omega})\) when \(\epsilon = \Omega(1)\), where \(\omega\) is the exponent of matrix multiplication. We also provide evidence that improving our running time may require new algorithmic techniques.
Statistical Learning with a Nuisance Component
Dylan Foster, Vasilis SyrgkanisTalk: Thursday 6/27/19 at 2:40 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We provide excess risk guarantees for statistical learning when the population risk with respect to which we evaluate the target model depends on an unknown model that also needs to be estimated from data (a "nuisance model"). We analyze a two-stage sample splitting meta-algorithm that takes as input two arbitrary estimation algorithms: one for the target model and one for the nuisance model. We show that if the population risk satisfies a condition called Neyman orthogonality, the impact of the nuisance estimation error on the excess risk bound achieved by the meta-algorithm is of second order. Our theorem is agnostic to the particular algorithms used for the target and nuisance and only makes an assumption on their individual performance. This enables the use of a plethora of existing results from statistical learning and machine learning literature to give new guarantees for learning with a nuisance component. Moreover, by focusing on excess risk rather than parameter estimation, we can give guarantees under weaker assumptions than in previous works and accommodate the case where the target parameter belongs to a complex nonparametric class. We characterize conditions on the metric entropy such that oracle rates---rates of the same order as if we knew the nuisance model---are achieved. We also analyze the rates achieved by specific estimation algorithms such as variance-penalized empirical risk minimization, neural network estimation and sparse high-dimensional linear model estimation. We highlight the applicability of our results in four settings of central importance in the literature: 1) heterogeneous treatment effect estimation, 2) offline policy optimization, 3) domain adaptation, and 4) learning with missing data.
Privately Learning High-Dimensional Distributions
Gautam Kamath, Jerry Li, Vikrant Singhal, Jonathan UllmanTalk: Thursday 6/27/19 at 2:50 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We present novel, computationally efficient, and differentially private algorithms for two fundamental high-dimensional learning problems: learning a multivariate Gaussian in R^d and learning a product distribution in {0,1}^d in total variation distance. The sample complexity of our algorithms nearly matches the sample complexity of the optimal non-private learners for these tasks in a wide range of parameters. Thus, our results show that private comes essentially for free for these problems, providing a counterpoint to the many negative results showing that privacy is often costly in high dimensions. Our algorithms introduce a novel technical approach to reducing the sensitivity of the estimation procedure that we call recursive private preconditioning, which may find additional applications.
Lower Bounds for Locally Private Estimation via Communication Complexity
John Duchi, Ryan RogersTalk: Thursday 6/27/19 at 3:00 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We develop lower bounds for estimation under local privacy constraints---including differential privacy and its relaxations to approximate or R\'{e}nyi differential privacy of all orders---by showing an equivalence between private estimation and communication-restricted estimation problems. Our results are strong enough to apply to arbitrarily interactive privacy mechanisms, and they also give sharp lower bounds for all levels of differential privacy protections, that is, privacy mechanisms with privacy levels \(\epsilon \in [0, \infty)\). As a particular consequence of our results, we show that the minimax mean-squared error for estimating the mean of a bounded or Gaussian random vector in \(d\) dimensions scales as \(\frac{d}{n} \cdot \frac{d}{ \min\{\epsilon, \epsilon^2\}}\).
Private Center Points and Learning of Halfspaces
Amos Beimel, Shay Moran, Kobbi NIssim, Uri StemmerTalk: Thursday 6/27/19 at 3:10 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We present a private learner for halfspaces over an arbitrary finite domain \(X\subset \R^d\) with sample complexity \(\poly(d,2^{\log^*|X|})\). The building block for this learner is a differentially private algorithm for locating an approximate center point in the convex hull of \(m>\poly(d,2^{\log^*|X|})\) points -- a high dimensional generalization of the median function. Our construction establishes a relationship between these two problems that is reminiscent of the relation in the one-dimensional setting between the median and learning one-dimensional thresholds [Bun et al.\ FOCS '15]. This relationship suggests that the problem of privately locating a center point may have further applications in the design of differentially private algorithms. We also provide a lower bound on the sample complexity for privately finding a point in the convex hull. For approximate differential privacy, we show a lower bound of \(m=\Omega(d+\log^*|X|)\), whereas for pure differential privacy \(m=\Omega(d\log|X|)\).
Parameter-free Online Convex Optimization with Sub-Exponential Noise
Kwang-Sung Jun, Francesco OrabonaTalk: Thursday 6/27/19 at 3:20 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We consider the problem of unconstrained online convex optimization (OCO) with sub-exponential noise, a strictly more general problem than the standard OCO. In this setting, the learner receives a subgradient of the loss functions corrupted by sub-exponential noise and strives to achieve optimal regret guarantee, without knowledge of the competitor norm, i.e. in a parameter-free way. Recently, Cutkosky and Boahen (COLT 2017) proved that it is impossible to guarantee a sublinear regret with unbounded subgradients without paying an exponential price. This paper shows that it is possible to go around the lower bound by allowing the observed subgradients to be unbounded via stochastic noise. Presence of unbounded noise in unconstrained OCO is challenging; existing algorithms do not provide near-optimal regret bounds or fail to have a guarantee. We design a novel parameter-free OCO algorithm for Banach space, which we call BANCO, via a reduction to betting on noisy coins. We show that BANCO achieves the optimal regret rate in our problem. Finally, we show the application of our results to obtain a parameter-free locally private stochastic subgradient descent algorithm, and the connection to the law of iterated logarithms.
An Information-Theoretic Approach to Minimax Regret in Partial Monitoring
Tor Lattimore, Csaba SzepesvariTalk: Wednesday 6/26/19 at 4:00 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We prove a new minimax theorem connecting the worst-case Bayesian regret and minimax regret under finite-action partial monitoring with no assumptions on the space of signals or decisions of the adversary. We then generalise the information-theoretic tools of Russo and Van Roy (2016) for proving Bayesian regret bounds and combine them with the minimax theorem to derive minimax regret bounds for various partial monitoring settings. The highlight is a clean analysis of `non-degenerate easy' and `hard' finite partial monitoring, with new regret bounds that are independent of arbitrarily large game-dependent constants. The power of the generalised machinery is further demonstrated by proving that the minimax regret for k-armed adversarial bandits is at most sqrt(2nk), improving on existing results by a factor of 2. Finally, we provide a simple analysis of the cops and robbers game, also improving best known constants.
Sample complexity of partition identification using multi-armed bandits
Sandeep Juneja, Subhashini KrishnasamyTalk: Wednesday 6/26/19 at 4:10 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
Given a vector of probability distributions, or arms, each of which can be sampled independently, we consider the problem of identifying the partition to which this vector belongs from a finitely partitioned universe of such vector of distributions. We study this as a pure exploration problem in multi armed bandit settings and develop sample complexity bounds on the total mean number of samples required for identifying the correct partition with high probability. This framework subsumes well studied problems such as finding the best arm or the best few arms. We consider distributions belonging to the single parameter exponential family and primarily consider partitions where the vector of means of arms lie either in a given set or its complement. The sets considered correspond to distributions where there exists a mean above a specified threshold, where the set is a half space and where either the set or its complement is a polytope, or more generally, a convex set. In these settings, we characterize the lower bounds on mean number of samples for each arm highlighting their dependence on the problem geometry. Further, inspired by the lower bounds, we propose algorithms that can match these bounds asymptotically with decreasing probability of error. Applications of this framework may be diverse. We briefly discuss one associated with finance.
Batch-Size Independent Regret Bounds for the Combinatorial Multi-Armed Bandit Problem
Nadav Merlis, Shie MannorTalk: Wednesday 6/26/19 at 4:20 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We consider the combinatorial multi-armed bandit (CMAB) problem, where the reward function is nonlinear. In this setting, the agent chooses a batch of arms on each round and receives feedback from each arm of the batch. The reward that the agent aims to maximize is a function of the selected arms and their expectations. In many applications, the reward function is highly nonlinear, and the performance of existing algorithms relies on a global Lipschitz constant to encapsulate the function's nonlinearity. This may lead to loose regret bounds, since by itself, a large gradient does not necessarily cause a large regret, but only in regions where the uncertainty in the reward's parameters is high. To overcome this problem, we introduce a new smoothness criterion, which we term \textit{Gini-weighted smoothness}, that takes into account both the nonlinearity of the reward and concentration properties of the arms. We show that a linear dependence of the regret in the batch size in existing algorithms can be replaced by this smoothness parameter. This, in turn, leads to much tighter regret bounds when the smoothness parameter is batch-size independent. For example, in the probabilistic maximum coverage (PMC) problem, that has many applications, including influence maximization, diverse recommendations and more, we achieve dramatic improvements in the upper bounds. We also prove matching lower bounds for the PMC problem and show that our algorithm is tight, up to a logarithmic factor in the problem's parameters.
Multi-armed Bandit Problems with Strategic Arms
Mark Braverman, Jieming Mao, Jon Schneider, S. Matthew WeinbergTalk: Wednesday 6/26/19 at 4:30 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We study a strategic version of the multi-armed bandit problem, where each arm is an individual strategic agent and we, the principal, pull one arm each round. When pulled, the arm receives some private reward \(v_a\) and can choose an amount \(x_a\) to pass on to the principal (keeping \(v_a-x_a\) for itself). All non-pulled arms get reward \(0\). Each strategic arm tries to maximize its own utility over the course of \(T\) rounds. Our goal is to design an algorithm for the principal incentivizing these arms to pass on as much of their private rewards as possible. When private rewards are stochastically drawn each round (\(v_a^t \leftarrow D_a\)), we show that: - Algorithms that perform well in the classic adversarial multi-armed bandit setting necessarily perform poorly: For all algorithms that guarantee low regret in an adversarial setting, there exist distributions \(D_1,\ldots,D_k\) and an \(o(T)\)-approximate Nash equilibrium for the arms where the principal receives reward \(o(T)\). - There exists an algorithm for the principal that induces a game among the arms where each arm has a dominant strategy. Moreover, for every \(o(T)\)-approximate Nash equilibrium, the principal receives expected reward \(\mu'T - o(T)\), where \(\mu'\) is the second-largest of the means \(E[D_{a}]\). This algorithm maintains its guarantee if the arms are non-strategic (\(x_a = v_a\)), and also if there is a mix of strategic and non-strategic arms.
Better Algorithms for Stochastic Bandits with Adversarial Corruptions
Anupam Gupta, Tomer Koren, Kunal TalwarTalk: Wednesday 6/26/19 at 4:40 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We study the stochastic multi-armed bandits problem in the presence of adversarial corruption. We present a new algorithm for this problem whose regret is nearly optimal, substantially improving upon previous work. Our algorithm is agnostic to the level of adversarial contamination and can tolerate a significant amount of corruption with virtually no degradation in performance.
On the Performance of Thompson Sampling on Logistic Bandits
Shi Dong, Tengyu Ma, Benjamin Van RoyTalk: Wednesday 6/26/19 at 4:50 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We study the logistic bandit, in which rewards are binary with success probability \(\exp(\beta a^\top \theta) / (1 + \exp(\beta a^\top \theta))\) and actions \(a\) and coefficients \(\theta\) are within the \(d\)-dimensional unit ball. While prior regret bounds for algorithms that address the logistic bandit exhibit exponential dependence on the slope parameter \(\beta\), we establish a regret bound for Thompson sampling that is independent of \(\beta\). Specifically, we establish that, when the set of feasible actions is identical to the set of possible coefficient vectors, the Bayesian regret of Thompson sampling is \(\tilde{O}(d\sqrt{T})\). We also establish a \(\tilde{O}(\sqrt{d\eta T}/\lambda)\) bound that applies more broadly, where \(\lambda\) is the worst-case optimal log-odds and \(\eta\) is the ````fragility dimension," a new statistic we define to capture the degree to which an optimal action for one model fails to satisfice for others. We demonstrate that the fragility dimension plays an essential role by showing that, for any \(\epsilon > 0\), no algorithm can achieve \(\mathrm{poly}(d, 1/\lambda)\cdot T^{1-\epsilon}\) regret.
Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits
Yingkai Li, Yining Wang, Yuan ZhouTalk: Wednesday 6/26/19 at 5:00 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We study the linear contextual bandit problem with finite action sets. When the problem dimension is d, the time horizon is T, and there are n\leq 2^{d/2} candidate actions per time period, we (1) show that the minimax expected regret is \Omega(dT log T log n) for every algorithm, and (2) introduce a Variable-Confidence-Level (VCL) SupLinUCB algorithm whose regret matches the lower bound up to iterated logarithmic factors. Our algorithmic result saves two\sqrt{\log T} factors from previous analysis, and our information-theoretical lower bound also improves previous results by one \sqrt{\log t} factor, revealing a regret scaling quite different from classical multi-armed bandits in which no logarithmic T term is present in minimax regret (Audibert and Bubeck, 2009). Our proof techniques include variable confidence levels and a careful analysis of layer sizes of SupLinUCB (Chu et al., 2011) on the upper bound side, and delicately constructed adversarial sequences showing the tightness of elliptical potential lemmas on the lower bound side.
Gaussian Process Optimization with Adaptive Sketching: Scalable and No Regret
Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, Lorenzo RosascoTalk: Wednesday 6/26/19 at 5:10 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
Gaussian processes (GP) are a popular Bayesian approach for the optimization of black-box functions. Despite their effectiveness in simple problems, GP-based algorithms hardly scale to complex high-dimensional functions, as their per-iteration time and space cost is at least \emph{quadratic} in the number of dimensions \(d\) and iterations~\(t\). Given a set of \(A\) alternative to choose from, the overall runtime \(O(t^3 A)\) quickly becomes prohibitive. In this paper, we introduce BKB (\textit{budgeted kernelized bandit}), a novel approximate GP algorithm for optimization under bandit feedback that achieves near-optimal regret (and hence near-optimal convergence rate) with near-constant per-iteration complexity and no assumption on the input space or covariance of the GP. Combining a kernelized linear bandit algorithm (GP-UCB) and a matrix sketching technique (i.e., leverage score sampling), we prove that selecting inducing points based on their posterior variance, gives an accurate randomized low-rank approximation of the GP preserving variance estimates and confidence intervals. We show that this procedure selects at most \(\widetilda{O}(d_{eff})\) points, where \(d_{eff}\) is the \emph{effective} dimension of the explored space, which is typically much smaller than both \(d\) and \(t\). This greatly reduces the dimensionality of the problem, thus leading to a\(\widetilda{O}(T A d_{eff}^2)\) runtime and \(\widetilda{O}(A d_{eff})\) space complexity.
Improved Path-length Regret Bounds for Bandits
Sébastien Bubeck, Yuanzhi Li, Haipeng Luo, Chen-Yu WeiTalk: Wednesday 6/26/19 at 5:20 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We study adaptive regret bounds in terms of the variation of the losses (the so-called path-length bounds) for both multi-armed bandit and more generally linear bandit. We first show that the seemingly suboptimal path-length bound of (Wei and Luo, 2018) is in fact not improvable for adaptive adversary. Despite this negative result, we then develop two new algorithms, one that strictly improves over (Wei and Luo, 2018) with a smaller path-length measure, and the other which improves over (Wei and Luo, 2018) for oblivious adversary when the path-length is large. Our algorithms are based on the well-studied optimistic mirror descent framework, but importantly with several novel techniques, including new optimistic predictions, a slight bias towards recently selected arms, and the use of a hybrid regularizer similar to that of (Bubeck et al., 2018). Furthermore, we extend our results to linear bandit by showing a reduction to obtaining dynamic regret for a full-information problem, followed by a further reduction to convex body chasing. We propose a simple greedy chasing algorithm for squared 2-norm, leading to new dynamic regret results and as a consequence the first path-length regret for general linear bandit as well.
Contextual Bandits with Continuous Actions: Smoothing, Zooming, and Adapting
Akshay Krishnamurthy, John Langford, Aleksandrs Slivkins, Chicheng ZhangTalk: Wednesday 6/26/19 at 5:30 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We study contextual bandit learning with an abstract policy class and continuous action space. We obtain two qualitatively different regret bounds: one competes with a smoothed version of the policy class under no continuity assumptions, while the other requires standard Lipschitz assumptions. Both bounds exhibit data-dependent “zooming'' behavior and, with no tuning, yield improved guarantees for benign problems. We also study adapting to unknown smoothness parameters, establishing a price-of-adaptivity and deriving optimal adaptive algorithms that require no additional information.
A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal, and Parameter-free
Yifang Chen, Chung-Wei Lee, Haipeng Luo, Chen-Yu WeiTalk: Wednesday 6/26/19 at 5:40 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret. Specifically, our algorithm achieves dynamic regret \(\mathcal{O}(\min\{\sqrt{ST}, \Delta^{\frac{1}{3}}T^{\frac{2}{3}}\})\) for a contextual bandit problem with \(T\) rounds, \(S\) switches and \(\Delta\) total variation in data distributions. Importantly, our algorithm is adaptive and does not need to know \(S\) or \(\Delta\) ahead of time, and can be implemented efficiently assuming access to an ERM oracle. Our results strictly improve the \(\mathcal{O} (\min \{S^{\frac{1}{4}}T^{\frac{3}{4}}, \Delta^{\frac{1}{5}}T^{\frac{4}{5}}\})\) bound of (Luo et al., 2018), and greatly generalize and improve the \(\mathcal{O}(\sqrt{ST})\) result of (Auer et al., 2018) that holds only for the two-armed bandit problem without contextual information. The key novelty of our algorithm is to introduce ``replay phases'', in which the algorithm acts according to its previous decisions for a certain amount of time in order to detect non-stationarity while maintaining a good balance between exploration and exploitation.
Adaptively Tracking the Best Bandit Arm with an Unknown Number of Distribution Changes
Peter Auer, Pratik Gajane, and Ronald OrtnerTalk: Wednesday 6/26/19 at 5:40 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We consider the variant of the stochastic multi-armed bandit problem where the stochastic reward distributions may change abruptly several times. In contrast to previous work, we are able to achieve (nearly) optimal mini-max regret bounds without knowing the number of changes. For this setting, we propose an algorithm called ADSWITCH and provide performance guarantees for the regret evaluated against the optimal non-stationary policy. Our regret bound is the first optimal bound for an algorithm that is not tuned with respect to the number of changes.
Bandit Principal Component Analysis
Wojciech Kotlowski, Gergely NeuTalk: Wednesday 6/26/19 at 5:50 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We consider a partial-feedback variant of the well-studied online PCA problem where a learner attempts to predict a sequence of \(d\)-dimensional vectors in terms of a quadratic loss, while only having limited feedback about the environment's choices. We focus on a natural notion of bandit feedback where the learner only observes the loss associated with its own prediction. Based on the classical observation that this decision-making problem can be lifted to the space of density matrices, we propose an algorithm that is shown to achieve a regret of \(O(d^{3/2}\sqrt{T})\) after \(T\) rounds in the worst case. We also prove data-dependent bounds that improve on the basic result when the loss matrices of the environment have bounded rank or the loss of the best action is bounded. One version of our algorithm runs in \(O(d)\) time per trial which massively improves over every previously known online PCA method. We complement these results by a lower bound of \(\Omega(d\sqrt{T})\).
Disagreement-Based Combinatorial Pure Exploration: Sample Complexity Bounds and an Efficient Algorithm
Tongyi Cao, Akshay KrishnamurthyTalk: Thursday 6/27/19 at 9:00 AM
Poster session: Thursday 6/27/19 at 6:00 PM
We design new algorithms for the combinatorial pure exploration problem in the multi-arm bandit framework. In this problem, we are given \(K\) distributions and a collection of subsets \(\mathcal{V} \subset 2^{[K]}\) of these distributions, and we would like to find the subset \(v \in \mathcal{V}\) that has largest mean, while collecting, in a sequential fashion, as few samples from the distributions as possible. In both the fixed budget and fixed confidence settings, our algorithms achieve new sample-complexity bounds that provide polynomial improvements on previous results in some settings. Via an information-theoretic lower bound, we show that no approach based on uniform sampling can improve on ours in any regime, yielding the first interactive algorithms for this problem with this basic property. Computationally, we show how to efficiently implement our fixed confidence algorithm whenever \(\mathcal{V}\) supports efficient linear optimization. Our results involve precise concentration-of-measure arguments and a new algorithm for linear programming with exponentially many constraints.
Active Regression via Linear-Sample Sparsification
Xue Chen, Eric PriceTalk: Thursday 6/27/19 at 9:10 AM
Poster session: Thursday 6/27/19 at 6:00 PM
We present an approach that improves the sample complexity for a variety of curve fitting problems, including active learning for linear regression, polynomial regression, and continuous sparse Fourier transforms. In the active linear regression problem, one would like to estimate the least squares solution \(\beta^*\) minimizing \(\norm{X\beta - y}_2\) given the entire unlabeled dataset \(X \in \R^{n \times d}\) but only observing a small number of labels \(y_i\). We show that \(O(d)\) labels suffice to find a constant factor approximation \(\wt{\beta}\): \[ \E[\norm{X\wt{\beta} - y}_2^2] \leq 2 \E[\norm{X \beta^* - y}_2^2]. \] This improves on the best previous result of \(O(d \log d)\) from leverage score sampling. We also present results for the \emph{inductive} setting, showing when \(\wt{\beta}\) will generalize to fresh samples; these apply to continuous settings such as polynomial regression. Finally, we show how the techniques yield improved results for the non-linear sparse Fourier transform setting.
Combinatorial Algorithms for Optimal Design
Vivek Madan, Mohit Singh, Uthaipon Tantipongpipat, Weijun XieTalk: Thursday 6/27/19 at 9:20 AM
Poster session: Thursday 6/27/19 at 6:00 PM
In an optimal design problem, we are given a set of linear experiments v_1,...,v_n in R^d and k >= d, and our goal is to select a set or a multiset S subset of [n] of size k such that \Phi((\sum_{i \in [n]} v_i v_i^T)^{-1}) is minimized. When \Phi(M) = det(M)^{1/d}, the problem is known as the D-optimal design problem, and when \Phi(M) = \tr(M), it is known as the A-optimal design problem. One of the most common heuristics used in practice to solve these problems is the local search heuristic, also known as the Fedorov's exchange method. This is due to its simplicity and its empirical performance. However, despite its wide usage no theoretical bound has been proven for this algorithm. In this paper, we bridge this gap and prove approximation guarantees for the local search algorithms for D-optimal design and A-optimal design problems. We show that the local search algorithms are asymptotically optimal when k/d is large. In addition to this, we also prove similar approximation guarantees for the greedy algorithms for D-optimal design and A-optimal design problems when k/d is large.
Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression
Michal Derezinski, Kenneth L. Clarkson, Michael W. Mahoney, Manfred K. WarmuthTalk: Thursday 6/27/19 at 9:30 AM
Poster session: Thursday 6/27/19 at 6:00 PM
In experimental design, we are given a large collection of vectors, each with a hidden response value that we assume derives from an underlying linear model, and we wish to pick a small subset of the vectors such that querying the corresponding responses will lead to a good estimator of the model. A classical approach in statistics is to assume the responses are linear, plus zero-mean i.i.d. Gaussian noise, in which case the goal is to provide an unbiased estimator with smallest mean squared error (A-optimal design). A related approach, more common in computer science, is to assume the responses are arbitrary but fixed, in which case the goal is to estimate the least squares solution using few responses, as quickly as possible, for worst-case inputs. Despite many attempts, characterizing the relationship between these two approaches has proven elusive. We address this by proposing a framework for experimental design where the responses are produced by an arbitrary unknown distribution. We show that there is an efficient randomized experimental design procedure that achieves strong variance bounds for an unbiased estimator using few responses in this general model. Nearly tight bounds for the classical A-optimality criterion, as well as improved bounds for worst-case responses, emerge as special cases of this result. In the process, we develop a new algorithm for a joint sampling distribution called volume sampling, and we propose a new i.i.d. importance sampling method: inverse score sampling. A key novelty of our analysis is in developing new expected error bounds for worst-case regression by controlling the tail behavior of i.i.d. sampling via the jointness of volume sampling. Our result motivates a new minimax-optimality criterion for experimental design which can be viewed as an extension of both A-optimal design and sampling for worst-case regression.
Sorted Top-k in Rounds
Mark Braverman, Jieming Mao, Yuval PeresTalk: Thursday 6/27/19 at 9:40 AM
Poster session: Thursday 6/27/19 at 6:00 PM
We consider the sorted top-\(k\) problem whose goal is to recover the top-\(k\) items with the correct order out of \(n\) items using pairwise comparisons. In many applications, multiple rounds of interaction can be costly. We restrict our attention to algorithms with a constant number of rounds \(r\) and try to minimize the sample complexity, i.e. the number of comparisons. When the comparisons are noiseless, we characterize how the optimal sample complexity depends on the number of rounds (up to a polylogarithmic factor for general \(r\) and up to a constant factor for \(r=1\) or 2). In particular, the sample complexity is \(\Theta(n^2)\) for \(r=1\), \(\Theta(n\sqrt{k} + n^{4/3})\) for \(r=2\) and \(\tilde{\Theta}\left(n^{2/r} k^{(r-1)/r} + n\right)\) for \(r \geq 3\). We extend our results of sorted top-\(k\) to the noisy case where each comparison is correct with probability \(2/3\). When \(r=1\) or 2, we show that the sample complexity gets an extra \(\Theta(\log(k))\) factor when we transition from the noiseless case to the noisy case. We also prove new results for top-\(k\) and sorting in the noisy case. We believe our techniques can be generally useful for understanding the trade-off between round complexities and sample complexities of rank aggregation problems.
The implicit bias of gradient descent on nonseparable data
Ziwei Ji, Matus TelgarskyTalk: Wednesday 6/26/19 at 2:00 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray defined by the data. The direction of this ray is the maximum margin predictor of a maximal linearly separable subset of the data; the gradient descent iterates converge to this ray in direction at the rate O(ln ln t/ln t). The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate O((ln t)^2/sqrt{t}).
How do infinite width bounded norm networks look in function space?
Pedro Savarese, Itay Evron, Daniel Soudry, Nathan SrebroTalk: Wednesday 6/26/19 at 2:10 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We consider the question of what functions can be captured by ReLU networks with an unbounded number of units (infinite width), but where the overall network Euclidean norm (sum of squares of all weights in the system, except for an unregularized bias term for each unit) is bounded; or equivalently what is the minimal norm required to approximate a given function. For functions f : R -> R and a single hidden layer, we show that the minimal network norm for representing f is max(integral |f''(x)|dx, |f'(-infty) + f'(+infty)|), and hence the minimal norm fit for a sample is given by a linear spline interpolation.
Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit
Song Mei, Theodor Misiakiewicz, Andrea MontanariTalk: Wednesday 6/26/19 at 2:20 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We consider learning two layer neural networks using stochastic gradient descent. The mean-field description of this learning dynamics approximates the evolution of the network weights by an evolution in the space of probability distributions in R^D (where D is the number of parameters associated to each neuron). This evolution can be defined through a partial differential equation or, equivalently, as the gradient flow in the Wasserstein space of probability distributions. Earlier work shows that (under some regularity assumptions), the mean field description is accurate as soon as the number of hidden units is much larger than the dimension D. In this paper we establish stronger and more general approximation guarantees. First of all, we show that the number of hidden units only needs to be larger than a quantity dependent on the regularity properties of the data, and independent of the dimensions. Next, we generalize this analysis to the case of unbounded activation functions, which was not covered by earlier bounds. We extend our results to noisy stochastic gradient descent. Finally, we show that kernel ridge regression can be recovered as a special limit of the mean field analysis.
Learning Neural Networks with Two Nonlinear Layers in Polynomial Time
Surbhi Goel, Adam R. KlivansTalk: Wednesday 6/26/19 at 2:30 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We give a polynomial-time algorithm for learning neural networks with one layer of sigmoids feeding into any Lipschitz, monotone activation function (e.g., sigmoid or ReLU). The algorithm succeeds with respect to {\em any} distribution on the unit ball in \(n\) dimensions (hidden weight vectors in the first layer have unit norm). This is the first efficient algorithm for learning a general class of neural networks with more than one nonlinear layer that makes no restrictions on the VC-dimension of the network. Algorithms for learning relaxations of our model (e.g., allowing larger weight vectors in the first layer) would lead to breakthroughs on notoriously hard problems in Boolean function learning. As one example, we give the first superpolynomial SQ lower bound for learning even a single layer of ReLUs. Thus, our results are ``best possible'' with respect to current techniques. Our algorithm-- {\em Alphatron}-- is an iterative update rule that combines isotonic regression with kernel methods. We use this algorithm to give a simple reduction for translating PAC learning algorithms to the more general, real-valued setting of {\em probabilistic concepts}, a model that (unlike PAC learning) requires non-i.i.d. noise-tolerance. This substantially improves many longstanding results for PAC learning Boolean functions.
Learning Two Layer Rectified Neural Networks in Polynomial Time
Ainesh Bakshi, Rajesh Jayaram, David P. WoodruffTalk: Wednesday 6/26/19 at 2:40 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We consider the following fundamental problem in the study of neural networks: given input examples \(x \in \mathbb{R}^d\) and their vector-valued labels, as defined by an underlying generative neural network, recover the weight matrices of this network. We consider two-layer networks, mapping \(\mathbb{R}^d\) to \(\mathbb{R}^m\), with a single hidden layer and \(k\) non-linear activation units \(f(\cdot)\), where \(f(x) = \max \{x , 0\}\) is the ReLU activation function. Such a network is specified by two weight matrices, \(\mathbf{U}^* \in \mathbb{R}^{m \times k}, \mathbf{V}^* \in \mathbb{R}^{k \times d}\), such that the label of an example \(x \in \mathbb{R}^{d}\) is given by \(\mathbf{U}^* f(\mathbf{V}^* x)\), where \(f(\cdot)\) is applied coordinate-wise. Given \(n\) samples \(x^1,\dots,x^n \in \mathbb{R}^d\) as a matrix \(\mathbf{X} \in \mathbb{R}^{d \times n}\) and the label \(\mathbf{U}^* f(\mathbf{V}^* \mathbf{X})\) of the network on these samples, our goal is to recover the weight matrices \(\mathbf{U}^*\) and \(\mathbf{V}^*\). More generally, our labels \(\mathbf{U}^* f(\mathbf{V}^* \mathbf{X})\) may be corrupted by noise, and instead we observe \(\mathbf{U}^* f(\mathbf{V}^* \mathbf{X}) + \mathbf{E}\) where \(\mathbf{E}\) is some noise matrix. Even in this case, we may still be interested in recovering good approximations to the weight matrices \(\mathbf{U}^*\) and \(\mathbf{V}^*\). In this work, we develop algorithms and hardness results under varying assumptions on the input and noise. Although the problem is NP-hard even for \(k=2\), by assuming Gaussian marginals over the input \(\mathbf{X}\) we are able to develop polynomial time algorithms for the approximate recovery of \(\mathbf{U}^*\) and \(\mathbf{V}^*\). Perhaps surprisingly, in the noiseless case our algorithms recover \(\mathbf{U}^*,\mathbf{V}^*\) \textit{exactly}, i.e. with no error, in \textit{strongly} polynomial time. To the best of the our knowledge, this is the first algorithm to accomplish exact recovery for the ReLU activation function. For the noisy case, we give the first polynomial time algorithm that approximately recovers the weights in the presence of mean-zero noise \(\mathbf{E}\). Our algorithms generalize to a larger class of \textit{rectified} activation functions, \(f(x) = 0\) when \(x\leq 0\), and \(f(x) > 0\) otherwise. Although our polynomial time results require \(\mathbf{U}^*\) to have full column rank, we also give a fixed-parameter tractable algorithm (in \(k\)) when \(\mathbf{U}^*\) does not have this property. Lastly, we give a fixed-parameter tractable algorithm for more arbitrary noise matrices \(\mathbf{E}\), so long as they are independent of \(\mathbf{X}\).
Gradient Descent for One-Hidden-Layer Neural Networks: Polynomial Convergence and SQ Lower Bounds
Santosh Vempala, John WilmesTalk: Wednesday 6/26/19 at 2:50 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We study the complexity of training neural network models with one hidden nonlinear activation layer and an output weighted sum layer. We analyze Gradient Descent applied to learning a bounded target function on n real-valued inputs. We give an agnostic learning guarantee for GD: starting from a randomly initialized network, it converges in mean squared loss to the minimum error (in 2-norm) of the best approximation of the target function using a polynomial of degree at most k. Moreover, for any k, the size of the network and number of iterations needed are both bounded by n^{O(k)} log(1/ε). The core of our analysis is the following existence theorem, which is of independent interest: for any ε > 0, any bounded function that has a degree k polynomial approximation with error ε_0 (in 2-norm), can be approximated to within error ε_0 + ε as a linear combination of n^{O(k)}poly(1/ε) randomly chosen gates from any class of gates whose corresponding activation function has nonzero coefficients in its harmonic expansion for degrees up to k. In particular, this applies to training networks of unbiased sigmoids and ReLUs. We complement this result with a nearly matching lower bound in the Statistical Query model. GD fits well in the SQ framework since each training step is determined by an expectation over the input distribution. We show that any SQ algorithm that achieves significant improvement over a constant function with queries of tolerance some inverse polynomial in the input dimensionality n must use n^{Ω(k)} queries even when the target functions are degree-k polynomials, and the input distribution is uniform over the unit sphere. Our approach for both parts is based on spherical harmonics. We view gradient descent as an operator on the space of functions, and study its dynamics. An essential tool is the Funk-Hecke theorem, which explains the eigenfunctions of this operator in the case of the mean squared loss.
Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks
Ohad ShamirTalk: Wednesday 6/26/19 at 3:00 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We study the dynamics of gradient descent on objective functions of the form f(\prod_{i=1}^{k} w_i) (with respect to scalar parameters w_1,..,w_k), which arise in the context of training depth-k linear neural networks. We prove that for standard random initializations, and under mild assumptions on f, the number of iterations required for convergence scales exponentially with the depth k. We also show empirically that this phenomenon can occur in higher dimensions, where each w_i is a matrix. This highlights a potential obstacle in understanding the convergence of gradient-based methods for deep linear neural networks, where k is large.
Depth Separations in Neural Networks: What is Actually Being Separated?
Itay Safran, Ronen Eldan, Ohad ShamirTalk: Wednesday 6/26/19 at 3:10 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
Existing depth separation results for constant-depth networks essentially show that certain radial functions in \(\mathbb{R}^d\), which can be easily approximated with depth \(3\) networks, cannot be approximated by depth \(2\) networks, even up to constant accuracy, unless their size is exponential in \(d\). However, the functions used to demonstrate this are rapidly oscillating, with a Lipschitz parameter scaling polynomially with the dimension \(d\) (or equivalently, by scaling the function, the hardness result applies to \(\mathcal{O}(1)\)-Lipschitz functions only when the target accuracy \(\epsilon\) is at most \(\text{poly}(1/d)\)). In this paper, we study whether such depth separations might still hold for Lipschitz radial functions when \(\epsilon\) is a constant. Perhaps surprisingly, we show that the answer is negative: In contrast to the intuition suggested by previous work, it \emph{is} possible to approximate radial functions with depth \(2\), size \(\text{poly}(d)\) networks, for every constant \(\epsilon\). We complement it by showing that approximation is also possible with depth \(2\), size \(\text{poly}(1/\epsilon)\) networks, for every constant \(d\). Finally, we show that it is not possible to have polynomial dependence in both \(d,1/\epsilon\) simultaneously. Overall, our results indicate that in order to show depth separations for expressing Lipschitz functions with constant accuracy -- if at all possible -- one would need fundamentally different techniques than existing ones in the literature.
Stochastic Gradient Descent Learns State Equations with Nonlinear Activations
Samet OymakTalk: Wednesday 6/26/19 at 3:20 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We study discrete time dynamical systems governed by the state equation \(h_{t+1}=ϕ(Ah_t+Bu_t)\). Here A,B are weight matrices, ϕ is an activation function, and \(u_t\) is the input data. This relation is the backbone of recurrent neural networks (e.g. LSTMs) which have broad applications in sequential learning tasks. We utilize stochastic gradient descent to learn the weight matrices from a finite input/state trajectory \((u_t,h_t)_{t=0}^N\). We prove that SGD estimate linearly converges to the ground truth weights while using near-optimal sample size. Our results apply to increasing activations whose derivatives are bounded away from zero. The analysis is based on i) an SGD convergence result with nonlinear activations and ii) careful statistical characterization of the state vector. Numerical experiments verify the fast convergence of SGD on ReLU and leaky ReLU in consistence with our theory.
Tight analyses for non smooth stochastic gradient descent
Nicholas J. A. Harvey, Christopher Liaw, Yaniv Plan, Sikander RandhawaTalk: Thursday 6/27/19 at 4:00 PM
Poster session: Thursday 6/27/19 at 6:00 PM
Consider the problem of minimizing functions that are Lipschitz and strongly convex, but not necessarily differentiable. We prove that after T steps of stochastic gradient descent, the error of the final iterate is O(log(T)/T) with high probability. We also construct a function from this class for which the error of the final iterate of deterministic gradient descent is Omega(log(T)/T). This shows that the upper bound is tight and that, in this setting, the last iterate of stochastic gradient descent has the same general error rate (with high probability) as deterministic gradient descent. This resolves both open questions posed by Shamir (2012). An intermediate step of our analysis proves that the suffix averaging method achieves error O(1/T) with high probability, which is optimal (for any first-order optimization method). This improves results of Rakhlin et al. (2012) and Hazan and Kale (2014), both of which achieved error O(1/T), but only in expectation, and achieved a high probability error bound of O(log(log(T))/T), which is sub-optimal.
Making the Last Iterate of SGD Information Theoretically Optimal
Prateek Jain, Dheeraj Nagaraj, Praneeth NetrapalliTalk: Thursday 6/27/19 at 4:10 PM
Poster session: Thursday 6/27/19 at 6:00 PM
Stochastic gradient descent (SGD) is one of the most widely used algorithms for large scale optimization problems. While classical theoretical analysis of SGD for convex problems studies (suffix) \emph{averages} of iterates and obtains information theoretically optimal bounds on suboptimality, the \emph{last point} of SGD is, by far, the most preferred choice in practice. The best known results for last point of SGD~\citep{shamir2013stochastic} however, are suboptimal compared to information theoretic lower bounds by a \(\log T\) factor, where \(T\) is the number of iterations.~\cite{harvey2018tight} shows that in fact, this additional \(\log T\) factor is tight for standard step size sequences of \(\OTheta{\frac{1}{\sqrt{t}}}\) and \(\OTheta{\frac{1}{t}}\) for non-strongly convex and strongly convex settings, respectively. Similarly, even for subgradient descent (GD) when applied to non-smooth, convex functions, the best known step-size sequences still lead to \(O(\log T)\)-suboptimal convergence rates (on the final iterate). The main contribution of this work is to design new step size sequences that enjoy information theoretically optimal bounds on the suboptimality of \emph{last point} of SGD as well as GD. We achieve this by designing a modification scheme, that converts one sequence of step sizes to another so that the last point of SGD/GD with modified sequence has the same suboptimality guarantees as the average of SGD/GD with original sequence. We also show that our result holds with high-probability. We validate our results through simulations which demonstrate that the new step size sequence indeed improves the final iterate significantly compared to the standard step size sequences.
Stochastic Approximation of Smooth and Strongly Convex Functions: Beyond the \(O(1/T)\) Convergence Rate
Lijun Zhang, Zhi-Hua ZhouTalk: Thursday 6/27/19 at 4:20 PM
Poster session: Thursday 6/27/19 at 6:00 PM
Stochastic approximation (SA) is a classical approach for stochastic convex optimization. Previous studies have demonstrated that the convergence rate of SA can be improved by introducing either smoothness or strong convexity condition. In this paper, we make use of smoothness and strong convexity simultaneously to boost the convergence rate. Let \(\lambda\) be the modulus of strong convexity, \(\kappa\) be the condition number, \(F_*\) be the minimal risk, and \(\alpha>1\) be some small constant. First, we demonstrate that, in expectation, an \(O(1/[\lambda T^\alpha] + \kappa F_*/T)\) risk bound is attainable when \(T = \Omega(\kappa^\alpha)\). Thus, when \(F_*\) is small, the convergence rate could be faster than \(O(1/[\lambda T])\) and approaches \(O(1/[\lambda T^\alpha])\) in the ideal case. Second, to further benefit from small risk, we show that, in expectation, an \(O(1/2^{T/\kappa}+F_*)\) risk bound is achievable. Thus, the excess risk reduces exponentially until reaching \(O(F_*)\), and if \(F_*=0\), we obtain a global linear convergence. Finally, we emphasize that our proof is constructive and each risk bound is equipped with an efficient stochastic algorithm attaining that bound.
An Optimal High-Order Tensor Method for Convex Optimization
Bo Jiang, Haoyue Wang, Shuzhong ZhangTalk: Thursday 6/27/19 at 4:30 PM
Poster session: Thursday 6/27/19 at 6:00 PM
This paper is concerned with finding an optimal algorithm for minimizing a composite convex objective function. The basic setting is that the objective is the sum of two convex functions: the first function is smooth with up to the d-th order derivative information available, and the second function is possibly non-smooth, but its proximal tensor mappings can be computed approximately in an efficient manner. The problem is to find -- in that setting -- the best possible (optimal) iteration complexity for convex optimization. Along that line, for the smooth case (without the second non-smooth part in the objective), Nesterov (1983) proposed an optimal algorithm for the first-order methods (d=1) with iteration complexity O( 1 / k^2) and recently Nesterov (2018) proposed a high-order tensor algorithm with iteration complexity O( 1 / k^{d+1}). In this paper, we propose a new high-order tensor algorithm for the general composite case, with the iteration complexity of O( 1 / k^{(3d+1)/2} ), which matches the lower bound for the d-th order methods as established in Nesterov (2018), Y.Arjevani et al. (2018), and hence is optimal. Our approach is based on the Accelerated Hybrid Proximal Extragradient (A-HPE) framework proposed in R.D.C.Monteiro and B.F.Svaiter (2013), where a bisection procedure is installed for each A-HPE iteration. At each bisection step a proximal tensor subproblem is approximately solved, and the total number of bisection steps per A-HPE iteration is bounded by a logarithmic factor in the precision required.
Near-optimal method for highly smooth convex optimization
Sebastien Bubeck, Qijia Jiang, Yin Tat Lee, Yuanzhi Li, Aaron SidfordTalk: Thursday 6/27/19 at 4:30 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We propose a near-optimal method for highly smooth convex optimization. More precisely, in the oracle model where one obtains the \(p^{th}\) order Taylor expansion of a function at the query point, we propose a method with rate of convergence \(\tilde{O}(1/k^{\frac{3p +1}{2}})\) after \(k\) queries to the oracle for any convex function whose \(p^{th}\) order derivative is Lipschitz.
Optimal Tensor Methods in Smooth Convex and Uniformly Convex Optimization
Alexander Gasnikov, Pavel Dvurechensky, Eduard Gorbunov, Evgeniya Vorontsova, Daniil Selikhanovych, Cesar A. UribeTalk: Thursday 6/27/19 at 4:30 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We consider convex optimization problems with the objective function having Lipshitz-continuous \(p\)-th order derivative, where \(p\geq 1\). We propose a new tensor method, which closes the gap between the lower \(O\left(\e^{-\frac{2}{3p+1}} \right)\) and upper \(O\left(\e^{-\frac{1}{p+1}} \right)\) iteration complexity bounds for this class of optimization problems. We also consider uniformly convex functions, and show how the proposed method can be accelerated under this additional assumption. Moreover, we introduce a \(p\)-th order condition number which naturally arises in the complexity analysis of tensor methods under this assumption. Finally, we make a numerical study of the proposed optimal method and show that in practice it is faster than the best known accelerated tensor method. We also compare the performance of tensor methods for \(p=2\) and \(p=3\) and show that the 3rd-order method is superior to the 2nd-order method in practice.
Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions
Adrien Taylor, Francis BachTalk: Thursday 6/27/19 at 4:40 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We provide a novel computer-assisted technique for systematically analyzing first-order methods for optimization. In contrast with previous works, the approach is particularly suited for handling sublinear convergence rates and stochastic oracles. The technique relies on semidefinite programming and potential functions. It allows simultaneously obtaining worst-case guarantees on the behavior of those algorithms, and assisting in choosing appropriate parameters for tuning their worst-case performances. The technique also benefits from comfortable tightness guarantees, meaning that unsatisfactory results can be improved only by changing the setting. We use the approach for analyzing deterministic and stochastic first-order methods under different assumptions on the nature of the stochastic noise. Among others, we treat unstructured noise with bounded variance, different noise models arising in over-parametrized expectation minimization problems, and randomized block-coordinate descent schemes.
Beyond Least-Squares: Fast Rates for Regularized Empirical Risk Minimization through Self-Concordance
Ulysse Marteau-Ferey, Dmitrii M. Ostrovskii, Francis Bach, Alessandro RudiTalk: Thursday 6/27/19 at 4:50 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We consider learning methods based on the regularization of a convex empirical risk by a squared Hilbertian norm, a setting that includes linear predictors and non-linear predictors through positive-definite kernels. In order to go beyond the generic analysis leading to convergence rates of the excess risk as \(O(1/\sqrt{n})\) from \(n\) observations, we assume that the individual losses are self-concordant, that is, their third-order derivatives are bounded by their second-order derivatives. This setting includes least-squares, as well as all generalized linear models such as logistic and softmax regression. For this class of losses, we provide a bias-variance decomposition and show that the assumptions commonly made in least-squares regression, such as the source and capacity conditions, can be adapted to obtain fast non-asymptotic rates of convergence by improving the bias terms, the variance terms or both.
Solving Empirical Risk Minimization in the Current Matrix Multiplication Time
Yin Tat Lee, Zhao Song, Qiuyi ZhangTalk: Thursday 6/27/19 at 5:00 PM
Poster session: Thursday 6/27/19 at 6:00 PM
Many convex problems in machine learning and computer science share the same form: \begin{align*} \min_{x} \sum_{i} f_i( A_i x + b_i), \end{align*} where \(f_i\) are convex functions on \(\R^{n_i}\) with constant \(n_i\), \(A_i \in \R^{n_i \times d}\), \(b_i \in \R^{n_i}\) and \(\sum_i n_i = n\). This problem generalizes linear programming and includes many problems in empirical risk minimization. In this paper, we give an algorithm that runs in time \begin{align*} O^* ( ( n^{\omega} + n^{2.5 - \alpha/2} + n^{2+ 1/6} ) \log (n / \delta) ) \end{align*} where \(\omega\) is the exponent of matrix multiplication, \(\alpha\) is the dual exponent of matrix multiplication, and \(\delta\) is the relative accuracy. Note that the runtime has only a log dependence on the condition numbers or other data dependent parameters and these are captured in \(\delta\). For the current bound \(\omega \sim 2.38\) [Vassilevska Williams'12, Le Gall'14] and \(\alpha \sim 0.31\) [Le Gall, Urrutia'18], our runtime \(O^* ( n^{\omega} \log (n / \delta))\) matches the current best for solving a dense least squares regression problem, a special case of the problem we consider. Very recently, [Alman'18] proved that all the current known techniques can not give a better \(\omega\) below \(2.168\) which is larger than our \(2+1/6\). Our result generalizes the very recent result of solving linear programs in current matrix multiplication time [Cohen, Lee, Song'18] to a more broad class of problems. Our algorithm proposes two concepts which are different from [Cohen, Lee, Song'18] :\\ \(\bullet\) We give a robust deterministic central path method, whereas the previous one is a stochastic central path which updates weights by a random sparse vector. \\ \(\bullet\) We propose an efficient data-structure to maintain the central path of interior point methods even when the update to weights are dense.
Lower Bounds for Parallel and Randomized Convex Optimization
Jelena Diakonikolas, Cristóbal GuzmánTalk: Thursday 6/27/19 at 5:10 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We study the question of whether parallelization in the exploration of the feasible set can be used to speed up convex optimization, in the local oracle model of computation. We show that the answer is negative for both deterministic and randomized algorithms applied to essentially any of the interesting geometries and nonsmooth, weakly-smooth, or smooth objective functions. In particular, we show that it is not possible to obtain a polylogarithmic (in the sequential complexity of the problem) number of parallel rounds with a polynomial (in the dimension) number of queries per round. In the majority of these settings and when the dimension of the space is polynomial in the inverse target accuracy, our lower bounds match the oracle complexity of sequential convex optimization, up to at most a logarithmic factor in the dimension, which makes them (nearly) tight. Prior to our work, lower bounds for parallel convex optimization algorithms were only known in a small fraction of the settings considered in this paper, mainly applying to Euclidean (\(\ell_2\)) and \(\ell_\infty\) spaces. Our work provides a more general and streamlined approach for proving lower bounds in the setting of parallel convex optimization.
The Complexity of Making the Gradient Small in Stochastic Convex Optimization
Dylan Foster, Ayush Sekhari, Ohad Shamir, Nathan Srebro, Karthik Sridharan, Blake WoodworthTalk: Thursday 6/27/19 at 5:20 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We give nearly matching upper and lower bounds on the complexity of finding epsilon-stationary points (\(\|\nabla F(x)\|\leq\epsilon\)) in stochastic convex optimization. We jointly study the complexity in both the local stochastic oracle model and global stochastic oracle (statistical learning, or "sample complexity") model. This allows us to decompose the complexity of stochastic optimization cleanly into optimization complexity and sample complexity, and reveals some surprising differences between the two models. In particular, we show that the in the global stochastic oracle/statistical learning model, only logarithmic dependence on smoothness is required to find an approximate stationary point, thereby showing that the folklore understanding that smoothness is required to find stationary points is only weakly true in this model. Our upper bounds are based on extensions of recent techniques proposed by Allen-Zhu (2018). In particular, we show how to extend these techniques to the statistical learning setting to get improved rates. Interestingly, our algorithm for global stochastic oracles can be implemented efficiently using finite-sum optimization methods, and hence provides an interesting tradeoff of having smaller sample complexity at the cost of slightly higher computation.
A Universal Algorithm for Variational Inequalities Adaptive to Smoothness and Noise
Francis Bach, Kfir Y. LevyTalk: Thursday 6/27/19 at 5:30 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We consider variational inequalities coming from monotone operators, a setting that includes convex minimization and convex-concave saddle-point problems. We assume an access to potentially noisy unbiased values of the monotone operators and assess convergence through a compatible gap function which corresponds to the standard optimality criteria in the aforementioned subcases. We present a universal algorithm for these inequalities based on the Mirror-Prox algorithm. Concretely, our algorithm simultaneously achieves the optimal rates for the smooth/non-smooth, and noisy/noiseless setting. This is done without any prior knowledge of these properties, and in the general set-up of arbitrary norms and compatible Bregman divergences. For convex minimization and convex-concave saddle-point problems, this leads to new adaptive algorithms. Our method relies on a novel yet simple adaptive choice of the step-size, which can be seen as the appropriate extension of AdaGrad to handling constrained problems.
Stabilized SVRG: Simple Variance Reduction for Nonconvex Optimization
Rong Ge, Zhize Li, Weiyao Wang, Xiang WangTalk: Thursday 6/27/19 at 5:40 PM
Poster session: Thursday 6/27/19 at 6:00 PM
Variance reduction techniques like SVRG provide simple and fast algorithms for optimizing a convex finite-sum objective. For nonconvex objectives, it can also find a first-order stationary point (with small gradient). However, in nonconvex optimization it is often crucial to find a second-order stationary point (with small gradient and almost PSD hessian). In this paper, we show that Stabilized SVRG (a simple variant of SVRG) can converge to an \(\epsilon\)-second-order stationary point using only \(O(n^{2/3}/\epsilon^2+n/\epsilon^{1.5})\) stochastic gradients. To our best knowledge this is the first second-order guarantee for a simple variant of SVRG. The running time almost matches the known guarantees for \(\epsilon\)-first-order stationary points.
Sharp Analysis for Nonconvex SGD Escaping from Saddle Points
Cong Fang, Zhouchen Lin, Tong ZhangTalk: Thursday 6/27/19 at 5:50 PM
Poster session: Thursday 6/27/19 at 6:00 PM
In this paper, we prove that the simplest Stochastic Gradient Descent (SGD) algorithm is able to efficiently escape from saddle points and find an \((\epsilon, O(\epsilon^{0.5}))\)-approximate second-order stationary point in \(\tilde{O}(\epsilon^{-3.5})\) stochastic gradient computations for generic nonconvex optimization problems, under both gradient-Lipschitz and Hessian-Lipschitz assumptions. This unexpected result subverts the classical belief that SGD requires at least \(O(\epsilon^{-4})\) stochastic gradient computations for obtaining an \((\epsilon, O(\epsilon ^{0.5}))\)-approximate second-order stationary point. Such SGD rate matches, up to a polylogarithmic factor of problem-dependent parameters, the rate of most accelerated nonconvex stochastic optimization algorithms that adopt additional techniques, such as Nesterov's momentum acceleration, negative curvature search, as well as quadratic and cubic regularization tricks. Our novel analysis gives new insights into nonconvex SGD and can be potentially generalized to a broad class of stochastic optimization algorithms.
Learning Linear Dynamical Systems with Semi-Parametric Least Squares
Max Simchowitz, Ross Boczar, Benjamin RechtTalk: Friday 6/28/19 at 9:00 AM
Poster session: Tuesday 6/25/19 at 6:00 PM
We analyze a simple prefiltered variation of the least squares estimator for the problem of estimation with biased, semi-parametric noise, an error model studied more broadly in causal statistics and active learning. We prove an oracle inequality which demonstrates that this procedure provably mitigates the variance introduced by long-term dependencies. We then demonstrate that prefiltered least-squares yields, to our knowledge, the first algorithm that provably estimates the parameters of partially-observed linear systems that attains rates which do not not incur a worst-case dependence on the rate at which these dependencies decay. The algorithm is provably consistent even for systems which satisfy the weaker marginal stability condition obeyed by many classical models based on Newtonian mechanics. In this context, our semi-parametric framework yields guarantees for both stochastic and worst-case noise.
Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning
R. Srikant, Lei YingTalk: Friday 6/28/19 at 9:10 AM
Poster session: Tuesday 6/25/19 at 6:00 PM
We consider the dynamics of a linear stochastic approximation algorithm driven by Markovian noise, and derive finite-time bounds on the moments of the error, i.e., deviation of the output of the algorithm from the equilibrium point of an associated ordinary differential equation (ODE). To obtain finite-time bounds on the mean-square error in the case of constant step-size algorithms, our analysis uses Stein's method to identify a Lyapunov function that can potentially yield good steady-state bounds, and uses this Lyapunov function to obtain finite-time bounds by mimicking the corresponding steps in the analysis of the associated ODE. We also provide a comprehensive treatment of the moments of the square of the 2-norm of the approximation error. Our analysis yields the following results: (i) for a given step-size, we show that the lower-order moments can be made small as a function of the step-size and can be upper-bounded by the moments of a Gaussian random variable; (ii) we show that the higher-order moments beyond a threshold may be infinite in steady-state; and (iii) we characterize the number of samples needed for the finite-time bounds to be of the same order as the steady-state bounds. As a by-product of our analysis, we also solve the open problem of obtaining finite-time bounds for the performance of temporal difference learning algorithms with linear function approximation and a constant step-size, without requiring a projection step or an i.i.d. noise assumption.
Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches
Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John LangfordTalk: Friday 6/28/19 at 9:20 AM
Poster session: Tuesday 6/25/19 at 6:00 PM
We study the sample complexity of model-based reinforcement learning (henceforth RL) in general contextual decision processes. We design new algorithms for RL with a generic model class and analyze their statistical properties. Our algorithms have sample complexity governed by a new structural parameter called the witness rank, which we show to be small in several settings of interest, including factored MDPs and reactive POMDPs. We also show that the witness rank of a problem is never larger than the recently proposed Bellman rank parameter governing the sample complexity of the model-free algorithm OLIVE (Jiang et al., 2017), the only other provably sample efficient algorithm at this level of generality. Focusing on the special case of factored MDPs, we prove an exponential lower bound for all model-free approaches, including OLIVE, which when combined with our algorithmic results demonstrates exponential separation between model-based and model-free RL in some rich-observation settings.
Non-asymptotic Analysis of Biased Stochastic Approximation Scheme
Belhal Karimi, Blazej Miasojedow, Eric Moulines, Hoi-To WaiTalk: Friday 6/28/19 at 9:30 AM
Poster session: Tuesday 6/25/19 at 6:00 PM
Stochastic approximation (SA) is a key method used in statistical learning. Recently, its non- asymptotic convergence analysis has been considered in many papers. However, most of the prior analyses are made under restrictive assumptions such as unbiased gradient estimates and convex objective function, which significantly limit their applications to sophisticated tasks such as online and reinforcement learning. These restrictions are all essentially relaxed in this work. In particular, we analyze a general SA scheme to minimize a non-convex, smooth objective function. We consider update procedure whose drift term depends on a state-dependent Markov chain and the mean field is not necessarily of gradient type, covering approximate second-order method and allowing asymptotic bias for the one-step updates. We illustrate these settings with the online EM algorithm and the policy-gradient method for average reward maximization in reinforcement learning.
The Gap Between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint
Stephen Tu, Benjamin RechtTalk: Friday 6/28/19 at 9:40 AM
Poster session: Tuesday 6/25/19 at 6:00 PM
The effectiveness of model-based versus model-free methods is a long-standing question in reinforcement learning (RL). Motivated by recent empirical success of RL on continuous control tasks, we study the sample complexity of popular model-based and model-free algorithms on the Linear Quadratic Regulator (LQR). We show that for policy evaluation, a simple model-based plugin method requires asymptotically less samples than the classical least-squares temporal difference (LSTD) estimator to reach the same quality of solution; the sample complexity gap between the two methods can be at least a factor of state dimension. For policy evaluation, we study a simple family of problem instances and show that nominal (certainty equivalence principle) control also requires several factors of state and input dimension fewer samples than the policy gradient method to reach the same level of control performance on these instances. Furthermore, the gap persists even when employing baselines commonly used in practice. To the best of our knowledge, this is the first theoretical result which demonstrates a separation in the sample complexity between model-based and model-free methods on a continuous control task.
Towards Efficient Effective Reinforcement Learning Algorithms That Interact With People
Emma BrunskillTalk (Keynote): Friday 6/28/19 at 10:00 AM
There is increasing excitement about reinforcement learning-- a subarea of machine learning for enabling an agent to learn to make good decisions. Yet numerous questions and challenges remain for reinforcement learning to help support progress in applications that involve interacting with people, like education, consumer marketing and healthcare. I will discuss our work on some of the technical challenges that arise in this pursuit, including minimax PAC and regret bounds for reinforcement learning in tabular environments, and counterfactual reasoning from prior data.
Estimating the Mixing Time of Ergodic Markov Chains
Geoffrey Wolfer, Aryeh KontorovichTalk: Friday 6/28/19 at 2:20 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We address the problem of estimating the mixing time tmix of an arbitrary ergodic finite Markov chain from a single trajectory of length m. The reversible case was addressed by Hsu et al. (2017), who left the general case as an open problem. In the reversible case, the analysis is greatly facilitated by the fact that the Markov operator is self-adjoint, and Weyl’s inequality allows for a dimension-free perturbation analysis of the empirical eigenvalues. As Hsu et al. point out, in the absence of reversibility (and hence, the non-symmetry of the pair probabilities matrix), the existing perturbation analysis has a worst-case exponential dependence on the number of states d. Furthermore, even if an eigenvalue perturbation analysis with better dependence on d were available, in the non-reversible case the connection between the spectral gap and the mixing time is not nearly as straightforward as in the reversible case. Our key insight is to estimate the pseudo-spectral gap instead, which allows us to overcome the loss of self-adjointness and to achieve a polynomial dependence on d and the minimal stationary probability. Additionally, in the reversible case, we obtain simultaneous nearly (up to logarithmic factors) minimax rates in tmix and precision ε, closing a gap in Hsu et al., who treated ε as constant in the lower bounds. Finally, we construct fully empirical confidence intervals for the pseudo-spectral gap, which shrink to zero at a rate of roughly 1/sqrt(m), and improve the state of the art in even the reversible case.
Distribution-Dependent Analysis of Gibbs-ERM Principle
Ilja Kuzborskij, Nicolò Cesa-Bianchi, Csaba SzepesvariTalk: Friday 6/28/19 at 2:30 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
Gibbs-ERM learning is a natural idealized model of learning with stochastic optimization algorithms (such as Stochastic Gradient Langevin Dynamics and —to some extent— Stochastic Gradient Descent), while it also arises in other contexts, including PAC-Bayesian theory, and sampling mechanisms. In this work we study the excess risk suffered by a Gibbs-ERM learner that uses non-convex, regularized empirical risk with the goal to understand the interplay between the data- generating distribution and learning in large hypothesis spaces. Our main results are distribution-dependent upper bounds on several notions of excess risk. We show that, in all cases, the distribution-dependent excess risk is essentially controlled by the effective dimension tr(H* (H* + λI)^{−1}) of the problem, where H* is the Hessian matrix of the risk at a local minimum. This is a well-established notion of effective dimension appearing in several previous works, including the analyses of SGD and ridge regression, but ours is the first work that brings this dimension to the analysis of learning using Gibbs densities. The distribution-dependent view we advocate here improves upon earlier results of Raginsky et al. (2017), and can yield much tighter bounds depending on the interplay between the data-generating distribution and the loss function. The first part of our analysis focuses on the localized excess risk in the vicinity of a fixed local minimizer. This result is then extended to bounds on the global excess risk, by characterizing probabilities of local minima (and their complement) under Gibbs densities, a results which might be of independent interest.
Theoretical guarantees for sampling and inference in generative models with latent diffusions
Belinda Tzen, Maxim RaginskyTalk: Friday 6/28/19 at 2:40 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We introduce and study a class of probabilistic generative models, where the latent object is a finite-dimensional diffusion process on a finite time interval and the observed variable is drawn conditionally on the terminal point of the diffusion. We make the following contributions: We provide a unified viewpoint on both sampling and variational inference in such generative models through the lens of stochastic control. We quantify the expressiveness of diffusion-based generative models. Specifically, we show that one can efficiently sample from a wide class of terminal target distributions by choosing the drift of the latent diffusion from the class of multilayer feedforward neural nets, with the accuracy of sampling measured by the Kullback--Leibler divergence to the target distribution. Finally, we present and analyze a scheme for unbiased simulation of generative models with latent diffusions, and provide bounds on the variance of the resulting estimators, and show that this scheme can be implemented as a deep generative model with a random number of layers.
Sampling and Optimization on Convex Sets in Riemannian Manifolds of Non-Negative Curvature
Navin Goyal, Abhishek ShettyTalk: Friday 6/28/19 at 2:50 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
The Euclidean space notion of convex sets (and functions) generalizes to Riemannian manifolds in a natural sense and is called geodesic convexity. Extensively studied computational problems such as convex optimization and sampling in convex sets also have meaningful counterparts in the manifold setting. Geodesically convex optimization is a well-studied problem with ongoing research and considerable recent interest in machine learning and theoretical computer science. In this paper, we study sampling and convex optimization problems over manifolds of non-negative curvature proving polynomial running time in the dimension and other relevant parameters. Our algorithms assume a warm start. We first present a random walk based sampling algorithm and then combine it with simulated annealing for solving convex optimization problems. To our knowledge, these are the first algorithms in the general setting of positively curved manifolds with provable polynomial guarantees under reasonable assumptions, and the first study of the connection between sampling and optimization in this setting.
Nonconvex sampling with the Metropolis-adjusted Langevin algorithm
Oren Mangoubi, Nisheeth K. VishnoiTalk: Friday 6/28/19 at 3:00 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
The Langevin Markov chain algorithms are widely deployed methods to sample from distributions in challenging high-dimensional and non-convex Statistics and Machine Learning applications. Despite this, current bounds for Langevin algorithms are slower than those of competing algorithms in many important situations, for instance when sampling from weakly log-concave distributions, or when sampling or optimizing from non-convex log-densities. In this paper, we obtain improved bounds in many of these situations, showing that the Metropolis-adjusted Langevin algorithm (MALA) is faster than the best bounds for its competitor algorithms when the target distribution \(\pi\) satisfies weak third- and fourth- order regularity properties associated with the input data. Our regularity conditions are weaker than the usual Euclidean operator norm regularity properties and allow us to show faster bounds for a much larger class of distributions than would be possible with the usual Euclidean operator norm constants, including in Statistics and Machine learning applications where the data satisfy a certain incoherence condition. In particular, we show that using our regularity conditions one can obtain faster bounds for applications for sampling problems in Bayesian logistic regression with weakly convex priors, and the nonconvex optimization problem of learning linear classifiers with zero-one loss functions. Our main technical contribution in this paper is our analysis of the Metropolis acceptance probability of MALA in terms of its energy-conservation error, and our bound for this error in terms of third- and fourth- order regularity conditions. Our combination of this higher-order analysis of the energy conservation error with the conductance method is key to obtaining bounds which have a sub-linear dependence on \(d\) in the non-strongly logconcave setting.
Normal Approximation for Stochastic Gradient Descent via Non-Asymptotic Rates of Martingale CLT
Andreas Anastasiou, Krishnakumar Balasubramanian, Murat ErdogduTalk: Friday 6/28/19 at 3:10 PM
Poster session: Wednesday 6/26/19 at 6:00 PM
We provide non-asymptotic convergence rates of the Polyak-Ruppert averaged stochastic gradient descent (SGD) to a normal random vector for a class of twice-differentiable test functions. A crucial intermediate step is proving a non-asymptotic martingale central limit theorem (CLT), i.e., establishing the rates of convergence of a multivariate martingale difference sequence to a normal random vector, which might be of independent interest. We obtain the explicit rates for the multivariate martingale CLT using a combination of Stein's method and Lindeberg's argument, which is then used in conjunction with a non-asymptotic analysis of averaged SGD proposed in Polyak and Juditsky (1992). Our results have potentially interesting consequences for computing confidence intervals for parameter estimation with SGD and constructing hypothesis tests with SGD that are valid in a non-asymptotic sense.
The Relative Complexity of Maximum Likelihood Estimation, MAP Estimation, and Sampling
Christopher Tosh, Sanjoy DasguptaTalk: Friday 6/28/19 at 3:20 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We prove that, for a broad range of problems, maximum-a-posteriori (MAP) estimation and approximate sampling of the posterior are at least as computationally difficult as maximum-likelihood (ML) estimation. By way of illustration, we show how hardness results for ML estimation of mixtures of Gaussians and topic models carry over to MAP estimation and approximate sampling.
Uniform concentration and symmetrization for weak interactions
Andreas Maurer, Massimiliano PontilTalk: Friday 6/28/19 at 4:00 PM
Poster session: Thursday 6/27/19 at 6:00 PM
The method to derive uniform bounds with Gaussian and Rademacher complexities is extended to the case where the sample average is replaced by a nonlinear statistic. Tight bounds are obtained for U-statistics, smoothened L-statistics and error functionals of l2-regularized algorithms.
Learning from Weakly Dependent Data under Dobrushin's Condition
Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, Siddhartha JayantiTalk: Friday 6/28/19 at 4:10 PM
Poster session: Thursday 6/27/19 at 6:00 PM
Statistical learning theory has largely focused on learning and generalization given independent and identically distributed (i.i.d.) samples. Motivated by applications involving time-series data, there has been a growing literature on learning and generalization in settings where data is sampled from an ergodic process. This work has also developed complexity measures, which appropriately extend Rademacher complexity to bound the generalization error and learning rates of hypothesis classes in this setting. Rather than time-series data, our work is motivated by settings where data is sampled on a network or a spatial domain, and thus do not fit well the framework of prior work. We provide learning and generalization bounds for data that are complexly dependent, yet their distribution satisfies the standard Dobrushin condition. Indeed, we show that the standard complexity measures of (Gaussian) Rademacher complexity and VC dimension are sufficient measures of complexity for the purposes of bounding the generalization error and learning rates of hypothesis classes in our setting. Moreover, our generalization bounds only degrade by constant factors compared to their i.i.d.~analogs, and our learnability bounds degrade by log factors in the size of the training set.
High probability generalization bounds for uniformly stable algorithms with nearly optimal rate
Vitaly Feldman, Jan VondrakTalk: Friday 6/28/19 at 4:20 PM
Poster session: Thursday 6/27/19 at 6:00 PM
Algorithmic stability is a classical approach to understanding and analysis of the generalization error in learning theory. A notable weakness of most stability-based generalization bounds is that they hold only in expectation, whereas model-complexity based approaches establish generalization with high probability. This issue was addressed in a landmark paper by Bousquet and Elisseeff (2002) who introduced the notion of uniform stability and used it to prove generalization bounds that hold with high probability. Specifically, their bound on the estimation error of any \(\gamma\)-uniformly stable learning algorithm on \(n\) samples is \(O(\gamma \sqrt{n \log(1/\delta)})\) with probability \(\geq 1-\delta\). The \(\sqrt{n}\) overhead significantly limits the applicability of this bound (or even makes it vacuous in the common settings where \(\gamma \geq 1/\sqrt{n}\)). A stronger bound was recently proved by Feldman and Vondrak (2018) that reduces the overhead to at most \(O(n^{1/4})\). We prove a bound of \(O(\gamma \log(n)\log(n/\delta))\) on the estimation error of any \(\gamma\)-uniformly stable algorithm. Our result leads to the first high-probability generalization bounds for several well-studied optimization algorithms with nearly optimal rate. Our proof technique is new and we introduce several analysis tools that might find additional applications.
When can unlabeled data improve the learning rate?
Christina Göpfert, Shai Ben-David, Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Ruth UrnerTalk: Friday 6/28/19 at 4:30 PM
Poster session: Thursday 6/27/19 at 6:00 PM
In semi-supervised classification, one is given access both to labeled and unlabeled data. As unlabeled data is typically cheaper to acquire than labeled data, this setup becomes advantageous as soon as one can exploit the unlabeled data in order to produce a better classifier than with labeled data alone. However, the conditions under which such an improvement is possible are not fully understood yet. Our analysis focuses on improvements in the {\em minimax} learning rate in terms of the number of labeled examples (with the number of unlabeled examples being allowed to depend on the number of labeled ones). We argue that for such improvements to be realistic and indisputable, certain specific conditions should be satisfied and previous analyses have failed to meet those conditions. We then demonstrate examples where these conditions can be met, in particular showing rate changes from \(1/\sqrt{\l}\) to \(e^{-c\l}\) and from \(1/\sqrt{\l}\) to \(1/\l\). These results improve our understanding of what is and isn't possible in semi-supervised learning.
Classification with unknown class conditional label noise on non-compact feature spaces
Henry Reeve, Ata KabanTalk: Friday 6/28/19 at 4:40 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We investigate the problem of classification in the presence of unknown class conditional label noise in which the labels observed by the learner have been corrupted with some unknown class dependent probability. In order to obtain finite sample rates, previous approaches to classification with unknown class conditional label noise have required that the regression function attains its extrema uniformly on sets of positive measure. We shall consider this problem in the setting of non-compact metric spaces, where the regression function need not attain its extrema. In this setting we determine the minimax optimal learning rates (up to logarithmic factors). The rate displays interesting threshold behaviour: When the regression function approaches its extrema at a sufficient rate, the optimal learning rates are of the same order as those obtained in the label-noise free setting. If the regression function approaches its extrema more gradually then classification performance necessarily degrades. In addition, we present an algorithm which attains these rates without prior knowledge of either the distributional parameters or the local density. This identifies for the first time a scenario in which finite sample rates are achievable in the label noise setting, but they differ from the optimal rates without label noise.
Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon
Alexander Rakhlin, Xiyu ZhaiTalk: Friday 6/28/19 at 4:50 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We show that minimum-norm interpolation in the Reproducing Kernel Hilbert Space corresponding to the Laplace kernel is not consistent if input dimension is constant. The lower bound holds for any choice of kernel bandwidth, even if selected based on data. The result supports the empirical observation that minimum-norm interpolation (that is, exact fit to training data) in RKHS generalizes well for some high-dimensional datasets, but not for low-dimensional ones.
On Communication Complexity of Classification Problems
Daniel Kane, Roi Livni, Shay Moran, Amir YehudayoffTalk: Friday 6/28/19 at 5:00 PM
Poster session: Thursday 6/27/19 at 6:00 PM
This work studies distributed learning in the spirit of Yao's model of communication complexity: consider a two-party setting, where each of the players gets a list of labelled examples and they communicate in order to jointly perform some learning task. To naturally fit into the framework of learning theory, the players can send each other examples (as well as bits) where each example/bit costs one unit of communication. This enables a uniform treatment of infinite classes such as half-spaces in \(\R^d\), which are ubiquitous in machine learning. We study several fundamental questions in this model. For example, we provide combinatorial characterizations of the classes that can be learned with efficient communication in the proper-case as well as in the improper-case. These findings imply unconditional separations in this context between various learning tasks, e.g.\ realizable versus agnostic learning, proper versus improper learning, etcetera. The derivation of these results hinges on a type of decision problems we term ``{\it realizability problems}'' where the goal is deciding whether a distributed input sample is consistent with an hypothesis from a pre-specified class. From a technical perspective, the protocols we devise (i.e.\ the upper bounds) are based on ideas from machine learning and the impossibility results (i.e.\ the lower bounds) are based on ideas from communication complexity.
Space lower bounds for linear prediction in the streaming model
Yuval Dagan, Gil Kur, Ohad ShamirTalk: Friday 6/28/19 at 5:10 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We show that fundamental learning tasks, such as finding an approximate linear separator or linear regression, require memory at least \emph{quadratic} in the dimension, in a natural streaming setting. This implies that such problems cannot be solved (at least in this setting) by scalable memory-efficient streaming algorithms. Our results build on a memory lower bound for a simple linear-algebraic problem -- finding orthogonal vectors -- and utilize the estimates on the packing of the Grassmannian, the manifold of all linear subspaces of fixed dimension.
Affine Invariant Covariance Estimation for Heavy-Tailed Distributions
Dmitrii M. Ostrovskii, Alessandro RudiTalk: Friday 6/28/19 at 5:20 PM
Poster session: Thursday 6/27/19 at 6:00 PM
In this work we provide an estimator for the covariance matrix of a heavy-tailed multivariate distribution. We prove that the proposed estimator \(\widehat{\mathbf{S}}\) admits an \textit{affine-invariant} bound of the form \[ (1-\varepsilon) \mathbf{S} \preccurlyeq \widehat{\mathbf{S}} \preccurlyeq (1+\varepsilon) \mathbf{S} \] in high probability, where \(\mathbf{S}\) is the unknown covariance matrix, and \(\preccurlyeq\) is the positive semidefinite order on symmetric matrices. The result only requires the existence of fourth-order moments, and allows for \(\varepsilon = O(\sqrt{\kappa^4 d/n})\) where \(\kappa^4\) is some measure of kurtosis of the distribution, \(d\) the dimensionality of the space, and \(n\) is the sample size. More generally, we can allow for regularization with level~\(\lambda\), then \(\varepsilon\) depends on the degrees of freedom number which is generally smaller than \(d\). The computational cost of the novel estimator is essentially~\(O(d^2 n + d^3)\), comparable to that of the sample covariance matrix in the statistically interesing regime~\(d = O(n)\). We consider applications of our estimator to eigenvalue estimation with relative error, and to random design ridge regression.
Approximate Guarantees for Dictionary Learning
Aditya Bhaskara, Wai Ming TaiTalk: Friday 6/28/19 at 5:30 PM
Poster session: Thursday 6/27/19 at 6:00 PM
In the dictionary learning (or sparse coding) problem, we are given a collection of signals (vectors in \(\R^d\)), and the goal is to find a ``basis'' in which the signals have a sparse (approximate) representation. The problem has received a lot of attention in signal processing, learning, and theoretical computer science. The problem is formalized as factorizing a matrix \(X (d \times n)\) (whose columns are the signals) as \(X = AY\), where \(A\) has a prescribed number \(m\) of columns (typically \(m \ll n\)), and \(Y\) has columns that are \(k\)-sparse (typically \(k \ll d\)). Most of the known theoretical results involve assuming that the columns of the unknown \(A\) have certain {\em incoherence} properties, and that the coefficient matrix \(Y\) has random (or partly random) structure. The goal of our work is to understand what can be said in the absence of such assumptions. Can we still find \(A\) and \(Y\) such that \(X \approx AY\)? We show that this is possible, if we allow violating the bounds on \(m\) and \(k\) by appropriate factors that depend on \(k\) and the desired approximation. Our results rely on an algorithm for what we call the ``threshold correlation'' problem, which turns out to be related to hypercontractive norms of matrices. We also show that our algorithmic ideas apply to a setting in which some of the columns of \(X\) are outliers, thus giving similar guarantees even in this challenging setting.
Sample-Optimal Low-Rank Approximation of Distance Matrices
Piotr Indyk, Ali Vakilian, Tal Wagner, David WoodruffTalk: Friday 6/28/19 at 5:40 PM
Poster session: Thursday 6/27/19 at 6:00 PM
A distance matrix \(A \in \R^{n \times m}\) represents all pairwise distances, \(A_{ij}=\dist(x_i,y_j)\), between two point sets \(x_1,...,x_n\) and \(y_1,...,y_m\) in an arbitrary metric space \((\mathcal Z, \dist)\). Such matrices arise in various computational contexts such as learning image manifolds, handwriting recognition, and multi-dimensional unfolding. In this work we study algorithms for low rank approximation of distance matrices. Recent work by Bakshi and Woodruff (NeurIPS 2018) showed it is possible to compute a rank-\(k\) approximation of a distance matrix in time \(O((n+m)^{1+\gamma}) \cdot\poly{k,1/\epsilon}\), where \(\epsilon>0\) is an error parameter and \(\gamma>0\) is an arbitrarily small constant. Notably, their bound is sublinear in the matrix size, which is unachievable for general matrices. We present an algorithm that is both simpler and more efficient. It reads only \(O((n+m) k/\epsilon)\) entries of the input matrix, and has a running time of \(O(n+m) \cdot \poly{k,1/\epsilon}\). We complement the sample complexity of our algorithm with a matching lower bound on the number of entries that must be read by any algorithm. We provide experimental results to validate the approximation quality and running time of our algorithm.
A near-optimal algorithm for approximating the John Ellipsoid
Michael B. Cohen, Ben Cousins, Yin Tat Lee, Xin YangTalk: Friday 6/28/19 at 5:50 PM
Poster session: Thursday 6/27/19 at 6:00 PM
We develop a simple and efficient algorithm for approximating the John Ellipsoid of a symmetric polytope. Our algorithm is near optimal in the sense that our time complexity matches the current best verification algorithm. Experimental results suggest that our algorithm significantly outperforms existing algorithms. We also provide the MATLAB code for further research.