| Tutorial: Learning Topics in Game-Theoretic Decision Making |
|---|
| Michael Littman - Rutgers University |
The tutorial will cover some topics of recent interest in AI and economics concerning design making in a computational game-theory framework. It will highlight areas in which computational learning theory has played a role and could play a greater role in the future. Covered areas include recent representational and algorithmic advances, stochastic games and reinforcement learning, no regret algorithms, and the role of various equilibrium concepts. |
| Invited Talk: A General Class of No-Regret Learning Algorithms and Game-Theoretic Equilibria |
|---|
| Amy Greenwald - Brown University Amir Jafari - Brown University |
A general class of no-regret learning algorithms, called $\Phi$-no-regret learning algorithms is defined, which spans the spectrum from no-internal-regret learning to no-external-regret learning, and beyond. $\Phi$ describes the set of strategies to which the play of a learning algorithm is compared: a learning algorithm satisfies $\Phi$-no-regret iff no regret is experienced for playing as the algorithm prescribes, rather than playing according to any of the transformations of the algorithm's play prescribed by elements of $\Phi$. Analogously, a class of game-theoretic equilibria, called $\Phi$-equilibria, is defined, and it is shown that the empirical distribution of play of $\Phi$-no-regret algorithms converges to the set of $\Phi$-equilibria. Perhaps surprisingly, the strongest form of no-regret algorithms in this class are no-internal-regret algorithms. Thus, the tightest game-theoretic solution concept to which $\Phi$-no-regret algorithms (provably) converge is correlated equilibrium. In particular, Nash equilibrium is not a necessary outcome of learning via any $\Phi$-no-regret learning algorithms. |
| Preference Elicitation and Query Learning |
|---|
| Avrim Blum - Carnegie Mellon University Jeffrey Jackson - Duquesne University Tuomas Sandholm - Carnegie Mellon University Martin Zinkevich - Carnegie Mellon University |
In this paper we initiate an exploration of relationships between ``preference elicitation'', a learning-style problem that arises in combinatorial auctions, and the problem of learning via queries studied in computational learning theory. Preference elicitation is the process of asking questions about the preferences of bidders so as to best divide some set of goods. As a learning problem, it can be thought of as a setting in which there are multiple target concepts that can each be queried separately, but where the goal is not so much to learn each concept as it is to produce an ``optimal example''. In this work, we prove a number of similarities and differences between preference elicitation and query learning, giving both separation results and proving some connections between these problems. |
| Efficient Algorithms for Online Decision Problems |
|---|
| Adam Kalai - Massachusetts Institute of Technology Santosh Vempala - Massachusetts Institute of Technology |
In an online decision problem, one makes a sequence of decisions without knowledge of the future. Tools from learning such as Weighted Majority and its many variants demonstrate that online algorithms can perform nearly as well as the best single decision chosen in hindsight, even when there are exponentially many possible decisions. However, the naive application of these algorithms is inefficient for such large problems. For some problems with nice structure, specialized efficient solutions have been developed. We show that a very simple idea, used in Hannan's seminal 1957 paper, gives efficient solutions to all of these problems. Essentially, in each period, one chooses the decision that worked best in the past. To guarantee low regret, it is necessary to add randomness. Surprisingly, this simple approach gives additive epsilon regret per period, efficiently. We present a simple general analysis and several extensions, including a (1+epsilon)-competitive algorithm as well as a lazy one that rarely switches between decisions. |
| Positive Definite Rational Kernels |
|---|
| Corinna Cortes - AT\&T Labs Patrick Haffner - AT\&T Labs Mehryar Mohri - AT\&T Labs |
Kernel methods are widely used in statistical learning techniques. We recently introduced a general kernel framework based on weighted transducers or rational relations, 'rational kernels', to extend kernel methods to the analysis of variable-length sequences or more generally weighted automata. These kernels are efficient to compute and have been successfully used in applications such as spoken-dialog classification. Not all rational kernels are 'positive definite and symmetric' (PDS) however, a sufficient property for guaranteeing the convergence of discriminant classification algorithms such as Support Vector Machines. We present several theoretical results related to PDS rational kernels. We show in particular that under some conditions these kernels are closed under sum, product, or Kleene-closure and give a general method for constructing a PDS rational kernel from an arbitrary transducer defined on some non-idempotent semirings. We also show that some commonly used string kernels or similarity measures such as the edit-distance, the convolution kernels of Haussler, and some string kernels used in the context of computational biology are specific instances of rational kernels. Our results include the proof that the edit-distance over a non-trivial alphabet is not 'negative definite', which, to the best of our knowledge, was never stated or proved before. |
| Bhattacharyya and Expected Likelihood Kernels |
|---|
| Tony Jebara - Columbia University Risi Kondor - Columbia University |
We introduce a new class of kernels between distributions. These induce a kernel on the input space between data points by associating to each datum a generative model fit to the data point individually. The kernel is then computed by integrating the product of the two generative models corresponding to two data points. This kernel permits discriminative estimation via, for instance, support vector machines, while exploiting the properties, assumptions, and invariances inherent in the choice of generative model. It satisfies Mercer's condition and can be computed in closed form for a large class of models, including exponential family models, mixtures, hidden Markov models and Bayesian networks. For other models the kernel can be approximated by sampling methods. Experiments are shown for multinomial models in text classification and for hidden Markov models for protein sequence classification. |
| Maximal Margin Classification for Metric Spaces |
|---|
| Matthias Hein - Max Planck Institute for Biological Cybernetics Olivier Bousquet - Max Planck Institute for Biological Cybernetics |
In this article we construct a maximal margin classification algorithm for arbitrary metric spaces. At first we show that the Support Vector Machine (SVM) is a maximal margin algorithm for the class of metric spaces where the negative squared distance is conditionally positive definite (CPD). This means that the metric space can be isometrically embedded into a Hilbert space, where one performs linear maximal margin separation. We will show that the solution only depends on the metric, but not on the kernel. Following the framework we develop for the SVM, we construct an algorithm for maximal margin classification in arbitrary metric spaces. The main difference compared with SVM is that we no longer embed isometrically into a Hilbert space, but a Banach space. We further give an estimate of the capacity of the function class involved in this algorithm via Rademacher averages. We recover an algorithm of Graepel et al.(1999). |
| Maximum Margin Algorithms with Boolean Kernels |
|---|
| Roni Khardon - Tufts University Rocco Servedio - Columbia University |
Recent work has introduced Boolean kernels with which one can learn
over a feature space containing all conjunctions of length up to $k$
(for any $1\leq k \leq n$) over the original $n$ Boolean features in the
input space. This motivates the question of whether maximum margin
algorithms such as support vector machines can learn Disjunctive
Normal Form expressions in the PAC learning model using this kernel. We
study this question, as well as a variant in which structural risk
minimization (SRM) is performed where the class hierarchy is taken
over the length of conjunctions.
We show that such maximum margin algorithms do not PAC learn
$t(n)$-term DNF for any $t(n) = \omega(1),$
even when used with such a SRM scheme.
We also consider PAC learning under the uniform distribution
and show that if the kernel uses conjunctions of length
$\tilde{\omega}(\sqrt{n})$ then the maximum margin hypothesis will
fail on the uniform distribution as well.
Our results concretely illustrate that margin based algorithms may overfit
when learning simple target functions with natural kernels.
|
| Knowledge-Based Nonlinear Kernel Classifiers |
|---|
| Glenn Fung - The University of Wisconsin-Madison Olvi Mangasarian - The University of Wisconsin-Madison Jude Shavlik - The University of Wisconsin-Madison |
Prior knowledge in the form of multipe polyhedral sets, each belonging to one of two categories, is introduced into a reformulation of a nonlinear kernel support vector machine (SVM) classifier. The resulting formulation leads to a linear program that can be solved efficiently. This extends, in a rather unobvious fashion, previous work that incorporated similar prior knowledge into a linear SVM classifier. Numerical tests on standard-type test problems, such as exclusive-or prior knowledge sets and a checkerboard with 16 points and prior knowledge instead of the usual 1000 points, show the effectiveness of the proposed approach in generating sharp nonlinear classifiers based mostly or totally on prior knowledge. |
| Fast Kernels for Inexact String Matching |
|---|
| Christina Leslie - Columbia University Rui Kuang - Columbia University |
We introduce several new families of string kernels designed in particular for use with support vector machines (SVMs) for classification of protein sequence data. These kernels -- restricted gappy kernels, substitution kernels, and wildcard kernels -- are based on feature spaces indexed by k-length subsequences either from the string alphabet or the alphabet augmented by a wildcard character, and hence they are related to the recently presented (k,m)-mismatch kernel and string kernels used in text classification. However, for all kernels we define here, the kernel value K(x,y) can be computed in O(c(K)(|x| + |y|)) time, where the constant c(K) depends on the parameters of the kernel but is independent of the size of the alphabet. Thus the computation of these kernels is linear in the length of the sequences, like the mismatch kernel, but we improve upon the alphabet-dependent constant c(K) of the mismatch kernel. We compute the kernels efficiently using a recursive function based on a trie data structure and relate our new kernels to the recently described transducer formalism. Finally, we report protein classification experiments on a benchmark SCOP dataset, where we show that our new faster kernels achieve SVM classification performance comparable to the mismatch kernel and the Fisher kernel derived from profile hidden Markov models. |
| On Graph Kernels: Hardness Results and Efficient Alternatives |
|---|
| Thomas Gaertner - Fraunhofer AIS, University of Bonn, and University of Bristol Peter Flach - University of Bristol Stefan Wrobel - Fraunhofer AIS and University of Bonn |
As most `real-world' data is structured, research in kernel methods has begun
investigating kernels for various kinds of structured data. One of the most
widely used tools for modeling structured data are graphs. An interesting and
important challenge is thus to investigate kernels on instances that are
represented by graphs. So far, only very specific graphs such as trees and
strings have been considered.
This paper investigates kernels on labeled directed graphs with general
structure. It is shown that computing a strictly positive definite graph
kernel is at least as hard as solving the graph isomorphism problem. It is
also shown that computing an inner product in a feature space indexed by all
possible graphs, where each feature counts the number of subgraphs isomorphic
to that graph, is {\it NP}-hard. On the other hand, inner products in an
alternative feature space, based on walks in the graph, can be computed in
polynomial time. Such kernels are defined in this paper. |
| Kernels and Regularization on Graphs |
|---|
| Alexander J. Smola - The Australian National University Risi Kondor - Columbia University |
We introduce a family of kernels on graphs based on the notion of regularization operators. This generalizes in a natural way the notion of regularization and Greens functions, as commonly used for real valued functions, to graphs. It turns out that diffusion kernels can be found as a special case of our reasoning. We show that the class of positive, monotonically decreasing functions on the unit interval leads to kernels and corresponding regularization operators. |
| Data-Dependent Bounds for Multi-Category Classification\\Based on Convex Losses |
|---|
| Ilya Desyatnikov - Technion Ron Meir - Technion |
Algorithms for solving multi-category classification problems using output coding have become very popular in recent years. Following initial attempts with discrete coding matrices, recent work has attempted to alleviate some of their shortcomings by considering real-valued `coding' matrices. We consider an approach to multi-category classification, based on minimizing a convex upper bound on the $0-1$ loss. We show that this approach is closely related to output coding, and derive data-dependent bounds on the performance. These bounds can be optimized in order to obtain effective coding matrices, which guarantee small generalization error. Moreover, our results apply directly to kernel based approaches. |
| Comparing Clusterings by the Variation of Information |
|---|
| Marina Meila - University of Washington |
This paper proposes an information theoretic criterion for comparing two partitions, or clusterings, of the same data set. The criterion, called variation of information (VI), measures the amount of information lost and gained in changing from one clustering to another. The criterion makes no assumptions about how the clusterings were generated and applies to both soft and hard clusterings. The basic properties of VI are presented and discussed from the point of view of comparing clusterings. In particular, the VI is positive, symmetric and obeys the triangle inequality. Thus, surprisingly enough, it is a true metric on the space of clusterings. |
| Multiplicative Updates for Large Margin Classifiers |
|---|
| Fei Sha - University of Pennsylvania Lawrence Saul - University of Pennsylvania Daniel Lee - University of Pennsylvania |
Various problems in nonnegative quadratic programming arise in the training of large margin classifiers. We derive multiplicative updates for these problems that converge monotonically to the desired solutions for hard and soft margin classifiers. The updates differ strikingly in form from other multiplicative updates used in machine learning. In this paper, we provide complete proofs of convergence for these updates and extend previous work to incorporate sum and box constraints in addition to nonnegativity. |
| Simplified PAC-Bayesian Margin Bounds |
|---|
| David McAllester - Toyota Technological Institute at Chicago |
The theoretical understanding of support vector machines is largely based on margin bounds for linear classifiers with unit-norm weight vectors and unit-norm feature vectors. Unit-norm margin bounds have been proved previously using fat-shattering arguments and Rademacher complexity. Recently Langford and Shawe-Taylor proved a dimension-independent unit-norm margin bound using a relatively simple PAC-Bayesian argument. Unfortunately, the Langford-Shawe-Taylor bound is stated in a variational form making direct comparison to fat-shattering bounds difficult. This paper provides an explicit solution to the variational problem implicit in the Langford-Shawe-Taylor bound and shows that the PAC-Bayesian margin bounds are significantly tighter. Because a PAC-Bayesian bound is derived from a particular prior distribution over hypotheses, a PAC-Bayesian margin bound also seems to provide insight into the nature of the learning bias underlying the bound. |
| Sparse Kernel Partial Least Squares Regression |
|---|
| Michinari Momma - Rensselaer Polytechnic Institute Kristin Bennett - Rensselaer Polytechnic Institute |
Partial Least Squares Regression (PLS) and its kernel version (KPLS) have become competitive regression approaches. KPLS performs as well as or better than support vector regression (SVR) for moderately-sized problems with the advantages of simple implementation, less training cost, and easier tuning of parameters. Unlike SVR, KPLS requires manipulation of the full kernel matrix and the resulting regression function requires the full training data. In this paper we rigorously derive a sparse KPLS algorithm. The underlying KPLS algorithm is modified to maintain sparsity in all steps of the algorithm. The resulting nu-KPLS algorithm explicitly models centering and bias rather than using kernel centering. An epsilon-insensitive loss function is used to produce sparse solutions in the dual space. The final regression function for the nu-KPLS algorithm only requires a relatively small set of support vectors. |
| Sparse Probabilistic Regression by Label Partitioning |
|---|
| Shantanu Chakrabartty - The Johns Hopkins University Gert Cauwenberghs - The Johns Hopkins University Prof Jayadeva - Indian Institute of Technology |
A large-margin learning machine for sparse probability regression is presented. Unlike support vector machines and other forms of kernel machines, nonlinear features are obtained by transforming labels into higher-dimensional label space rather than transforming data vectors into feature space. Linear multi-class logistic regression with partitioned classes of labels yields a nonlinear classifier in the original labels. With a linear kernel in data space, storage and run-time requirements are reduced from the number of support vectors to the number of partitioned labels. Using the partitioning property of KL-divergence in label space, an iterative alignment procedure produces sparse training coefficients. Experiments show that label partitioning is effective in modeling non-linear decision boundaries with same, and in some cases superior, generalization performance to Support Vector Machines with significantly reduced memory and run-time requirements. |
| Learning with Rigorous Support Vector Machines |
|---|
| Jinbo Bi - Rensselaer Polytechnic Insitute Vladimir Vapnik - NEC Laboratories America |
We examine the so-called rigorous support vector machine (RSVM) approach proposed by Vapnik (1998). The formulation of RSVM is derived by explicitly implementing the structural risk minimization principle with a parameter H used to directly control the VC dimension of the set of separating hyperplanes. By optimizing the dual problem, RSVM finds the optimal separating hyperplane from a set of functions with VC dimension approximate to H-squared+1. RSVM produces classifiers equivalent to those obtained by classic SVMs for appropriate parameter choices, but the use of the parameter H facilitates model selection, thus minimizing VC bounds on the generalization risk more effectively. In our empirical studies, good models are achieved for an appropriate H-squared in [5% L, 30% L] where L is the size of training data. |
| Robust Regression by Boosting the Median |
|---|
| Balazs Kegl - University of Montreal |
Most boosting regression algorithms use the weighted average of base regressors as their final regressor. In this paper we analyze the choice of the weighted median. We propose a general boosting algorithm based on this approach. We prove boosting-type convergence of the algorithm and give clear conditions for the convergence of the robust training error. The algorithm recovers AdaBoost and AdaBoost_rho as special cases. For boosting confidence-rated predictions, it leads to a new approach that outputs a different decision and interprets robustness in a different manner than the approach based on the weighted average. In the general, non-binary case we suggest practical strategies based on the analysis of the algorithm and experiments. |
| Boosting with Diverse Base Classifiers |
|---|
| Sanjoy Dasgupta - University of California at San Diego Philip Long - Genome Institute of Singapore |
We establish a new bound on the generalization error rate of the Boost-by-Majority algorithm. The bound holds when the algorithm is applied to a collection of base classifiers that contains a "diverse" subset of "good" classifiers, in a precisely defined sense. We describe cross-validation experiments that suggest that Boost-by-Majority can be the basis of a practically useful learning method, often improving on the generalization of AdaBoost on large datasets. |
| Reducing Kernel Matrix Diagonal Dominance using Semi-Definite Programming |
|---|
| Jaz Kandola - Royal Holloway University of London Thore Graepel - Microsoft Research John Shawe-Taylor - Royal Holloway University of London |
Kernel-based learning methods revolve around the notion of a {\em
kernel} or Gram matrix between data points. These square,
symmetric, positive semi-definite matrices can informally be
regarded as encoding pairwise similarity between all of the
objects in a data-set. In this paper we propose an algorithm for
manipulating the diagonal entries of a kernel matrix using
semi-definite programming. Kernel matrix diagonal dominance
reduction attempts to deal with the problem of learning with
almost orthogonal features, a phenomenon commonplace in kernel
matrices derived from string kernels or Gaussian kernels with
small width parameter. We show how this task can be formulated as
a semi-definite programming optimization problem that can be
solved with readily available optimizers. Theoretically we provide
an analysis using Rademacher based bounds to provide an
alternative motivation for the 1-norm SVM motivated from kernel
diagonal reduction. We assess the performance of the algorithm on
standard data sets with encouraging results in terms of
approximation and prediction.
|
| Optimal Rates of Aggregation |
|---|
| Alexandre Tsybakov - Universit\'{e} Paris 6 |
We study the problem of aggregation of M arbitrary estimators of a regression function with respect to the mean squared risk. Three main types of aggregation are considered: model selection, convex and linear aggregation. We define the notion of optimal rate of aggregation in an abstract context and prove lower bounds valid for any method of aggregation. We then construct procedures that attain these bounds, thus establishing optimal rates of linear, convex and model selection type aggregation. |
| Distance-Based Classification with Lipschitz Functions |
|---|
| Ulrike von Luxburg - Max Planck Institute for Biological Cybernetics Olivier Bousquet - Max Planck Institute for Biological Cybernetics |
The goal of this article is to develop a framework for large margin classification in metric spaces. We want to find a generalization of linear decision functions for metric spaces and define a corresponding notion of margin such that the decision function separates the training points with a large margin. It will turn out that using Lipschitz functions as decision functions, the inverse of the Lipschitz constant can be interpreted as the size of a margin. In order to construct a clean mathematical setup we isometrically embed the given metric space into a Banach space and the space of Lipschitz functions into its dual space. Our approach leads to a general large margin algorithm for classification in metric spaces. To analyze this algorithm, we first prove a representer theorem. It states that there exists a solution which can be expressed as linear combination of distances to sets of training points. Then we analyze the Rademacher complexity of some Lipschitz function classes. The generality of the Lipschitz approach can be seen from the fact that several well-known algorithms are special cases of the Lipschitz algorithm, among them the support vector machine, the linear programming machine, and the 1-nearest neighbor classifier. |
| Random Subclass Bounds |
|---|
| Shahar Mendelson - The Australian National University Petra Philips - The Australian National University |
It has been recently shown that sharp generalization bounds can be obtained when the function class from which the algorithm chooses its hypotheses is "small" in the sense that the Rademacher averages of this function class are small (Mendelson,2002). Seemingly based on different arguments, generalization bounds were obtained in the compression scheme (Littlestone and Warmuth,1986), luckiness (Shawe-Taylor et al, 1998), and algorithmic luckiness (Herbrich and Williamson, 2002) frameworks in which the "size" of the function class is not specified a priori. We show that the bounds obtained in all these frameworks follow from the same general principle, namely that coordinate projections of this function subclass evaluated on random samples are "small" with high probability. |
| PAC-MDL Bounds |
|---|
| Avrim Blum - Carnegie Mellon University John Langford - IBM TJ Watson Research Center |
We point out that a number of standard sample complexity bounds (VC-dimension, PAC-Bayes, and others) are all related to the number of bits required to communicate the labels given the unlabeled data for a natural communication game. Motivated by this observation, we give a general sample complexity bound based on this game that allows us to unify these different bounds in one common framework. |
| Universal Well-Calibrated Algorithm for On-line Classification |
|---|
| Vladimir Vovk - Royal Holloway University of London |
We study the problem of on-line classification in which the prediction algorithm is given a "confidence level" 1-delta and is required to output as its prediction a range of labels (intuitively, those labels deemed compatible with the available data at the level delta) rather than just one label; as usual, the examples are assumed to be generated independently from the same probability distribution P. The prediction algorithm is said to be "well-calibrated" for P and delta if the long-run relative frequency of errors does not exceed delta almost surely w.r. to P. For well-calibrated algorithms we take the number of "uncertain" predictions (i.e., those containing more than one label) as the principal measure of predictive performance. The main result of this paper is the construction of a prediction algorithm which, for any (unknown) P and any delta: (a) makes errors independently and with probability delta at every trial (in particular, is well-calibrated for P and delta); (b) makes in the long run no more uncertain predictions than any other prediction algorithm that is well-calibrated for P and delta; (c) processes example n in time O(log n). |
| Learning Probabilistic Linear-Threshold Classifiers via\\Selective Sampling |
|---|
| Nicol\`{o} Cesa-Bianchi - Universit\`{a} degli Studi di Milano Alex Conconi - Universit\`{a} degli Studi di Milano Claudio Gentile - Universit\`{a} dell'Insubria |
In this paper we investigate selective sampling, a learning model where the learner observes a sequence of i.i.d.\ unlabeled instances each time deciding whether to query the label of the current instance. We assume that labels are binary and stochastically related to instances via a linear probabilistic function whose coefficients are arbitrary and unknown. We then introduce a new selective sampling rule and show that its expected regret (with respect to the classifier knowing the underlying linear function and observing the label realization after each prediction) grows not much faster than the number of sampled labels. Furthermore, under additional assumptions on the true margin distribution, we prove that the number of sampled labels grows only logarithmically in the number of observed instances. Experiments carried out on a text categorization problem show that: (1) our selective sampling algorithm performs better than the Perceptron algorithm even when the latter is given the true label after each classification; (2) when allowed to observe the true label after each classification, the performance of our algorithm remains the same. Finally, we note that by expressing our selective sampling rule in dual variables we can learn nonlinear probabilistic functions via the kernel machinery. |
| Learning Algorithms for Enclosing Points in Bregmanian Spheres |
|---|
| Koby Crammer - The Hebrew University of Jerusalem Yoram Singer - The Hebrew University of Jerusalem |
We discuss the problem of finding a generalized sphere that encloses points originating from a single source. The points contained in such a sphere are within a maximal divergence from a center point. The divergences we study are known as the Bregman divergences which include as a special case both the Euclidean distance and the relative entropy. We cast the learning task as an optimization problem and show that it results in a simple dual form which bears interesting algebraic properties. We then discuss a general algorithmic framework to solve the optimization problem. Our training algorithm employs an auxiliary function that bounds the dual's objective function and can be used with a broad class of Bregman functions. As a specific application of the algorithm we give a detailed derivation for the relative entropy. We analyze the generalization ability of the algorithm by adopting margin-style proof techniques. We also describe and analyze two schemes of online algorithms for the case when the radius of the sphere is set in advance. |
| Internal Regret in On-line Portfolio Selection |
|---|
| Gilles Stoltz - Universit\'{e} Paris-Sud Gabor Lugosi - University Pompeu Fabra |
This paper extends the game-theoretic notion of internal regret to the case of on-line potfolio selection problems. New sequential investment strategies are designed to minimize the cumulative internal regret for all possible market behaviors. Some of the introduced strategies, apart from achieving a small internal regret, achieve an accumulated wealth almost as large as that of the best constantly rebalanced portfolio. It is argued that the low-internal-regret property is related to stability and experiments on real stock exchange data demonstrate that the new strategies achieve usually better returns compared to some known algorithms. |
| Lower Bounds on the Sample Complexity of Exploration in the\\Multi-Armed Bandit Problem |
|---|
| Shie Mannor - Massachusetts Institute of Technology John Tsitsiklis - Massachusetts Institute of Technology |
We consider the Multi-armed bandit problem under the PAC
("probably approximately correct") model. It was shown by Even-Dar
et al. that given n arms, it suffices to play the arms a total of
times which is linear in n, inversely quadratic in epsilon, and
linear in the logarithm of one over delta, to find an
epsilon-optimal arm with probability of at least 1-delta.
Our contribution is a matching lower bound that holds for any sampling policy.
We also generalize the lower bound to a Bayesian setting, and to the case
where the statistics of the arms are known but the identities of the arms are
not.
|
| Smooth $\epsilon$-Insensitive Regression by Loss Symmetrization |
|---|
| Ofer Dekel - The Hebrew University of Jerusalem Shai Shalev-Shwartz - The Hebrew University of Jerusalem Yoram Singer - The Hebrew University of Jerusalem |
We describe a framework for solving regression problems by reduction to classification. Our reduction is based on symmetrization of margin-based loss functions commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. Our construction yields a smooth version of the epsilon-insensitive hinge loss that is used in support vector regression. A byproduct of this construction is a new simple form of regularization for boosting-based classification and regression algorithms. We present two parametric families of batch learning algorithms for minimizing these losses. The first family employs a log-additive update and is based on recent boosting algorithms while the second family uses a new form of additive update. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the epsilon-insensitive logistic loss. Our regression framework also has implications on classification algorithms, namely, a new additive batch algorithm for the log-loss and exp-loss used in boosting. |
| On Finding Large Conjunctive Clusters |
|---|
| Nina Mishra - HP Labs/Stanford University Dana Ron - Tel-Aviv University Ram Swaminathan - HP Labs |
We propose a new formulation of the clustering problem that differs from previous work in several aspects. First, the goal is to explicitly output a collection of simple and meaningful conjunctive descriptions of the clusters. Second, the clusters might overlap, i.e., a point can belong to multiple clusters. Third, the clusters might not cover all points, i.e., not every point is clustered. Finally, we allow a point to be assigned to a conjunctive cluster description even if it does not completely satisfy all of the attributes, but rather only satisfies most. A convenient way to view our clustering problem is that of finding a collection of large bicliques in a bipartite graph. Identifying one largest conjunctive cluster is equivalent to finding a maximum edge biclique. Since this problem is NP-hard [Peeters00] and there is evidence that it is difficult to approximate [Feige02], we solve a relaxed version where the objective is to find a large subgraph that is close to being a biclique. We give a randomized algorithm that finds a relaxed biclique with almost as many edges as the maximum biclique. We then extend this algorithm to identify a good collection of large relaxed bicliques. A key property of these algorithms is that their running time is independent of the number of data points and linear in the number of attributes. |
| Learning Arithmetic Circuits via Partial Derivatives |
|---|
| Adam Klivans - Harvard University Amir Shpilka - Harvard University and Massachusetts Institute of Technology |
We present a polynomial time algorithm for learning several
models of algebraic computation. We show that any arithmetic
circuit whose partial derivatives induce a ``low''-dimensional vector
space is exactly learnable from membership and equivalence queries.
As a consequence we obtain the first polynomial time algorithm for
learning depth three multilinear arithmetic circuits.
In addition, we give the first polynomial time algorithms for learning
restricted algebraic branching programs and noncommutative
arithmetic formulae. Previously only restricted versions of
depth-2 arithmetic circuits were known to be learnable in polynomial
time. Our learning algorithms can be viewed as solving a
generalization of the well known it polynomial interpolation
problem where the unknown polynomial has a succinct representation.
We can learn representations of polynomials encoding exponentially many
monomials. Our techniques combine a careful algebraic analysis of arithmetic
circuits' partial derivatives with the ``multiplicity automata" techniques due
to Beimel et al \cite{BBB00}. |
| Using a Linear Fit to Determine Monotonicity Directions |
|---|
| Malik Magdon-Ismail - RPI Joseph Sill - Plumtree Software |
Let $f$ be a function on $R^d$ that is monotonic in every variable. There are $2^d$ possible assignments to the directions of monotonicity (two per variable). We provide sufficient conditions under which the optimal linear model obtained from a least squares regression on $f$ will identify the monotonicity directions correctly. We show that when the input dimensions are independent, the linear fit correctly identifies the monotonicity directions. We provide an example to illustrate that in the general case, when the input dimensions are not independent, the linear fit may not identify the directions correctly. However, when the inputs are jointly Gaussian, as is often assumed in practice, the linear fit will correctly identify the monotonicity directions, even if the input dimensions are dependent. Gaussian densities are a special case of a more general class of densities (Mahalanobis densities) for which the result holds. Our results hold when $f$ is a classification or regression function. If a finite data set is sampled from the function, we show that if the exact linear regression would have yielded the correct monotonicity directions, then the sample regression will also do so asymptotically (in a probabilistic sense). This result holds even if the data are noisy. |
| Generalization Bounds for Voting Classifiers Based on\\Sparsity and Clustering |
|---|
| Vladimir Koltchinskii - The University of New Mexico Dmitry Panchenko - Massachusetts Institute of Technology Savina Andonova - Boston University |
We prove new margin type bounds on the generalization error of voting classifiers that take into account the sparsity of weights and certain measures of clustering of weak classifiers in the convex combination. We also present experimental results to illustrate the behavior of the parameters of interest for several data sets. |
| Sequence Prediction Based on Monotone Complexity |
|---|
| Marcus Hutter - IDSIA |
This paper studies sequence prediction based on the monotone Kolmogorov complexity Km=-log m, i.e. based on universal deterministic/one-part MDL. m is extremely close to Solomonoff's prior M, the latter being an excellent predictor in deterministic as well as probabilistic environments, where performance is measured in terms of convergence of posteriors or losses. Despite this closeness to M, it is difficult to assess the prediction quality of m, since little is known about the closeness of their posteriors, which are the important quantities for prediction. We show that for deterministic computable environments, the "posterior" and losses of m converge, but rapid convergence could only be shown on-sequence; the off-sequence behavior is unclear. In probabilistic environments, neither the posterior nor the losses converge, in general. |
| How Many Strings Are Easy to Predict? |
|---|
| Yuri Kalnishkan - Royal Holloway University of London Vladimir Vovk - Royal Holloway University of London Michael V. Vyugin - Royal Holloway University of London |
It is well known in the theory of Kolmogorov complexity that most strings
cannot be compressed; more precisely, only exponentially few
($\Theta(2^{n-m})$) strings of length $n$ can be compressed by $m$ bits. This
paper extends the `incompressibility' property of Kolmogorov complexity to the
`unpredictability' property of predictive complexity. The `unpredictability'
property states that predictive complexity (defined as the loss suffered by a
universal prediction algorithm working infinitely long) of most strings is
close to a trivial upper bound (the loss suffered by a trivial minimax
constant prediction strategy). We show that only exponentially few strings
can be successfully predicted and find the base of the exponent. |
| Polynomial Certificates for Propositional Classes |
|---|
| Marta Arias - Tufts University Roni Khardon - Tufts University Rocco Servedio - Columbia University |
This paper studies the query complexity of learning
classes of expressions in propositional logic
from equivalence and membership queries.
We give new constructions of
polynomial size certificates of non-membership
for monotone, unate and Horn CNF functions.
Our constructions yield quantitatively different bounds from
previous constructions of certificates for these classes.
We prove lower bounds on certificate size
which show that for some parameter settings the certificates we
construct for these classes are exactly optimal.
Finally, we also prove that a natural generalization of these classes,
the class of renamable Horn CNF functions,
does {\em not} have polynomial size certificates of non-membership, thus
answering an open question of Feigelson. |
| On-line Learning with Imperfect Monitoring |
|---|
| Shie Mannor - Massachusetts Institute of Technology Nahum Shimkin - Technion |
We study on-line play of repeated matrix games in which the observations of past actions of the other player and the obtained reward are partial and stochastic. We define the Partial Observation Bayes Envelope (POBE) as the best reward against the worst-case stationary strategy of the opponent that agrees with past observations. Our goal is to have the (unobserved) average reward above the POBE. For the case where the observations (but not necessarily the rewards) depend on the opponent play alone, an algorithm for attaining the POBE is derived. This algorithm is based on an application of approachability theory combined with a worst-case view over the unobserved rewards. We also suggest a simplified solution concept for general signaling structure. This concept may fall short of the POBE. |
| Exploiting Task Relatedness for Multiple Task Learning |
|---|
| Shai Ben-David - Cornell University and Technion Reba Schuller - Cornell University |
The approach of learning of multiple "related" tasks simultaneously has proven quite successful in practice; however, theoretical justification for this success has remained elusive. The starting point for previous work on multiple task learning has been that the tasks to be learned jointly are somehow "algorithmically related", in the sense that the results of applying a specific learning algorithm to these tasks are assumed to be similar. We offer an alternative approach, defining relatedness of tasks on the basis of similarity between the example generating distributions that underline these task. We provide a formal framework for this notion of task relatedness, which captures a sub-domain of the wide scope of issues in which one may apply a multiple task learning approach. Our notion of task similarity is relevant to a variety of real life multitask learning scenarios and allows the formal derivation of generalization bounds that are strictly stronger than the previously known bounds for both the learning-to-learn and the multitask learning scenarios. We give precise conditions under which our bounds guarantee generalization on the basis of smaller sample sizes than the standard single-task approach. |
| Approximate Equivalence of Markov Decision Processes |
|---|
| Eyal Even-Dar - Tel Aviv University Yishay Mansour - Tel Aviv University |
We consider the problem of finding the minimal $\eps$-equivalent MDP for an MDP given in its tabular form. We show that the problem is NP-Hard and then give a bicriteria approximation algorithm to the problem. We suggest that the right measure for finding minimal $\eps$-equivalent model is $L_1$ rather than $L_\infty$ by giving both an example, which demonstrates the drawback of using $L_\infty$, and performance guarantees for using $L_1$. In addition, we give a polynomial algorithm that decides whether two MDPs are equivalent. |
| An Information Theoretic Tradeoff between Complexity and Accuracy |
|---|
| Ran Gilad-Bachrach - The Hebrew University of Jerusalem Amir Navot - The Hebrew University of Jerusalem Naftali Tishby - The Hebrew University of Jerusalem |
A fundamental question in learning theory is the quantification of the basic tradeoff between the complexity of a model and its predictive accuracy. One valid way of quantifying this tradeoff, known as the Information Bottleneck, is to measure both the complexity of the model and its prediction accuracy by using Shannon's mutual information. In this paper we show that the Information Bottleneck framework answers a well defined and known coding problem and at same time it provides a general relationship between complexity and prediction accuarcy, measured by mutual information. We study the nature of this complexity-accuracy tradeoff and discuss some of its theoretical properties. Furthermore, we present relations to classical information theoretic problems, such as rate-distortion theory, cost-capacity tradeoff and source coding with side information. |
| Learning Random log-depth Decision Trees under the Uniform\\Distribution |
|---|
| Jeffrey Jackson - Duquesne University Rocco Servedio - Columbia University |
We consider three natural models of random log-depth decision trees. We give an efficient algorithm that for each of these models learns---as a decision tree---all but an inverse polynomial fraction of such trees using only uniformly distributed random examples. |
| Projective DNF Formulae and Their Revision |
|---|
| Robert Sloan - University of Illinois at Chicago Balazs Szorenyi - University of Szeged Gyorgy Turan - University of Illinois at Chicago |
Projection learning, introduced by Valiant, aims to learn a target concept by learning some of the concept's projections to a class of smaller domains, and by combining these projections. A natural special case of this framework is when the projection domains are subcubes of a fixed dimension, and the restrictions of the target to these domains are conjunctions. The first objective of this work is to study the class of functions learned in this case, called projective DNFs. We give some basic properties of projective DNFs by comparing them to standard DNF subclasses such as k-DNF and k-term DNF, and we present lower and upper bounds for the exclusion dimension of projective DNF. For the subclass of 1-projective DNF we prove a characterization theorem giving an explicit description of all such expressions. Our second objective is to extend Valiant's result on attribute-efficient learning of projective DNFs to the related case of efficient revision. We show that Littlestone's Winnow, the paradigmatic attribute-efficient disjunction learning algorithm, is also an efficient algorithm for revising disjunctions, if the initial weights are suitably modified, even with attribute errors. This implies that Valiant's result can be extended to the efficient revision of projective DNF. |
| Learning with Equivalence Constraints, and the Relation to\\Multiclass Learning |
|---|
| Aharon Bar Hillel - The Hebrew University of Jerusalem Daphna Weinshall - The Hebrew University of Jerusalem |
We study the problem of learning partitions using equivalence constraints as input. This is a binary classification problem in the product space of pairs of datapoints. The training data includes pairs of datapoints which are labeled as coming from the same class or not. This kind of data appears naturally in applications where explicit labeling of datapoints is hard to get, but relations between datapoints can be more easily obtained, using, for example, Markovian dependency (as in video clips). Our problem is an unlabeled partition problem, and is therefore tightly related to multiclass classification. We show that the solutions of the two problems are related, in the sense that a good solution to the binary classification problem entails the existence of a good solution to the multiclass problem, and vice versa. We also show that bounds on the sample complexity of the two problems are similar, by showing that their relevant 'dimensions' (VC dimension for the binary problem, Natarajan dimension for the multiclass problem) bound each other. Finally, we show the feasibility of solving multiclass learning efficiently by using a solution of the equivalent binary classification problem. In this way advanced techniques developed for binary classification, such as SVM and boosting, can be used directly to enhance multiclass learning. |
| Tutorial: Machine Learning Methods in Natural Language Processing |
|---|
| Michael Collins - Massachusetts Institute of Technology |
Statistical or machine learning approaches have become quite prominent in the Natural Language Processing literature. Common techniques include generative models such as Hidden Markov Models or Probabilistic Context-Free Grammars, and more general noisy-channel models such as the statistical approach to machine translation pioneered by researchers at IBM in the early 90s. Recent work has considered discriminative methods such as (conditional) markov random fields, or large-margin methods. This tutorial will describe several of these techniques. The methods will be motivated through a number of natural language problems: from part-of-speech tagging and parsing, to machine translation, dialogue systems and information extraction problems. I will also concentrate on links to the COLT and kernel methods literature: for example covering kernels over the discrete structures found in NLP, online algorithms for NLP problems, and the issues in extending generalization bounds from classification problems to NLP problems such as parsing. |
| Invited talk: Learning from Uncertain Data |
|---|
| Mehryar Mohri - AT\&T Labs |
The application of statistical methods to natural language processing has been remarkably successful over the past two decades. But, to deal with recent problems arising in this field, machine learning techniques must be generalized to deal with uncertain data, or datasets whose elements are distributions over sequences, such as weighted automata. This paper reviews a number of recent results related to this question. We discuss how to compute efficiently basic statistics from a weighted automaton such as the expected count of an arbitrary sequence and higher moments of that distribution, by using weighted transducers. Both the corresponding transducers and related algorithms are described. We show how general classification techniques such as Support Vector Machines can be extended to deal with distributions by using general kernels between weighted automata. We describe several examples of positive definite kernels between weighted automata such as kernels based on counts of common $n$-gram sequences, counts of common factors or suffixes, or other more complex kernels, and describe a general algorithm for computing them efficiently. We also demonstrate how machine learning techniques such as clustering based on the edit-distance can be extended to deal with unweighted and weighted automata representing distributions. |
| Invited talk: Learning and Parsing Stochastic Unification-Based Grammars |
|---|
| Mark Johnson - Brown University |
Stochastic Unification-Based Grammars combine knowledge-rich and data-rich approaches to natural language processing. This provides a rich structure to the learning and parsing (decoding) tasks that can be described with undirected graphical models. While most work to date has treated parsing as a straight-forward multi-class classification problem, we are beginning to see how this structure can be exploited in learning and parsing. Exploiting this structure is likely to become more important as the research focus moves from parsing to more realistic tasks such as machine translation and summarization. |
| Generality's Price: Inescapable Deficiencies in Machine-Learned\\Programs |
|---|
| John Case - University of Delaware Keh-Jiann Chen - Academia Sinica Sanjay Jain - National University of Singapore Wolfgang Merkle - Universit\"{a}t Heidelberg James Royer - Syracuse University |
This paper investigates some delicate tradeoffs between the generality of an algorithmic learning device and the quality of the programs it learns successfully. There are results to the effect that, thanks to small increases in generality of a learning device, the computational complexity of some successfully learned programs is provably unalterably suboptimal. There are also results in which the complexity of successfully learned programs is asymptotically optimal and the learning device is general, but, still thanks to the generality, some of those optimal, learned programs are provably unalterably information deficient - in some cases, deficient as to safe, algorithmic extractability/provability of the fact that they are even approximately optimal. The paper is on the borderline between learning theory, complexity theory and logic. |
| On Learning To Coordinate: Random Bits Help, Insightful Normal\\Forms, and Competency Isomorphisms |
|---|
| John Case - University of Delaware Sanjay Jain - National University of Singapore Franco Montagna - University of Siena Giulia Simi - University of Siena Andrea Sorbi - University of Siena |
A mere bounded number of random bits judiciously employed by a probabilistically correct algorithmic coordinator is shown to increase the power of learning to coordinate compared to deterministic algorithmic coordinators. Furthermore, these probabilistic algorithmic coordinators are provably not characterized in power by teams of deterministic ones. An insightful, enumeration technique based, normal form characterization of the classes that are learnable by total computable coordinators is given. These normal forms are for insight only since it is shown that the complexity of the normal form of a total computable coordinator can be infeasible compared to the original coordinator. Montagna and Osherson showed that the competence class of a total coordinator cannot be strictly improved by another total coordinator. It is shown in the present paper that the competencies of any two total coordinators are the same modulo isomorphism. Furthermore, a completely effective, index set version of this competency isomorphism result is given, where all the coordinators are total computable. |
| Learning All Subfunctions of a Function |
|---|
| Sanjay Jain - National University of Singapore Efim Kinber - Sacred Heart University Rolf Wiehagen - University of Kaiserslautern |
Sublearning, a model for learning of subconcepts of a concept, is presented. Sublearning a class of total recursive functions informally means to learn all functions from that class together with all of their subfunctions. While in language learning it is known to be impossible to learn any infinite language together with all of its sublanguages, the situation changes for sublearning of functions. Several types of sublearning are defined and compared to each other as well as to other learning types. For example, in some cases, sublearning coincides with robust learning. Furthermore, whereas in usual function learning there are classes that cannot be learned consistently, all sublearnable classes of some natural types can be learned consistently. Moreover, the power of sublearning is characterized in several terms, thereby establishing a close connection to measurable classes and variants of this notion. As a consequence, there are rich classes which do not need any self-referential coding for sublearning them. |
| When Is Small Beautiful? |
|---|
| Amiran Ambroladze - Lund University John Shawe-Taylor - Royal Holloway University of London |
- |
| Learning a Function of $r$ Relevant Variables |
|---|
| Avrim Blum - Carnegie Mellon University |
- |
| Subspace Detection: A Robust Statistics Formulation |
|---|
| Sanjoy Dasgupta - University of California at San Diego |
- |
| How Fast Is $k$-Means? |
|---|
| Sanjoy Dasgupta - University of California at San Diego |
- |
| Universal Coding of Zipf Distributions |
|---|
| Yoav Freund - Mitsubishi Electric Research Labs Alon Orlitsky - University of California at San Diego Prasad Santhanam - University of California at San Diego Junan Zhang - University of California at San Diego |
- |
| An Open Problem Regarding the Convergence of Universal A\\Priori Probability |
|---|
| Marcus Hutter - IDSIA |
- |
| Entropy Bounds for Restricted Convex Hulls |
|---|
| Vladimir Koltchinskii - University of New Mexico |
- |
| Compressing to VC Dimension Many Points |
|---|
| Manfred K. Warmuth - University of California at Santa Cruz |
- |
| ||
|