All Camera-ready Abstracts

Tutorial: Learning Topics in Game-Theoretic Decision Making
Michael Littman - Rutgers University
The tutorial will cover some topics of recent interest in AI and
economics concerning design making in a computational game-theory
framework.  It will highlight areas in which computational learning
theory has played a role and could play a greater role in the future.
Covered areas include recent representational and algorithmic
advances, stochastic games and reinforcement learning, no regret
algorithms, and the role of various equilibrium concepts.
Invited Talk: A General Class of No-Regret Learning Algorithms and Game-Theoretic Equilibria
Amy Greenwald - Brown University
Amir Jafari - Brown University
A general class of no-regret learning algorithms, called
$\Phi$-no-regret learning algorithms is defined, which spans the
spectrum from no-internal-regret learning to no-external-regret
learning, and beyond.  $\Phi$ describes the set of strategies to which
the play of a learning algorithm is compared: a learning algorithm
satisfies $\Phi$-no-regret iff no regret is experienced for playing as
the algorithm prescribes, rather than playing according to any of the
transformations of the algorithm's play prescribed by elements of $\Phi$.

Analogously, a class of game-theoretic equilibria, called
$\Phi$-equilibria, is defined, and it is shown that the empirical
distribution of play of $\Phi$-no-regret algorithms converges to the
set of $\Phi$-equilibria.  Perhaps surprisingly, the strongest form of
no-regret algorithms in this class are no-internal-regret algorithms.
Thus, the tightest game-theoretic solution concept to which
$\Phi$-no-regret algorithms (provably) converge is correlated
equilibrium.  In particular, Nash equilibrium is not a necessary
outcome of learning via any $\Phi$-no-regret learning algorithms.
Preference Elicitation and Query Learning
Avrim Blum - Carnegie Mellon University
Jeffrey Jackson - Duquesne University
Tuomas Sandholm - Carnegie Mellon University
Martin Zinkevich - Carnegie Mellon University
In this paper we initiate an exploration of relationships between
``preference elicitation'', a learning-style problem that arises in
combinatorial auctions, and the problem of learning via queries
studied in computational learning theory.  Preference elicitation is
the process of asking questions about the preferences of bidders so as
to best divide some set of goods.  As a learning problem, it can be
thought of as a setting in which there are multiple target concepts
that can each be queried separately, but where the goal is not so much
to learn each concept as it is to produce an ``optimal example''.  In
this work, we prove a number of similarities and differences between
preference elicitation and query learning, giving both separation
results and proving some connections between these problems.
Efficient Algorithms for Online Decision Problems
Adam Kalai - Massachusetts Institute of Technology
Santosh Vempala - Massachusetts Institute of Technology
In an online decision problem, one makes a sequence of
decisions without knowledge of the future. Tools from learning
such as Weighted Majority and its many variants demonstrate that online
algorithms can perform nearly as well as the best single decision chosen in
hindsight, even when there are exponentially many possible decisions. However,
the naive application of these algorithms is inefficient for such large
problems.  For some problems with nice structure, specialized efficient
solutions have been developed.

We show that a very simple idea, used in Hannan's seminal 1957
paper, gives efficient solutions to all of these problems.  Essentially, in
each period, one chooses the decision that worked best in the past. To
guarantee low regret, it is necessary to add randomness. Surprisingly, this
simple approach gives additive epsilon regret per period, efficiently. We
present a simple general analysis and several extensions, including a
(1+epsilon)-competitive algorithm as well as a lazy one that rarely switches
between decisions.
Positive Definite Rational Kernels
Corinna Cortes - AT\&T Labs
Patrick Haffner - AT\&T Labs
Mehryar Mohri - AT\&T Labs
Kernel methods are widely used in statistical learning techniques. We
recently introduced a general kernel framework based on weighted
transducers or rational relations, 'rational kernels', to extend
kernel methods to the analysis of variable-length sequences or more
generally weighted automata. These kernels are efficient to compute
and have been successfully used in applications such as spoken-dialog
classification. Not all rational kernels are 'positive definite
and symmetric' (PDS) however, a sufficient property for guaranteeing
the convergence of discriminant classification algorithms such as
Support Vector Machines. We present several theoretical results
related to PDS rational kernels. We show in particular that under some
conditions these kernels are closed under sum, product, or
Kleene-closure and give a general method for constructing a PDS
rational kernel from an arbitrary transducer defined on some
non-idempotent semirings. We also show that some commonly used string
kernels or similarity measures such as the edit-distance, the
convolution kernels of Haussler, and some string kernels used in the
context of computational biology are specific instances of rational
kernels. Our results include the proof that the edit-distance over a
non-trivial alphabet is not 'negative definite', which, to the best of our
knowledge, was never stated or proved before.
Bhattacharyya and Expected Likelihood Kernels
Tony Jebara - Columbia University
Risi Kondor - Columbia University
We introduce a new class of kernels between distributions. These
induce a kernel on the input space between data points by associating
to each datum a generative model fit to the data point
individually. The kernel is then computed by integrating the product
of the two generative models corresponding to two data points. This
kernel permits discriminative estimation via, for instance, support
vector machines, while exploiting the properties, assumptions, and
invariances inherent in the choice of generative model. It satisfies
Mercer's condition and can be computed in closed form for a large
class of models, including exponential family models, mixtures, hidden Markov
models and Bayesian networks. For other models the kernel can be approximated
by sampling methods. Experiments are
shown for multinomial models in text classification and for hidden
Markov models for protein sequence classification.
Maximal Margin Classification for Metric Spaces
Matthias Hein - Max Planck Institute for Biological Cybernetics
Olivier Bousquet - Max Planck Institute for Biological Cybernetics
In this article we construct a maximal margin classification
algorithm for arbitrary metric spaces. At first we show that the
Support Vector Machine (SVM) is a maximal margin algorithm for the
class of metric spaces where the negative squared distance is
conditionally positive definite (CPD). This means that the metric
space can be isometrically embedded into a Hilbert space, where
one performs linear maximal margin separation. We will show that
the solution only depends on the metric, but not on the kernel.
Following the framework we develop for the SVM, we construct an
algorithm for maximal margin classification in arbitrary metric
spaces. The main difference compared with SVM is that we no longer
embed isometrically into a Hilbert space, but a Banach space. We
further give an estimate of the capacity of the function class
involved in this algorithm via Rademacher averages. We recover an
algorithm of Graepel et al.(1999).
Maximum Margin Algorithms with Boolean Kernels
Roni Khardon - Tufts University
Rocco Servedio - Columbia University
Recent work has introduced Boolean kernels with which one can learn
over a feature space containing all conjunctions of length up to $k$
(for any $1\leq k \leq n$) over the original $n$ Boolean features in the
input space.  This motivates the question of whether maximum margin
algorithms such as support vector machines can learn Disjunctive 
Normal Form expressions in the PAC learning model using this kernel.  We 
study this question, as well as a variant in which structural risk
minimization (SRM) is performed where the class hierarchy is taken
over the length of conjunctions. 

We show that such maximum margin algorithms do not PAC learn 
$t(n)$-term DNF for any $t(n) = \omega(1),$ 
even when used with such a SRM scheme.  
We also consider PAC learning under the uniform distribution
and show that if the kernel uses conjunctions of length
$\tilde{\omega}(\sqrt{n})$ then the maximum margin hypothesis will
fail on the uniform distribution as well.
Our results concretely illustrate that margin based algorithms may overfit
when learning simple target functions with natural kernels. 
Knowledge-Based Nonlinear Kernel Classifiers
Glenn Fung - The University of Wisconsin-Madison
Olvi Mangasarian - The University of Wisconsin-Madison
Jude Shavlik - The University of Wisconsin-Madison
Prior knowledge in the form of multipe polyhedral sets,
each belonging to one of two categories, is introduced
into a reformulation of a nonlinear kernel support vector
machine (SVM) classifier. The resulting formulation leads
to a linear program that can be solved efficiently. This
extends, in a rather unobvious fashion, previous work
that incorporated similar prior knowledge
into a linear SVM classifier. Numerical tests on
standard-type test problems, such as exclusive-or prior
knowledge sets and a checkerboard with 16 points and prior 
knowledge instead of the usual 1000 points, show the
effectiveness of the proposed approach in generating
sharp nonlinear classifiers based mostly or totally
on prior knowledge.
Fast Kernels for Inexact String Matching
Christina Leslie - Columbia University
Rui Kuang - Columbia University
We introduce several new families of string kernels designed in
particular for use with support vector machines (SVMs) for
classification of protein sequence data.  These kernels -- restricted
gappy kernels, substitution kernels, and wildcard kernels -- are based
on feature spaces indexed by k-length subsequences either from the
string alphabet or the alphabet augmented by a wildcard character, and
hence they are related to the recently presented (k,m)-mismatch kernel
and string kernels used in text classification.  However, for all
kernels we define here, the kernel value K(x,y) can be computed in
O(c(K)(|x| + |y|)) time, where the constant c(K) depends on the
parameters of the kernel but is independent of the size of the
alphabet.  Thus the computation of these kernels is linear in the
length of the sequences, like the mismatch kernel, but we improve upon
the alphabet-dependent constant c(K) of the mismatch kernel.  We
compute the kernels efficiently using a recursive function based on a
trie  data structure and relate our new kernels to the recently
described transducer formalism.  Finally, we report protein
classification experiments on a benchmark SCOP dataset, where we
show that our new faster kernels achieve SVM classification
performance comparable to the mismatch kernel and the Fisher kernel
derived from profile hidden Markov models.
On Graph Kernels: Hardness Results and Efficient Alternatives
Thomas Gaertner - Fraunhofer AIS, University of Bonn, and University of Bristol
Peter Flach - University of Bristol
Stefan Wrobel - Fraunhofer AIS and University of Bonn
As most `real-world' data is structured, research in kernel methods has begun
investigating kernels for various kinds of structured data. One of the most
widely used tools for modeling structured data are graphs. An interesting and
important challenge is thus to investigate kernels on instances that are
represented by graphs. So far, only very specific graphs such as trees and
strings have been considered.

This paper investigates kernels on labeled directed graphs with general
structure. It is shown that computing a strictly positive definite graph
kernel is at least as hard as solving the graph isomorphism  problem. It is
also shown that computing an inner product in a feature space indexed by all
possible graphs, where each feature counts the number of subgraphs isomorphic
to that graph, is {\it NP}-hard. On the other hand, inner products in an
alternative feature space, based on walks in the graph, can be computed in
polynomial time. Such kernels are defined in this paper.
Kernels and Regularization on Graphs
Alexander J. Smola - The Australian National University
Risi Kondor - Columbia University
  We introduce a family of kernels on graphs based on the notion of
  regularization operators. This generalizes in a natural way the
  notion of regularization and Greens functions, as commonly used for
  real valued functions, to graphs. It turns out that diffusion
  kernels can be found as a special case of our reasoning. We show
  that the class of positive, monotonically decreasing functions on
  the unit interval leads to kernels and corresponding regularization
  operators. 
Data-Dependent Bounds for Multi-Category Classification\\Based on Convex Losses
Ilya Desyatnikov - Technion
Ron Meir - Technion
Algorithms for solving multi-category classification problems using
output coding have become very popular in recent years. Following
initial attempts with discrete coding matrices, recent work has
attempted to alleviate some of their shortcomings by considering
real-valued `coding' matrices. We consider an approach to
multi-category classification, based on minimizing a convex upper
bound on the $0-1$ loss. We show that this approach is closely
related to output coding, and derive data-dependent bounds on the
performance. These bounds can be optimized in order to obtain
effective coding matrices, which guarantee small generalization
error. Moreover, our results apply directly to kernel based
approaches.
Comparing Clusterings by the Variation of Information
Marina Meila - University of Washington
This paper proposes an information theoretic criterion for comparing
two partitions, or clusterings, of the same data set. The
criterion, called variation of information (VI), measures the amount
of information lost and gained in changing from one clustering  to
another. The criterion makes no assumptions about how the
clusterings were generated and applies to both soft and hard
clusterings. The basic properties of VI are presented and discussed
from the point of view of comparing clusterings. In particular, the VI
is positive, symmetric and obeys the triangle inequality. Thus,
surprisingly enough, it is a true metric on the space of clusterings.
Multiplicative Updates for Large Margin Classifiers
Fei Sha - University of Pennsylvania
Lawrence Saul - University of Pennsylvania
Daniel Lee - University of Pennsylvania
Various problems in nonnegative quadratic programming
arise in the training of large margin classifiers. 
We derive multiplicative updates for these problems that
converge monotonically to the desired solutions for 
hard and soft margin classifiers.  The updates differ strikingly in
form from other multiplicative updates used in machine learning.  
In this paper, we provide complete proofs of
convergence for these updates and
extend previous work to incorporate
sum and box constraints in addition to nonnegativity.
Simplified PAC-Bayesian Margin Bounds
David McAllester - Toyota Technological Institute at Chicago
The theoretical understanding of support vector machines is largely based on
margin bounds for linear classifiers with unit-norm weight vectors and
unit-norm feature vectors.  Unit-norm margin bounds have
been proved previously using fat-shattering arguments and Rademacher
complexity.
Recently Langford and
Shawe-Taylor proved a dimension-independent unit-norm margin bound
using a relatively simple PAC-Bayesian argument.  Unfortunately, the
Langford-Shawe-Taylor bound is stated in a variational form making
direct comparison to fat-shattering bounds difficult. This paper
provides an explicit solution to the variational problem implicit in
the Langford-Shawe-Taylor bound and shows that the PAC-Bayesian margin
bounds are significantly tighter.  Because a PAC-Bayesian bound is derived
from a particular
prior distribution over hypotheses, a PAC-Bayesian margin bound also
seems to provide insight into the nature of the learning bias
underlying the bound.
Sparse Kernel Partial Least Squares Regression
Michinari Momma - Rensselaer Polytechnic Institute
Kristin Bennett - Rensselaer Polytechnic Institute
Partial Least Squares Regression (PLS) and its kernel version (KPLS) have
become competitive regression approaches.  KPLS performs as well as or better
than support vector regression (SVR) for moderately-sized problems with the
advantages of simple implementation, less training cost, and easier tuning of
parameters.   Unlike SVR, KPLS requires manipulation of the full kernel matrix
and the resulting regression function requires the full training data. In this
paper we rigorously derive a sparse KPLS algorithm. The underlying KPLS
algorithm is modified to maintain sparsity in all steps of the algorithm.   
The resulting nu-KPLS algorithm explicitly models centering and bias rather
than using kernel centering.  An epsilon-insensitive loss function is used to
produce sparse solutions in the dual space.  The final regression function for
the nu-KPLS algorithm only requires a relatively small set of support vectors.
Sparse Probabilistic Regression by Label Partitioning
Shantanu Chakrabartty - The Johns Hopkins University
Gert Cauwenberghs - The Johns Hopkins University
Prof Jayadeva - Indian Institute of Technology
A large-margin learning machine for sparse probability regression is
presented.  Unlike support vector machines and other forms of kernel
machines, nonlinear features are obtained by transforming labels into
higher-dimensional label space rather than transforming data vectors
into feature space.  Linear multi-class logistic regression with
partitioned classes of labels yields a nonlinear classifier in the
original labels.  With a linear kernel in data space, storage and
run-time requirements are reduced from the number of support vectors
to the number of partitioned labels.  Using the partitioning property
of KL-divergence in label space, an iterative alignment procedure
produces sparse training coefficients.  Experiments show that label
partitioning is effective in modeling non-linear decision boundaries
with same, and in some cases superior, generalization performance to
Support Vector Machines with significantly reduced memory and run-time
requirements.
Learning with Rigorous Support Vector Machines
Jinbo Bi - Rensselaer Polytechnic Insitute
Vladimir Vapnik - NEC Laboratories America
We examine the so-called rigorous support vector machine (RSVM)
approach proposed by Vapnik (1998). The formulation of RSVM is
derived by explicitly implementing the structural risk
minimization principle with a parameter H used to directly
control the VC dimension of the set of separating hyperplanes. By
optimizing the dual problem, RSVM finds the optimal separating
hyperplane from a set of functions with VC dimension approximate
to H-squared+1. RSVM produces classifiers equivalent to those obtained
by classic SVMs for appropriate parameter choices, but the use of
the parameter H facilitates model selection, thus minimizing VC
bounds on the generalization risk more effectively. In our
empirical studies, good models are achieved for an appropriate
H-squared in [5% L, 30% L] where L is the size of
training data.
Robust Regression by Boosting the Median
Balazs Kegl - University of Montreal
Most boosting regression algorithms use the weighted average of base
regressors as their final regressor. In this paper we analyze the
choice of the weighted median. We propose a general boosting algorithm
based on this approach. We prove boosting-type convergence of the
algorithm and give clear conditions for the convergence of the robust
training error. The algorithm recovers AdaBoost and AdaBoost_rho as
special cases. For boosting confidence-rated predictions, it leads to
a new approach that outputs a different decision and interprets
robustness in a different manner than the approach based on the
weighted average. In the general, non-binary case we suggest practical
strategies based on the analysis of the algorithm and experiments.
Boosting with Diverse Base Classifiers
Sanjoy Dasgupta - University of California at San Diego
Philip Long - Genome Institute of Singapore
We establish a new bound on the generalization error rate of the
Boost-by-Majority algorithm.  The bound holds when the algorithm is
applied to a collection of base classifiers that contains a
"diverse" subset of "good" classifiers, in a precisely defined
sense.  We describe cross-validation experiments that suggest that
Boost-by-Majority can be the basis of a practically useful learning
method, often improving on the generalization of AdaBoost on large
datasets.
Reducing Kernel Matrix Diagonal Dominance using Semi-Definite Programming
Jaz Kandola - Royal Holloway University of London
Thore Graepel - Microsoft Research
John Shawe-Taylor - Royal Holloway University of London
Kernel-based learning methods revolve around the notion of a {\em
kernel} or Gram matrix between data points. These square,
symmetric, positive semi-definite matrices can informally be
regarded as encoding pairwise similarity between all of the
objects in a data-set. In this paper we propose an algorithm for
manipulating the diagonal entries of a kernel matrix using
semi-definite programming. Kernel matrix diagonal dominance
reduction attempts to deal with the problem of learning with
almost orthogonal features, a phenomenon commonplace in kernel
matrices derived from string kernels or Gaussian kernels with
small width parameter. We show how this task can be formulated as
a semi-definite programming optimization problem that can be
solved with readily available optimizers. Theoretically we provide
an analysis using Rademacher based bounds to provide an
alternative motivation for the 1-norm SVM motivated from kernel
diagonal reduction. We assess the performance of the algorithm on
standard data sets with encouraging results in terms of
approximation and prediction.
Optimal Rates of Aggregation
Alexandre Tsybakov - Universit\'{e} Paris 6
We study the problem of aggregation of M arbitrary estimators  
of a regression function with respect to the mean squared risk.
Three main types of aggregation are considered: model selection,
convex and linear aggregation. We define the notion of
optimal rate of aggregation in an abstract context and prove
lower bounds valid for any method of aggregation. We then 
construct procedures that attain these bounds, thus establishing
optimal rates of linear, convex and model selection type aggregation.
Distance-Based Classification with Lipschitz Functions
Ulrike von Luxburg - Max Planck Institute for Biological Cybernetics
Olivier Bousquet - Max Planck Institute for Biological Cybernetics
The goal of this article is to develop a framework for large margin
classification in metric spaces.  We want to find a generalization of
linear decision functions for metric spaces and define a corresponding
notion of margin such that the decision function separates the
training points with a large margin. It will turn out that using
Lipschitz functions as decision functions, the inverse of the Lipschitz
constant can be interpreted as the size of a margin. In order to
construct a clean mathematical setup we isometrically embed the given
metric space into a Banach space and the space of Lipschitz functions
into its dual space.  Our approach leads to a general large margin
algorithm for classification in metric spaces. To analyze this
algorithm, we first prove a representer theorem. It states that there
exists a solution which can be expressed as linear combination of
distances to sets of training points. Then we analyze the Rademacher
complexity of some Lipschitz function classes. The generality of the
Lipschitz approach can be seen from the fact that several well-known
algorithms are special cases of the Lipschitz algorithm, among them
the support vector machine, the linear programming machine, and
the 1-nearest neighbor classifier.
Random Subclass Bounds
Shahar Mendelson - The Australian National University
Petra Philips - The Australian National University
It has been recently shown that sharp generalization bounds can be obtained
when the function class from which the algorithm chooses its hypotheses is
"small" in the sense that the  Rademacher averages of this function class are
small (Mendelson,2002). Seemingly based on different arguments, generalization
bounds were obtained in the compression scheme (Littlestone and Warmuth,1986),
luckiness (Shawe-Taylor et al, 1998), and algorithmic luckiness (Herbrich and
Williamson, 2002) frameworks in which the "size" of the function class is not
specified a priori.

We show that the bounds obtained in all these frameworks follow from 
the same general principle, namely that coordinate projections of this
function subclass evaluated on random samples are "small" with high
probability.
PAC-MDL Bounds
Avrim Blum - Carnegie Mellon University
John Langford - IBM TJ Watson Research Center
We point out that a number of standard sample complexity bounds
(VC-dimension, PAC-Bayes, and others) are all related to the number of
bits required to communicate the labels given the unlabeled data for a
natural communication game.  Motivated by this observation, we give a
general sample complexity bound based on this game that allows us to
unify these different bounds in one common framework.
Universal Well-Calibrated Algorithm for On-line Classification
Vladimir Vovk - Royal Holloway University of London
We study the problem of on-line classification in which the prediction
algorithm is given a "confidence level" 1-delta and is required to output as
its prediction a range of labels (intuitively, those labels deemed compatible
with the available data
at the level delta) rather than just one label; as usual, the examples are
assumed to be generated independently from the same probability distribution
P. The prediction algorithm is said to be "well-calibrated" for P and delta if
the long-run relative frequency of errors does not exceed delta almost surely
w.r. to P. For well-calibrated algorithms we take the number of "uncertain"
predictions (i.e., those containing more than one label) as the principal
measure of predictive performance. The main result of this paper is the
construction of a prediction algorithm which, for any (unknown) P and any
delta: (a) makes errors independently and with probability delta at every
trial (in particular, is well-calibrated for P and delta); (b) makes in the
long run no more uncertain predictions than any other prediction algorithm
that is well-calibrated for P and delta; (c) processes example n in time O(log
n).
Learning Probabilistic Linear-Threshold Classifiers via\\Selective Sampling
Nicol\`{o} Cesa-Bianchi - Universit\`{a} degli Studi di Milano
Alex Conconi - Universit\`{a} degli Studi di Milano
Claudio Gentile - Universit\`{a} dell'Insubria
In this paper we investigate selective sampling, a learning model 
where the learner observes a sequence of i.i.d.\ unlabeled instances
each time deciding whether to query the label of the current 
instance. We assume that labels are binary and stochastically related
to instances via a linear probabilistic function whose coefficients 
are arbitrary and unknown. We then introduce a new selective sampling
rule and show that its expected regret (with respect to the 
classifier knowing the underlying linear function and observing the 
label realization after each prediction) grows not much faster than 
the number of sampled labels. Furthermore, under additional 
assumptions on the true margin distribution, we prove that the number
of sampled labels grows only logarithmically in the number of 
observed instances. Experiments carried out on a text categorization 
problem show that: (1) our selective sampling algorithm performs 
better than the Perceptron algorithm even when the latter is given 
the true label after each classification; (2) when allowed to 
observe the true label after each classification, the performance 
of our algorithm remains the same. Finally, we note that by 
expressing our selective sampling rule in dual variables we can 
learn nonlinear probabilistic functions via the kernel machinery.
Learning Algorithms for Enclosing Points in Bregmanian Spheres
Koby Crammer - The Hebrew University of Jerusalem
Yoram Singer - The Hebrew University of Jerusalem
We discuss the problem of finding a generalized sphere
that encloses points originating from a single source. 
The points contained in such a sphere are within a
maximal divergence from a center point. The divergences
we study are known as the Bregman divergences which include
as a special case both the Euclidean distance and the relative
entropy. We cast the learning task as an optimization problem
and show that it results in a simple dual form which bears
interesting algebraic properties. We then discuss a general
algorithmic framework to solve the optimization problem.
Our training algorithm employs an auxiliary function that
bounds the dual's objective function and can be used
with a broad class of Bregman functions. As a specific
application of the algorithm we give a detailed derivation
for the relative entropy. We analyze the generalization
ability of the algorithm by adopting margin-style proof
techniques. We also describe and analyze two schemes of
online algorithms for the case when the radius of the
sphere is set in advance.
Internal Regret in On-line Portfolio Selection
Gilles Stoltz - Universit\'{e} Paris-Sud
Gabor Lugosi - University Pompeu Fabra
This paper extends the game-theoretic notion of internal regret to the case of
on-line potfolio selection problems. New sequential investment strategies are
designed to minimize the cumulative internal regret for all possible market
behaviors. Some of the
introduced strategies, apart from achieving a small internal regret, achieve
an accumulated wealth almost as large as that of the best constantly
rebalanced portfolio. It is argued that the low-internal-regret property is
related to stability and experiments on real stock exchange data demonstrate
that the new strategies achieve usually better returns compared to some known
algorithms.
Lower Bounds on the Sample Complexity of Exploration in the\\Multi-Armed Bandit Problem
Shie Mannor - Massachusetts Institute of Technology
John Tsitsiklis - Massachusetts Institute of Technology
We consider the Multi-armed bandit problem under the PAC
("probably approximately correct") model. It was shown by Even-Dar
et al. that given n arms, it suffices to play the arms a total of
times which is linear in n, inversely quadratic in epsilon, and
linear in the logarithm of one over delta, to find an
epsilon-optimal arm with probability of at least 1-delta. 
Our contribution is a matching lower bound that holds for any sampling policy.
We also generalize the lower bound to a Bayesian setting, and to the case
where the statistics of the arms are known but the identities of the arms are
not.
Smooth $\epsilon$-Insensitive Regression by Loss Symmetrization
Ofer Dekel - The Hebrew University of Jerusalem
Shai Shalev-Shwartz - The Hebrew University of Jerusalem
Yoram Singer - The Hebrew University of Jerusalem
We describe a framework for solving regression problems by
reduction to classification. Our reduction is based on
symmetrization of margin-based loss functions commonly used
in boosting algorithms, namely, the logistic loss and the
exponential loss. Our construction yields a smooth version
of the epsilon-insensitive hinge loss that is used in support vector
regression. A byproduct of this construction is a new simple
form of regularization for boosting-based classification and
regression algorithms.
We present two parametric families of batch learning
algorithms for minimizing these losses. The first family employs a
log-additive
update and is based on recent boosting algorithms while the second
family uses a new form of additive update. We also describe
and analyze online gradient descent (GD) and exponentiated gradient
(EG) algorithms for the epsilon-insensitive logistic loss.
Our regression framework also has implications on classification
algorithms, namely, a new additive batch algorithm for the log-loss
and exp-loss used in boosting.
On Finding Large Conjunctive Clusters
Nina Mishra - HP Labs/Stanford University
Dana Ron - Tel-Aviv University
Ram Swaminathan - HP Labs
We propose a new formulation of the clustering problem that differs
from previous work in several aspects.  First, the goal is to
explicitly output a collection of simple and meaningful conjunctive 
descriptions of the clusters. Second, the clusters
might overlap, i.e., a point can belong to multiple clusters. Third,
the clusters might not cover all points, i.e., not every point is
clustered.  Finally, we allow a point to be assigned to a conjunctive
cluster description even if it does not completely satisfy all of the
attributes, but rather only satisfies most.

A convenient way to view our clustering problem is that of finding a
collection of large bicliques in a bipartite graph.  Identifying one
largest conjunctive cluster is equivalent to finding a maximum edge
biclique.  Since this problem is NP-hard [Peeters00] and there is
evidence that it is difficult to approximate [Feige02], we solve
a relaxed version where the objective is to find a large subgraph that
is close to being a biclique.  We give a randomized algorithm that finds a
relaxed biclique with almost as many edges as the maximum biclique.
We then extend this algorithm to identify a good collection of large
relaxed bicliques.  A key property of these algorithms is that their
running time is independent of the number of data points and
linear in the number of attributes.
Learning Arithmetic Circuits via Partial Derivatives
Adam Klivans - Harvard University
Amir Shpilka - Harvard University and Massachusetts Institute of Technology
We present a polynomial time algorithm for learning several
models of algebraic computation.  We show that any arithmetic
circuit whose partial derivatives induce a ``low''-dimensional vector
space is exactly learnable from membership and equivalence queries.
As a consequence we obtain the first polynomial time algorithm for
learning depth three multilinear arithmetic circuits.  
In addition, we give the first polynomial time algorithms for learning
restricted algebraic branching programs and noncommutative 
arithmetic formulae. Previously only restricted versions of
depth-2 arithmetic circuits were known to be learnable in polynomial
time. Our learning algorithms can be viewed as solving a
generalization of the well known it polynomial interpolation
problem where the unknown polynomial has a succinct representation.
We can learn representations of polynomials encoding exponentially many
monomials.  Our techniques combine a careful algebraic analysis of arithmetic
circuits' partial derivatives with the ``multiplicity automata" techniques due
to Beimel et al \cite{BBB00}.
Using a Linear Fit to Determine Monotonicity Directions
Malik Magdon-Ismail - RPI
Joseph Sill - Plumtree Software
Let $f$ be a function on $R^d$ that is monotonic
in every variable. 
There are $2^d$ possible assignments to the 
directions of monotonicity (two per variable). 
We provide sufficient conditions under which the 
optimal
linear model obtained from a least squares regression on $f$
will identify the monotonicity directions correctly.
We show that 
when the input dimensions are independent,
the linear fit correctly
identifies the monotonicity directions.
We provide an example to illustrate that in the general
case, when the input dimensions are not 
independent, the linear fit may not
identify the directions correctly. However, 
when the inputs are  
jointly Gaussian, as is often assumed
in practice, the linear fit will correctly
identify the monotonicity directions,
even if the input dimensions are dependent.
Gaussian densities are a special case of a more general class
of densities (Mahalanobis densities) for which the result
holds. 
Our results hold when $f$ is a classification or regression function.


If a finite data set is sampled from the function, we show that if the exact
linear
regression would have yielded the correct monotonicity directions, then
the sample regression will also do so 
asymptotically (in a 
probabilistic sense). This result holds even 
if the data are noisy.
Generalization Bounds for Voting Classifiers Based on\\Sparsity and Clustering
Vladimir Koltchinskii - The University of New Mexico
Dmitry Panchenko - Massachusetts Institute of Technology
Savina Andonova - Boston University
We prove new margin type bounds on the generalization error of
voting classifiers that take into account the sparsity of weights
and certain measures of clustering of weak classifiers in the
convex combination. We also present experimental results to
illustrate the behavior of the parameters of interest for several
data sets.
Sequence Prediction Based on Monotone Complexity
Marcus Hutter - IDSIA
This paper studies sequence prediction based on the monotone
Kolmogorov complexity Km=-log m, i.e. based on universal
deterministic/one-part MDL. m is extremely close to Solomonoff's
prior M, the latter being an excellent predictor in deterministic
as well as probabilistic environments, where performance is
measured in terms of convergence of posteriors or losses. Despite
this closeness to M, it is difficult to assess the prediction
quality of m, since little is known about the closeness of their
posteriors, which are the important quantities for prediction. We
show that for deterministic computable environments, the
"posterior" and losses of m converge, but rapid convergence could
only be shown on-sequence; the off-sequence behavior is unclear.
In probabilistic environments, neither the posterior nor the
losses converge, in general.
How Many Strings Are Easy to Predict?
Yuri Kalnishkan - Royal Holloway University of London
Vladimir Vovk - Royal Holloway University of London
Michael V. Vyugin - Royal Holloway University of London
It is well known in the theory of Kolmogorov complexity that most strings
cannot be compressed; more precisely, only exponentially few
($\Theta(2^{n-m})$) strings of length $n$ can be compressed by $m$ bits.  This
paper extends the `incompressibility' property of Kolmogorov complexity to the
`unpredictability' property of predictive complexity.  The `unpredictability'
property states that predictive complexity (defined as the loss suffered by a
universal prediction algorithm working infinitely long) of most strings is
close to a trivial upper bound (the loss suffered by a trivial minimax
constant prediction strategy).  We show that only exponentially few strings
can be successfully predicted and find the base of the exponent.
Polynomial Certificates for Propositional Classes
Marta Arias - Tufts University
Roni Khardon - Tufts University
Rocco Servedio - Columbia University
This paper studies the query complexity of learning
classes of expressions in propositional logic
from equivalence and membership queries.
We give new constructions of
polynomial size certificates of non-membership
for monotone, unate and Horn CNF functions.
Our constructions yield quantitatively different bounds from 
previous constructions of certificates for these classes.
We prove lower bounds on certificate size 
which show that for some parameter settings the certificates we 
construct for these classes are exactly optimal.
Finally, we also prove that a natural generalization of these classes, 
the class of renamable Horn CNF functions,
does {\em not} have polynomial size certificates of non-membership, thus
answering an open question of Feigelson.
On-line Learning with Imperfect Monitoring
Shie Mannor - Massachusetts Institute of Technology
Nahum Shimkin - Technion
We study on-line play of repeated matrix games in which the
observations of past actions of the other player and the obtained
reward are partial and stochastic. We define the Partial
Observation Bayes Envelope (POBE) as the best reward against the
worst-case stationary strategy of the opponent that agrees with
past observations. Our goal is to have the (unobserved) average
reward above the POBE. For the case where the observations (but
not necessarily the rewards) depend on the opponent play alone, an algorithm
for attaining the POBE is derived. This algorithm is based on an application
of approachability theory combined with a worst-case view over the unobserved
rewards. We also suggest a simplified solution concept for general signaling
structure. This concept may fall short of the POBE.
Exploiting Task Relatedness for Multiple Task Learning
Shai Ben-David - Cornell University and Technion
Reba Schuller - Cornell University
The approach of learning of multiple "related" tasks
simultaneously has proven quite successful in practice; however,
theoretical justification for this success has remained elusive.
The starting point for previous work on multiple task learning has
been that the tasks to be learned jointly are somehow
"algorithmically related", in the sense that the results of
applying a specific learning algorithm to these tasks are assumed
to be similar. We offer an alternative approach, defining relatedness of tasks
on the basis of similarity between the example generating distributions
that underline these task.

We provide a formal framework for this notion of task relatedness, which
captures
a sub-domain of the wide scope of issues in which one may apply a
multiple task learning approach. Our notion of
task similarity is relevant to a variety of real life multitask learning
scenarios and allows the formal derivation of
generalization bounds that are strictly stronger than the
previously known bounds for both the learning-to-learn and the
multitask learning scenarios. We give precise conditions under
which our bounds guarantee generalization on the basis of smaller
sample sizes than the standard single-task approach.
Approximate Equivalence of Markov Decision Processes
Eyal Even-Dar - Tel Aviv University
Yishay Mansour - Tel Aviv University
We consider the problem of finding the minimal $\eps$-equivalent
MDP for an MDP given in its tabular form. We show that the problem
is NP-Hard and then give a bicriteria approximation algorithm to
the problem. We suggest that the right measure for finding minimal
$\eps$-equivalent model is $L_1$ rather than $L_\infty$ by giving
both an example, which demonstrates the drawback of using
$L_\infty$, and performance guarantees for using $L_1$. In
addition, we give a polynomial algorithm that decides whether two
MDPs are equivalent.
An Information Theoretic Tradeoff between Complexity and Accuracy
Ran Gilad-Bachrach - The Hebrew University of Jerusalem
Amir Navot - The Hebrew University of Jerusalem
Naftali Tishby - The Hebrew University of Jerusalem
A fundamental question in learning theory is the quantification of the basic
tradeoff between the complexity of a model and its predictive accuracy. One
valid way of quantifying this tradeoff, known as the Information Bottleneck,
is to measure both the complexity of the model and its prediction accuracy by
using Shannon's mutual information. In this paper we show that the Information
Bottleneck framework answers a well defined and known
coding problem and at same time it provides a general relationship between
complexity and prediction accuarcy, measured by mutual information. We study
the nature of this 
complexity-accuracy tradeoff and discuss some of its theoretical properties.
Furthermore, we present relations to classical information theoretic problems,
such as rate-distortion theory, cost-capacity tradeoff and source coding with
side information.

Learning Random log-depth Decision Trees under the Uniform\\Distribution
Jeffrey Jackson - Duquesne University
Rocco Servedio - Columbia University
We consider three natural models of random log-depth decision trees.  
We give an efficient algorithm that for each of these models 
learns---as a decision tree---all but an inverse polynomial fraction 
of such trees using only uniformly distributed random examples.
Projective DNF Formulae and Their Revision
Robert Sloan - University of Illinois at Chicago
Balazs Szorenyi - University of Szeged
Gyorgy Turan - University of Illinois at Chicago
Projection learning, introduced by Valiant, aims to learn a target
concept by learning some of the concept's projections to a class of
smaller domains, and by combining these projections.  A natural special
case of this framework is when the projection domains are subcubes of
a fixed dimension, and the restrictions of the target to these domains
are conjunctions.

The first objective of this work is to study the class of functions
learned in this case, called projective DNFs.  We give some basic
properties of projective DNFs by comparing them to standard DNF
subclasses such as k-DNF and k-term DNF, and we present lower and
upper bounds for the exclusion dimension of projective DNF. For the
subclass of 1-projective DNF we prove a characterization theorem
giving an explicit description of all such expressions.

Our second objective is to extend Valiant's result on
attribute-efficient learning of projective DNFs to the related case of
efficient revision.  We show that Littlestone's Winnow, the
paradigmatic attribute-efficient disjunction learning algorithm, is
also an efficient algorithm for revising disjunctions, if the initial
weights are suitably modified, even with attribute errors.  This
implies that Valiant's result can be extended to the efficient
revision of projective DNF.
Learning with Equivalence Constraints, and the Relation to\\Multiclass Learning
Aharon Bar Hillel - The Hebrew University of Jerusalem
Daphna Weinshall - The Hebrew University of Jerusalem
We study the problem of learning partitions using equivalence
constraints as input. This is a binary classification problem in the product
space of pairs of datapoints. The training data includes pairs of datapoints
which are labeled as coming from the same class or not. This kind of data
appears naturally in applications where explicit labeling of datapoints is
hard to get,
but relations between datapoints can be more easily obtained, using, for
example, Markovian dependency (as in video clips).
Our problem is an unlabeled partition problem, and is therefore
tightly related to multiclass classification. We show that the
solutions of the two problems are related, in the sense that a good
solution to the binary classification problem entails the existence of a good
solution to the multiclass problem, and vice versa. We also
show that bounds on the sample complexity of the two problems are
similar, by showing that their relevant 'dimensions' (VC dimension for the
binary problem, Natarajan dimension for the multiclass problem) bound each
other.  Finally, we show the feasibility of solving multiclass learning
efficiently by using a solution of the equivalent binary classification
problem.  In this way advanced techniques developed for binary classification,
such as SVM and boosting, can be used directly to enhance multiclass learning.

Tutorial: Machine Learning Methods in Natural Language Processing
Michael Collins - Massachusetts Institute of Technology
Statistical or machine learning approaches have become quite prominent
in the Natural Language Processing literature. Common techniques
include generative models such as Hidden Markov Models or
Probabilistic Context-Free Grammars, and more general noisy-channel
models such as the statistical approach to machine translation
pioneered by researchers at IBM in the early 90s. Recent work has
considered discriminative methods such as (conditional) markov random
fields, or large-margin methods. This tutorial will describe several
of these techniques. The methods will be motivated through a number of
natural language problems: from part-of-speech tagging and parsing, to
machine translation, dialogue systems and information extraction
problems. I will also concentrate on links to the COLT and kernel
methods literature: for example covering kernels over the discrete
structures found in NLP, online algorithms for NLP problems, and the
issues in extending generalization bounds from classification problems
to NLP problems such as parsing.
Invited talk: Learning from Uncertain Data
Mehryar Mohri - AT\&T Labs
The application of statistical methods to natural language processing
has been remarkably successful over the past two decades. But, to deal
with recent problems arising in this field, machine learning
techniques must be generalized to deal with uncertain data, or
datasets whose elements are distributions over sequences, such as
weighted automata. This paper reviews a number of recent results
related to this question. We discuss how to compute efficiently basic
statistics from a weighted automaton such as the expected count of an
arbitrary sequence and higher moments of that distribution, by using
weighted transducers. Both the corresponding transducers and related
algorithms are described. We show how general classification
techniques such as Support Vector Machines can be extended to deal
with distributions by using general kernels between weighted
automata. We describe several examples of positive definite kernels
between weighted automata such as kernels based on counts of common
$n$-gram sequences, counts of common factors or suffixes, or other
more complex kernels, and describe a general algorithm for computing
them efficiently. We also demonstrate how machine learning techniques
such as clustering based on the edit-distance can be extended to deal
with unweighted and weighted automata representing distributions.
Invited talk: Learning and Parsing Stochastic Unification-Based Grammars
Mark Johnson - Brown University
Stochastic Unification-Based Grammars combine knowledge-rich and
data-rich approaches to natural language processing.  This provides a
rich structure to the learning and parsing (decoding) tasks that can
be described with undirected graphical models.  While most work to
date has treated parsing as a straight-forward multi-class
classification problem, we are beginning to see how this structure can
be exploited in learning and parsing.  Exploiting this structure is
likely to become more important as the research focus moves from
parsing to more realistic tasks such as machine translation and
summarization.
Generality's Price: Inescapable Deficiencies in Machine-Learned\\Programs
John Case - University of Delaware
Keh-Jiann Chen - Academia Sinica
Sanjay Jain - National University of Singapore
Wolfgang Merkle - Universit\"{a}t Heidelberg
James Royer - Syracuse University
This paper investigates some delicate tradeoffs between the
generality of an algorithmic learning device and the
quality of the programs it learns successfully.  There are
results to the effect that, thanks to small increases in generality
of a learning device, the computational complexity of some
successfully learned programs is provably unalterably 
suboptimal.  There are also results in which the complexity of
successfully learned programs is asymptotically optimal and
the learning device is general, but, still thanks to the generality,
some of those optimal, learned programs are provably unalterably
information deficient - in some cases, deficient as to
safe, algorithmic extractability/provability of the fact that they
are even approximately optimal.  The paper is on the borderline 
between learning theory, complexity theory and logic.
On Learning To Coordinate: Random Bits Help, Insightful Normal\\Forms, and Competency Isomorphisms
John Case - University of Delaware
Sanjay Jain - National University of Singapore
Franco Montagna - University of Siena
Giulia Simi - University of Siena
Andrea Sorbi - University of Siena
A mere bounded number of random bits judiciously employed by
a probabilistically correct algorithmic coordinator is shown
to increase the power of learning to coordinate compared to
deterministic algorithmic coordinators.  Furthermore, these
probabilistic algorithmic coordinators are provably not
characterized in power by teams of deterministic ones.

An insightful, enumeration technique based, normal form
characterization of the classes that are learnable by
total computable coordinators is given.  These normal forms are
for insight only since it is shown that the complexity of the
normal form of a total computable coordinator can be infeasible
compared to the original coordinator.

Montagna and Osherson showed that the competence class of
a total coordinator cannot be strictly improved by another total
coordinator.  It is shown in the present paper that the
competencies of any two total coordinators are the same
modulo isomorphism.  Furthermore, a completely effective,
index set version of this competency isomorphism result is
given, where all the coordinators are total computable.
Learning All Subfunctions of a Function
Sanjay Jain - National University of Singapore
Efim Kinber - Sacred Heart University
Rolf Wiehagen - University of Kaiserslautern
Sublearning, a model for learning of subconcepts of a concept, is
presented.
Sublearning a class of total recursive functions informally means to learn
all functions from that class together with all of their 
subfunctions. While in language learning it is known to be
impossible to learn any infinite language together with all of its
sublanguages, the situation changes for sublearning of functions.

Several types of sublearning are defined and compared to each other as
well as to other learning types. For example, in some cases, sublearning
coincides with robust learning. Furthermore, whereas in usual
function learning there are classes that cannot be learned consistently,
all sublearnable classes of some natural types can be
learned consistently.

Moreover, the power of sublearning is characterized in several terms,
thereby establishing a close connection to measurable classes and
variants of this notion. As a consequence, there are rich classes which do
not need any self-referential coding for sublearning them.
When Is Small Beautiful?
Amiran Ambroladze - Lund University
John Shawe-Taylor - Royal Holloway University of London
-
Learning a Function of $r$ Relevant Variables
Avrim Blum - Carnegie Mellon University
-
Subspace Detection: A Robust Statistics Formulation
Sanjoy Dasgupta - University of California at San Diego
-
How Fast Is $k$-Means?
Sanjoy Dasgupta - University of California at San Diego
-
Universal Coding of Zipf Distributions
Yoav Freund - Mitsubishi Electric Research Labs
Alon Orlitsky - University of California at San Diego
Prasad Santhanam - University of California at San Diego
Junan Zhang - University of California at San Diego
-
An Open Problem Regarding the Convergence of Universal A\\Priori Probability
Marcus Hutter - IDSIA
-
Entropy Bounds for Restricted Convex Hulls
Vladimir Koltchinskii - University of New Mexico
-
Compressing to VC Dimension Many Points
Manfred K. Warmuth - University of California at Santa Cruz
-
CyberChairPRO Copyright © by Richard van de Stadt  (Borbala Online Conference Services)
Hosted by Baskin School of Engineering, University of California, Santa Cruz)