- ★ Efficient Active Learning (Beygelzimer, Hsu, Karampatziakis, Langford and Zhang)
- ★ From Bandits to Experts: On the Value of More Information (Mannor and Shamir)
- ☆ Stumping along a summary (Salperwyck and Urvoy)
- Deviations of stochastic bandit regret (Audibert and Salomon)
- An Empirical Evaluation of Thompson Sampling (Chapelle and Li)
- Efficient Optimal Learning for Contextual Bandits (Dudik, Hsu, Kale, Karampatziakis, Langford, Reyzin and Zhang)
- An Unbiased Offline Evaluation of Contextual Bandit Algorithms based on Generalized Linear Models (Li, Chu, Langford, Wang)
- Exploration and Exploitation in an Artificial Experimenter (Lovell, Zauner and Gunn)
- Online Clustering with Experts (Choromanska and Monteleoni)
- PAC-Bayesian Analysis of the Exploration-Exploitation Trade-off (Seldin, Cesa-Bianchi, Laviolette, Auer, Shawe-taylor and Peters)
See the original call for papers for more information on the workshop. ★ indicates a paper selected for a contributed talk and ☆ indicates a challenge submission selected for a contributed talk. This list is also available on Mendeley where you will find the BibTeX references. Click on the titles below to download the corresponding PDFs.
★ Efficient Active Learning
Beygelzimer, Alina and Hsu, Daniel and Karampatziakis, Nikos and Langford, John and Zhang, Tong
We present and analyze an active learning algorithm that is theoretically sound in an agnostic setting, empirically effective, and as efficient as standard online learning algorithms. This allows us to soundly and effectively optimize the explore/exploit trade-off in active learning at a scale of 10^6 examples/second. The present work is primarily based on (Beygelzimer et al., 2010) and (Karampatziakis & Langford, 2011).
★ From Bandits to Experts: On the Value of More Information
Mannor, Shie and Shamir, Ohad
Learning from Experts and Multi-armed Bandits are two of the most common settings studied in online learning. Whereas the first setting assumes that the performance of all k actions are revealed at the end of each round, the bandit setting assumes that only the performance of the chosen action is revealed, with corresponding √k-degradation in the provable regret guarantee. In this paper, we study a natural setting which interpolates between the experts and the bandits setting, where choosing an action also reveals some side-information on the performance of some of the other actions. We develop practical algorithms with provable regret guarantees, as well as partially-matching lower bounds. The regret depends on non- trivial graph theoretic properties of the information feedback structure, and has an interesting trade-off between regret optimality and computational efficiency. We end by discussing some of the many open questions that remain.
☆ Stumping along a summary
Salperwyck, Christophe and Urvoy, Tanguy
The methods we used to compete in the « Exploration & Exploitation » challenge are based on three layers. The first layer provides an online summary of the data stream for continuous and nominal data. Continuous data are handled using the Greenwald and Khanna online quantile summary which provides error guarantees for a fixed memory size. Nominal data are summarized with a hash-based counting structure. With these techniques we managed to build an accurate stream summary with a small memory footprint. The second layer uses the summary to build predictors. We explored several kinds of trees from simple decision stumps to deep multivariate ones. The stumps proved to be remarkably stable and efficient. But on the other hand, a progressive unfolding of the trees seemed to improve the model on the long run. For the last layer, we explored several combination strategies: online bagging, exponential weighting, linear ranker, etc. We observed a tradeoff between the expressiveness of the predictors and the power of the combination strategy but most strategies being difficult to tune, we went back to a simple averaging. It seems, from our experiments, that both the need for exploration and the click scarcity sharpens the need for very stable models.
Deviations of stochastic bandit regret
Audibert, Jean-Yves and Salomon, Antoine
This paper studies the deviations of the regret in a stochastic multi-armed bandit problem. When the total number of plays n is known beforehand by the agent, Audibert et al. (2009) exhibit a policy such that with probability at least 1−1/n, the regret of the policy is of order log n. They have also shown that such a property is not shared by the popular ucb1 policy of Auer et al. (2002). This work first answers an open question: it extends this negative result to any anytime policy. The second contribution of this paper is to design anytime robust policies for specific multi-armed bandit problems in which some restrictions are put on the set of possible distributions of the different arms.
An Empirical Evaluation of Thompson Sampling
Chapelle, Olivier and Li, Lihong
Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly not very popular in the literature. We present here some empirical results using Thompson sampling and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against.
Efficient Optimal Learning for Contextual Bandits
Dudik, Miroslav and Hsu, Daniel and Kale, Satyen and Karampatziakis, Nikos and Langford, John and Reyzin, Lev and Zhang, Tong
We address the problem of learning in an online setting where the learner repeatedly observes features x, selects among K actions, and receives reward r for the action taken. We provide the first efficient algorithm with an optimal regret. Our algorithm uses an oracle which returns an optimal policy given rewards for all actions for each x. The algorithm has running time polylog(N), where N is the number of policies that we compete with. This is exponentially faster than all previous algorithms that achieve optimal regret in this setting. Our formulation also enables us to create an algorithm with regret that is additive rather than multiplicative in feedback delay as in all previous work.
An Unbiased Offline Evaluation of Contextual Bandit Algorithms based on Generalized Linear Models
Li, Lihong and Chu, Wei and Langford, John and Wang, Xuanhui
Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging due to their “partial-label” nature. Common practice is to create a simulator which simulates the online environment for the problem at hand and then run an algorithm against this simulator. However, creating simulator itself is often difficult and modeling bias is usually unavoidably introduced. In this paper, we review a replay method for contextual bandit algorithms. Different from simulatorbased approaches, our method is completely data-driven and is easy to adapt to different applications. More importantly, our method can provide provably unbiased evaluations, as demonstrated by real data from Yahoo! Front Page. With this method, a number of popular algorithms are compared, suggesting strong and comparable performance of these algorithms.
Exploration and Exploitation in an Artificial Experimenter
Lovell, Chris and Zauner, Klaus-Peter and Gunn, Steve
An artificial experimenter is a computational implementation of the decision making processes a laboratory experimenter will make. Artificial experimenter’s analyse the available data, propose hypotheses to represent the behaviours investigated and design experiments to evaluate or improve those hypotheses. In doing so they perform active discovery. A key problem faced is deciding when to perform experiments that exploit the information held within the current hypotheses to evaluate them and when to perform experiments that explore the parameter space to discover features of the behaviour being investigated not yet identified. As resources in physical experimentation are extremely limited, addressing this trade-off is critical to obtaining a representative model of the system under investigation. To achieve this, a Bayesian notion of surprise has been used to effectively manage the transition between exploration and exploitation in simulated and physical experimental trials.
Online Clustering with Experts
Choromanska, Anna and Monteleoni, Claire
We propose an online clustering algorithm that manages the exploration/exploitation tradeoff using an adaptive weighting over batch clustering algorithms. We extend algorithms for online supervised learning, with access to expert predictors, to the unsupervised learning setting. Instead of computing prediction errors in order to re-weight the experts, the algorithm computes an approximation to the current value of the k-means ob- jective obtained by each expert. When the experts are batch clustering algorithms with b-approximation guarantees with respect to the k-means objective (for example, the k-means++ or k-means\# algorithms), applied to a sliding window of the data stream, our algorithm achieves an approximation guarantee with respect to the k-means objective. The form of this online clustering approximation guarantee is novel, and extends an evaluation framework proposed by Dasgupta as an analog to regret. Our algorithm tracks the best clustering algorithm on real and simulated data sets.
PAC-Bayesian Analysis of the Exploration-Exploitation Trade-off
Seldin, Yevgeny and Cesa-Bianchi, Nicolo and Laviolette, François and Auer, Peter and Shawe-taylor, John and Peters, Jan
We develop a coherent framework for integrative simultaneous analysis of the exploration-exploitation and model order selection trade-offs. We improve over our preceding results on the same subject (Seldin et al., 2011) by combining PAC-Bayesian analysis with Bernstein-type inequality for martingales. Such a combination is also of independent interest for studies of multiple simultaneously evolving martingales.