Source author record

Yishay Mansour

Yishay Mansour appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computer Science and Game Theory Data Structures and Algorithms Artificial Intelligence Cryptography and Security Formal Languages and Automata Theory Information Theory math.OC Networking and Internet Architecture Social and Information Networks

Catalog footprint

What is connected

60works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Collaborating in Multi-Armed Bandits with Strategic Agents

We study collaborative learning in multi-agent Bayesian bandit problems, where strategic agents collectively solve the same bandit instance. While multiple agents can accelerate learning by sharing information, strategic agents might prefer to free-ride and avoid exploration. We consider a setting with persistent agents that participate in multiple time periods. This is in contrast to most previous works on incentives in multi-agent MAB, which assume short-lived agents, namely each agent has a single decision to make and optimizes their expected reward in that single decision. As in the multi-agent MAB model with incentives, our model does not have monetary transfers, and the only incentives are through information sharing. We propose \texttt{CAOS}, a mechanism that sustains collaboration as a Nash equilibrium while achieving strong regret guarantees. Our results demonstrate that collaborative exploration can be sustained purely through information sharing, achieving performance close to that of fully cooperative systems despite strategic behavior.

preprint2026arXiv

Cost-Aware Learning

We consider the problem of Cost-Aware Learning, where sampling different component functions of a finite-sum objective incurs different costs. The objective is to reach a target error while minimizing the total cost. First, we propose the Cost-Aware Stochastic Gradient Descent algorithm for convex functions, and derive its cost complexity to attain an error of $ε$. Furthermore, we establish a lower bound for this setting and provide a subset selection algorithm to further reduce the cost of training. We apply our theoretical insights to reinforcement learning with language models, where the computational cost of policy gradients varies with sequence length. To this end, we introduce Cost-Aware GRPO, an algorithm designed to reduce the cost of policy optimization while preserving performance. Empirical results on 1.5B and 8B LLMs demonstrate that our approach reduces the tokens used in policy optimization by up to about 30% while matching or exceeding baseline accuracy.

preprint2026arXiv

Online Set Learning from Precision and Recall Feedback

We consider the problem of learning an unknown subset $N_\text{target}$ of a domain in an online setting. In each round $t$, the learner predicts a set of items ${N}_t$ and receives one of two types of feedback, each with equal probability: precision feedback, in which a randomly chosen item from the predicted set $N_t$ is revealed and the learner is told whether it belongs to $N_\text{target}$ (incurring a reward if it does), or recall feedback, in which a randomly chosen item from the target set $N_\text{target}$ is revealed and the learner is told whether it belongs to $N_t$ (incurring a reward if it does). The goal is to maximize the cumulative reward over time. This simple online set learning problem abstracts a variety of learning scenarios with precision- and recall-type feedback. We show that a hypothesis class (a family of subsets of the domain) is learnable in this setting if and only if it has finite Vapnik-Chervonenkis (VC) dimension, mirroring the classical PAC characterization. However, the resulting algorithmic structure is markedly more intricate: in contrast to standard Probably Approximately Correct (PAC) learning -- where the algorithmic landscape is governed by the simple principle of Empirical Risk Minimization (ERM) -- our partial feedback model can invalidate ERM and even all proper learning rules. We develop algorithms to address the dependencies induced by the feedback, obtaining regret guarantees in both the realizable and agnostic settings. Our results provide a qualitative characterization of learnability in this model, addressing its most basic question, while pointing to a range of natural and intriguing open questions, including the determination of optimal regret rates.

preprint2026arXiv

Scale-Sensitive Shattering: Learnability and Evaluability at Optimal Scale

We study the optimal scale at which real-valued function classes exhibit uniform convergence and learnability. Our main result establishes a scale-sensitive generalization of the fundamental theorem of PAC learning: for every bounded real-valued class and every $γ>0$, uniform convergence at scale $γ$, agnostic learnability at scale $γ/2$, and finiteness of the fat-shattering dimension at every scale $γ'>γ$ are equivalent. This resolves a question by Anthony and Bartlett (Cambridge Univ. Press 1999) on the precise scales governing learnability, refuting a conjecture attributed there to Phil Long that a multiplicative 2-factor gap is unavoidable, and improves the upper bounds of Bartlett and Long (JCSS 1998), which incur such a loss. The key technical ingredient is a direct bound on empirical $\ell_\infty$ covering numbers, avoiding the standard detour through packing numbers. As a consequence, we obtain sharp asymptotic metric-entropy bounds in terms of the fat-shattering scale $γ$: an $O(\log^2 n)$ bound holds already at scale $γ/2$, while an $O(\log n)$ bound holds at scale $2γ$. We further show that the $O(\log^2 n)$ bound is sometimes tight. These results resolve open questions by Alon et al. (JACM 1997) and Rudelson and Vershynin (Ann. of Math. 2006). As an application, we establish a sharp dichotomy for bounded integral probability metrics: every such IPM is either estimable or cannot be weakly evaluated within any multiplicative factor $c<3$, while $3$-weak evaluability always holds, resolving an open question from Aiyer et al. (ICML 2026). We also highlight several open questions on quantitative sample complexity and evaluability.

preprint2023arXiv

Benign Underfitting of Stochastic Gradient Descent

We study to what extent may stochastic gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without-replacement) SGD is classically known to minimize the population risk at rate $O(1/\sqrt n)$, and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of $Ω(1)$. Consequently, it turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related with-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results.

preprint2023arXiv

Principal-Agent Reward Shaping in MDPs

Principal-agent problems arise when one party acts on behalf of another, leading to conflicts of interest. The economic literature has extensively studied principal-agent problems, and recent work has extended this to more complex scenarios such as Markov Decision Processes (MDPs). In this paper, we further explore this line of research by investigating how reward shaping under budget constraints can improve the principal's utility. We study a two-player Stackelberg game where the principal and the agent have different reward functions, and the agent chooses an MDP policy for both players. The principal offers an additional reward to the agent, and the agent picks their policy selfishly to maximize their reward, which is the sum of the original and the offered reward. Our results establish the NP-hardness of the problem and offer polynomial approximation algorithms for two classes of instances: Stochastic trees and deterministic decision processes with a finite horizon.

preprint2022arXiv

Cooperative Online Learning in Stochastic and Adversarial MDPs

We study cooperative online learning in stochastic and adversarial Markov decision process (MDP). That is, in each episode, $m$ agents interact with an MDP simultaneously and share information in order to minimize their individual regret. We consider environments with two types of randomness: \emph{fresh} -- where each agent's trajectory is sampled i.i.d, and \emph{non-fresh} -- where the realization is shared by all agents (but each agent's trajectory is also affected by its own actions). More precisely, with non-fresh randomness the realization of every cost and transition is fixed at the start of each episode, and agents that take the same action in the same state at the same time observe the same cost and next state. We thoroughly analyze all relevant settings, highlight the challenges and differences between the models, and prove nearly-matching regret lower and upper bounds. To our knowledge, we are the first to consider cooperative reinforcement learning (RL) with either non-fresh randomness or in adversarial MDPs.

preprint2022arXiv

Guarantees for Epsilon-Greedy Reinforcement Learning with Function Approximation

Myopic exploration policies such as epsilon-greedy, softmax, or Gaussian noise fail to explore efficiently in some reinforcement learning tasks and yet, they perform well in many others. In fact, in practice, they are often selected as the top choices, due to their simplicity. But, for what tasks do such policies succeed? Can we give theoretical guarantees for their favorable performance? These crucial questions have been scarcely investigated, despite the prominent practical importance of these policies. This paper presents a theoretical analysis of such policies and provides the first regret and sample-complexity bounds for reinforcement learning with myopic exploration. Our results apply to value-function-based algorithms in episodic MDPs with bounded Bellman Eluder dimension. We propose a new complexity measure called myopic exploration gap, denoted by alpha, that captures a structural property of the MDP, the exploration policy and the given value function class. We show that the sample-complexity of myopic exploration scales quadratically with the inverse of this quantity, 1 / alpha^2. We further demonstrate through concrete examples that myopic exploration gap is indeed favorable in several tasks where myopic exploration succeeds, due to the corresponding dynamics and reward structure.

preprint2022arXiv

Improved Generalization Bounds for Adversarially Robust Learning

We consider a model of robust learning in an adversarial environment. The learner gets uncorrupted training data with access to possible corruptions that may be affected by the adversary during testing. The learner's goal is to build a robust classifier, which will be tested on future adversarial examples. The adversary is limited to $k$ possible corruptions for each input. We model the learner-adversary interaction as a zero-sum game. This model is closely related to the adversarial examples model of Schmidt et al. (2018); Madry et al. (2017). Our main results consist of generalization bounds for the binary and multiclass classification, as well as the real-valued case (regression). For the binary classification setting, we both tighten the generalization bound of Feige et al. (2015), and are also able to handle infinite hypothesis classes. The sample complexity is improved from $O(\frac{1}{ε^4}\log(\frac{|H|}δ))$ to $O\big(\frac{1}{ε^2}(kVC(H)\log^{\frac{3}{2}+α}(kVC(H))+\log(\frac{1}δ)\big)$ for any $α> 0$. Additionally, we extend the algorithm and generalization bound from the binary to the multiclass and real-valued cases. Along the way, we obtain results on fat-shattering dimension and Rademacher complexity of $k$-fold maxima over function classes; these may be of independent interest. For binary classification, the algorithm of Feige et al. (2015) uses a regret minimization algorithm and an ERM oracle as a black box; we adapt it for the multiclass and regression settings. The algorithm provides us with near-optimal policies for the players on a given training sample.

preprint2022arXiv

Learning Revenue Maximization using Posted Prices for Stochastic Strategic Patient Buyers

We consider a seller faced with buyers which have the ability to delay their decision, which we call patience. Each buyer's type is composed of value and patience, and it is sampled i.i.d. from a distribution. The seller, using posted prices, would like to maximize her revenue from selling to the buyer. In this paper, we formalize this setting and characterize the resulting Stackelberg equilibrium, where the seller first commits to her strategy, and then the buyers best respond. Following this, we show how to compute both the optimal pure and mixed strategies. We then consider a learning setting, where the seller does not have access to the distribution over buyer's types. Our main results are the following. We derive a sample complexity bound for the learning of an approximate optimal pure strategy, by computing the fat-shattering dimension of this setting. Moreover, we provide a general sample complexity bound for the approximate optimal mixed strategy. We also consider an online setting and derive a vanishing regret bound with respect to both the optimal pure strategy and the optimal mixed strategy.

preprint2022arXiv

Stochastic Shortest Path with Adversarially Changing Costs

Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. In this paper we present the adversarial SSP model that also accounts for adversarial changes in the costs over time, while the underlying transition function remains unchanged. Formally, an agent interacts with an SSP environment for $K$ episodes, the cost function changes arbitrarily between episodes, and the transitions are unknown to the agent. We develop the first algorithms for adversarial SSPs and prove high probability regret bounds of $\widetilde O (\sqrt{K})$ assuming all costs are strictly positive, and $\widetilde O (K^{3/4})$ in the general case. We are the first to consider this natural setting of adversarial SSP and obtain sub-linear regret for it.

preprint2022arXiv

Strategizing against Learners in Bayesian Games

We study repeated two-player games where one of the players, the learner, employs a no-regret learning strategy, while the other, the optimizer, is a rational utility maximizer. We consider general Bayesian games, where the payoffs of both the optimizer and the learner could depend on the type, which is drawn from a publicly known distribution, but revealed privately to the learner. We address the following questions: (a) what is the bare minimum that the optimizer can guarantee to obtain regardless of the no-regret learning algorithm employed by the learner? (b) are there learning algorithms that cap the optimizer payoff at this minimum? (c) can these algorithms be implemented efficiently? While building this theory of optimizer-learner interactions, we define a new combinatorial notion of regret called polytope swap regret, that could be of independent interest in other settings.

preprint2022arXiv

There is no Accuracy-Interpretability Tradeoff in Reinforcement Learning for Mazes

Interpretability is an essential building block for trustworthiness in reinforcement learning systems. However, interpretability might come at the cost of deteriorated performance, leading many researchers to build complex models. Our goal is to analyze the cost of interpretability. We show that in certain cases, one can achieve policy interpretability while maintaining its optimality. We focus on a classical problem from reinforcement learning: mazes with $k$ obstacles in $\mathbb{R}^d$. We prove the existence of a small decision tree with a linear function at each inner node and depth $O(\log k + 2^d)$ that represents an optimal policy. Note that for the interesting case of a constant $d$, we have $O(\log k)$ depth. Thus, in this setting, there is no accuracy-interpretability tradeoff. To prove this result, we use a new "compressing" technique that might be useful in additional settings.

preprint2022arXiv

What killed the Convex Booster ?

A landmark negative result of Long and Servedio established a worst-case spectacular failure of a supervised learning trio (loss, algorithm, model) otherwise praised for its high precision machinery. Hundreds of papers followed up on the two suspected culprits: the loss (for being convex) and/or the algorithm (for fitting a classical boosting blueprint). Here, we call to the half-century+ founding theory of losses for class probability estimation (properness), an extension of Long and Servedio's results and a new general boosting algorithm to demonstrate that the real culprit in their specific context was in fact the (linear) model class. We advocate for a more general stanpoint on the problem as we argue that the source of the negative result lies in the dark side of a pervasive -- and otherwise prized -- aspect of ML: \textit{parameterisation}.

preprint2021arXiv

Online Markov Decision Processes with Aggregate Bandit Feedback

We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics. In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback: the trajectory is revealed along with the cumulative loss suffered, rather than the individual losses encountered along the trajectory. Our main result is a computationally efficient algorithm with $O(\sqrt{K})$ regret for this setting, where $K$ is the number of episodes. We establish this result via an efficient reduction to a novel bandit learning setting we call Distorted Linear Bandits (DLB), which is a variant of bandit linear optimization where actions chosen by the learner are adversarially distorted before they are committed. We then develop a computationally-efficient online algorithm for DLB for which we prove an $O(\sqrt{T})$ regret bound, where $T$ is the number of time steps. Our algorithm is based on online mirror descent with a self-concordant barrier regularization that employs a novel increasing learning rate schedule.

preprint2021arXiv

Planning and Learning with Stochastic Action Sets

In many practical uses of reinforcement learning (RL) the set of actions available at a given state is a random variable, with realizations governed by an exogenous stochastic process. Somewhat surprisingly, the foundations for such sequential decision processes have been unaddressed. In this work, we formalize and investigate MDPs with stochastic action sets (SAS-MDPs) to provide these foundations. We show that optimal policies and value functions in this model have a structure that admits a compact representation. From an RL perspective, we show that Q-learning with sampled action sets is sound. In model-based settings, we consider two important special cases: when individual actions are available with independent probabilities; and a sampling-based model for unknown distributions. We develop poly-time value and policy iteration methods for both cases; and in the first, we offer a poly-time linear programming solution.

preprint2021arXiv

Separating Adaptive Streaming from Oblivious Streaming

We present a streaming problem for which every adversarially-robust streaming algorithm must use polynomial space, while there exists a classical (oblivious) streaming algorithm that uses only polylogarithmic space. This is the first separation between oblivious streaming and adversarially-robust streaming, and resolves one of the central open questions in adversarial robust streaming.

preprint2020arXiv

Adversarially Robust Streaming Algorithms via Differential Privacy

A streaming algorithm is said to be adversarially robust if its accuracy guarantees are maintained even when the data stream is chosen maliciously, by an adaptive adversary. We establish a connection between adversarial robustness of streaming algorithms and the notion of differential privacy. This connection allows us to design new adversarially robust streaming algorithms that outperform the current state-of-the-art constructions for many interesting regimes of parameters.

preprint2020arXiv

Beyond Individual and Group Fairness

We present a new data-driven model of fairness that, unlike existing static definitions of individual or group fairness is guided by the unfairness complaints received by the system. Our model supports multiple fairness criteria and takes into account their potential incompatibilities. We consider both a stochastic and an adversarial setting of our model. In the stochastic setting, we show that our framework can be naturally cast as a Markov Decision Process with stochastic losses, for which we give efficient vanishing regret algorithmic solutions. In the adversarial setting, we design efficient algorithms with competitive ratio guarantees. We also report the results of experiments with our algorithms and the stochastic framework on artificial datasets, to demonstrate their effectiveness empirically.

preprint2020arXiv

Detecting malicious PDF using CNN

Malicious PDF files represent one of the biggest threats to computer security. To detect them, significant research has been done using handwritten signatures or machine learning based on manual feature extraction. Those approaches are both time-consuming, require significant prior knowledge and the list of features has to be updated with each newly discovered vulnerability. In this work, we propose a novel algorithm that uses an ensemble of Convolutional Neural Network (CNN) on the byte level of the file, without any handcrafted features. We show, using a data set of 90000 files downloadable online, that our approach maintains a high detection rate (94%) of PDF malware and even detects new malicious files, still undetected by most antiviruses. Using automatically generated features from our CNN network, and applying a clustering algorithm, we also obtain high similarity between the antiviruses' labels and the resulting clusters.

preprint2020arXiv

Near-optimal Regret Bounds for Stochastic Shortest Path

Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent is unaware of the environment dynamics (i.e., the transition function) and has to repeatedly play for a given number of episodes while reasoning about the problem's optimal solution. Unlike other well-studied models in reinforcement learning (RL), the length of an episode is not predetermined (or bounded) and is influenced by the agent's actions. Recently, Tarbouriech et al. (2019) studied this problem in the context of regret minimization and provided an algorithm whose regret bound is inversely proportional to the square root of the minimum instantaneous cost. In this work we remove this dependence on the minimum cost---we give an algorithm that guarantees a regret bound of $\widetilde{O}(B_\star |S| \sqrt{|A| K})$, where $B_\star$ is an upper bound on the expected cost of the optimal policy, $S$ is the set of states, $A$ is the set of actions and $K$ is the number of episodes. We additionally show that any learning algorithm must have at least $Ω(B_\star \sqrt{|S| |A| K})$ regret in the worst case.

preprint2020arXiv

Planning in Hierarchical Reinforcement Learning: Guarantees for Using Local Policies

We consider a settings of hierarchical reinforcement learning, in which the reward is a sum of components. For each component we are given a policy that maximizes it and our goal is to assemble a policy from the individual policies that maximizes the sum of the components. We provide theoretical guarantees for assembling such policies in deterministic MDPs with collectible rewards. Our approach builds on formulating this problem as a traveling salesman problem with discounted reward. We focus on local solutions, i.e., policies that only use information from the current state; thus, they are easy to implement and do not require substantial computational resources. We propose three local stochastic policies and prove that they guarantee better performance than any deterministic local policy in the worst case; experimental results suggest that they also perform better on average.

preprint2020arXiv

Reinforcement Learning with Feedback Graphs

We study episodic reinforcement learning in Markov decision processes when the agent receives additional feedback per step in the form of several transition observations. Such additional observations are available in a range of tasks through extended sensors or prior knowledge about the environment (e.g., when certain actions yield similar outcome). We formalize this setting using a feedback graph over state-action pairs and show that model-based algorithms can leverage the additional feedback for more sample-efficient learning. We give a regret bound that, ignoring logarithmic factors and lower-order terms, depends only on the size of the maximum acyclic subgraph of the feedback graph, in contrast with a polynomial dependency on the number of states and actions in the absence of a feedback graph. Finally, we highlight challenges when leveraging a small dominating set of the feedback graph as compared to the bandit setting and propose a new algorithm that can use knowledge of such a dominating set for more sample-efficient learning of a near-optimal policy.

preprint2020arXiv

Three Approaches for Personalization with Applications to Federated Learning

The standard objective in machine learning is to train a single model for all users. However, in many learning scenarios, such as cloud computing and federated learning, it is possible to learn a personalized model per user. In this work, we present a systematic learning-theoretic study of personalization. We propose and analyze three approaches: user clustering, data interpolation, and model interpolation. For all three approaches, we provide learning-theoretic guarantees and efficient algorithms for which we also demonstrate the performance empirically. All of our algorithms are model-agnostic and work for any hypothesis class.

preprint2020arXiv

Unknown mixing times in apprenticeship and reinforcement learning

We derive and analyze learning algorithms for apprenticeship learning, policy evaluation, and policy gradient for average reward criteria. Existing algorithms explicitly require an upper bound on the mixing time. In contrast, we build on ideas from Markov chain theory and derive sampling algorithms that do not require such an upper bound. For these algorithms, we provide theoretical bounds on their sample-complexity and running time.

preprint2019arXiv

Thompson Sampling for Adversarial Bit Prediction

We study the Thompson sampling algorithm in an adversarial setting, specifically, for adversarial bit prediction. We characterize the bit sequences with the smallest and largest expected regret. Among sequences of length $T$ with $k < \frac{T}{2}$ zeros, the sequences of largest regret consist of alternating zeros and ones followed by the remaining ones, and the sequence of smallest regret consists of ones followed by zeros. We also bound the regret of those sequences, the worse case sequences have regret $O(\sqrt{T})$ and the best case sequence have regret $O(1)$. We extend our results to a model where false positive and false negative errors have different weights. We characterize the sequences with largest expected regret in this generalized setting, and derive their regret bounds. We also show that there are sequences with $O(1)$ regret.

preprint2016arXiv

Delay and Cooperation in Nonstochastic Bandits

We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than $d$ hops to arrive, where $d$ is a delay parameter. We introduce \textsc{Exp3-Coop}, a cooperative version of the {\sc Exp3} algorithm and prove that with $K$ actions and $N$ agents the average per-agent regret after $T$ rounds is at most of order $\sqrt{\bigl(d+1 + \tfrac{K}{N}α_{\le d}\bigr)(T\ln K)}$, where $α_{\le d}$ is the independence number of the $d$-th power of the connected communication graph $G$. We then show that for any connected graph, for $d=\sqrt{K}$ the regret bound is $K^{1/4}\sqrt{T}$, strictly better than the minimax regret $\sqrt{KT}$ for noncooperating agents. More informed choices of $d$ lead to bounds which are arbitrarily close to the full information minimax regret $\sqrt{T\ln K}$ when $G$ is dense. When $G$ has sparse components, we show that a variant of \textsc{Exp3-Coop}, allowing agents to choose their parameters according to their centrality in $G$, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.

preprint2016arXiv

Dynamics of Evolving Social Groups

Exclusive social groups are ones in which the group members decide whether or not to admit a candidate to the group. Examples of exclusive social groups include academic departments and fraternal organizations. In the present paper we introduce an analytic framework for studying the dynamics of exclusive social groups. In our model, every group member is characterized by his opinion, which is represented as a point on the real line. The group evolves in discrete time steps through a voting process carried out by the group's members. Due to homophily, each member votes for the candidate who is more similar to him (i.e., closer to him on the line). An admission rule is then applied to determine which candidate, if any, is admitted. We consider several natural admission rules including majority and consensus. We ask: how do different admission rules affect the composition of the group in the long term? We study both growing groups (where new members join old ones) and fixed-size groups (where new members replace those who quit). Our analysis reveals intriguing phenomena and phase transitions, some of which are quite counterintuitive.

preprint2016arXiv

Label Efficient Learning by Exploiting Multi-class Output Codes

We present a new perspective on the popular multi-class algorithmic techniques of one-vs-all and error correcting output codes. Rather than studying the behavior of these techniques for supervised learning, we establish a connection between the success of these methods and the existence of label-efficient learning procedures. We show that in both the realizable and agnostic cases, if output codes are successful at learning from labeled data, they implicitly assume structure on how the classes are related. By making that structure explicit, we design learning algorithms to recover the classes with low label complexity. We provide results for the commonly studied cases of one-vs-all learning and when the codewords of the classes are well separated. We additionally consider the more challenging case where the codewords are not well separated, but satisfy a boundary features condition that captures the natural intuition that every bit of the codewords should be significant.

preprint2016arXiv

Online Learning with Low Rank Experts

We consider the problem of prediction with expert advice when the losses of the experts have low-dimensional structure: they are restricted to an unknown $d$-dimensional subspace. We devise algorithms with regret bounds that are independent of the number of experts and depend only on the rank $d$. For the stochastic model we show a tight bound of $Θ(\sqrt{dT})$, and extend it to a setting of an approximate $d$ subspace. For the adversarial model we show an upper bound of $O(d\sqrt{T})$ and a lower bound of $Ω(\sqrt{dT})$.

preprint2016arXiv

Predicting Counterfactuals from Large Historical Data and Small Randomized Trials

When a new treatment is considered for use, whether a pharmaceutical drug or a search engine ranking algorithm, a typical question that arises is, will its performance exceed that of the current treatment? The conventional way to answer this counterfactual question is to estimate the effect of the new treatment in comparison to that of the conventional treatment by running a controlled, randomized experiment. While this approach theoretically ensures an unbiased estimator, it suffers from several drawbacks, including the difficulty in finding representative experimental populations as well as the cost of running such trials. Moreover, such trials neglect the huge quantities of available control-condition data which are often completely ignored. In this paper we propose a discriminative framework for estimating the performance of a new treatment given a large dataset of the control condition and data from a small (and possibly unrepresentative) randomized trial comparing new and old treatments. Our objective, which requires minimal assumptions on the treatments, models the relation between the outcomes of the different conditions. This allows us to not only estimate mean effects but also to generate individual predictions for examples outside the randomized sample. We demonstrate the utility of our approach through experiments in three areas: Search engine operation, treatments to diabetes patients, and market value estimation for houses. Our results demonstrate that our approach can reduce the number and size of the currently performed randomized controlled experiments, thus saving significant time, money and effort on the part of practitioners.

preprint2016arXiv

When should an expert make a prediction?

We consider a setting where in a known future time, a certain continuous random variable will be realized. There is a public prediction that gradually converges to its realized value, and an expert that has access to a more accurate prediction. Our goal is to study {\em when} should the expert reveal his information, assuming that his reward is based on a logarithmic market scoring rule (i.e., his reward is proportional to the gain in log-likelihood of the realized value). Our contributions are: (1) we characterize the expert's optimal policy and show that it is threshold based. (2) we analyze the expert's asymptotic expected optimal reward and show a tight connection to the Law of the Iterated Logarithm, and (3) we give an efficient dynamic programming algorithm to compute the optimal policy.

preprint2015arXiv

Classification with Low Rank and Missing Data

We consider classification and regression tasks where we have missing data and assume that the (clean) data resides in a low rank subspace. Finding a hidden subspace is known to be computationally hard. Nevertheless, using a non-proper formulation we give an efficient agnostic algorithm that classifies as good as the best linear classifier coupled with the best low-dimensional subspace in which the data resides. A direct implication is that our algorithm can linearly (and non-linearly through kernels) classify provably as well as the best classifier that has access to the full data.

preprint2015arXiv

Making the Most of Your Samples

We study the problem of setting a price for a potential buyer with a valuation drawn from an unknown distribution $D$. The seller has "data"' about $D$ in the form of $m \ge 1$ i.i.d. samples, and the algorithmic challenge is to use these samples to obtain expected revenue as close as possible to what could be achieved with advance knowledge of $D$. Our first set of results quantifies the number of samples $m$ that are necessary and sufficient to obtain a $(1-ε)$-approximation. For example, for an unknown distribution that satisfies the monotone hazard rate (MHR) condition, we prove that $\tildeΘ(ε^{-3/2})$ samples are necessary and sufficient. Remarkably, this is fewer samples than is necessary to accurately estimate the expected revenue obtained by even a single reserve price. We also prove essentially tight sample complexity bounds for regular distributions, bounded-support distributions, and a wide class of irregular distributions. Our lower bound approach borrows tools from differential privacy and information theory, and we believe it could find further applications in auction theory. Our second set of results considers the single-sample case. For regular distributions, we prove that no pricing strategy is better than $\tfrac{1}{2}$-approximate, and this is optimal by the Bulow-Klemperer theorem. For MHR distributions, we show how to do better: we give a simple pricing strategy that guarantees expected revenue at least $0.589$ times the maximum possible. We also prove that no pricing strategy achieves an approximation guarantee better than $\frac{e}{4} \approx .68$.

preprint2015arXiv

Optimistic-Conservative Bidding in Sequential Auctions

In this work we consider selling items using a sequential first price auction mechanism. We generalize the assumption of conservative bidding to extensive form games (henceforth optimistic conservative bidding), and show that for both linear and unit demand valuations, the only pure subgame perfect equilibrium where buyers are bidding in an optimistic conservative manner is the minimal Walrasian equilibrium. In addition, we show examples where without the requirement of conservative bidding, subgame perfect equilibria can admit a variety of unlikely predictions, including high price of anarchy and low revenue in markets composed of additive bidders, equilibria which elicit all the surplus as revenue, and more. We also show that the order in which the items are sold can influence the outcome.

preprint2014arXiv

Learning Valuation Distributions from Partial Observation

Auction theory traditionally assumes that bidders' valuation distributions are known to the auctioneer, such as in the celebrated, revenue-optimal Myerson auction. However, this theory does not describe how the auctioneer comes to possess this information. Recently, Cole and Roughgarden [2014] showed that an approximation based on a finite sample of independent draws from each bidder's distribution is sufficient to produce a near-optimal auction. In this work, we consider the problem of learning bidders' valuation distributions from much weaker forms of observations. Specifically, we consider a setting where there is a repeated, sealed-bid auction with $n$ bidders, but all we observe for each round is who won, but not how much they bid or paid. We can also participate (i.e., submit a bid) ourselves, and observe when we win. From this information, our goal is to (approximately) recover the inherently recoverable part of the underlying bid distributions. We also consider extensions where different subsets of bidders participate in each round, and where bidders' valuations have a common-value component added to their independent private values.

preprint2014arXiv

Learning What's going on: reconstructing preferences and priorities from opaque transactions

We consider a setting where $n$ buyers, with combinatorial preferences over $m$ items, and a seller, running a priority-based allocation mechanism, repeatedly interact. Our goal, from observing limited information about the results of these interactions, is to reconstruct both the preferences of the buyers and the mechanism of the seller. More specifically, we consider an online setting where at each stage, a subset of the buyers arrive and are allocated items, according to some unknown priority that the seller has among the buyers. Our learning algorithm observes only which buyers arrive and the allocation produced (or some function of the allocation, such as just which buyers received positive utility and which did not), and its goal is to predict the outcome for future subsets of buyers. For this task, the learning algorithm needs to reconstruct both the priority among the buyers and the preferences of each buyer. We derive mistake bound algorithms for additive, unit-demand and single minded buyers. We also consider the case where buyers' utilities for a fixed bundle can change between stages due to different (observed) prices. Our algorithms are efficient both in computation time and in the maximum number of mistakes (both polynomial in the number of buyers and items).

preprint2014arXiv

Local computation mechanism design

We introduce the notion of Local Computation Mechanism Design - designing game theoretic mechanisms which run in polylogarithmic time and space. Local computation mechanisms reply to each query in polylogarithmic time and space, and the replies to different queries are consistent with the same global feasible solution. In addition, the computation of the payments is also done in polylogarithmic time and space. Furthermore, the mechanisms need to maintain incentive compatibility with respect to the allocation and payments. We present local computation mechanisms for a variety of classical game-theoretical problems: 1. stable matching, 2. job scheduling, 3. combinatorial auctions for unit-demand and k-minded bidders, and 4. the housing allocation problem. For stable matching, some of our techniques may have general implications. Specifically, we show that when the men's preference lists are bounded, we can achieve an arbitrarily good approximation to the stable matching within a fixed number of iterations of the Gale-Shapley algorithm.

preprint2014arXiv

Nonstochastic Multi-Armed Bandits with Graph-Structured Feedback

We present and study a partial-information model of online learning, where a decision maker repeatedly chooses from a finite set of actions, and observes some subset of the associated losses. This naturally models several situations where the losses of different actions are related, and knowing the loss of one action provides information on the loss of other actions. Moreover, it generalizes and interpolates between the well studied full-information setting (where all losses are revealed) and the bandit setting (where only the loss of the action chosen by the player is revealed). We provide several algorithms addressing different variants of our setting, and provide tight regret bounds depending on combinatorial properties of the information feedback structure.

preprint2014arXiv

On the Complexity of Learning with Kernels

A well-recognized limitation of kernel learning is the requirement to handle a kernel matrix, whose size is quadratic in the number of training examples. Many methods have been proposed to reduce this computational cost, mostly by using a subset of the kernel matrix entries, or some form of low-rank matrix approximation, or a random projection method. In this paper, we study lower bounds on the error attainable by such methods as a function of the number of entries observed in the kernel matrix or the rank of an approximate kernel matrix. We show that there are kernel learning problems where no such method will lead to non-trivial computational savings. Our results also quantify how the problem difficulty depends on parameters such as the nature of the loss function, the regularization parameter, the norm of the desired predictor, and the kernel matrix rank. Our results also suggest cases where more efficient kernel learning might be possible.

preprint2014arXiv

Probe Scheduling for Efficient Detection of Silent Failures

Most discovery systems for silent failures work in two phases: a continuous monitoring phase that detects presence of failures through probe packets and a localization phase that pinpoints the faulty element(s). This separation is important because localization requires significantly more resources than detection and should be initiated only when a fault is present. We focus on improving the efficiency of the detection phase, where the goal is to balance the overhead with the cost associated with longer failure detection times. We formulate a general model which unifies the treatment of probe scheduling mechanisms, stochastic or deterministic, and different cost objectives - minimizing average detection time (SUM) or worst-case detection time (MAX). We then focus on two classes of schedules. {\em Memoryless schedules} -- a subclass of stochastic schedules which is simple and suitable for distributed deployment. We show that the optimal memorlyess schedulers can be efficiently computed by convex programs (for SUM objectives) or linear programs (for MAX objectives), and surprisingly perhaps, are guaranteed to have expected detection times that are not too far off the (NP hard) stochastic optima. {\em Deterministic schedules} allow us to bound the maximum (rather than expected) cost of undetected faults, but like stochastic schedules, are NP hard to optimize. We develop novel efficient deterministic schedulers with provable approximation ratios. An extensive simulation study on real networks, demonstrates significant performance gains of our memoryless and deterministic schedulers over previous approaches. Our unified treatment also facilitates a clear comparison between different objectives and scheduling mechanisms.

preprint2013arXiv

A Local Computation Approximation Scheme to Maximum Matching

We present a polylogarithmic local computation matching algorithm which guarantees a $(1-\eps)$-approximation to the maximum matching in graphs of bounded degree.

preprint2013arXiv

An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering

Assignment methods are at the heart of many algorithms for unsupervised learning and clustering - in particular, the well-known K-means and Expectation-Maximization (EM) algorithms. In this work, we study several different methods of assignment, including the "hard" assignments used by K-means and the ?soft' assignments used by EM. While it is known that K-means minimizes the distortion on the data and EM maximizes the likelihood, little is known about the systematic differences of behavior between the two algorithms. Here we shed light on these differences via an information-theoretic analysis. The cornerstone of our results is a simple decomposition of the expected distortion, showing that K-means (and its extension for inferring general parametric densities from unlabeled sample data) must implicitly manage a trade-off between how similar the data assigned to each cluster are, and how the data are balanced among the clusters. How well the data are balanced is measured by the entropy of the partition defined by the hard assignments. In addition to letting us predict and verify systematic differences between K-means and EM on specific examples, the decomposition allows us to give a rather general argument showing that K ?means will consistently find densities with less "overlap" than EM. We also study a third natural assignment method that we call posterior assignment, that is close in spirit to the soft assignments of EM, but leads to a surprisingly different algorithm.

preprint2013arXiv

Exact Inference of Hidden Structure from Sample Data in Noisy-OR Networks

In the literature on graphical models, there has been increased attention paid to the problems of learning hidden structure (see Heckerman [H96] for survey) and causal mechanisms from sample data [H96, P88, S93, P95, F98]. In most settings we should expect the former to be difficult, and the latter potentially impossible without experimental intervention. In this work, we examine some restricted settings in which perfectly reconstruct the hidden structure solely on the basis of observed sample data.

preprint2013arXiv

Fast Planning in Stochastic Games

Stochastic games generalize Markov decision processes (MDPs) to a multiagent setting by allowing the state transitions to depend jointly on all player actions, and having rewards determined by multiplayer matrix games at each state. We consider the problem of computing Nash equilibria in stochastic games, the analogue of planning in MDPs. We begin by providing a generalization of finite-horizon value iteration that computes a Nash strategy for each player in generalsum stochastic games. The algorithm takes an arbitrary Nash selection function as input, which allows the translation of local choices between multiple Nash equilibria into the selection of a single global Nash equilibrium. Our main technical result is an algorithm for computing near-Nash equilibria in large or infinite state spaces. This algorithm builds on our finite-horizon value iteration algorithm, and adapts the sparse sampling methods of Kearns, Mansour and Ng (1999) to stochastic games. We conclude by descrbing a counterexample showing that infinite-horizon discounted value iteration, which was shown by shaplely to converge in the zero-sum case (a result we give extend slightly here), does not converge in the general-sum case.

preprint2013arXiv

From Bandits to Experts: A Tale of Domination and Independence

We consider the partial observability model for multi-armed bandits, introduced by Mannor and Shamir. Our main result is a characterization of regret in the directed observability model in terms of the dominating and independence numbers of the observability graph. We also show that in the undirected case, the learner can achieve optimal regret without even accessing the observability graph before selecting an action. Both results are shown using variants of the Exp3 algorithm operating on the observability graph in a time-efficient manner.

preprint2013arXiv

Nash Convergence of Gradient Dynamics in Iterated General-Sum Games

Multi-agent games are becoming an increasing prevalent formalism for the study of electronic commerce and auctions. The speed at which transactions can take place and the growing complexity of electronic marketplaces makes the study of computationally simple agents an appealing direction. In this work, we analyze the behavior of agents that incrementally adapt their strategy through gradient ascent on expected payoff, in the simple setting of two-player, two-action, iterated general-sum games, and present a surprising result. We show that either the agents will converge to Nash equilibrium, or if the strategies themselves do not converge, then their average payoffs will nevertheless converge to the payoffs of a Nash equilibrium.

preprint2013arXiv

On the Complexity of Policy Iteration

Decision-making problems in uncertain or stochastic domains are often formulated as Markov decision processes (MDPs). Policy iteration (PI) is a popular algorithm for searching over policy-space, the size of which is exponential in the number of states. We are interested in bounds on the complexity of PI that do not depend on the value of the discount factor. In this paper we prove the first such non-trivial, worst-case, upper bounds on the number of iterations required by PI to converge to the optimal policy. Our analysis also sheds new light on the manner in which PI progresses through the space of policies.

preprint2013arXiv

Thompson Sampling for Complex Bandit Problems

We consider stochastic multi-armed bandit problems with complex actions over a set of basic arms, where the decision maker plays a complex action rather than a basic arm in each round. The reward of the complex action is some function of the basic arms' rewards, and the feedback observed may not necessarily be the reward per-arm. For instance, when the complex actions are subsets of the arms, we may only observe the maximum reward over the chosen subset. Thus, feedback across complex actions may be coupled due to the nature of the reward function. We prove a frequentist regret bound for Thompson sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them. The bound holds for discretely-supported priors over the parameter space and without additional structural properties such as closed-form posteriors, conjugate prior structure or independence across arms. The regret bound scales logarithmically with time but, more importantly, with an improved constant that non-trivially captures the coupling across complex actions due to the structure of the rewards. As applications, we derive improved regret bounds for classes of complex bandit problems involving selecting subsets of arms, including the first nontrivial regret bounds for nonlinear MAX reward feedback from subsets.

preprint2012arXiv

Converting online algorithms to local computation algorithms

We propose a general method for converting online algorithms to local computation algorithms by selecting a random permutation of the input, and simulating running the online algorithm. We bound the number of steps of the algorithm using a query tree, which models the dependencies between queries. We improve previous analyses of query trees on graphs of bounded degree, and extend the analysis to the cases where the degrees are distributed binomially, and to a special case of bipartite graphs. Using this method, we give a local computation algorithm for maximal matching in graphs of bounded degree, which runs in time and space O(log^3 n). We also show how to convert a large family of load balancing algorithms (related to balls and bins problems) to local computation algorithms. This gives several local load balancing algorithms which achieve the same approximation ratios as the online algorithms, but run in O(log n) time and space. Finally, we modify existing local computation algorithms for hypergraph 2-coloring and k-CNF and use our improved analysis to obtain better time and space bounds, of O(log^4 n), removing the dependency on the maximal degree of the graph from the exponent.

preprint2012arXiv

Distributed Learning, Communication Complexity and Privacy

We consider the problem of PAC-learning from distributed data and analyze fundamental communication complexity questions involved. We provide general upper and lower bounds on the amount of communication needed to learn well, showing that in addition to VC-dimension and covering number, quantities such as the teaching-dimension and mistake-bound of a class play an important role. We also present tight results for a number of common concept classes including conjunctions, parity functions, and decision lists. For linear separators, we show that for non-concentrated distributions, we can use a version of the Perceptron algorithm to learn with much less communication than the number of updates given by the usual margin bound. We also show how boosting can be performed in a generic manner in the distributed setting to achieve communication with only logarithmic dependence on 1/epsilon for any concept class, and demonstrate how recent work on agnostic learning from class-conditional queries can be used to achieve low communication in agnostic settings as well. We additionally present an analysis of privacy, considering both differential privacy and a notion of distributional privacy that is especially appealing in this context.

preprint2012arXiv

Doubleclick Ad Exchange Auction

Display advertisements on the web are sold via ad exchanges that use real time auction. We describe the challenges of designing a suitable auction, and present a simple auction called the Optional Second Price (OSP) auction that is currently used in Doubleclick Ad Exchange.

preprint2012arXiv

Efficient Nash Computation in Large Population Games with Bounded Influence

We introduce a general representation of large-population games in which each player s influence ON the others IS centralized AND limited, but may otherwise be arbitrary.This representation significantly generalizes the class known AS congestion games IN a natural way.Our main results are provably correct AND efficient algorithms FOR computing AND learning approximate Nash equilibria IN this general framework.

preprint2012arXiv

Multiple Source Adaptation and the Renyi Divergence

This paper presents a novel theoretical study of the general problem of multiple source adaptation using the notion of Renyi divergence. Our results build on our previous work [12], but significantly broaden the scope of that work in several directions. We extend previous multiple source loss guarantees based on distribution weighted combinations to arbitrary target distributions P, not necessarily mixtures of the source distributions, analyze both known and unknown target distribution cases, and prove a lower bound. We further extend our bounds to deal with the case where the learner receives an approximate distribution for each source instead of the exact one, and show that similar loss guarantees can be achieved depending on the divergence between the approximate and true distributions. We also analyze the case where the labeling functions of the source domains are somewhat different. Finally, we report the results of experiments with both an artificial data set and a sentiment analysis task, showing the performance benefits of the distribution weighted combinations and the quality of our bounds based on the Renyi divergence.

preprint2012arXiv

Planning in POMDPs Using Multiplicity Automata

Planning and learning in Partially Observable MDPs (POMDPs) are among the most challenging tasks in both the AI and Operation Research communities. Although solutions to these problems are intractable in general, there might be special cases, such as structured POMDPs, which can be solved efficiently. A natural and possibly efficient way to represent a POMDP is through the predictive state representation (PSR) - a representation which recently has been receiving increasing attention. In this work, we relate POMDPs to multiplicity automata- showing that POMDPs can be represented by multiplicity automata with no increase in the representation size. Furthermore, we show that the size of the multiplicity automaton is equal to the rank of the predictive state representation. Therefore, we relate both the predictive state representation and POMDPs to the well-founded multiplicity automata literature. Based on the multiplicity automata representation, we provide a planning algorithm which is exponential only in the multiplicity automata rank rather than the number of states of the POMDP. As a result, whenever the predictive state representation is logarithmic in the standard POMDP representation, our planning algorithm is efficient.

preprint2012arXiv

The AND-OR game: Equilibrium Characterization (Working Paper)

We consider a simple simultaneous first price auction for multiple items in a complete information setting. Our goal is to completely characterize the mixed equilibria in this setting, for a simple, yet highly interesting, {\tt AND}-{\tt OR} game, where one agent is single minded and the other is unit demand.

preprint2011arXiv

Non-Price Equilibria in Markets of Discrete Goods

We study markets of indivisible items in which price-based (Walrasian) equilibria often do not exist due to the discrete non-convex setting. Instead we consider Nash equilibria of the market viewed as a game, where players bid for items, and where the highest bidder on an item wins it and pays his bid. We first observe that pure Nash-equilibria of this game excatly correspond to price-based equilibiria (and thus need not exist), but that mixed-Nash equilibria always do exist, and we analyze their structure in several simple cases where no price-based equilibrium exists. We also undertake an analysis of the welfare properties of these equilibria showing that while pure equilibria are always perfectly efficient ("first welfare theorem"), mixed equilibria need not be, and we provide upper and lower bounds on their amount of inefficiency.

preprint2011arXiv

Welfare and Profit Maximization with Production Costs

Combinatorial Auctions are a central problem in Algorithmic Mechanism Design: pricing and allocating goods to buyers with complex preferences in order to maximize some desired objective (e.g., social welfare, revenue, or profit). The problem has been well-studied in the case of limited supply (one copy of each item), and in the case of digital goods (the seller can produce additional copies at no cost). Yet in the case of resources---oil, labor, computing cycles, etc.---neither of these abstractions is just right: additional supplies of these resources can be found, but at increasing difficulty (marginal cost) as resources are depleted. In this work, we initiate the study of the algorithmic mechanism design problem of combinatorial pricing under increasing marginal cost. The goal is to sell these goods to buyers with unknown and arbitrary combinatorial valuation functions to maximize either the social welfare, or the seller's profit; specifically we focus on the setting of \emph{posted item prices} with buyers arriving online. We give algorithms that achieve {\em constant factor} approximations for a class of natural cost functions---linear, low-degree polynomial, logarithmic---and that give logarithmic approximations for more general increasing marginal cost functions (along with a necessary additive loss). We show that these bounds are essentially best possible for these settings.

preprint2010arXiv

Approximation Schemes for Sequential Posted Pricing in Multi-Unit Auctions

We design algorithms for computing approximately revenue-maximizing {\em sequential posted-pricing mechanisms (SPM)} in $K$-unit auctions, in a standard Bayesian model. A seller has $K$ copies of an item to sell, and there are $n$ buyers, each interested in only one copy, who have some value for the item. The seller must post a price for each buyer, the buyers arrive in a sequence enforced by the seller, and a buyer buys the item if its value exceeds the price posted to it. The seller does not know the values of the buyers, but have Bayesian information about them. An SPM specifies the ordering of buyers and the posted prices, and may be {\em adaptive} or {\em non-adaptive} in its behavior. The goal is to design SPM in polynomial time to maximize expected revenue. We compare against the expected revenue of optimal SPM, and provide a polynomial time approximation scheme (PTAS) for both non-adaptive and adaptive SPMs. This is achieved by two algorithms: an efficient algorithm that gives a $(1-\frac{1}{\sqrt{2πK}})$-approximation (and hence a PTAS for sufficiently large $K$), and another that is a PTAS for constant $K$. The first algorithm yields a non-adaptive SPM that yields its approximation guarantees against an optimal adaptive SPM -- this implies that the {\em adaptivity gap} in SPMs vanishes as $K$ becomes larger.

preprint2010arXiv

Selective Call Out and Real Time Bidding

Ads on the Internet are increasingly sold via ad exchanges such as RightMedia, AdECN and Doubleclick Ad Exchange. These exchanges allow real-time bidding, that is, each time the publisher contacts the exchange, the exchange ``calls out'' to solicit bids from ad networks. This aspect of soliciting bids introduces a novel aspect, in contrast to existing literature. This suggests developing a joint optimization framework which optimizes over the allocation and well as solicitation. We model this selective call out as an online recurrent Bayesian decision framework with bandwidth type constraints. We obtain natural algorithms with bounded performance guarantees for several natural optimization criteria. We show that these results hold under different call out constraint models, and different arrival processes. Interestingly, the paper shows that under MHR assumptions, the expected revenue of generalized second price auction with reserve is constant factor of the expected welfare. Also the analysis herein allow us prove adaptivity gap type results for the adwords problem.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint

Fields this researcher appears in

Source provenance

Where this author record came from

arxivconfidence 95%

external id: arxiv:2605.09565:author:2:yishay-mansour

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.13684:author:2:yishay-mansour

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.13145:author:3:yishay-mansour

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2604.28020:author:5:yishay-mansour

Imported May 20, 2026Synced May 20, 2026

11 works

Haim Kaplan

Researcher

Haim Kaplan contributes to research discovery and scholarly infrastructure.

Open to collaborate

7 works

Avinatan Hassidim

Researcher

Avinatan Hassidim contributes to research discovery and scholarly infrastructure.

Open to collaborate

6 works

Mehryar Mohri

Researcher

Mehryar Mohri contributes to research discovery and scholarly infrastructure.

Open to collaborate

5 works

Michael Kearns

Researcher

Michael Kearns contributes to research discovery and scholarly infrastructure.

Open to collaborate

Yishay Mansour

What is connected

Connect this record

See the researcher in context

Building this map preview

60 published item(s)

Collaborating in Multi-Armed Bandits with Strategic Agents

Cost-Aware Learning

Online Set Learning from Precision and Recall Feedback

Scale-Sensitive Shattering: Learnability and Evaluability at Optimal Scale

Benign Underfitting of Stochastic Gradient Descent

Principal-Agent Reward Shaping in MDPs

Cooperative Online Learning in Stochastic and Adversarial MDPs

Guarantees for Epsilon-Greedy Reinforcement Learning with Function Approximation

Improved Generalization Bounds for Adversarially Robust Learning

Learning Revenue Maximization using Posted Prices for Stochastic Strategic Patient Buyers

Stochastic Shortest Path with Adversarially Changing Costs

Strategizing against Learners in Bayesian Games

There is no Accuracy-Interpretability Tradeoff in Reinforcement Learning for Mazes

What killed the Convex Booster ?

Online Markov Decision Processes with Aggregate Bandit Feedback

Planning and Learning with Stochastic Action Sets

Separating Adaptive Streaming from Oblivious Streaming

Adversarially Robust Streaming Algorithms via Differential Privacy

Beyond Individual and Group Fairness

Detecting malicious PDF using CNN

Near-optimal Regret Bounds for Stochastic Shortest Path

Planning in Hierarchical Reinforcement Learning: Guarantees for Using Local Policies

Reinforcement Learning with Feedback Graphs

Three Approaches for Personalization with Applications to Federated Learning

Unknown mixing times in apprenticeship and reinforcement learning

Thompson Sampling for Adversarial Bit Prediction

Delay and Cooperation in Nonstochastic Bandits

Dynamics of Evolving Social Groups

Label Efficient Learning by Exploiting Multi-class Output Codes

Online Learning with Low Rank Experts

Predicting Counterfactuals from Large Historical Data and Small Randomized Trials

When should an expert make a prediction?

Classification with Low Rank and Missing Data

Making the Most of Your Samples

Optimistic-Conservative Bidding in Sequential Auctions

Learning Valuation Distributions from Partial Observation

Learning What's going on: reconstructing preferences and priorities from opaque transactions

Local computation mechanism design

Nonstochastic Multi-Armed Bandits with Graph-Structured Feedback

On the Complexity of Learning with Kernels

Probe Scheduling for Efficient Detection of Silent Failures

A Local Computation Approximation Scheme to Maximum Matching

An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering

Exact Inference of Hidden Structure from Sample Data in Noisy-OR Networks

Fast Planning in Stochastic Games

From Bandits to Experts: A Tale of Domination and Independence

Nash Convergence of Gradient Dynamics in Iterated General-Sum Games

On the Complexity of Policy Iteration

Thompson Sampling for Complex Bandit Problems

Converting online algorithms to local computation algorithms

Distributed Learning, Communication Complexity and Privacy

Doubleclick Ad Exchange Auction

Efficient Nash Computation in Large Population Games with Bounded Influence

Multiple Source Adaptation and the Renyi Divergence

Planning in POMDPs Using Multiplicity Automata

The AND-OR game: Equilibrium Characterization (Working Paper)

Non-Price Equilibria in Markets of Discrete Goods

Welfare and Profit Maximization with Production Costs

Approximation Schemes for Sequential Posted Pricing in Multi-Unit Auctions

Selective Call Out and Real Time Bidding