Source author record

Haipeng Luo

Haipeng Luo appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computer Science and Game Theory Artificial Intelligence physics.app-ph physics.optics

Catalog footprint

What is connected

24works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Adaptive Calibration in Non-Stationary Environments

Making calibrated online predictions is a central challenge in modern AI systems. Much of the existing literature focuses on fully adversarial environments where outcomes may be arbitrary, leading to conservative algorithms that can perform suboptimally in more benign settings, such as when outcomes are nearly stationary. This gap raises a natural question: can we design online prediction algorithms whose calibration error automatically adapts to the degree of non-stationarity in the environment, smoothly interpolating between i.i.d. and adversarial regimes? We answer this question in the affirmative and develop a suite of algorithms that achieve adaptive calibration guarantees under multiple calibration measures. Specifically, with $T$ being the number of rounds and $C\in[0,T]$ being an unknown non-stationary measure defined as the minimal $\ell_1$ deviation of the mean outcomes, our algorithms attain $\widetilde{O}(\sqrt{T}+(TC)^{\frac{1}{3}})$ for $\ell_1$ calibration error and $\widetilde{O}((1+C)^{\frac{1}{3}})$ for both $\ell_2$ and pseudo KL calibration error. These bounds match the optimal rates in the stationary case ($C=0$) and recover known guarantees in the fully adversarial regime ($C=T$). Our approach builds on and extends prior work [Hu et al., 2026, Luo et al., 2025], introducing an epoch-based scheduling together with a novel non-uniform partition of the prediction space that allocates finer resolution near the underlying ground truth.

preprint2026arXiv

Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions

Last-iterate convergence of learning dynamics in games has attracted significant recent attention. In two-player zero-sum games with bandit feedback, where only the loss of the selected action pair is observed, Fiegel et al. (2025) show a separation between average-iterate and last-iterate convergence in duality gap: while the optimal t^(-1/2) rate after t rounds is achievable for the former via standard no-regret algorithms, the latter cannot converge faster than t^(-1/3) in expectation or t^(-1/4) with high probability. However, in many practical settings, such as preference learning, the players observe not only their loss but also the opponent's action. This raises a natural question: can such additional information enable faster last-iterate convergence? We answer this question affirmatively, showing that t^(-1/2) last-iterate convergence is achievable with high probability in this setting, via an efficient algorithm that updates its strategy infrequently by solving an estimated log-barrier-regularized game. We identify fundamental obstacles preventing standard analysis for multi-armed bandits, the single-player case, from generalizing to games, and develop a novel analysis to overcome them. Experiments confirm that our algorithm indeed converges faster than naive baselines and prior methods that do not exploit opponent-action feedback. Finally, we note that our results also improve those for dueling bandits, a special case with skew-symmetric game matrices.

preprint2022arXiv

Adaptive Bandit Convex Optimization with Heterogeneous Curvature

We consider the problem of adversarial bandit convex optimization, that is, online learning over a sequence of arbitrary convex loss functions with only one function evaluation for each of them. While all previous works assume known and homogeneous curvature on these loss functions, we study a heterogeneous setting where each function has its own curvature that is only revealed after the learner makes a decision. We develop an efficient algorithm that is able to adapt to the curvature on the fly. Specifically, our algorithm not only recovers or \emph{even improves} existing results for several homogeneous settings, but also leads to surprising results for some heterogeneous settings -- for example, while Hazan and Levy (2014) showed that $\widetilde{O}(d^{3/2}\sqrt{T})$ regret is achievable for a sequence of $T$ smooth and strongly convex $d$-dimensional functions, our algorithm reveals that the same is achievable even if $T^{3/4}$ of them are not strongly convex, and sometimes even if a constant fraction of them are not strongly convex. Our approach is inspired by the framework of Bartlett et al. (2007) who studied a similar heterogeneous setting but with stronger gradient feedback. Extending their framework to the bandit feedback setting requires novel ideas such as lifting the feasible domain and using a logarithmically homogeneous self-concordant barrier regularizer.

preprint2022arXiv

Clairvoyant Regret Minimization: Equivalence with Nemirovski's Conceptual Prox Method and Extension to General Convex Games

A recent paper by Piliouras et al. [2021, 2022] introduces an uncoupled learning algorithm for normal-form games -- called Clairvoyant MWU (CMWU). In this note we show that CMWU is equivalent to the conceptual prox method described by Nemirovski [2004]. This connection immediately shows that it is possible to extend the CMWU algorithm to any convex game, a question left open by Piliouras et al. We call the resulting algorithm -- again equivalent to the conceptual prox method -- Clairvoyant OMD. At the same time, we show that our analysis yields an improved regret bound compared to the original bound by Piliouras et al., in that the regret of CMWU scales only with the square root of the number of players, rather than the number of players themselves.

preprint2022arXiv

Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits

We consider the problem of combining and learning over a set of adversarial bandit algorithms with the goal of adaptively tracking the best one on the fly. The CORRAL algorithm of Agarwal et al. (2017) and its variants (Foster et al., 2020a) achieve this goal with a regret overhead of order $\widetilde{O}(\sqrt{MT})$ where $M$ is the number of base algorithms and $T$ is the time horizon. The polynomial dependence on $M$, however, prevents one from applying these algorithms to many applications where $M$ is poly$(T)$ or even larger. Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only \emph{logarithmic} dependence on $M$ as long as some conditions are satisfied. As the main example, we apply our recipe to the problem of adversarial linear bandits over a $d$-dimensional $\ell_p$ unit-ball for $p \in (1,2]$. By corralling a large set of $T$ base algorithms, each starting at a different time step, our final algorithm achieves the first optimal switching regret $\widetilde{O}(\sqrt{d S T})$ when competing against a sequence of comparators with $S$ switches (for some known $S$). We further extend our results to linear bandits over a smooth and strongly convex domain as well as unconstrained linear bandits.

preprint2022arXiv

Kernelized Multiplicative Weights for 0/1-Polyhedral Games: Bridging the Gap Between Learning in Extensive-Form and Normal-Form Games

While extensive-form games (EFGs) can be converted into normal-form games (NFGs), doing so comes at the cost of an exponential blowup of the strategy space. So, progress on NFGs and EFGs has historically followed separate tracks, with the EFG community often having to catch up with advances (e.g., last-iterate convergence and predictive regret bounds) from the larger NFG community. In this paper we show that the Optimistic Multiplicative Weights Update (OMWU) algorithm -- the premier learning algorithm for NFGs -- can be simulated on the normal-form equivalent of an EFG in linear time per iteration in the game tree size using a kernel trick. The resulting algorithm, Kernelized OMWU (KOMWU), applies more broadly to all convex games whose strategy space is a polytope with 0/1 integral vertices, as long as the kernel can be evaluated efficiently. In the particular case of EFGs, KOMWU closes several standing gaps between NFG and EFG learning, by enabling direct, black-box transfer to EFGs of desirable properties of learning dynamics that were so far known to be achievable only in NFGs. Specifically, KOMWU gives the first algorithm that guarantees at the same time last-iterate convergence, lower dependence on the size of the game tree than all prior algorithms, and $\tilde{\mathcal{O}}(1)$ regret when followed by all players.

preprint2022arXiv

Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints

We study regret minimization for infinite-horizon average-reward Markov Decision Processes (MDPs) under cost constraints. We start by designing a policy optimization algorithm with carefully designed action-value estimator and bonus term, and show that for ergodic MDPs, our algorithm ensures $\widetilde{O}(\sqrt{T})$ regret and constant constraint violation, where $T$ is the total number of time steps. This strictly improves over the algorithm of (Singh et al., 2020), whose regret and constraint violation are both $\widetilde{O}(T^{2/3})$. Next, we consider the most general class of weakly communicating MDPs. Through a finite-horizon approximation, we develop another algorithm with $\widetilde{O}(T^{2/3})$ regret and constraint violation, which can be further improved to $\widetilde{O}(\sqrt{T})$ via a simple modification, albeit making the algorithm computationally inefficient. As far as we know, these are the first set of provable algorithms for weakly communicating MDPs with cost constraints.

preprint2022arXiv

Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments

We initiate the study of dynamic regret minimization for goal-oriented reinforcement learning modeled by a non-stationary stochastic shortest path problem with changing cost and transition functions. We start by establishing a lower bound $Ω((B_{\star} SAT_{\star}(Δ_c + B_{\star}^2Δ_P))^{1/3}K^{2/3})$, where $B_{\star}$ is the maximum expected cost of the optimal policy of any episode starting from any state, $T_{\star}$ is the maximum hitting time of the optimal policy of any episode starting from the initial state, $SA$ is the number of state-action pairs, $Δ_c$ and $Δ_P$ are the amount of changes of the cost and transition functions respectively, and $K$ is the number of episodes. The different roles of $Δ_c$ and $Δ_P$ in this lower bound inspire us to design algorithms that estimate costs and transitions separately. Specifically, assuming the knowledge of $Δ_c$ and $Δ_P$, we develop a simple but sub-optimal algorithm and another more involved minimax optimal algorithm (up to logarithmic terms). These algorithms combine the ideas of finite-horizon approximation [Chen et al., 2022a], special Bernstein-style bonuses of the MVP algorithm [Zhang et al., 2020], adaptive confidence widening [Wei and Luo, 2021], as well as some new techniques such as properly penalizing long-horizon policies. Finally, when $Δ_c$ and $Δ_P$ are unknown, we develop a variant of the MASTER algorithm [Wei and Luo, 2021] and integrate the aforementioned ideas into it to achieve $\widetilde{O}(\min\{B_{\star} S\sqrt{ALK}, (B_{\star}^2S^2AT_{\star}(Δ_c+B_{\star}Δ_P))^{1/3}K^{2/3}\})$ regret, where $L$ is the unknown number of changes of the environment.

preprint2022arXiv

No-Regret Learning in Time-Varying Zero-Sum Games

Learning from repeated play in a fixed two-player zero-sum game is a classic problem in game theory and online learning. We consider a variant of this problem where the game payoff matrix changes over time, possibly in an adversarial manner. We first present three performance measures to guide the algorithmic design for this problem: 1) the well-studied individual regret, 2) an extension of duality gap, and 3) a new measure called dynamic Nash Equilibrium regret, which quantifies the cumulative difference between the player's payoff and the minimax game value. Next, we develop a single parameter-free algorithm that simultaneously enjoys favorable guarantees under all these three performance measures. These guarantees are adaptive to different non-stationarity measures of the payoff matrices and, importantly, recover the best known results when the payoff matrix is fixed. Our algorithm is based on a two-layer structure with a meta-algorithm learning over a group of black-box base-learners satisfying a certain property, along with several novel ingredients specifically designed for the time-varying game setting. Empirical results further validate the effectiveness of our algorithm.

preprint2022arXiv

Policy Optimization for Stochastic Shortest Path

Policy optimization is among the most popular and successful reinforcement learning algorithms, and there is increasing interest in understanding its theoretical guarantees. In this work, we initiate the study of policy optimization for the stochastic shortest path (SSP) problem, a goal-oriented reinforcement learning model that strictly generalizes the finite-horizon model and better captures many applications. We consider a wide range of settings, including stochastic and adversarial environments under full information or bandit feedback, and propose a policy optimization algorithm for each setting that makes use of novel correction terms and/or variants of dilated bonuses (Luo et al., 2021). For most settings, our algorithm is shown to achieve a near-optimal regret bound. One key technical contribution of this work is a new approximation scheme to tackle SSP problems that we call \textit{stacked discounted approximation} and use in all our proposed algorithms. Unlike the finite-horizon approximation that is heavily used in recent SSP algorithms, our new approximation enables us to learn a near-stationary policy with only logarithmic changes during an episode and could lead to an exponential improvement in space complexity.

preprint2021arXiv

Active Online Learning with Hidden Shifting Domains

Online machine learning systems need to adapt to domain shifts. Meanwhile, acquiring label at every timestep is expensive. We propose a surprisingly simple algorithm that adaptively balances its regret and its number of label queries in settings where the data streams are from a mixture of hidden domains. For online linear regression with oblivious adversaries, we provide a tight tradeoff that depends on the durations and dimensionalities of the hidden domains. Our algorithm can adaptively deal with interleaving spans of inputs from different domains. We also generalize our results to non-linear regression for hypothesis classes with bounded eluder dimension and adaptive adversaries. Experiments on synthetic and realistic datasets demonstrate that our algorithm achieves lower regret than uniform queries and greedy queries with equal labeling budget.

preprint2020arXiv

A Closer Look at Small-loss Bounds for Bandits with Graph Feedback

We study small-loss bounds for adversarial multi-armed bandits with graph feedback, that is, adaptive regret bounds that depend on the loss of the best arm or related quantities, instead of the total number of rounds. We derive the first small-loss bound for general strongly observable graphs, resolving an open problem of Lykouris et al. (2018). Specifically, we develop an algorithm with regret $\mathcal{\tilde{O}}(\sqrt{κL_*})$ where $κ$ is the clique partition number and $L_*$ is the loss of the best arm, and for the special case of self-aware graphs where every arm has a self-loop, we improve the regret to $\mathcal{\tilde{O}}(\min\{\sqrt{αT}, \sqrt{κL_*}\})$ where $α\leq κ$ is the independence number. Our results significantly improve and extend those by Lykouris et al. (2018) who only consider self-aware undirected graphs. Furthermore, we also take the first attempt at deriving small-loss bounds for weakly observable graphs. We first prove that no typical small-loss bounds are achievable in this case, and then propose algorithms with alternative small-loss bounds in terms of the loss of some specific subset of arms. A surprising side result is that $\mathcal{\tilde{O}}(\sqrt{T})$ regret is achievable even for weakly observable graphs as long as the best arm has a self-loop. Our algorithms are based on the Online Mirror Descent framework but require a suite of novel techniques that might be of independent interest. Moreover, all our algorithms can be made parameter-free without the knowledge of the environment.

preprint2020arXiv

Comparator-adaptive Convex Bandits

We study bandit convex optimization methods that adapt to the norm of the comparator, a topic that has only been studied before for its full-information counterpart. Specifically, we develop convex bandit algorithms with regret bounds that are small whenever the norm of the comparator is small. We first use techniques from the full-information setting to develop comparator-adaptive algorithms for linear bandits. Then, we extend the ideas to convex bandits with Lipschitz or smooth loss functions, using a new single-point gradient estimator and carefully designed surrogate losses.

preprint2020arXiv

Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes

Model-free reinforcement learning is known to be memory and computation efficient and more amendable to large scale problems. In this paper, two model-free algorithms are introduced for learning infinite-horizon average-reward Markov Decision Processes (MDPs). The first algorithm reduces the problem to the discounted-reward version and achieves $\mathcal{O}(T^{2/3})$ regret after $T$ steps, under the minimal assumption of weakly communicating MDPs. To our knowledge, this is the first model-free algorithm for general MDPs in this setting. The second algorithm makes use of recent advances in adaptive algorithms for adversarial multi-armed bandits and improves the regret to $\mathcal{O}(\sqrt{T})$, albeit with a stronger ergodic assumption. This result significantly improves over the $\mathcal{O}(T^{3/4})$ regret achieved by the only existing model-free algorithm by Abbasi-Yadkori et al. (2019a) for ergodic MDPs in the infinite-horizon average-reward setting.

preprint2020arXiv

Open Problem: Model Selection for Contextual Bandits

In statistical learning, algorithms for model selection allow the learner to adapt to the complexity of the best hypothesis class in a sequence. We ask whether similar guarantees are possible for contextual bandit learning.

preprint2019arXiv

Efficient Electro-optical Tuning of Optical Frequency Microcomb on a Monolithically Integrated High-Q Lithium Niobate Microdisk

We demonstrate efficient tuning of a monolithically integrated lithium niobate microdisk (LN) optical frequency microcomb. Utilizing the high optical quality (Q) factor (i.e., Q~7.1*10^6) of the microdisk, the microcomb spans over a spectral bandwidth of ~200 nm at a pump power as low as 20.4 mW. Combining the large eletro-optic coefficient of LN and optimum design of the geometry of microelectrodes, we demonstrate electro-optical tuning of the comb with a spectral range of 400 pm and a tuning efficiency of ~38 pm/100V.

preprint2016arXiv

Improved Regret Bounds for Oracle-Based Adversarial Contextual Bandits

We give an oracle-based algorithm for the adversarial contextual bandit problem, where either contexts are drawn i.i.d. or the sequence of contexts is known a priori, but where the losses are picked adversarially. Our algorithm is computationally efficient, assuming access to an offline optimization oracle, and enjoys a regret of order $O((KT)^{\frac{2}{3}}(\log N)^{\frac{1}{3}})$, where $K$ is the number of actions, $T$ is the number of iterations and $N$ is the number of baseline policies. Our result is the first to break the $O(T^{\frac{3}{4}})$ barrier that is achieved by recently introduced algorithms. Breaking this barrier was left as a major open problem. Our analysis is based on the recent relaxation based approach of (Rakhlin and Sridharan, 2016).

preprint2015arXiv

Achieving All with No Parameters: Adaptive NormalHedge

We study the classic online learning problem of predicting with expert advice, and propose a truly parameter-free and adaptive algorithm that achieves several objectives simultaneously without using any prior information. The main component of this work is an improved version of the NormalHedge.DT algorithm (Luo and Schapire, 2014), called AdaNormalHedge. On one hand, this new algorithm ensures small regret when the competitor has small loss and almost constant regret when the losses are stochastic. On the other hand, the algorithm is able to compete with any convex combination of the experts simultaneously, with a regret in terms of the relative entropy of the prior and the competitor. This resolves an open problem proposed by Chaudhuri et al. (2009) and Chernov and Vovk (2010). Moreover, we extend the results to the sleeping expert setting and provide two applications to illustrate the power of AdaNormalHedge: 1) competing with time-varying unknown competitors and 2) predicting almost as well as the best pruning tree. Our results on these applications significantly improve previous work from different aspects, and a special case of the first application resolves another open problem proposed by Warmuth and Koolen (2014) on whether one can simultaneously achieve optimal shifting regret for both adversarial and stochastic losses.

preprint2015arXiv

Fast Convergence of Regularized Learning in Games

We show that natural classes of regularized learning algorithms with a form of recency bias achieve faster convergence rates to approximate efficiency and to coarse correlated equilibria in multiplayer normal form games. When each player in a game uses an algorithm from our class, their individual regret decays at $O(T^{-3/4})$, while the sum of utilities converges to an approximate optimum at $O(T^{-1})$--an improvement upon the worst case $O(T^{-1/2})$ rates. We show a black-box reduction for any algorithm in the class to achieve $\tilde{O}(T^{-1/2})$ rates against an adversary, while maintaining the faster rates against algorithms in the class. Our results extend those of [Rakhlin and Shridharan 2013] and [Daskalakis et al. 2014], who only analyzed two-player zero-sum games for specific algorithms.

preprint2015arXiv

Online Gradient Boosting

We extend the theory of boosting for regression problems to the online learning setting. Generalizing from the batch setting for boosting, the notion of a weak learning algorithm is modeled as an online learning algorithm with linear loss functions that competes with a base class of regression functions, while a strong learning algorithm is an online learning algorithm with convex loss functions that competes with a larger class of regression functions. Our main result is an online gradient boosting algorithm which converts a weak online learning algorithm into a strong one where the larger class of functions is the linear span of the base class. We also give a simpler boosting algorithm that converts a weak online learning algorithm into a strong one where the larger class of functions is the convex hull of the base class, and prove its optimality.

preprint2015arXiv

Optimal and Adaptive Algorithms for Online Boosting

We study online boosting, the task of converting any weak online learner into a strong online learner. Based on a novel and natural definition of weak online learnability, we develop two online boosting algorithms. The first algorithm is an online version of boost-by-majority. By proving a matching lower bound, we show that this algorithm is essentially optimal in terms of the number of weak learners and the sample complexity needed to achieve a specified accuracy. This optimal algorithm is not adaptive however. Using tools from online loss minimization, we derive an adaptive online boosting algorithm that is also parameter-free, but not optimal. Both algorithms work with base learners that can handle example importance weights directly, as well as by rejection sampling examples with probability defined by the booster. Results are complemented with an extensive experimental study.

preprint2014arXiv

A Drifting-Games Analysis for Online Learning and Applications to Boosting

We provide a general mechanism to design online learning algorithms based on a minimax analysis within a drifting-games framework. Different online learning settings (Hedge, multi-armed bandit problems and online convex optimization) are studied by converting into various kinds of drifting games. The original minimax analysis for drifting games is then used and generalized by applying a series of relaxations, starting from choosing a convex surrogate of the 0-1 loss function. With different choices of surrogates, we not only recover existing algorithms, but also propose new algorithms that are totally parameter-free and enjoy other useful properties. Moreover, our drifting-games framework naturally allows us to study high probability bounds without resorting to any concentration results, and also a generalized notion of regret that measures how good the algorithm is compared to all but the top small fraction of candidates. Finally, we translate our new Hedge algorithm into a new adaptive boosting algorithm that is computationally faster as shown in experiments, since it ignores a large number of examples on each round.

preprint2014arXiv

Accelerated Parallel Optimization Methods for Large Scale Machine Learning

The growing amount of high dimensional data in different machine learning applications requires more efficient and scalable optimization algorithms. In this work, we consider combining two techniques, parallelism and Nesterov's acceleration, to design faster algorithms for L1-regularized loss. We first simplify BOOM, a variant of gradient descent, and study it in a unified framework, which allows us to not only propose a refined measurement of sparsity to improve BOOM, but also show that BOOM is provably slower than FISTA. Moving on to parallel coordinate descent methods, we then propose an efficient accelerated version of Shotgun, improving the convergence rate from $O(1/t)$ to $O(1/t^2)$. Our algorithm enjoys a concise form and analysis compared to previous work, and also allows one to study several connected work in a unified way.

preprint2013arXiv

Towards Minimax Online Learning with Unknown Time Horizon

We consider online learning when the time horizon is unknown. We apply a minimax analysis, beginning with the fixed horizon case, and then moving on to two unknown-horizon settings, one that assumes the horizon is chosen randomly according to some known distribution, and the other which allows the adversary full control over the horizon. For the random horizon setting with restricted losses, we derive a fully optimal minimax algorithm. And for the adversarial horizon setting, we prove a nontrivial lower bound which shows that the adversary obtains strictly more power than when the horizon is fixed and known. Based on the minimax solution of the random horizon setting, we then propose a new adaptive algorithm which "pretends" that the horizon is drawn from a distribution from a special family, but no matter how the actual horizon is chosen, the worst-case regret is of the optimal rate. Furthermore, our algorithm can be combined and applied in many ways, for instance, to online convex optimization, follow the perturbed leader, exponential weights algorithm and first order bounds. Experiments show that our algorithm outperforms many other existing algorithms in an online linear optimization setting.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint

Fields this researcher appears in

Machine Learning Computer Science and Game Theory Artificial Intelligence physics.app-ph physics.optics

Source provenance

Where this author record came from

arxivconfidence 95%

external id: arxiv:2605.09363:author:3:haipeng-luo

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.11490:author:2:haipeng-luo

Imported May 20, 2026Synced May 21, 2026

5 works

Mengxiao Zhang

Researcher

Mengxiao Zhang contributes to research discovery and scholarly infrastructure.

Open to collaborate

5 works

Robert E. Schapire

Researcher

Robert E. Schapire contributes to research discovery and scholarly infrastructure.

Open to collaborate

3 works

Chung-Wei Lee

Researcher

Chung-Wei Lee contributes to research discovery and scholarly infrastructure.

Open to collaborate

3 works

Liyu Chen

Researcher

Liyu Chen contributes to research discovery and scholarly infrastructure.

Open to collaborate

Haipeng Luo

What is connected

Connect this record

See the researcher in context

Building this map preview

24 published item(s)

Adaptive Calibration in Non-Stationary Environments

Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions

Adaptive Bandit Convex Optimization with Heterogeneous Curvature

Clairvoyant Regret Minimization: Equivalence with Nemirovski's Conceptual Prox Method and Extension to General Convex Games

Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits

Kernelized Multiplicative Weights for 0/1-Polyhedral Games: Bridging the Gap Between Learning in Extensive-Form and Normal-Form Games

Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints

Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments

No-Regret Learning in Time-Varying Zero-Sum Games

Policy Optimization for Stochastic Shortest Path

Active Online Learning with Hidden Shifting Domains

A Closer Look at Small-loss Bounds for Bandits with Graph Feedback

Comparator-adaptive Convex Bandits

Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes

Open Problem: Model Selection for Contextual Bandits

Efficient Electro-optical Tuning of Optical Frequency Microcomb on a Monolithically Integrated High-Q Lithium Niobate Microdisk

Improved Regret Bounds for Oracle-Based Adversarial Contextual Bandits

Achieving All with No Parameters: Adaptive NormalHedge

Fast Convergence of Regularized Learning in Games

Online Gradient Boosting

Optimal and Adaptive Algorithms for Online Boosting

A Drifting-Games Analysis for Online Learning and Applications to Boosting

Accelerated Parallel Optimization Methods for Large Scale Machine Learning

Towards Minimax Online Learning with Unknown Time Horizon