Researcher profile

Yu-Xiang Wang

Yu-Xiang Wang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
28works
0followers
10topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

28 published item(s)

preprint2026arXiv

Dataset Watermarking for Closed LLMs with Provable Detection

Large language models (LLMs) are pre-trained and post-trained on vast amounts of loosely curated data, raising the possibility that these models may have been trained on proprietary datasets or the same benchmarks used for evaluation. This motivates the need for dataset watermarking: designing datasets such that training on them leaves detectable signatures in the resulting model. Prior work has explored this problem for open models. We introduce the first dataset watermarking method for closed LLMs with provable detection. In particular, we embed a dataset-level watermark signal by increasing the co-occurrence frequency of randomly selected word pairs through rephrasing, and detect it using a statistical test on co-occurrence patterns in model-generated outputs. We evaluate our method with multiple base models and benchmark datasets and show that it reliably detects the watermark ($p <0.01$) in the fine-tuning stage. Notably, our method remains effective in a data mixture setting where the watermarked dataset constitutes only approximately $1\%$ of the total fine-tuning tokens. Furthermore, we show that our method preserves the utility and semantic integrity of the benchmark.

preprint2026arXiv

Optimal Contextual Pricing under Agnostic Non-Lipschitz Demand

We study contextual dynamic pricing with linear valuations and bounded-support agnostic noise, whose induced demand curve may be non-Lipschitz with arbitrary jumps and atoms. Such discontinuities break the cross-context interpolation arguments used by smooth-demand pricing algorithms, while the best previous method achieved only $\tilde O(T^{3/4})$ regret. We propose Conservative-Markdown Redirect-UCB Pricing, a polynomial-time algorithm that combines randomized parameter estimation, conservative residual-grid probing, and confidence-based one-step redirection. Our algorithm achieves $\tilde O(T^{2/3})$ optimal regret, matching the known lower bounds of Kleinberg and Leighton (2003) up to logarithmic factors and improving over the previous upper bound of Xu and Wang (2022). Under stochastic well-conditioned contexts, this closes the long-existing open regret gap in linear-valuation contextual pricing under agnostic non-Lipschitz noise distribution.

preprint2026arXiv

Stable Minima of ReLU Neural Networks Suffer from the Curse of Dimensionality: The Neural Shattering Phenomenon

We study the implicit bias of flatness / low (loss) curvature and its effects on generalization in two-layer overparameterized ReLU networks with multivariate inputs -- a problem well motivated by the minima stability and edge-of-stability phenomena in gradient-descent training. Existing work either requires interpolation or focuses only on univariate inputs. This paper presents new and somewhat surprising theoretical results for multivariate inputs. On two natural settings (1) generalization gap for flat solutions, and (2) mean-squared error (MSE) in nonparametric function estimation by stable minima, we prove upper and lower bounds, which establish that while flatness does imply generalization, the resulting rates of convergence necessarily deteriorate exponentially as the input dimension grows. This gives an exponential separation between the flat solutions compared to low-norm solutions (i.e., weight decay), which are known not to suffer from the curse of dimensionality. In particular, our minimax lower bound construction, based on a novel packing argument with boundary-localized ReLU neurons, reveals how flat solutions can exploit a kind of &#34;neural shattering&#34; where neurons rarely activate, but with high weight magnitudes. This leads to poor performance in high dimensions. We corroborate these theoretical findings with extensive numerical simulations. To the best of our knowledge, our analysis provides the first systematic explanation for why flat minima may fail to generalize in high dimensions.

preprint2023arXiv

Improving the Privacy and Practicality of Objective Perturbation for Differentially Private Linear Learners

In the arena of privacy-preserving machine learning, differentially private stochastic gradient descent (DP-SGD) has outstripped the objective perturbation mechanism in popularity and interest. Though unrivaled in versatility, DP-SGD requires a non-trivial privacy overhead (for privately tuning the model&#39;s hyperparameters) and a computational complexity which might be extravagant for simple models such as linear and logistic regression. This paper revamps the objective perturbation mechanism with tighter privacy analyses and new computational tools that boost it to perform competitively with DP-SGD on unconstrained convex generalized linear problems.

preprint2023arXiv

Offline Reinforcement Learning with Differential Privacy

The offline reinforcement learning (RL) problem is often motivated by the need to learn data-driven decision policies in financial, legal and healthcare applications. However, the learned policy could retain sensitive information of individuals in the training data (e.g., treatment and outcome of patients), thus susceptible to various privacy risks. We design offline RL algorithms with differential privacy guarantees which provably prevent such risks. These algorithms also enjoy strong instance-dependent learning bounds under both tabular and linear Markov decision process (MDP) settings. Our theory and simulation suggest that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.

preprint2022arXiv

Adaptive Private-K-Selection with Adaptive K and Application to Multi-label PATE

We provide an end-to-end Renyi DP based-framework for differentially private top-$k$ selection. Unlike previous approaches, which require a data-independent choice on $k$, we propose to privately release a data-dependent choice of $k$ such that the gap between $k$-th and the $(k+1)$st &#34;quality&#34; is large. This is achieved by a novel application of the Report-Noisy-Max. Not only does this eliminate one hyperparameter, the adaptive choice of $k$ also certifies the stability of the top-$k$ indices in the unordered set so we can release them using a variant of propose-test-release (PTR) without adding noise. We show that our construction improves the privacy-utility trade-offs compared to the previous top-$k$ selection algorithms theoretically and empirically. Additionally, we apply our algorithm to &#34;Private Aggregation of Teacher Ensembles (PATE)&#34; in multi-label classification tasks with a large number of labels and show that it leads to significant performance gains.

preprint2022arXiv

Doubly Robust Crowdsourcing

Large-scale labeled dataset is the indispensable fuel that ignites the AI revolution as we see today. Most such datasets are constructed using crowdsourcing services such as Amazon Mechanical Turk which provides noisy labels from non-experts at a fair price. The sheer size of such datasets mandates that it is only feasible to collect a few labels per data point. We formulate the problem of test-time label aggregation as a statistical estimation problem of inferring the expected voting score. By imitating workers with supervised learners and using them in a doubly robust estimation framework, we prove that the variance of estimation can be substantially reduced, even if the learner is a poor approximation. Synthetic and real-world experiments show that by combining the doubly robust approach with adaptive worker/item selection rules, we often need much lower label cost to achieve nearly the same accuracy as in the ideal world where all workers label all data points.

preprint2022arXiv

Generalized PTR: User-Friendly Recipes for Data-Adaptive Algorithms with Differential Privacy

The &#39;&#39;Propose-Test-Release&#39;&#39; (PTR) framework is a classic recipe for designing differentially private (DP) algorithms that are data-adaptive, i.e. those that add less noise when the input dataset is nice. We extend PTR to a more general setting by privately testing data-dependent privacy losses rather than local sensitivity, hence making it applicable beyond the standard noise-adding mechanisms, e.g. to queries with unbounded or undefined sensitivity. We demonstrate the versatility of generalized PTR using private linear regression as a case study. Additionally, we apply our algorithm to solve an open problem from &#39;&#39;Private Aggregation of Teacher Ensembles (PATE)&#39;&#39; -- privately releasing the entire model with a delicate data-dependent analysis.

preprint2022arXiv

Mixed Differential Privacy in Computer Vision

We introduce AdaMix, an adaptive differentially private algorithm for training deep neural network classifiers using both private and public image data. While pre-training language models on large public datasets has enabled strong differential privacy (DP) guarantees with minor loss of accuracy, a similar practice yields punishing trade-offs in vision tasks. A few-shot or even zero-shot learning baseline that ignores private data can outperform fine-tuning on a large private dataset. AdaMix incorporates few-shot training, or cross-modal zero-shot learning, on public data prior to private fine-tuning, to improve the trade-off. AdaMix reduces the error increase from the non-private upper bound from the 167-311\% of the baseline, on average across 6 datasets, to 68-92\% depending on the desired privacy level selected by the user. AdaMix tackles the trade-off arising in visual classification, whereby the most privacy sensitive data, corresponding to isolated points in representation space, are also critical for high classification accuracy. In addition, AdaMix comes with strong theoretical privacy guarantees and convergence analysis.

preprint2022arXiv

Near-optimal Offline Reinforcement Learning with Linear Representation: Leveraging Variance Information with Pessimism

Offline reinforcement learning, which seeks to utilize offline/historical data to optimize sequential decision-making strategies, has gained surging prominence in recent studies. Due to the advantage that appropriate function approximators can help mitigate the sample complexity burden in modern reinforcement learning problems, existing endeavors usually enforce powerful function representation models (e.g. neural networks) to learn the optimal policies. However, a precise understanding of the statistical limits with function representations, remains elusive, even when such a representation is linear. Towards this goal, we study the statistical limits of offline reinforcement learning with linear model representations. To derive the tight offline learning bound, we design the variance-aware pessimistic value iteration (VAPVI), which adopts the conditional variance information of the value function for time-inhomogeneous episodic linear Markov decision processes (MDPs). VAPVI leverages estimated variances of the value functions to reweight the Bellman residuals in the least-square pessimistic value iteration and provides improved offline learning bounds over the best-known existing results (whereas the Bellman residuals are equally weighted by design). More importantly, our learning bounds are expressed in terms of system quantities, which provide natural instance-dependent characterizations that previous results are short of. We hope our results draw a clearer picture of what offline learning should look like when linear representations are provided.

preprint2022arXiv

Offline Stochastic Shortest Path: Learning, Evaluation and Towards Optimality

Goal-oriented Reinforcement Learning, where the agent needs to reach the goal state while simultaneously minimizing the cost, has received significant attention in real-world applications. Its theoretical formulation, stochastic shortest path (SSP), has been intensively researched in the online setting. Nevertheless, it remains understudied when such an online interaction is prohibited and only historical data is provided. In this paper, we consider the offline stochastic shortest path problem when the state space and the action space are finite. We design the simple value iteration-based algorithms for tackling both offline policy evaluation (OPE) and offline policy learning tasks. Notably, our analysis of these simple algorithms yields strong instance-dependent bounds which can imply worst-case bounds that are near-minimax optimal. We hope our study could help illuminate the fundamental statistical limits of the offline SSP problem and motivate further studies beyond the scope of current consideration.

preprint2022arXiv

Optimal Accounting of Differential Privacy via Characteristic Function

Characterizing the privacy degradation over compositions, i.e., privacy accounting, is a fundamental topic in differential privacy (DP) with many applications to differentially private machine learning and federated learning. We propose a unification of recent advances (Renyi DP, privacy profiles, $f$-DP and the PLD formalism) via the \emph{characteristic function} ($ϕ$-function) of a certain \emph{dominating} privacy loss random variable. We show that our approach allows \emph{natural} adaptive composition like Renyi DP, provides \emph{exactly tight} privacy accounting like PLD, and can be (often \emph{losslessly}) converted to privacy profile and $f$-DP, thus providing $(ε,δ)$-DP guarantees and interpretable tradeoff functions. Algorithmically, we propose an \emph{analytical Fourier accountant} that represents the \emph{complex} logarithm of $ϕ$-functions symbolically and uses Gaussian quadrature for numerical computation. On several popular DP mechanisms and their subsampled counterparts, we demonstrate the flexibility and tightness of our approach in theory and experiments.

preprint2022arXiv

Optimal Dynamic Regret in LQR Control

We consider the problem of nonstochastic control with a sequence of quadratic losses, i.e., LQR control. We provide an efficient online algorithm that achieves an optimal dynamic (policy) regret of $\tilde{O}(\text{max}\{n^{1/3} \mathcal{TV}(M_{1:n})^{2/3}, 1\})$, where $\mathcal{TV}(M_{1:n})$ is the total variation of any oracle sequence of Disturbance Action policies parameterized by $M_1,...,M_n$ -- chosen in hindsight to cater to unknown nonstationarity. The rate improves the best known rate of $\tilde{O}(\sqrt{n (\mathcal{TV}(M_{1:n})+1)} )$ for general convex losses and we prove that it is information-theoretically optimal for LQR. Main technical components include the reduction of LQR to online linear regression with delayed feedback due to Foster and Simchowitz (2020), as well as a new proper learning algorithm with an optimal $\tilde{O}(n^{1/3})$ dynamic regret on a family of ``minibatched&#39;&#39; quadratic losses, which could be of independent interest.

preprint2022arXiv

Optimal Dynamic Regret in Proper Online Learning with Strongly Convex Losses and Beyond

We study the framework of universal dynamic regret minimization with strongly convex losses. We answer an open problem in Baby and Wang 2021 by showing that in a proper learning setup, Strongly Adaptive algorithms can achieve the near optimal dynamic regret of $\tilde O(d^{1/3} n^{1/3}\text{TV}[u_{1:n}]^{2/3} \vee d)$ against any comparator sequence $u_1,\ldots,u_n$ simultaneously, where $n$ is the time horizon and $\text{TV}[u_{1:n}]$ is the Total Variation of comparator. These results are facilitated by exploiting a number of new structures imposed by the KKT conditions that were not considered in Baby and Wang 2021 which also lead to other improvements over their results such as: (a) handling non-smooth losses and (b) improving the dimension dependence on regret. Further, we also derive near optimal dynamic regret rates for the special case of proper online learning with exp-concave losses and an $L_\infty$ constrained decision set.

preprint2022arXiv

Provably Confidential Language Modelling

Large language models are shown to memorize privacy information such as social security numbers in training data. Given the sheer scale of the training corpus, it is challenging to screen and filter these privacy data, either manually or automatically. In this paper, we propose Confidentially Redacted Training (CRT), a method to train language generation models while protecting the confidential segments. We borrow ideas from differential privacy (which solves a related but distinct problem) and show that our method is able to provably prevent unintended memorization by randomizing parts of the training process. Moreover, we show that redaction with an approximately correct screening policy amplifies the confidentiality guarantee. We implement the method for both LSTM and GPT language models. Our experimental results show that the models trained by CRT obtain almost the same perplexity while preserving strong confidentiality.

preprint2022arXiv

Revisiting Model-Agnostic Private Learning: Faster Rates and Active Learning

The Private Aggregation of Teacher Ensembles (PATE) framework is one of the most promising recent approaches in differentially private learning. Existing theoretical analysis shows that PATE consistently learns any VC-classes in the realizable setting, but falls short in explaining its success in more general cases where the error rate of the optimal classifier is bounded away from zero. We fill in this gap by introducing the Tsybakov Noise Condition (TNC) and establish stronger and more interpretable learning bounds. These bounds provide new insights into when PATE works and improve over existing results even in the narrower realizable setting. We also investigate the compelling idea of using active learning for saving privacy budget, and empirical studies show the effectiveness of this new idea. The novel components in the proofs include a more refined analysis of the majority voting classifier - which could be of independent interest - and an observation that the synthetic &#34;student&#34; learning problem is nearly realizable by construction under the Tsybakov noise condition.

preprint2022arXiv

Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost

We study the problem of reinforcement learning (RL) with low (policy) switching cost - a problem well-motivated by real-life RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of $\widetilde{O}(\sqrt{H^4S^2AT})$ while requiring a switching cost of $O(HSA \log\log T)$. This is an exponential improvement over the best-known switching cost $O(H^2SA\log T)$ among existing methods with $\widetilde{O}(\mathrm{poly}(H,S,A)\sqrt{T})$ regret. In the above, $S,A$ denotes the number of states and actions in an $H$-horizon episodic Markov Decision Process model with unknown transitions, and $T$ is the number of steps. As a byproduct of our new techniques, we also derive a reward-free exploration algorithm with a switching cost of $O(HSA)$. Furthermore, we prove a pair of information-theoretical lower bounds which say that (1) Any no-regret algorithm must have a switching cost of $Ω(HSA)$; (2) Any $\widetilde{O}(\sqrt{T})$ regret algorithm must incur a switching cost of $Ω(HSA\log\log T)$. Both our algorithms are thus optimal in their switching costs.

preprint2022arXiv

Towards Agnostic Feature-based Dynamic Pricing: Linear Policies vs Linear Valuation with Unknown Noise

In feature-based dynamic pricing, a seller sets appropriate prices for a sequence of products (described by feature vectors) on the fly by learning from the binary outcomes of previous sales sessions (&#34;Sold&#34; if valuation $\geq$ price, and &#34;Not Sold&#34; otherwise). Existing works either assume noiseless linear valuation or precisely-known noise distribution, which limits the applicability of those algorithms in practice when these assumptions are hard to verify. In this work, we study two more agnostic models: (a) a &#34;linear policy&#34; problem where we aim at competing with the best linear pricing policy while making no assumptions on the data, and (b) a &#34;linear noisy valuation&#34; problem where the random valuation is linear plus an unknown and assumption-free noise. For the former model, we show a $\tildeΘ(d^{\frac13}T^{\frac23})$ minimax regret up to logarithmic factors. For the latter model, we present an algorithm that achieves an $\tilde{O}(T^{\frac34})$ regret, and improve the best-known lower bound from $Ω(T^{\frac35})$ to $\tildeΩ(T^{\frac23})$. These results demonstrate that no-regret learning is possible for feature-based dynamic pricing under weak assumptions, but also reveal a disappointing fact that the seemingly richer pricing feedback is not significantly more useful than the bandit-feedback in regret reduction.

preprint2022arXiv

Towards Differential Relational Privacy and its use in Question Answering

Memorization of the relation between entities in a dataset can lead to privacy issues when using a trained model for question answering. We introduce Relational Memorization (RM) to understand, quantify and control this phenomenon. While bounding general memorization can have detrimental effects on the performance of a trained model, bounding RM does not prevent effective learning. The difference is most pronounced when the data distribution is long-tailed, with many queries having only few training examples: Impeding general memorization prevents effective learning, while impeding only relational memorization still allows learning general properties of the underlying concepts. We formalize the notion of Relational Privacy (RP) and, inspired by Differential Privacy (DP), we provide a possible definition of Differential Relational Privacy (DrP). These notions can be used to describe and compute bounds on the amount of RM in a trained model. We illustrate Relational Privacy concepts in experiments with large-scale models for Question Answering.

preprint2021arXiv

An Optimal Reduction of TV-Denoising to Adaptive Online Learning

We consider the problem of estimating a function from $n$ noisy samples whose discrete Total Variation (TV) is bounded by $C_n$. We reveal a deep connection to the seemingly disparate problem of Strongly Adaptive online learning (Daniely et al, 2015) and provide an $O(n \log n)$ time algorithm that attains the near minimax optimal rate of $\tilde O (n^{1/3}C_n^{2/3})$ under squared error loss. The resulting algorithm runs online and optimally adapts to the unknown smoothness parameter $C_n$. This leads to a new and more versatile alternative to wavelets-based methods for (1) adaptively estimating TV bounded functions; (2) online forecasting of TV bounded trends in time series.

preprint2021arXiv

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

We consider the problem of offline reinforcement learning (RL) -- a well-motivated setting of RL that aims at policy optimization using only historical data. Despite its wide applicability, theoretical understandings of offline RL, such as its optimal sample complexity, remain largely open even in basic settings such as \emph{tabular} Markov Decision Processes (MDPs). In this paper, we propose Off-Policy Double Variance Reduction (OPDVR), a new variance reduction based algorithm for offline RL. Our main result shows that OPDVR provably identifies an $ε$-optimal policy with $\widetilde{O}(H^2/d_mε^2)$ episodes of offline data in the finite-horizon stationary transition setting, where $H$ is the horizon length and $d_m$ is the minimal marginal state-action distribution induced by the behavior policy. This improves over the best known upper bound by a factor of $H$. Moreover, we establish an information-theoretic lower bound of $Ω(H^2/d_mε^2)$ which certifies that OPDVR is optimal up to logarithmic factors. Lastly, we show that OPDVR also achieves rate-optimal sample complexity under alternative settings such as the finite-horizon MDPs with non-stationary transitions and the infinite horizon MDPs with discounted rewards.

preprint2021arXiv

Understanding the 2016 US Presidential Election using ecological inference and distribution regression with census microdata

We combine fine-grained spatially referenced census data with the vote outcomes from the 2016 US presidential election. Using this dataset, we perform ecological inference using distribution regression (Flaxman et al, KDD 2015) with a multinomial-logit regression so as to model the vote outcome Trump, Clinton, Other / Didn&#39;t vote as a function of demographic and socioeconomic features. Ecological inference allows us to estimate &#34;exit poll&#34; style results like what was Trump&#39;s support among white women, but for entirely novel categories. We also perform exploratory data analysis to understand which census variables are predictive of voting for Trump, voting for Clinton, or not voting for either. All of our methods are implemented in Python and R, and are available online for replication.

preprint2021arXiv

Voting-based Approaches For Differentially Private Federated Learning

Differentially Private Federated Learning (DPFL) is an emerging field with many applications. Gradient averaging based DPFL methods require costly communication rounds and hardly work with large-capacity models, due to the explicit dimension dependence in its added noise. In this work, inspired by knowledge transfer non-federated privacy learning from Papernot et al.(2017; 2018), we design two new DPFL schemes, by voting among the data labels returned from each local model, instead of averaging the gradients, which avoids the dimension dependence and significantly reduces the communication cost. Theoretically, by applying secure multi-party computation, we could exponentially amplify the (data-dependent) privacy guarantees when the margin of the voting scores are large. Extensive experiments show that our approaches significantly improve the privacy-utility trade-off over the state-of-the-arts in DPFL.

preprint2020arXiv

An end-to-end Differentially Private Latent Dirichlet Allocation Using a Spectral Algorithm

We provide an end-to-end differentially private spectral algorithm for learning LDA, based on matrix/tensor decompositions, and establish theoretical guarantees on utility/consistency of the estimated model parameters. The spectral algorithm consists of multiple algorithmic steps, named as &#34;{edges}&#34;, to which noise could be injected to obtain differential privacy. We identify \emph{subsets of edges}, named as &#34;{configurations}&#34;, such that adding noise to all edges in such a subset guarantees differential privacy of the end-to-end spectral algorithm. We characterize the sensitivity of the edges with respect to the input and thus estimate the amount of noise to be added to each edge for any required privacy level. We then characterize the utility loss for each configuration as a function of injected noise. Overall, by combining the sensitivity and utility characterization, we obtain an end-to-end differentially private spectral algorithm for LDA and identify the corresponding configuration that outperforms others in any specific regime. We are the first to achieve utility guarantees under the required level of differential privacy for learning in LDA. Overall our method systematically outperforms differentially private variational inference.

preprint2020arXiv

Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning

We consider the problem of off-policy evaluation for reinforcement learning, where the goal is to estimate the expected reward of a target policy $π$ using offline data collected by running a logging policy $μ$. Standard importance-sampling based approaches for this problem suffer from a variance that scales exponentially with time horizon $H$, which motivates a splurge of recent interest in alternatives that break the &#34;Curse of Horizon&#34; (Liu et al. 2018, Xie et al. 2019). In particular, it was shown that a marginalized importance sampling (MIS) approach can be used to achieve an estimation error of order $O(H^3/ n)$ in mean square error (MSE) under an episodic Markov Decision Process model with finite states and potentially infinite actions. The MSE bound however is still a factor of $H$ away from a Cramer-Rao lower bound of order $Ω(H^2/n)$. In this paper, we prove that with a simple modification to the MIS estimator, we can asymptotically attain the Cramer-Rao lower bound, provided that the action space is finite. We also provide a general method for constructing MIS estimators with high-probability error bounds.

preprint2020arXiv

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

Time series forecasting is an important problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. In this paper, we propose to tackle such forecasting problem with Transformer [1]. Although impressed by its performance in our preliminary study, we found its two major weaknesses: (1) locality-agnostics: the point-wise dot-product self-attention in canonical Transformer architecture is insensitive to local context, which can make the model prone to anomalies in time series; (2) memory bottleneck: space complexity of canonical Transformer grows quadratically with sequence length $L$, making directly modeling long time series infeasible. In order to solve these two issues, we first propose convolutional self-attention by producing queries and keys with causal convolution so that local context can be better incorporated into attention mechanism. Then, we propose LogSparse Transformer with only $O(L(\log L)^{2})$ memory cost, improving forecasting accuracy for time series with fine granularity and strong long-term dependencies under constrained memory budget. Our experiments on both synthetic data and real-world datasets show that it compares favorably to the state-of-the-art.

preprint2020arXiv

Provably Efficient Q-Learning with Low Switching Cost

We take initial steps in studying PAC-MDP algorithms with limited adaptivity, that is, algorithms that change its exploration policy as infrequently as possible during regret minimization. This is motivated by the difficulty of running fully adaptive algorithms in real-world applications (such as medical domains), and we propose to quantify adaptivity using the notion of local switching cost. Our main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for H-step episodic MDP that achieves sublinear regret whose local switching cost in K episodes is $O(H^3SA\log K)$, and we provide a lower bound of $Ω(HSA)$ on the local switching cost for any no-regret algorithm. Our algorithm can be naturally adapted to the concurrent setting, which yields nontrivial results that improve upon prior work in certain aspects.

preprint2020arXiv

Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) -- the problem of evaluating a new policy using the historical data obtained by different behavior policies -- under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon $H$. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of $$ \frac{1}{n} \sum\nolimits_{t=1}^H\mathbb{E}_μ\left[\frac{d_t^π(s_t)^2}{d_t^μ(s_t)^2} \mathrm{Var}_μ\left[\frac{π_t(a_t|s_t)}{μ_t(a_t|s_t)}\big( V_{t+1}^π(s_{t+1}) + r_t\big) \middle| s_t\right]\right] + \tilde{O}(n^{-1.5}) $$ where $μ$ and $π$ are the logging and target policies, $d_t^μ(s_t)$ and $d_t^π(s_t)$ are the marginal distribution of the state at $t$th step, $H$ is the horizon, $n$ is the sample size and $V_{t+1}^π$ is the value function of the MDP under $π$. The result matches the Cramer-Rao lower bound in \citet{jiang2016doubly} up to a multiplicative factor of $H$. To the best of our knowledge, this is the first OPE estimation error bound with a polynomial dependence on $H$. Besides theory, we show empirical superiority of our method in time-varying, partially observable, and long-horizon RL environments.