Researcher profile

Alex Olshevsky

Alex Olshevsky contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
13works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

13 published item(s)

preprint2026arXiv

Bridging the Gap Between Average and Discounted TD Learning

The analysis of Temporal Difference (TD) learning in the average-reward setting faces notable theoretical difficulties because the Bellman operator is not contractive with respect to any norm. This complicates standard analyses of stochastic updates that are effective in discounted settings. Although a considerable body of literature addresses these challenges, existing theoretical approaches come with limitations. We introduce a novel algorithm designed explicitly for policy evaluation in the average-reward setting, utilizing sampling from two Markovian trajectories. Our proposed method overcomes previous limitations by guaranteeing convergence to the unique solution of a properly defined projected Bellman equation. Notably, and in contrast to earlier work, our convergence analysis is uniformly applicable to both linear function approximation and tabular settings and does not involve explicit dimension-dependent terms in its convergence bounds. These results align with what is known to hold in the discounted setting. Furthermore, our algorithm achieves improved dependence on the problem's condition number, reducing the sample complexity from quartic, as in prior literature, to quadratic scaling, and thus matching the efficiency seen in the discounted setting.

preprint2026arXiv

Data Deletion Can Help in Adaptive RL

Deploying reinforcement learning policies in the real world requires adapting to time-varying environments. We study this problem in the contextual Markov Decision Process (cMDP) framework, where a family of environments is indexed by a low-dimensional context unknown at test time. The standard approach decomposes the problem: train a so-called "universal policy" which assumes knowledge of the true context, then pair it with a context estimator which approximates context using the observed trajectory. We identify a simple, counterintuitive trick that substantially improves the estimator: randomly delete a fraction of the training buffer after each round. This works because data is collected across multiple rounds using progressively better policies, and older trajectories come from a different distribution than what the estimator will face at deployment time; random deletion creates an implicit exponential decay on older data while preserving diversity without requiring any explicit identification of which samples are stale. This reduces robustness gap by 30% for MLPs and by 6% on average for recurrent networks. Strikingly, it allows a narrow MLP with 5x fewer parameters to outperform a wide MLP trained without deletion. To understand when and why deletion helps, we analyze regularized empirical risk minimization with a mismatch between the train distribution and the distribution at deployment; in this idealized setting, we prove that removing a single uniformly random training point decreases expected test loss in expectation under mild conditions. For ridge regression we make this quantitative: deletion helps when the regularization coefficient is moderate and the signal-to-noise ratio (SNR) is sufficiently low, and, crucially, this SNR threshold gives a direct measure of how large the distribution mismatch between training and deployment must be for deletion to be beneficial.

preprint2022arXiv

Distributed TD(0) with Almost No Communication

We provide a new non-asymptotic analysis of distributed TD(0) with linear function approximation. Our approach relies on "one-shot averaging," where $N$ agents run local copies of TD(0) and average the outcomes only once at the very end. We consider two models: one in which the agents interact with an environment they can observe and whose transitions depends on all of their actions (which we call the global state model), and one in which each agent can run a local copy of an identical Markov Decision Process, which we call the local state model. In the global state model, we show that the convergence rate of our distributed one-shot averaging method matches the known convergence rate of TD(0). By contrast, the best convergence rate in the previous literature showed a rate which, according to the worst-case bounds given, could underperform the non-distributed version by $O(N^3)$ in terms of the number of agents $N$. In the local state model, we demonstrate a version of the linear time speedup phenomenon, where the convergence time of the distributed process is a factor of $N$ faster than the convergence time of TD(0). As far as we are aware, this is the first result rigorously showing benefits from parallelism for temporal difference methods.

preprint2022arXiv

Optimal Lockdown for Pandemic Control

As a common strategy of contagious disease containment, lockdowns will inevitably weaken the economy. The ongoing COVID-19 pandemic underscores the trade-off arising from public health and economic cost. An optimal lockdown policy to resolve this trade-off is highly desired. Here we propose a mathematical framework of pandemic control through an optimal stabilizing non-uniform lockdown, where our goal is to reduce the economic activity as little as possible while decreasing the number of infected individuals at a prescribed rate. This framework allows us to efficiently compute the optimal stabilizing lockdown policy for general epidemic spread models, including both the classical SIS/SIR/SEIR models and a new model of COVID-19 transmissions. We demonstrate the power of this framework by analyzing publicly available data of inter-county travel frequencies to analyze a model of COVID-19 spread in the 62 counties of New York State. We find that an optimal stabilizing lockdown based on epidemic status in April 2020 would have reduced economic activity more stringently outside of New York City compared to within it, even though the epidemic was much more prevalent in New York City at that point. Such a counterintuitive result highlights the intricacies of pandemic control and sheds light on future lockdown policy design.

preprint2021arXiv

A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent

This paper is concerned with minimizing the average of $n$ cost functions over a network in which agents may communicate and exchange information with each other. We consider the setting where only noisy gradient information is available. To solve the problem, we study the distributed stochastic gradient descent (DSGD) method and perform a non-asymptotic convergence analysis. For strongly convex and smooth objective functions, DSGD asymptotically achieves the optimal network independent convergence rate compared to centralized stochastic gradient descent (SGD). Our main contribution is to characterize the transient time needed for DSGD to approach the asymptotic convergence rate, which we show behaves as $K_T=\mathcal{O}\left(\frac{n}{(1-ρ_w)^2}\right)$, where $1-ρ_w$ denotes the spectral gap of the mixing matrix. Moreover, we construct a "hard" optimization problem for which we show the transient time needed for DSGD to approach the asymptotic convergence rate is lower bounded by $Ω\left(\frac{n}{(1-ρ_w)^2} \right)$, implying the sharpness of the obtained result. Numerical experiments demonstrate the tightness of the theoretical results.

preprint2020arXiv

Asymptotic Convergence Rate of Alternating Minimization for Rank One Matrix Completion

We study alternating minimization for matrix completion in the simplest possible setting: completing a rank-one matrix from a revealed subset of the entries. We bound the asymptotic convergence rate by the variational characterization of eigenvalues of a reversible consensus problem. This leads to a polynomial upper bound on the asymptotic rate in terms of number of nodes as well as the largest degree of the graph of revealed entries.

preprint2020arXiv

Asymptotic Network Independence and Step-Size for A Distributed Subgradient Method

We consider whether distributed subgradient methods can achieve a linear speedup over a centralized subgradient method. While it might be hoped that distributed network of $n$ nodes that can compute $n$ times more subgradients in parallel compared to a single node might, as a result, be $n$ times faster, existing bounds for distributed optimization methods are often consistent with a slowdown rather than speedup compared to a single node. We show that a distributed subgradient method has this "linear speedup" property when using a class of square-summable-but-not-summable step-sizes which include $1/t^β$ when $β\in (1/2,1)$; for such step-sizes, we show that after a transient period whose size depends on the spectral gap of the network, the method achieves a performance guarantee that does not depend on the network or the number of nodes. We also show that the same method can fail to have this "asymptotic network independence" property under the optimally decaying step-size $1/\sqrt{t}$ and, as a consequence, can fail to provide a linear speedup compared to a single node with $1/\sqrt{t}$ step-size.

preprint2020arXiv

Asymptotic Network Independence in Distributed Stochastic Optimization for Machine Learning

We provide a discussion of several recent results which, in certain scenarios, are able to overcome a barrier in distributed stochastic optimization for machine learning. Our focus is the so-called asymptotic network independence property, which is achieved whenever a distributed method executed over a network of n nodes asymptotically converges to the optimal solution at a comparable rate to a centralized method with the same computational power as the entire network. We explain this property through an example involving the training of ML models and sketch a short mathematical analysis for comparing the performance of distributed stochastic gradient descent (DSGD) with centralized stochastic gradient decent (SGD).

preprint2020arXiv

Deterministic and Randomized Actuator Scheduling With Guaranteed Performance Bounds

In this paper, we investigate the problem of actuator selection for linear dynamical systems. We develop a framework to design a sparse actuator schedule for a given large-scale linear system with guaranteed performance bounds using deterministic polynomial-time and randomized approximately linear-time algorithms. First, we introduce systemic controllability metrics for linear dynamical systems that are monotone and homogeneous with respect to the controllability Gramian. We show that several popular and widely used optimization criteria in the literature belong to this class of controllability metrics. Our main result is to provide a polynomial-time actuator schedule that on average selects only a constant number of actuators at each time step, independent of the dimension, to furnish a guaranteed approximation of the controllability metrics in comparison to when all actuators are in use. Our results naturally apply to the dual problem of sensor selection, in which we provide a guaranteed approximation to the observability Gramian. We illustrate the effectiveness of our theoretical findings via several numerical simulations using benchmark examples.

preprint2020arXiv

Gradient Descent for Sparse Rank-One Matrix Completion for Crowd-Sourced Aggregation of Sparsely Interacting Workers

We consider worker skill estimation for the single-coin Dawid-Skene crowdsourcing model. In practice, skill-estimation is challenging because worker assignments are sparse and irregular due to the arbitrary and uncontrolled availability of workers. We formulate skill estimation as a rank-one correlation-matrix completion problem, where the observed components correspond to observed label correlations between workers. We show that the correlation matrix can be successfully recovered and skills are identifiable if and only if the sampling matrix (observed components) does not have a bipartite connected component. We then propose a projected gradient descent scheme and show that skill estimates converge to the desired global optima for such sampling matrices. Our proof is original and the results are surprising in light of the fact that even the weighted rank-one matrix factorization problem is NP-hard in general. Next, we derive sample complexity bounds in terms of spectral properties of the signless Laplacian of the sampling matrix. Our proposed scheme achieves state-of-art performance on a number of real-world datasets.

preprint2020arXiv

Local SGD With a Communication Overhead Depending Only on the Number of Workers

We consider speeding up stochastic gradient descent (SGD) by parallelizing it across multiple workers. We assume the same data set is shared among $n$ workers, who can take SGD steps and coordinate with a central server. Unfortunately, this could require a lot of communication between the workers and the server, which can dramatically reduce the gains from parallelism. The Local SGD method, proposed and analyzed in the earlier literature, suggests machines should make many local steps between such communications. While the initial analysis of Local SGD showed it needs $Ω( \sqrt{T} )$ communications for $T$ local gradient steps in order for the error to scale proportionately to $1/(nT)$, this has been successively improved in a string of papers, with the state-of-the-art requiring $Ω\left( n \left( \mbox{ polynomial in log } (T) \right) \right)$ communications. In this paper, we give a new analysis of Local SGD. A consequence of our analysis is that Local SGD can achieve an error that scales as $1/(nT)$ with only a fixed number of communications independent of $T$: specifically, only $Ω(n)$ communications are required.

preprint2020arXiv

On A Relaxation of Time-Varying Actuator Placement

We consider the time-varying actuator placement in continuous time, where the goal is to maximize the trace of the controllability Grammian. A natural relaxation of the problem is to allow the binary $\{0,1\}$ variable indicating whether an actuator is used at a given time to take on values in the closed interval $[0,1]$. We show that all optimal solutions of both the original and the relaxed problems can be given via an explicit formula, and that, as long as the input matrix has no zero columns, the solutions sets of the original and relaxed problem coincide.

preprint2019arXiv

Robust Asynchronous Stochastic Gradient-Push: Asymptotically Optimal and Network-Independent Performance for Strongly Convex Functions

We consider the standard model of distributed optimization of a sum of functions $F(\bz) = \sum_{i=1}^n f_i(\bz)$, where node $i$ in a network holds the function $f_i(\bz)$. We allow for a harsh network model characterized by asynchronous updates, message delays, unpredictable message losses, and directed communication among nodes. In this setting, we analyze a modification of the Gradient-Push method for distributed optimization, assuming that \begin{enumerate*}[label=(\roman*)] \item node $i$ is capable of generating gradients of its function $f_i(\bz)$ corrupted by zero-mean bounded-support additive noise at each step, \item $F(\bz)$ is strongly convex, and \item each $f_i(\bz)$ has Lipschitz gradients. We show that our proposed method asymptotically performs as well as the best bounds on centralized gradient descent that takes steps in the direction of the sum of the noisy gradients of all the functions $f_1(\bz), \ldots, f_n(\bz)$ at each step.