Researcher profile

Shahin Shahrampour

Shahin Shahrampour contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
19works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

19 published item(s)

preprint2022arXiv

Distributed Online System Identification for LTI Systems Using Reverse Experience Replay

Identification of linear time-invariant (LTI) systems plays an important role in control and reinforcement learning. Both asymptotic and finite-time offline system identification are well-studied in the literature. For online system identification, the idea of stochastic-gradient descent with reverse experience replay (SGD-RER) was recently proposed, where the data sequence is stored in several buffers and the stochastic-gradient descent (SGD) update performs backward in each buffer to break the time dependency between data points. Inspired by this work, we study distributed online system identification of LTI systems over a multi-agent network. We consider agents as identical LTI systems, and the network goal is to jointly estimate the system parameters by leveraging the communication between agents. We propose DSGD-RER, a distributed variant of the SGD-RER algorithm, and theoretically characterize the improvement of the estimation error with respect to the network size. Our numerical experiments certify the reduction of estimation error as the network size grows.

preprint2022arXiv

On Centralized and Distributed Mirror Descent: Convergence Analysis Using Quadratic Constraints

Mirror descent (MD) is a powerful first-order optimization technique that subsumes several optimization algorithms including gradient descent (GD). In this work, we develop a semi-definite programming (SDP) framework to analyze the convergence rate of MD in centralized and distributed settings under both strongly convex and non-strongly convex assumptions. We view MD with a dynamical system lens and leverage quadratic constraints (QCs) to provide explicit convergence rates based on Lyapunov stability. For centralized MD under strongly convex assumption, we develop a SDP that certifies exponential convergence rates. We prove that the SDP always has a feasible solution that recovers the optimal GD rate as a special case. We complement our analysis by providing the $O(1/k)$ convergence rate for convex problems. Next, we analyze the convergence of distributed MD and characterize the rate using SDP. To the best of our knowledge, the numerical rate of distributed MD has not been previously reported in the literature. We further prove an $O(1/k)$ convergence rate for distributed MD in the convex setting. Our numerical experiments on strongly convex problems indicate that our framework certifies superior convergence rates compared to the existing rates for distributed GD.

preprint2022arXiv

Regret Analysis of Distributed Online LQR Control for Unknown LTI Systems

Online optimization has recently opened avenues to study optimal control for time-varying cost functions that are unknown in advance. Inspired by this line of research, we study the distributed online linear quadratic regulator (LQR) problem for linear time-invariant (LTI) systems with unknown dynamics. Consider a multi-agent network where each agent is modeled as a LTI system. The network has a global time-varying quadratic cost, which may evolve adversarially and is only partially observed by each agent sequentially. The goal of the network is to collectively (i) estimate the unknown dynamics and (ii) compute local control sequences competitive to the best centralized policy in hindsight, which minimizes the sum of network costs over time. This problem is formulated as a regret minimization. We propose a distributed variant of the online LQR algorithm, where agents compute their system estimates during an exploration stage. Each agent then applies distributed online gradient descent on a semi-definite programming (SDP) whose feasible set is based on the agent system estimate. We prove that with high probability the regret bound of our proposed algorithm scales as $O(T^{2/3}\log T)$, implying the consensus of all agents over time. We also provide simulation results verifying our theoretical guarantee.

preprint2022arXiv

RFN: A Random-Feature Based Newton Method for Empirical Risk Minimization in Reproducing Kernel Hilbert Spaces

In supervised learning using kernel methods, we often encounter a large-scale finite-sum minimization over a reproducing kernel Hilbert space (RKHS). Large-scale finite-sum problems can be solved using efficient variants of Newton method, where the Hessian is approximated via sub-samples of data. In RKHS, however, the dependence of the penalty function to kernel makes standard sub-sampling approaches inapplicable, since the gram matrix is not readily available in a low-rank form. In this paper, we observe that for this class of problems, one can naturally use kernel approximation to speed up the Newton method. Focusing on randomized features for kernel approximation, we provide a novel second-order algorithm that enjoys local superlinear convergence and global linear convergence (with high probability). We derive the theoretical lower bound for the number of random features required for the approximated Hessian to be close to the true Hessian in the norm sense. Our numerical experiments on real-world data verify the efficiency of our method compared to several benchmarks.

preprint2022arXiv

Statistical and Topological Properties of Sliced Probability Divergences

The idea of slicing divergences has been proven to be successful when comparing two probability measures in various machine learning applications including generative modeling, and consists in computing the expected value of a `base divergence' between one-dimensional random projections of the two measures. However, the topological, statistical, and computational consequences of this technique have not yet been well-established. In this paper, we aim at bridging this gap and derive various theoretical properties of sliced probability divergences. First, we show that slicing preserves the metric axioms and the weak continuity of the divergence, implying that the sliced divergence will share similar topological properties. We then precise the results in the case where the base divergence belongs to the class of integral probability metrics. On the other hand, we establish that, under mild conditions, the sample complexity of a sliced divergence does not depend on the problem dimension. We finally apply our general results to several base divergences, and illustrate our theory on both synthetic and real data experiments.

preprint2021arXiv

Decentralized Riemannian Gradient Descent on the Stiefel Manifold

We consider a distributed non-convex optimization where a network of agents aims at minimizing a global function over the Stiefel manifold. The global function is represented as a finite sum of smooth local functions, where each local function is associated with one agent and agents communicate with each other over an undirected connected graph. The problem is non-convex as local functions are possibly non-convex (but smooth) and the Steifel manifold is a non-convex set. We present a decentralized Riemannian stochastic gradient method (DRSGD) with the convergence rate of $\mathcal{O}(1/\sqrt{K})$ to a stationary point. To have exact convergence with constant stepsize, we also propose a decentralized Riemannian gradient tracking algorithm (DRGTA) with the convergence rate of $\mathcal{O}(1/K)$ to a stationary point. We use multi-step consensus to preserve the iteration in the local (consensus) region. DRGTA is the first decentralized algorithm with exact convergence for distributed optimization on Stiefel manifold.

preprint2021arXiv

On Distributed Non-convex Optimization: Projected Subgradient Method For Weakly Convex Problems in Networks

The stochastic subgradient method is a widely-used algorithm for solving large-scale optimization problems arising in machine learning. Often these problems are neither smooth nor convex. Recently, Davis et al. [1-2] characterized the convergence of the stochastic subgradient method for the weakly convex case, which encompasses many important applications (e.g., robust phase retrieval, blind deconvolution, biconvex compressive sensing, and dictionary learning). In practice, distributed implementations of the projected stochastic subgradient method (stoDPSM) are used to speed-up risk minimization. In this paper, we propose a distributed implementation of the stochastic subgradient method with a theoretical guarantee. Specifically, we show the global convergence of stoDPSM using the Moreau envelope stationarity measure. Furthermore, under a so-called sharpness condition, we show that deterministic DPSM (with a proper initialization) converges linearly to the sharp minima, using geometrically diminishing step-size. We provide numerical experiments to support our theoretical analysis.

preprint2021arXiv

On the Local Linear Rate of Consensus on the Stiefel Manifold

We study the convergence properties of Riemannian gradient method for solving the consensus problem (for an undirected connected graph) over the Stiefel manifold. The Stiefel manifold is a non-convex set and the standard notion of averaging in the Euclidean space does not work for this problem. We propose Distributed Riemannian Consensus on Stiefel Manifold (DRCS) and prove that it enjoys a local linear convergence rate to global consensus. More importantly, this local rate asymptotically scales with the second largest singular value of the communication matrix, which is on par with the well-known rate in the Euclidean space. To the best of our knowledge, this is the first work showing the equality of the two rates. The main technical challenges include (i) developing a Riemannian restricted secant inequality for convergence analysis, and (ii) to identify the conditions (e.g., suitable step-size and initialization) under which the algorithm always stays in the local region.

preprint2020arXiv

A Mean-Field Theory for Kernel Alignment with Random Features in Generative and Discriminative Models

We propose a novel supervised learning method to optimize the kernel in the maximum mean discrepancy generative adversarial networks (MMD GANs), and the kernel support vector machines (SVMs). Specifically, we characterize a distributionally robust optimization problem to compute a good distribution for the random feature model of Rahimi and Recht. Due to the fact that the distributional optimization is infinite dimensional, we consider a Monte-Carlo sample average approximation (SAA) to obtain a more tractable finite dimensional optimization problem. We subsequently leverage a particle stochastic gradient descent (SGD) method to solve the derived finite dimensional optimization problem. Based on a mean-field analysis, we then prove that the empirical distribution of the interactive particles system at each iteration of the SGD follows the path of the gradient descent flow on the Wasserstein manifold. We also establish the non-asymptotic consistency of the finite sample estimator. We evaluate our kernel learning method for the hypothesis testing problem by evaluating the kernel MMD statistics, and show that our learning method indeed attains better power of the test for larger threshold values compared to an untrained kernel. Moreover, our empirical evaluation on benchmark data-sets shows the advantage of our kernel learning approach compared to alternative kernel learning methods.

preprint2020arXiv

Distributed Mirror Descent with Integral Feedback: Asymptotic Convergence Analysis of Continuous-time Dynamics

This work addresses distributed optimization, where a network of agents wants to minimize a global strongly convex objective function. The global function can be written as a sum of local convex functions, each of which is associated with an agent. We propose a continuous-time distributed mirror descent algorithm that uses purely local information to converge to the global optimum. Unlike previous work on distributed mirror descent, we incorporate an integral feedback in the update, allowing the algorithm to converge with a constant step-size when discretized. We establish the asymptotic convergence of the algorithm using Lyapunov stability analysis. We further illustrate numerical experiments that verify the advantage of adopting integral feedback for improving the convergence rate of distributed mirror descent.

preprint2020arXiv

Distributed Parameter Estimation in Randomized One-hidden-layer Neural Networks

This paper addresses distributed parameter estimation in randomized one-hidden-layer neural networks. A group of agents sequentially receive measurements of an unknown parameter that is only partially observable to them. In this paper, we present a fully distributed estimation algorithm where agents exchange local estimates with their neighbors to collectively identify the true value of the parameter. We prove that this distributed update provides an asymptotically unbiased estimator of the unknown parameter, i.e., the first moment of the expected global error converges to zero asymptotically. We further analyze the efficiency of the proposed estimation scheme by establishing an asymptotic upper bound on the variance of the global error. Applying our method to a real-world dataset related to appliances energy prediction, we observe that our empirical findings verify the theoretical results.

preprint2020arXiv

Generalization Guarantees for Sparse Kernel Approximation with Entropic Optimal Features

Despite their success, kernel methods suffer from a massive computational cost in practice. In this paper, in lieu of commonly used kernel expansion with respect to $N$ inputs, we develop a novel optimal design maximizing the entropy among kernel features. This procedure results in a kernel expansion with respect to entropic optimal features (EOF), improving the data representation dramatically due to features dissimilarity. Under mild technical assumptions, our generalization bound shows that with only $O(N^{\frac{1}{4}})$ features (disregarding logarithmic factors), we can achieve the optimal statistical accuracy (i.e., $O(1/\sqrt{N})$). The salient feature of our design is its sparsity that significantly reduces the time and space cost. Our numerical experiments on benchmark datasets verify the superiority of EOF over the state-of-the-art in kernel approximation.

preprint2020arXiv

Generalized Sliced Distances for Probability Distributions

Probability metrics have become an indispensable part of modern statistics and machine learning, and they play a quintessential role in various applications, including statistical hypothesis testing and generative modeling. However, in a practical setting, the convergence behavior of the algorithms built upon these distances have not been well established, except for a few specific cases. In this paper, we introduce a broad family of probability metrics, coined as Generalized Sliced Probability Metrics (GSPMs), that are deeply rooted in the generalized Radon transform. We first verify that GSPMs are metrics. Then, we identify a subset of GSPMs that are equivalent to maximum mean discrepancy (MMD) with novel positive definite kernels, which come with a unique geometric interpretation. Finally, by exploiting this connection, we consider GSPM-based gradient flows for generative modeling applications and show that under mild assumptions, the gradient flow converges to the global optimum. We illustrate the utility of our approach on both real and synthetic problems.

preprint2020arXiv

Learning from Non-Random Data in Hilbert Spaces: An Optimal Recovery Perspective

The notion of generalization in classical Statistical Learning is often attached to the postulate that data points are independent and identically distributed (IID) random variables. While relevant in many applications, this postulate may not hold in general, encouraging the development of learning frameworks that are robust to non-IID data. In this work, we consider the regression problem from an Optimal Recovery perspective. Relying on a model assumption comparable to choosing a hypothesis class, a learner aims at minimizing the worst-case error, without recourse to any probabilistic assumption on the data. We first develop a semidefinite program for calculating the worst-case error of any recovery map in finite-dimensional Hilbert spaces. Then, for any Hilbert space, we show that Optimal Recovery provides a formula which is user-friendly from an algorithmic point-of-view, as long as the hypothesis class is linear. Interestingly, this formula coincides with kernel ridgeless regression in some cases, proving that minimizing the average error and worst-case error can yield the same solution. We provide numerical experiments in support of our theoretical findings.

preprint2020arXiv

Unconstrained Online Optimization: Dynamic Regret Analysis of Strongly Convex and Smooth Problems

The regret bound of dynamic online learning algorithms is often expressed in terms of the variation in the function sequence ($V_T$) and/or the path-length of the minimizer sequence after $T$ rounds. For strongly convex and smooth functions, , Zhang et al. establish the squared path-length of the minimizer sequence ($C^*_{2,T}$) as a lower bound on regret. They also show that online gradient descent (OGD) achieves this lower bound using multiple gradient queries per round. In this paper, we focus on unconstrained online optimization. We first show that a preconditioned variant of OGD achieves $O(C^*_{2,T})$ with one gradient query per round. We then propose online optimistic Newton (OON) method for the case when the first and second order information of the function sequence is predictable. The regret bound of OON is captured via the quartic path-length of the minimizer sequence ($C^*_{4,T}$), which can be much smaller than $C^*_{2,T}$. We finally show that by using multiple gradients for OGD, we can achieve an upper bound of $O(\min\{C^*_{2,T},V_T\})$ on regret.

preprint2013arXiv

Exponentially Fast Parameter Estimation in Networks Using Distributed Dual Averaging

In this paper we present an optimization-based view of distributed parameter estimation and observational social learning in networks. Agents receive a sequence of random, independent and identically distributed (i.i.d.) signals, each of which individually may not be informative about the underlying true state, but the signals together are globally informative enough to make the true state identifiable. Using an optimization-based characterization of Bayesian learning as proximal stochastic gradient descent (with Kullback-Leibler divergence from a prior as a proximal function), we show how to efficiently use a distributed, online variant of Nesterov's dual averaging method to solve the estimation with purely local information. When the true state is globally identifiable, and the network is connected, we prove that agents eventually learn the true parameter using a randomized gossip scheme. We demonstrate that with high probability the convergence is exponentially fast with a rate dependent on the KL divergence of observations under the true state from observations under the second likeliest state. Furthermore, our work also highlights the possibility of learning under continuous adaptation of network which is a consequence of employing constant, unit stepsize for the algorithm.

preprint2013arXiv

Online Learning of Dynamic Parameters in Social Networks

This paper addresses the problem of online learning in a dynamic setting. We consider a social network in which each individual observes a private signal about the underlying state of the world and communicates with her neighbors at each time period. Unlike many existing approaches, the underlying state is dynamic, and evolves according to a geometric random walk. We view the scenario as an optimization problem where agents aim to learn the true state while suffering the smallest possible loss. Based on the decomposition of the global loss function, we introduce two update mechanisms, each of which generates an estimate of the true state. We establish a tight bound on the rate of change of the underlying state, under which individuals can track the parameter with a bounded variance. Then, we characterize explicit expressions for the steady state mean-square deviation(MSD) of the estimates from the truth, per individual. We observe that only one of the estimators recovers the optimal MSD, which underscores the impact of the objective function decomposition on the learning quality. Finally, we provide an upper bound on the regret of the proposed methods, measured as an average of errors in estimating the parameter in a finite time.

preprint2013arXiv

Reconstruction of Directed Networks from Consensus Dynamics

This paper addresses the problem of identifying the topology of an unknown, weighted, directed network running a consensus dynamics. We propose a methodology to reconstruct the network topology from the dynamic response when the system is stimulated by a wide-sense stationary noise of unknown power spectral density. The method is based on a node-knockout, or grounding, procedure wherein the grounded node broadcasts zero without being eliminated from the network. In this direction, we measure the empirical cross-power spectral densities of the outputs between every pair of nodes for both grounded and ungrounded consensus to reconstruct the unknown topology of the network. We also establish that in the special cases of undirected or purely unidirectional networks, the reconstruction does not need grounding. Finally, we extend our results to the case of a directed network assuming a general dynamics, and prove that the developed method can detect edges and their direction.

preprint2013arXiv

Topology Identification of Directed Dynamical Networks via Power Spectral Analysis

We address the problem of identifying the topology of an unknown weighted, directed network of LTI systems stimulated by wide-sense stationary noises of unknown power spectral densities. We propose several reconstruction algorithms based on the cross-power spectral densities of the network's response to the input noises. Our first algorithm reconstructs the Boolean structure (i.e., existence and directions of links) of a directed network from a series of dynamical responses. Moreover, we propose a second algorithm to recover the exact structure of the network (including edge weights), as well as the power spectral density of the input noises, when an eigenvalue-eigenvector pair of the connectivity matrix is known (for example, Laplacian connectivity matrices). Finally, for the particular cases of nonreciprocal networks (i.e., networks with no directed edges pointing in opposite directions) and undirected networks, we propose specialized algorithms that result in a lower computational cost.