Researcher profile

Anuran Makur

Anuran Makur contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
12topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2026arXiv

Unified High-Probability Analysis of Stochastic Variance-Reduced Estimation

Stochastic estimators are fundamental to large-scale optimization, where population quantities must be inferred from noisy oracle observations. Although influential methods such as momentum, SPIDER, STORM, and PAGE have been highly successful, their analyses are largely estimator-specific and expectation-based, obscuring the structural tradeoffs that determine reliability. In this paper, we develop a unified framework for stochastic variance-reduced estimation based on a recursion with three components: memory retention, reset probability, and a correction term for iterate movement. This framework recovers several classical estimators, motivates new second-order variants, and yields a bias-variance decomposition of estimation error. Our main result is a unified high-probability bound proved using a new dimension-free vector-valued Freedman inequality, valid for smooth normed spaces involving random sums of vector martingales. The result applies in both Euclidean and non-Euclidean settings, including the analysis of mirror-descent-based methods in Banach spaces. As applications, we obtain high-probability oracle complexities for unconstrained optimization with mirror descent, establishing the logarithmic dependence on the confidence level. We also derive the first $\tilde{\mathcal{O}}(\varepsilon^{-3})$ oracle-complexity bounds for stochastic optimization with expectation constraints, improving upon the existing $\tilde{\mathcal{O}}(\varepsilon^{-4})$ complexity by leveraging variance-reduced estimation for the first time in this setting.

preprint2024arXiv

Federated Optimization of Smooth Loss Functions

In this work, we study empirical risk minimization (ERM) within a federated learning framework, where a central server minimizes an ERM objective function using training data that is stored across $m$ clients. In this setting, the Federated Averaging (FedAve) algorithm is the staple for determining $ε$-approximate solutions to the ERM problem. Similar to standard optimization algorithms, the convergence analysis of FedAve only relies on smoothness of the loss function in the optimization parameter. However, loss functions are often very smooth in the training data too. To exploit this additional smoothness, we propose the Federated Low Rank Gradient Descent (FedLRGD) algorithm. Since smoothness in data induces an approximate low rank structure on the loss function, our method first performs a few rounds of communication between the server and clients to learn weights that the server can use to approximate clients' gradients. Then, our method solves the ERM problem at the server using inexact gradient descent. To show that FedLRGD can have superior performance to FedAve, we present a notion of federated oracle complexity as a counterpart to canonical oracle complexity. Under some assumptions on the loss function, e.g., strong convexity in parameter, $η$-Hölder smoothness in data, etc., we prove that the federated oracle complexity of FedLRGD scales like $ϕm(p/ε)^{Θ(d/η)}$ and that of FedAve scales like $ϕm(p/ε)^{3/4}$ (neglecting sub-dominant factors), where $ϕ\gg 1$ is a "communication-to-computation ratio," $p$ is the parameter dimension, and $d$ is the data dimension. Then, we show that when $d$ is small and the loss function is sufficiently smooth in the data, FedLRGD beats FedAve in federated oracle complexity. Finally, in the course of analyzing FedLRGD, we also establish a result on low rank approximation of latent variable models.

preprint2022arXiv

Gradient Descent for Low-Rank Functions

Several recent empirical studies demonstrate that important machine learning tasks, e.g., training deep neural networks, exhibit low-rank structure, where the loss function varies significantly in only a few directions of the input space. In this paper, we leverage such low-rank structure to reduce the high computational cost of canonical gradient-based methods such as gradient descent (GD). Our proposed \emph{Low-Rank Gradient Descent} (LRGD) algorithm finds an $ε$-approximate stationary point of a $p$-dimensional function by first identifying $r \leq p$ significant directions, and then estimating the true $p$-dimensional gradient at every iteration by computing directional derivatives only along those $r$ directions. We establish that the "directional oracle complexities" of LRGD for strongly convex and non-convex objective functions are $\mathcal{O}(r \log(1/ε) + rp)$ and $\mathcal{O}(r/ε^2 + rp)$, respectively. When $r \ll p$, these complexities are smaller than the known complexities of $\mathcal{O}(p \log(1/ε))$ and $\mathcal{O}(p/ε^2)$ of {\gd} in the strongly convex and non-convex settings, respectively. Thus, LRGD significantly reduces the computational cost of gradient-based methods for sufficiently low-rank functions. In the course of our analysis, we also formally define and characterize the classes of exact and approximately low-rank functions.

preprint2022arXiv

Inference in Opinion Dynamics under Social Pressure

We introduce a new opinion dynamics model where a group of agents holds two kinds of opinions: inherent and declared. Each agent's inherent opinion is fixed and unobservable by the other agents. At each time step, agents broadcast their declared opinions on a social network, which are governed by the agents' inherent opinions and social pressure. In particular, we assume that agents may declare opinions that are not aligned with their inherent opinions to conform with their neighbors. This raises the natural question: Can we estimate the agents' inherent opinions from observations of declared opinions? For example, agents' inherent opinions may represent their true political alliances (Democrat or Republican), while their declared opinions may model the political inclinations of tweets on social media. In this context, we may seek to predict the election results by observing voters' tweets, which do not necessarily reflect their political support due to social pressure. We analyze this question in the special case where the underlying social network is a complete graph. We prove that, as long as the population does not include large majorities, estimation of aggregate and individual inherent opinions is possible. On the other hand, large majorities force minorities to lie over time, which makes asymptotic estimation impossible.

preprint2020arXiv

Broadcasting on Random Directed Acyclic Graphs

We study a generalization of the well-known model of broadcasting on trees. Consider a directed acyclic graph (DAG) with a unique source vertex $X$, and suppose all other vertices have indegree $d\geq 2$. Let the vertices at distance $k$ from $X$ be called layer $k$. At layer $0$, $X$ is given a random bit. At layer $k\geq 1$, each vertex receives $d$ bits from its parents in layer $k-1$, which are transmitted along independent binary symmetric channel edges, and combines them using a $d$-ary Boolean processing function. The goal is to reconstruct $X$ with probability of error bounded away from $1/2$ using the values of all vertices at an arbitrarily deep layer. This question is closely related to models of reliable computation and storage, and information flow in biological networks. In this paper, we analyze randomly constructed DAGs, for which we show that broadcasting is only possible if the noise level is below a certain degree and function dependent critical threshold. For $d\geq 3$, and random DAGs with layer sizes $Ω(\log k)$ and majority processing functions, we identify the critical threshold. For $d=2$, we establish a similar result for NAND processing functions. We also prove a partial converse for odd $d\geq 3$ illustrating that the identified thresholds are impossible to improve by selecting different processing functions if the decoder is restricted to using a single vertex. Finally, for any noise level, we construct explicit DAGs (using expander graphs) with bounded degree and layer sizes $Θ(\log k)$ admitting reconstruction. In particular, we show that such DAGs can be generated in deterministic quasi-polynomial time or randomized polylogarithmic time in the depth. These results portray a doubly-exponential advantage for storing a bit in DAGs compared to trees, where $d=1$ but layer sizes must grow exponentially with depth in order to enable broadcasting.

preprint2020arXiv

Coding Theorems for Noisy Permutation Channels

In this paper, we formally define and analyze the class of noisy permutation channels. The noisy permutation channel model constitutes a standard discrete memoryless channel (DMC) followed by an independent random permutation that reorders the output codeword of the DMC. While coding theoretic aspects of this model have been studied extensively, particularly in the context of reliable communication in network settings where packets undergo transpositions, and closely related models of DNA based storage systems have also been analyzed recently, we initiate an information theoretic study of this model by defining an appropriate notion of noisy permutation channel capacity. Specifically, on the achievability front, we prove a lower bound on the noisy permutation channel capacity of any DMC in terms of the rank of the stochastic matrix of the DMC. On the converse front, we establish two upper bounds on the noisy permutation channel capacity of any DMC whose stochastic matrix is strictly positive (entry-wise). Together, these bounds yield coding theorems that characterize the noisy permutation channel capacities of every strictly positive and "full rank" DMC, and our achievability proof yields a conceptually simple, computationally efficient, and capacity achieving coding scheme for such DMCs. Furthermore, we also demonstrate the relation between the output degradation preorder over channels and noisy permutation channel capacity. In fact, the proof of one of our converse bounds exploits a degradation result that constructs a symmetric channel for any DMC such that the DMC is a degraded version of the symmetric channel. Finally, we illustrate some examples such as the special cases of binary symmetric channels and (general) erasure channels. Somewhat surprisingly, our results suggest that noisy permutation channel capacities are generally quite agnostic to the parameters that define the DMCs.

preprint2020arXiv

Estimation of Skill Distributions

In this paper, we study the problem of learning the skill distribution of a population of agents from observations of pairwise games in a tournament. These games are played among randomly drawn agents from the population. The agents in our model can be individuals, sports teams, or Wall Street fund managers. Formally, we postulate that the likelihoods of game outcomes are governed by the Bradley-Terry-Luce (or multinomial logit) model, where the probability of an agent beating another is the ratio between its skill level and the pairwise sum of skill levels, and the skill parameters are drawn from an unknown skill density of interest. The problem is, in essence, to learn a distribution from noisy, quantized observations. We propose a simple and tractable algorithm that learns the skill density with near-optimal minimax mean squared error scaling as $n^{-1+\varepsilon}$, for any $\varepsilon>0$, when the density is smooth. Our approach brings together prior work on learning skill parameters from pairwise comparisons with kernel density estimation from non-parametric statistics. Furthermore, we prove minimax lower bounds which establish minimax optimality of the skill parameter estimation technique used in our algorithm. These bounds utilize a continuum version of Fano's method along with a covering argument. We apply our algorithm to various soccer leagues and world cups, cricket world cups, and mutual funds. We find that the entropy of a learnt distribution provides a quantitative measure of skill, which provides rigorous explanations for popular beliefs about perceived qualities of sporting events, e.g., soccer league rankings. Finally, we apply our method to assess the skill distributions of mutual funds. Our results shed light on the abundance of low quality funds prior to the Great Recession of 2008, and the domination of the industry by more skilled funds after the financial crisis.