Source author record

Santosh S. Vempala

Santosh S. Vempala appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Data Structures and Algorithms Artificial Intelligence math.PR Neural and Evolutionary Computing Neurons and Cognition Computational Complexity Discrete Mathematics math.CO math.DG math.OC

Catalog footprint

What is connected

11works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Approximating Sparse Graphs: The Random Overlapping Communities Model

How can we approximate sparse graphs and sequences of sparse graphs (with unbounded average degree)? We consider convergence in the first $k$ moments of the graph spectrum (equivalent to the numbers of closed $k$-walks) appropriately normalized. We introduce a simple, easy to sample, random graph model that captures the limiting spectra of many sequences of interest, including the sequence of hypercube graphs. The Random Overlapping Communities (ROC) model is specified by a distribution on pairs $(s,q)$, $s \in \mathbb{Z}_+, q \in (0,1]$. A graph on $n$ vertices with average degree $d$ is generated by repeatedly picking pairs $(s,q)$ from the distribution, adding an Erdős-Rényi random graph of edge density $q$ on a subset of vertices chosen by including each vertex with probability $s/n$, and repeating this process so that the expected degree is $d$. Our proof of convergence to a ROC random graph is based on the Stieltjes moment condition. We also show that the model is an effective approximation for individual graphs. For almost all possible triangle-to-edge and four-cycle-to-edge ratios, there exists a pair $(s,q)$ such that the ROC model with this single community type produces graphs with both desired ratios, a property that cannot be achieved by stochastic block models of bounded description size. Moreover, ROC graphs exhibit an inverse relationship between degree and clustering coefficient, a characteristic of many real-world networks.

preprint2022arXiv

Assemblies of neurons learn to classify well-separated distributions

An assembly is a large population of neurons whose synchronous firing is hypothesized to represent a memory, concept, word, and other cognitive categories. Assemblies are believed to provide a bridge between high-level cognitive phenomena and low-level neural activity. Recently, a computational system called the Assembly Calculus (AC), with a repertoire of biologically plausible operations on assemblies, has been shown capable of simulating arbitrary space-bounded computation, but also of simulating complex cognitive phenomena such as language, reasoning, and planning. However, the mechanism whereby assemblies can mediate learning has not been known. Here we present such a mechanism, and prove rigorously that, for simple classification problems defined on distributions of labeled assemblies, a new assembly representing each class can be reliably formed in response to a few stimuli from the class; this assembly is henceforth reliably recalled in response to new stimuli from the same class. Furthermore, such class assemblies will be distinguishable as long as the respective classes are reasonably separated -- for example, when they are clusters of similar assemblies. To prove these results, we draw on random graph theory with dynamic edge weights to estimate sequences of activated vertices, yielding strong generalizations of previous calculations and theorems in this field over the past five years. These theorems are backed up by experiments demonstrating the successful formation of assemblies which represent concept classes on synthetic data drawn from such distributions, and also on MNIST, which lends itself to classification through one assembly per digit. Seen as a learning algorithm, this mechanism is entirely online, generalizes from very few samples, and requires only mild supervision -- all key attributes of learning in a model of the brain.

preprint2022arXiv

Constant-Factor Approximation Algorithms for Socially Fair $k$-Clustering

We study approximation algorithms for the socially fair $(\ell_p, k)$-clustering problem with $m$ groups, whose special cases include the socially fair $k$-median ($p=1$) and socially fair $k$-means ($p=2$) problems. We present (1) a polynomial-time $(5+2\sqrt{6})^p$-approximation with at most $k+m$ centers (2) a $(5+2\sqrt{6}+ε)^p$-approximation with $k$ centers in time $n^{2^{O(p)}\cdot m^2}$, and (3) a $(15+6\sqrt{6})^p$ approximation with $k$ centers in time $k^{m}\cdot\text{poly}(n)$. The first result is obtained via a refinement of the iterative rounding method using a sequence of linear programs. The latter two results are obtained by converting a solution with up to $k+m$ centers to one with $k$ centers using sparsification methods for (2) and via an exhaustive search for (3). We also compare the performance of our algorithms with existing bicriteria algorithms as well as exactly $k$ center approximation algorithms on benchmark datasets, and find that our algorithms also outperform existing methods in practice.

preprint2022arXiv

Convergence of the Riemannian Langevin Algorithm

We study the Riemannian Langevin Algorithm for the problem of sampling from a distribution with density $ν$ with respect to the natural measure on a manifold with metric $g$. We assume that the target density satisfies a log-Sobolev inequality with respect to the metric and prove that the manifold generalization of the Unadjusted Langevin Algorithm converges rapidly to $ν$ for Hessian manifolds. This allows us to reduce the problem of sampling non-smooth (constrained) densities in ${\bf R}^n$ to sampling smooth densities over appropriate manifolds, while needing access only to the gradient of the log-density, and this, in turn, to sampling from the natural Brownian motion on the manifold. Our main analytic tools are (1) an extension of self-concordance to manifolds, and (2) a stochastic approach to bounding smoothness on manifolds. A special case of our approach is sampling isoperimetric densities restricted to polytopes by using the metric defined by the logarithmic barrier.

preprint2022arXiv

How and When Random Feedback Works: A Case Study of Low-Rank Matrix Factorization

The success of gradient descent in ML and especially for learning neural networks is remarkable and robust. In the context of how the brain learns, one aspect of gradient descent that appears biologically difficult to realize (if not implausible) is that its updates rely on feedback from later layers to earlier layers through the same connections. Such bidirected links are relatively few in brain networks, and even when reciprocal connections exist, they may not be equi-weighted. Random Feedback Alignment (Lillicrap et al., 2016), where the backward weights are random and fixed, has been proposed as a bio-plausible alternative and found to be effective empirically. We investigate how and when feedback alignment (FA) works, focusing on one of the most basic problems with layered structure -- low-rank matrix factorization. In this problem, given a matrix $Y_{n\times m}$, the goal is to find a low rank factorization $Z_{n \times r}W_{r \times m}$ that minimizes the error $\|ZW-Y\|_F$. Gradient descent solves this problem optimally. We show that FA converges to the optimal solution when $r\ge \mbox{rank}(Y)$. We also shed light on how FA works. It is observed empirically that the forward weight matrices and (random) feedback matrices come closer during FA updates. Our analysis rigorously derives this phenomenon and shows how it facilitates convergence of FA*, a closely related variant of FA. We also show that FA can be far from optimal when $r < \mbox{rank}(Y)$. This is the first provable separation result between gradient descent and FA. Moreover, the representations found by gradient descent and FA can be almost orthogonal even when their error $\|ZW-Y\|_F$ is approximately equal. As a corollary, these results also hold for training two-layer linear neural networks when the training input is isotropic, and the output is a linear function of the input.

preprint2022arXiv

Provable Lifelong Learning of Representations

In lifelong learning, tasks (or classes) to be learned arrive sequentially over time in arbitrary order. During training, knowledge from previous tasks can be captured and transferred to subsequent ones to improve sample efficiency. We consider the setting where all target tasks can be represented in the span of a small number of unknown linear or nonlinear features of the input data. We propose a lifelong learning algorithm that maintains and refines the internal feature representation. We prove that for any desired accuracy on all tasks, the dimension of the representation remains close to that of the underlying representation. The resulting sample complexity improves significantly on existing bounds. In the setting of linear features, our algorithm is provably efficient and the sample complexity for input dimension $d$, $m$ tasks with $k$ features up to error $ε$ is $\tilde{O}(dk^{1.5}/ε+km/ε)$. We also prove a matching lower bound for any lifelong learning algorithm that uses a single task learner as a black box. We complement our analysis with an empirical study, including a heuristic lifelong learning algorithm for deep neural networks. Our method performs favorably on challenging realistic image datasets compared to state-of-the-art continual learning methods.

preprint2022arXiv

Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices

We study the Unadjusted Langevin Algorithm (ULA) for sampling from a probability distribution $ν= e^{-f}$ on $\mathbb{R}^n$. We prove a convergence guarantee in Kullback-Leibler (KL) divergence assuming $ν$ satisfies a log-Sobolev inequality and the Hessian of $f$ is bounded. Notably, we do not assume convexity or bounds on higher derivatives. We also prove convergence guarantees in Rényi divergence of order $q > 1$ assuming the limit of ULA satisfies either the log-Sobolev or Poincaré inequality. We also prove a bound on the bias of the limiting distribution of ULA assuming third-order smoothness of $f$, without requiring isoperimetry.

preprint2016arXiv

Accelerated Newton Iteration: Roots of Black Box Polynomials and Matrix Eigenvalues

We study the problem of computing the largest root of a real rooted polynomial $p(x)$ to within error $\varepsilon $ given only black box access to it, i.e., for any $x \in {\mathbb R}$, the algorithm can query an oracle for the value of $p(x)$, but the algorithm is not allowed access to the coefficients of $p(x)$. A folklore result for this problem is that the largest root of a polynomial can be computed in $O(n \log (1/\varepsilon ))$ polynomial queries using the Newton iteration. We give a simple algorithm that queries the oracle at only $O(\log n \log(1/\varepsilon ))$ points, where $n$ is the degree of the polynomial. Our algorithm is based on a novel approach for accelerating the Newton method by using higher derivatives. As a special case, we consider the problem of computing the top eigenvalue of a symmetric matrix in ${\mathbb Q}^{n \times n}$ to within error $\varepsilon $ in time polynomial in the input description, i.e., the number of bits to describe the matrix and $\log(1/\varepsilon )$. Well-known methods such as the power iteration and Lanczos iteration incur running time polynomial in $1/\varepsilon $, while Gaussian elimination takes $Ω(n^4)$ bit operations. As a corollary of our main result, we obtain a $\tilde{O}(n^ω \log^2 ( ||A||_F/\varepsilon ))$ bit complexity algorithm to compute the top eigenvalue of the matrix $A$ or to check if it is approximately PSD ($A \succeq -\varepsilon I$).

preprint2015arXiv

Max vs Min: Tensor Decomposition and ICA with nearly Linear Sample Complexity

We present a simple, general technique for reducing the sample complexity of matrix and tensor decomposition algorithms applied to distributions. We use the technique to give a polynomial-time algorithm for standard ICA with sample complexity nearly linear in the dimension, thereby improving substantially on previous bounds. The analysis is based on properties of random polynomials, namely the spacings of an ensemble of polynomials. Our technique also applies to other applications of tensor decompositions, including spherical Gaussian mixture models.

preprint2014arXiv

Unsupervised Learning through Prediction in a Model of Cortex

We propose a primitive called PJOIN, for "predictive join," which combines and extends the operations JOIN and LINK, which Valiant proposed as the basis of a computational theory of cortex. We show that PJOIN can be implemented in Valiant's model. We also show that, using PJOIN, certain reasonably complex learning and pattern matching tasks can be performed, in a way that involves phenomena which have been observed in cognition and the brain, namely memory-based prediction and downward traffic in the cortical hierarchy.

preprint2012arXiv

Structure from Local Optima: Learning Subspace Juntas via Higher Order PCA

We present a generalization of the well-known problem of learning k-juntas in R^n, and a novel tensor algorithm for unraveling the structure of high-dimensional distributions. Our algorithm can be viewed as a higher-order extension of Principal Component Analysis (PCA). Our motivating problem is learning a labeling function in R^n, which is determined by an unknown k-dimensional subspace. This problem of learning a k-subspace junta is a common generalization of learning a k-junta (a function of k coordinates in R^n) and learning intersections of k halfspaces. In this context, we introduce an irrelevant noisy attributes model where the distribution over the "relevant" k-dimensional subspace is independent of the distribution over the (n-k)-dimensional "irrelevant" subspace orthogonal to it. We give a spectral tensor algorithm which identifies the relevant subspace, and thereby learns k-subspace juntas under some additional assumptions. We do this by exploiting the structure of local optima of higher moment tensors over the unit sphere; PCA finds the global optima of the second moment tensor (covariance matrix). Our main result is that when the distribution in the irrelevant (n-k)-dimensional subspace is any Gaussian, the complexity of our algorithm is T(k,ε) + \poly(n), where T is the complexity of learning the concept in k dimensions, and the polynomial is a function of the k-dimensional concept class being learned. This substantially generalizes existing results on learning low-dimensional concepts.

Santosh S. Vempala

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

Approximating Sparse Graphs: The Random Overlapping Communities Model

Assemblies of neurons learn to classify well-separated distributions

Constant-Factor Approximation Algorithms for Socially Fair $k$-Clustering

Convergence of the Riemannian Langevin Algorithm

How and When Random Feedback Works: A Case Study of Low-Rank Matrix Factorization

Provable Lifelong Learning of Representations

Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices

Accelerated Newton Iteration: Roots of Black Box Polynomials and Matrix Eigenvalues

Max vs Min: Tensor Decomposition and ICA with nearly Linear Sample Complexity

Unsupervised Learning through Prediction in a Model of Cortex

Structure from Local Optima: Learning Subspace Juntas via Higher Order PCA