Source author record

Navin Goyal

Navin Goyal appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Machine Learning Computational Complexity Artificial Intelligence Computation Computation and Language Discrete Mathematics Distributed, Parallel, and Cluster Computing math.OC math.ST Networking and Internet Architecture Statistics Theory

Catalog footprint

What is connected

19works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Revisiting the Compositional Generalization Abilities of Neural Sequence Models

Compositional generalization is a fundamental trait in humans, allowing us to effortlessly combine known phrases to form novel sentences. Recent works have claimed that standard seq-to-seq models severely lack the ability to compositionally generalize. In this paper, we focus on one-shot primitive generalization as introduced by the popular SCAN benchmark. We demonstrate that modifying the training distribution in simple and intuitive ways enables standard seq-to-seq models to achieve near-perfect generalization performance, thereby showing that their compositional generalization abilities were previously underestimated. We perform detailed empirical analysis of this phenomenon. Our results indicate that the generalization performance of models is highly sensitive to the characteristics of the training data which should be carefully considered while designing such benchmarks in future.

preprint2020arXiv

Effect of Activation Functions on the Training of Overparametrized Neural Nets

It is well-known that overparametrized neural networks trained using gradient-based methods quickly achieve small training error with appropriate hyperparameter settings. Recent papers have proved this statement theoretically for highly overparametrized networks under reasonable assumptions. These results either assume that the activation function is ReLU or they crucially depend on the minimum eigenvalue of a certain Gram matrix depending on the data, random initialization and the activation function. In the later case, existing works only prove that this minimum eigenvalue is non-zero and do not provide quantitative bounds. On the empirical side, a contemporary line of investigations has proposed a number of alternative activation functions which tend to perform better than ReLU at least in some settings but no clear understanding has emerged. This state of affairs underscores the importance of theoretically understanding the impact of activation functions on training. In the present paper, we provide theoretical results about the effect of activation function on the training of highly overparametrized 2-layer neural networks. A crucial property that governs the performance of an activation is whether or not it is smooth. For non-smooth activations such as ReLU, SELU and ELU, all eigenvalues of the associated Gram matrix are large under minimal assumptions on the data. For smooth activations such as tanh, swish and polynomials, the situation is more complex. If the subspace spanned by the data has small dimension then the minimum eigenvalue of the Gram matrix can be small leading to slow training. But if the dimension is large and the data satisfies another mild condition, then the eigenvalues are large. If we allow deep networks, then the small data dimension is not a limitation provided that the depth is sufficient. We discuss a number of extensions and applications of these results.

preprint2020arXiv

Robust Identifiability in Linear Structural Equation Models of Causal Inference

In this work, we consider the problem of robust parameter estimation from observational data in the context of linear structural equation models (LSEMs). LSEMs are a popular and well-studied class of models for inferring causality in the natural and social sciences. One of the main problems related to LSEMs is to recover the model parameters from the observational data. Under various conditions on LSEMs and the model parameters the prior work provides efficient algorithms to recover the parameters. However, these results are often about generic identifiability. In practice, generic identifiability is not sufficient and we need robust identifiability: small changes in the observational data should not affect the parameters by a huge amount. Robust identifiability has received far less attention and remains poorly understood. Sankararaman et al. (2019) recently provided a set of sufficient conditions on parameters under which robust identifiability is feasible. However, a limitation of their work is that their results only apply to a small sub-class of LSEMs, called ``bow-free paths.'' In this work, we significantly extend their work along multiple dimensions. First, for a large and well-studied class of LSEMs, namely ``bow free'' models, we provide a sufficient condition on model parameters under which robust identifiability holds, thereby removing the restriction of paths required by prior work. We then show that this sufficient condition holds with high probability which implies that for a large set of parameters robust identifiability holds and that for such parameters, existing algorithms already achieve robust identifiability. Finally, we validate our results on both simulated and real-world datasets.

preprint2020arXiv

Stability of Linear Structural Equation Models of Causal Inference

We consider the numerical stability of the parameter recovery problem in Linear Structural Equation Model ($\LSEM$) of causal inference. A long line of work starting from Wright (1920) has focused on understanding which sub-classes of $\LSEM$ allow for efficient parameter recovery. Despite decades of study, this question is not yet fully resolved. The goal of this paper is complementary to this line of work; we want to understand the stability of the recovery problem in the cases when efficient recovery is possible. Numerical stability of Pearl's notion of causality was first studied in Schulman and Srivastava (2016) using the concept of condition number where they provide ill-conditioned examples. In this work, we provide a condition number analysis for the $\LSEM$. First we prove that under a sufficient condition, for a certain sub-class of $\LSEM$ that are \emph{bow-free} (Brito and Pearl (2002)), the parameter recovery is stable. We further prove that \emph{randomly} chosen input parameters for this family satisfy the condition with a substantial probability. Hence for this family, on a large subset of parameter space, recovery is numerically stable. Next we construct an example of $\LSEM$ on four vertices with \emph{unbounded} condition number. We then corroborate our theoretical findings via simulations as well as real-world experiments for a sociology application. Finally, we provide a general heuristic for estimating the condition number of any $\LSEM$ instance.

preprint2019arXiv

Sampling and Optimization on Convex Sets in Riemannian Manifolds of Non-Negative Curvature

The Euclidean space notion of convex sets (and functions) generalizes to Riemannian manifolds in a natural sense and is called geodesic convexity. Extensively studied computational problems such as convex optimization and sampling in convex sets also have meaningful counterparts in the manifold setting. Geodesically convex optimization is a well-studied problem with ongoing research and considerable recent interest in machine learning and theoretical computer science. In this paper, we study sampling and convex optimization problems over manifolds of non-negative curvature proving polynomial running time in the dimension and other relevant parameters. Our algorithms assume a warm start. We first present a random walk based sampling algorithm and then combine it with simulated annealing for solving convex optimization problems. To our knowledge, these are the first algorithms in the general setting of positively curved manifolds with provable polynomial guarantees under reasonable assumptions, and the first study of the connection between sampling and optimization in this setting.

preprint2016arXiv

Better Analysis of GREEDY Binary Search Tree on Decomposable Sequences

In their seminal paper [Sleator and Tarjan, J.ACM, 1985], the authors conjectured that the splay tree is dynamically optimal binary search tree (BST). In spite of decades of intensive research, the problem remains open. Perhaps a more basic question, which has also attracted much attention, is if there exists any dynamically optimal BST algorithm. One such candidate is GREEDY which is a simple and intuitive BST algorithm [Lucas, Rutgers Tech. Report, 1988; Munro, ESA, 2000; Demaine, Harmon, Iacono, Kane and Patrascu, SODA, 2009]. [Demaine et al., SODA, 2009] showed a novel connection between a geometric problem. Since dynamic optimality conjecture in its most general form remains elusive despite much effort, researchers have studied this problem on special sequences. Recently, [Chalermsook, Goswami, Kozma, Mehlhorn and Saranurak, FOCS, 2015] studied a type of sequences known as $k$-{\em decomposable sequences} in this context, where $k$ parametrizes easiness of the sequence. Using tools from forbidden submatrix theory, they showed that GREEDY takes $n2^{O(k^2)}$ time on this sequence and explicitly raised the question of improving this bound. In this paper, we show that GREEDY takes $O(n \log{k})$ time on $k$-decomposable sequences. In contrast to the previous approach, ours is based on first principles. One of the main ingredients of our result is a new construction of a lower bound certificate on the performance of any algorithm. This certificate is constructed using the execution of GREEDY, and is more nuanced and possibly more flexible than the previous independent set certificate of Demaine et al. This result, which is applicable to all sequences, may be of independent interest and may lead to further progress in analyzing GREEDY on $k$-decomposable as well as general sequences.

preprint2015arXiv

Heavy-tailed Independent Component Analysis

Independent component analysis (ICA) is the problem of efficiently recovering a matrix $A \in \mathbb{R}^{n\times n}$ from i.i.d. observations of $X=AS$ where $S \in \mathbb{R}^n$ is a random vector with mutually independent coordinates. This problem has been intensively studied, but all existing efficient algorithms with provable guarantees require that the coordinates $S_i$ have finite fourth moments. We consider the heavy-tailed ICA problem where we do not make this assumption, about the second moment. This problem also has received considerable attention in the applied literature. In the present work, we first give a provably efficient algorithm that works under the assumption that for constant $γ> 0$, each $S_i$ has finite $(1+γ)$-moment, thus substantially weakening the moment requirement condition for the ICA problem to be solvable. We then give an algorithm that works under the assumption that matrix $A$ has orthogonal columns but requires no moment assumptions. Our techniques draw ideas from convex geometry and exploit standard properties of the multivariate spherical Gaussian distribution in a novel way.

preprint2014arXiv

Fourier PCA and Robust Tensor Decomposition

Fourier PCA is Principal Component Analysis of a matrix obtained from higher order derivatives of the logarithm of the Fourier transform of a distribution.We make this method algorithmic by developing a tensor decomposition method for a pair of tensors sharing the same vectors in rank-$1$ decompositions. Our main application is the first provably polynomial-time algorithm for underdetermined ICA, i.e., learning an $n \times m$ matrix $A$ from observations $y=Ax$ where $x$ is drawn from an unknown product distribution with arbitrary non-Gaussian components. The number of component distributions $m$ can be arbitrarily higher than the dimension $n$ and the columns of $A$ only need to satisfy a natural and efficiently verifiable nondegeneracy condition. As a second application, we give an alternative algorithm for learning mixtures of spherical Gaussians with linearly independent means. These results also hold in the presence of Gaussian noise.

preprint2014arXiv

On Computing Maximal Independent Sets of Hypergraphs in Parallel

Whether or not the problem of finding maximal independent sets (MIS) in hypergraphs is in (R)NC is one of the fundamental problems in the theory of parallel computing. Unlike the well-understood case of MIS in graphs, for the hypergraph problem, our knowledge is quite limited despite considerable work. It is known that the problem is in \emph{RNC} when the edges of the hypergraph have constant size. For general hypergraphs with $n$ vertices and $m$ edges, the fastest previously known algorithm works in time $O(\sqrt{n})$ with $\text{poly}(m,n)$ processors. In this paper we give an EREW PRAM algorithm that works in time $n^{o(1)}$ with $\text{poly}(m,n)$ processors on general hypergraphs satisfying $m \leq n^{\frac{\log^{(2)}n}{8(\log^{(3)}n)^2}}$, where $\log^{(2)}n = \log\log n$ and $\log^{(3)}n = \log\log\log n$. Our algorithm is based on a sampling idea that reduces the dimension of the hypergraph and employs the algorithm for constant dimension hypergraphs as a subroutine.

preprint2014arXiv

The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures

In this paper we show that very large mixtures of Gaussians are efficiently learnable in high dimension. More precisely, we prove that a mixture with known identical covariance matrices whose number of components is a polynomial of any fixed degree in the dimension n is polynomially learnable as long as a certain non-degeneracy condition on the means is satisfied. It turns out that this condition is generic in the sense of smoothed complexity, as soon as the dimensionality of the space is high enough. Moreover, we prove that no such condition can possibly exist in low dimension and the problem of learning the parameters is generically hard. In contrast, much of the existing work on Gaussian Mixtures relies on low-dimensional projections and thus hits an artificial barrier. Our main result on mixture recovery relies on a new "Poissonization"-based technique, which transforms a mixture of Gaussians to a linear map of a product distribution. The problem of learning this map can be efficiently solved using some recent results on tensor decompositions and Independent Component Analysis (ICA), thus giving an algorithm for recovering the mixture. In addition, we combine our low-dimensional hardness results for Gaussian mixtures with Poissonization to show how to embed difficult instances of low-dimensional Gaussian mixtures into the ICA setting, thus establishing exponential information-theoretic lower bounds for underdetermined ICA in low dimension. To the best of our knowledge, this is the first such result in the literature. In addition to contributing to the problem of Gaussian mixture learning, we believe that this work is among the first steps toward better understanding the rare phenomenon of the "blessing of dimensionality" in the computational aspects of statistical inference.

preprint2014arXiv

Thompson Sampling for Contextual Bandits with Linear Payoffs

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the state-of-the-art methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze a generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studied versions of the contextual bandits problem. We provide the first theoretical guarantees for the contextual version of Thompson Sampling. We prove a high probability regret bound of $\tilde{O}(d^{3/2}\sqrt{T})$ (or $\tilde{O}(d\sqrt{T \log(N)})$), which is the best regret bound achieved by any computationally efficient algorithm available for this problem in the current literature, and is within a factor of $\sqrt{d}$ (or $\sqrt{\log(N)}$) of the information-theoretic lower bound for this problem.

preprint2013arXiv

Annotations for Sparse Data Streams

Motivated by cloud computing, a number of recent works have studied annotated data streams and variants thereof. In this setting, a computationally weak verifier (cloud user), lacking the resources to store and manipulate his massive input locally, accesses a powerful but untrusted prover (cloud service). The verifier must work within the restrictive data streaming paradigm. The prover, who can annotate the data stream as it is read, must not just supply the answer but also convince the verifier of its correctness. Ideally, both the amount of annotation and the space used by the verifier should be sublinear in the relevant input size parameters. A rich theory of such algorithms -- which we call schemes -- has emerged. Prior work has shown how to leverage the prover's power to efficiently solve problems that have no non-trivial standard data stream algorithms. However, while optimal schemes are now known for several basic problems, such optimality holds only for streams whose length is commensurate with the size of the data universe. In contrast, many real-world datasets are relatively sparse, including graphs that contain only O(n^2) edges, and IP traffic streams that contain much fewer than the total number of possible IP addresses, 2^128 in IPv6. We design the first schemes that allow both the annotation and the space usage to be sublinear in the total number of stream updates rather than the size of the data universe. We solve significant problems, including variations of INDEX, SET-DISJOINTNESS, and FREQUENCY-MOMENTS, plus several natural problems on graphs. On the other hand, we give a new lower bound that, for the first time, rules out smooth tradeoffs between annotation and space usage for a specific problem. Our technique brings out new nuances in Merlin-Arthur communication complexity models, and provides a separation between online versions of the MA and AMA models.

preprint2013arXiv

Dynamic vs Oblivious Routing in Network Design

Consider the robust network design problem of finding a minimum cost network with enough capacity to route all traffic demand matrices in a given polytope. We investigate the impact of different routing models in this robust setting: in particular, we compare \emph{oblivious} routing, where the routing between each terminal pair must be fixed in advance, to \emph{dynamic} routing, where routings may depend arbitrarily on the current demand. Our main result is a construction that shows that the optimal cost of such a network based on oblivious routing (fractional or integral) may be a factor of $\BigOmega(\log{n})$ more than the cost required when using dynamic routing. This is true even in the important special case of the asymmetric hose model. This answers a question in \cite{chekurisurvey07}, and is tight up to constant factors. Our proof technique builds on a connection between expander graphs and robust design for single-sink traffic patterns \cite{ChekuriHardness07}.

preprint2013arXiv

Efficient learning of simplices

We show an efficient algorithm for the following problem: Given uniformly random points from an arbitrary n-dimensional simplex, estimate the simplex. The size of the sample and the number of arithmetic operations of our algorithm are polynomial in n. This answers a question of Frieze, Jerrum and Kannan [FJK]. Our result can also be interpreted as efficiently learning the intersection of n+1 half-spaces in R^n in the model where the intersection is bounded and we are given polynomially many uniform samples from it. Our proof uses the local search technique from Independent Component Analysis (ICA), also used by [FJK]. Unlike these previous algorithms, which were based on analyzing the fourth moment, ours is based on the third moment. We also show a direct connection between the problem of learning a simplex and ICA: a simple randomized reduction to ICA from the problem of learning a simplex. The connection is based on a known representation of the uniform measure on a simplex. Similar representations lead to a reduction from the problem of learning an affine transformation of an n-dimensional l_p ball to ICA.

preprint2012arXiv

Analysis of Thompson Sampling for the multi-armed bandit problem

The multi-armed bandit problem is a popular model for studying exploration/exploitation trade-off in sequential decision problems. Many algorithms are now available for this well-studied problem. One of the earliest algorithms, given by W. R. Thompson, dates back to 1933. This algorithm, referred to as Thompson Sampling, is a natural Bayesian algorithm. The basic idea is to choose an arm to play according to its probability of being the best arm. Thompson Sampling algorithm has experimentally been shown to be close to optimal. In addition, it is efficient to implement and exhibits several desirable properties such as small regret for delayed feedback. However, theoretical understanding of this algorithm was quite limited. In this paper, for the first time, we show that Thompson Sampling algorithm achieves logarithmic expected regret for the multi-armed bandit problem. More precisely, for the two-armed bandit problem, the expected regret in time $T$ is $O(\frac{\ln T}Δ + \frac{1}{Δ^3})$. And, for the $N$-armed bandit problem, the expected regret in time $T$ is $O([(\sum_{i=2}^N \frac{1}{Δ_i^2})^2] \ln T)$. Our bounds are optimal but for the dependence on $Δ_i$ and the constant factors in big-Oh.

preprint2012arXiv

Further Optimal Regret Bounds for Thompson Sampling

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the state of the art methods. In this paper, we provide a novel regret analysis for Thompson Sampling that simultaneously proves both the optimal problem-dependent bound of $(1+ε)\sum_i \frac{\ln T}{Δ_i}+O(\frac{N}{ε^2})$ and the first near-optimal problem-independent bound of $O(\sqrt{NT\ln T})$ on the expected regret of this algorithm. Our near-optimal problem-independent bound solves a COLT 2012 open problem of Chapelle and Li. The optimal problem-dependent regret bound for this problem was first proven recently by Kaufmann et al. [ALT 2012]. Our novel martingale-based analysis techniques are conceptually simple, easily extend to distributions other than the Beta distribution, and also extend to the more general contextual bandits setting [Manuscript, Agrawal and Goyal, 2012].

preprint2011arXiv

Lower Bounds for the Average and Smoothed Number of Pareto Optima

Smoothed analysis of multiobjective 0-1 linear optimization has drawn considerable attention recently. The number of Pareto-optimal solutions (i.e., solutions with the property that no other solution is at least as good in all the coordinates and better in at least one) for multiobjective optimization problems is the central object of study. In this paper, we prove several lower bounds for the expected number of Pareto optima. Our basic result is a lower bound of Ω_d(n^(d-1)) for optimization problems with d objectives and n variables under fairly general conditions on the distributions of the linear objectives. Our proof relates the problem of lower bounding the number of Pareto optima to results in geometry connected to arrangements of hyperplanes. We use our basic result to derive (1) To our knowledge, the first lower bound for natural multiobjective optimization problems. We illustrate this for the maximum spanning tree problem with randomly chosen edge weights. Our technique is sufficiently flexible to yield such lower bounds for other standard objective functions studied in this setting (such as, multiobjective shortest path, TSP tour, matching). (2) Smoothed lower bound of min {Ω_d(n^(d-1.5) ϕ^{(d-log d) (1-Θ(1/ϕ))}), 2^{Θ(n)}}$ for the 0-1 knapsack problem with d profits for phi-semirandom distributions for a version of the knapsack problem. This improves the recent lower bound of Brunsch and Roeglin.

preprint2011arXiv

On Dynamic Optimality for Binary Search Trees

Does there exist O(1)-competitive (self-adjusting) binary search tree (BST) algorithms? This is a well-studied problem. A simple offline BST algorithm GreedyFuture was proposed independently by Lucas and Munro, and they conjectured it to be O(1)-competitive. Recently, Demaine et al. gave a geometric view of the BST problem. This view allowed them to give an online algorithm GreedyArb with the same cost as GreedyFuture. However, no o(n)-competitive ratio was known for GreedyArb. In this paper we make progress towards proving O(1)-competitive ratio for GreedyArb by showing that it is O(\log n)-competitive.

preprint2010arXiv

Satisfiability Thresholds for k-CNF Formula with Bounded Variable Intersections

We determine the thresholds for the number of variables, number of clauses, number of clause intersection pairs and the maximum clause degree of a k-CNF formula that guarantees satisfiability under the assumption that every two clauses share at most $α$ variables. More formally, we call these formulas $α$-intersecting and define, for example, a threshold $μ_i(k,α)$ for the number of clause intersection pairs $i$, such that every $α$-intersecting k-CNF formula in which at most $μ_i(k,α)$ pairs of clauses share a variable is satisfiable and there exists an unsatisfiable $α$-intersecting k-CNF formula with $μ_m(k,α)$ such intersections. We provide a lower bound for these thresholds based on the Lovasz Local Lemma and a nearly matching upper bound by constructing an unsatisfiable k-CNF to show that $μ_i(k,α) = \tildeΘ(2^{k(2+1/α)})$. Similar thresholds are determined for the number of variables ($μ_n = \tildeΘ(2^{k/α})$) and the number of clauses ($μ_m = \tildeΘ(2^{k(1+\frac{1}α)})$) (see [Scheder08] for an earlier but independent report on this threshold). Our upper bound construction gives a family of unsatisfiable formula that achieve all four thresholds simultaneously.

Navin Goyal

What is connected

Connect this record

See the researcher in context

Building this map preview

19 published item(s)

Revisiting the Compositional Generalization Abilities of Neural Sequence Models

Effect of Activation Functions on the Training of Overparametrized Neural Nets

Robust Identifiability in Linear Structural Equation Models of Causal Inference

Stability of Linear Structural Equation Models of Causal Inference

Sampling and Optimization on Convex Sets in Riemannian Manifolds of Non-Negative Curvature

Better Analysis of GREEDY Binary Search Tree on Decomposable Sequences

Heavy-tailed Independent Component Analysis

Fourier PCA and Robust Tensor Decomposition

On Computing Maximal Independent Sets of Hypergraphs in Parallel

The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures

Thompson Sampling for Contextual Bandits with Linear Payoffs

Annotations for Sparse Data Streams

Dynamic vs Oblivious Routing in Network Design

Efficient learning of simplices

Analysis of Thompson Sampling for the multi-armed bandit problem

Further Optimal Regret Bounds for Thompson Sampling

Lower Bounds for the Average and Smoothed Number of Pareto Optima

On Dynamic Optimality for Binary Search Trees

Satisfiability Thresholds for k-CNF Formula with Bounded Variable Intersections