Source author record

Rina Panigrahy

Rina Panigrahy appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Data Structures and Algorithms Formal Languages and Automata Theory Artificial Intelligence Computation and Language Computational Complexity Computational Geometry Computer Science and Game Theory Digital Libraries eess.AS Information Retrieval Logic in Computer Science physics.soc-ph Sound

Catalog footprint

What is connected

16works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Simple Mechanisms for Representing, Indexing and Manipulating Concepts

Supervised and unsupervised learning using deep neural networks typically aims to exploit the underlying structure in the training data; this structure is often explained using a latent generative process that produces the data, and the generative process is often hierarchical, involving latent concepts. Despite the significant work on understanding the learning of the latent structure and underlying concepts using theory and experiments, a framework that mathematically captures the definition of a concept and provides ways to operate on concepts is missing. In this work, we propose to characterize a simple primitive concept by the zero set of a collection of polynomials and use moment statistics of the data to uniquely represent the concepts; we show how this view can be used to obtain a signature of the concept. These signatures can be used to discover a common structure across the set of concepts and could recursively produce the signature of higher-level concepts from the signatures of lower-level concepts. To utilize such desired properties, we propose a method by keeping a dictionary of concepts and show that the proposed method can learn different types of hierarchical structures of the data.

preprint2022arXiv

A Theoretical View on Sparsely Activated Networks

Deep and wide neural networks successfully fit very complex functions today, but dense models are starting to be prohibitively expensive for inference. To mitigate this, one promising direction is networks that activate a sparse subgraph of the network. The subgraph is chosen by a data-dependent routing function, enforcing a fixed mapping of inputs to subnetworks (e.g., the Mixture of Experts (MoE) paradigm in Switch Transformers). However, prior work is largely empirical, and while existing routing functions work well in practice, they do not lead to theoretical guarantees on approximation ability. We aim to provide a theoretical explanation for the power of sparse networks. As our first contribution, we present a formal model of data-dependent sparse networks that captures salient aspects of popular architectures. We then introduce a routing function based on locality sensitive hashing (LSH) that enables us to reason about how well sparse networks approximate target functions. After representing LSH-based sparse networks with our model, we prove that sparse networks can match the approximation power of dense networks on Lipschitz functions. Applying LSH on the input vectors means that the experts interpolate the target function in different subregions of the input space. To support our theory, we define various datasets based on Lipschitz target functions, and we show that sparse networks give a favorable trade-off between number of active units and approximation quality.

preprint2022arXiv

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

In this paper, we propose a dynamic cascaded encoder Automatic Speech Recognition (ASR) model, which unifies models for different deployment scenarios. Moreover, the model can significantly reduce model size and power consumption without loss of quality. Namely, with the dynamic cascaded encoder model, we explore three techniques to maximally boost the performance of each model size: 1) Use separate decoders for each sub-model while sharing the encoders; 2) Use funnel-pooling to improve the encoder efficiency; 3) Balance the size of causal and non-causal encoders to improve quality and fit deployment constraints. Overall, the proposed large-medium model has 30% smaller size and reduces power consumption by 33%, compared to the baseline cascaded encoder model. The triple-size model that unifies the large, medium, and small models achieves 37% total size reduction with minimal quality loss, while substantially reducing the engineering efforts of having separate models.

preprint2020arXiv

Learning the gravitational force law and other analytic functions

Large neural network models have been successful in learning functions of importance in many branches of science, including physics, chemistry and biology. Recent theoretical work has shown explicit learning bounds for wide networks and kernel methods on some simple classes of functions, but not on more complex functions which arise in practice. We extend these techniques to provide learning bounds for analytic functions on the sphere for any kernel method or equivalent infinitely-wide network with the corresponding activation function trained with SGD. We show that a wide, one-hidden layer ReLU network can learn analytic functions with a number of samples proportional to the derivative of a related function. Many functions important in the sciences are therefore efficiently learnable. As an example, we prove explicit bounds on learning the many-body gravitational force function given by Newton's law of gravitation. Our theoretical bounds suggest that very wide ReLU networks (and the corresponding NTK kernel) are better at learning analytic functions as compared to kernel learning with Gaussian kernels. We present experimental evidence that the many-body gravitational force function is easier to learn with ReLU networks as compared to networks with exponential activations.

preprint2020arXiv

Recovering the Lowest Layer of Deep Networks with High Threshold Activations

Giving provable guarantees for learning neural networks is a core challenge of machine learning theory. Most prior work gives parameter recovery guarantees for one hidden layer networks, however, the networks used in practice have multiple non-linear layers. In this work, we show how we can strengthen such results to deeper networks -- we address the problem of uncovering the lowest layer in a deep neural network under the assumption that the lowest layer uses a high threshold before applying the activation, the upper network can be modeled as a well-behaved polynomial and the input distribution is Gaussian.

preprint2014arXiv

Sparse Matrix Factorization

We investigate the problem of factorizing a matrix into several sparse matrices and propose an algorithm for this under randomness and sparsity assumptions. This problem can be viewed as a simplification of the deep learning problem where finding a factorization corresponds to finding edges in different layers and values of hidden units. We prove that under certain assumptions for a sparse linear deep network with $n$ nodes in each layer, our algorithm is able to recover the structure of the network and values of top layer hidden units for depths up to $\tilde O(n^{1/6})$. We further discuss the relation among sparse matrix factorization, deep learning, sparse recovery and dictionary learning.

preprint2013arXiv

A Differential Equations Approach to Optimizing Regret Trade-offs

We consider the classical question of predicting binary sequences and study the {\em optimal} algorithms for obtaining the best possible regret and payoff functions for this problem. The question turns out to be also equivalent to the problem of optimal trade-offs between the regrets of two experts in an "experts problem", studied before by \cite{kearns-regret}. While, say, a regret of $Θ(\sqrt{T})$ is known, we argue that it important to ask what is the provably optimal algorithm for this problem --- both because it leads to natural algorithms, as well as because regret is in fact often comparable in magnitude to the final payoffs and hence is a non-negligible term. In the basic setting, the result essentially follows from a classical result of Cover from '65. Here instead, we focus on another standard setting, of time-discounted payoffs, where the final "stopping time" is not specified. We exhibit an explicit characterization of the optimal regret for this setting. To obtain our main result, we show that the optimal payoff functions have to satisfy the Hermite differential equation, and hence are given by the solutions to this equation. It turns out that characterization of the payoff function is qualitatively different from the classical (non-discounted) setting, and, namely, there's essentially a unique optimal solution.

preprint2013arXiv

Fractal structures in Adversarial Prediction

Fractals are self-similar recursive structures that have been used in modeling several real world processes. In this work we study how "fractal-like" processes arise in a prediction game where an adversary is generating a sequence of bits and an algorithm is trying to predict them. We will see that under a certain formalization of the predictive payoff for the algorithm it is most optimal for the adversary to produce a fractal-like sequence to minimize the algorithm's ability to predict. Indeed it has been suggested before that financial markets exhibit a fractal-like behavior. We prove that a fractal-like distribution arises naturally out of an optimization from the adversary's perspective. In addition, we give optimal trade-offs between predictability and expected deviation (i.e. sum of bits) for our formalization of predictive payoff. This result is motivated by the observation that several time series data exhibit higher deviations than expected for a completely random walk.

preprint2013arXiv

Optimal amortized regret in every interval

Consider the classical problem of predicting the next bit in a sequence of bits. A standard performance measure is {\em regret} (loss in payoff) with respect to a set of experts. For example if we measure performance with respect to two constant experts one that always predicts 0's and another that always predicts 1's it is well known that one can get regret $O(\sqrt T)$ with respect to the best expert by using, say, the weighted majority algorithm. But this algorithm does not provide performance guarantee in any interval. There are other algorithms that ensure regret $O(\sqrt {x \log T})$ in any interval of length $x$. In this paper we show a randomized algorithm that in an amortized sense gets a regret of $O(\sqrt x)$ for any interval when the sequence is partitioned into intervals arbitrarily. We empirically estimated the constant in the $O()$ for $T$ upto 2000 and found it to be small -- around 2.1. We also experimentally evaluate the efficacy of this algorithm in predicting high frequency stock data.

preprint2012arXiv

A non-expert view on Turing machines, Proof Verifiers, and Mental reasoning

The paper explores known results related to the problem of identifying if a given program terminates on all inputs -- this is a simple generalization of the halting problem. We will see how this problem is related and the notion of proof verifiers. We also see how verifying if a program is terminating involves reasoning through a tower of axiomatic theories -- such a tower of theories is known as Turing progressions and was first studied by Alan Turing in the 1930's. We will see that this process has a natural connection to ordinal numbers. The paper is presented from the perspective of a non-expert in the field of logic and proof theory.

preprint2012arXiv

Prediction strategies without loss

Consider a sequence of bits where we are trying to predict the next bit from the previous bits. Assume we are allowed to say 'predict 0' or 'predict 1', and our payoff is +1 if the prediction is correct and -1 otherwise. We will say that at each point in time the loss of an algorithm is the number of wrong predictions minus the number of right predictions so far. In this paper we are interested in algorithms that have essentially zero (expected) loss over any string at any point in time and yet have small regret with respect to always predicting 0 or always predicting 1. For a sequence of length $T$ our algorithm has regret $14εT $ and loss $2\sqrt{T}e^{-ε^2 T} $ in expectation for all strings. We show that the tradeoff between loss and regret is optimal up to constant factors. Our techniques extend to the general setting of $N$ experts, where the related problem of trading off regret to the best expert for regret to the `special' expert has been studied by Even-Dar et al. (COLT'07). We obtain essentially zero loss with respect to the special expert and optimal loss/regret tradeoff, improving upon the results of Even-Dar et al and settling the main question left open in their paper. The strong loss bounds of the algorithm have some surprising consequences. A simple iterative application of our algorithm gives essentially optimal regret bounds at multiple time scales, bounds with respect to $k$-shifting optima as well as regret bounds with respect to higher norms of the input sequence.

preprint2012arXiv

The Mind Grows Circuits

There is a vast supply of prior art that study models for mental processes. Some studies in psychology and philosophy approach it from an inner perspective in terms of experiences and percepts. Others such as neurobiology or connectionist-machines approach it externally by viewing the mind as complex circuit of neurons where each neuron is a primitive binary circuit. In this paper, we also model the mind as a place where a circuit grows, starting as a collection of primitive components at birth and then builds up incrementally in a bottom up fashion. A new node is formed by a simple composition of prior nodes when we undergo a repeated experience that can be described by that composition. Unlike neural networks, however, these circuits take "concepts" or "percepts" as inputs and outputs. Thus the growing circuits can be likened to a growing collection of lambda expressions that are built on top of one another in an attempt to compress the sensory input as a heuristic to bound its Kolmogorov Complexity.

preprint2011arXiv

Can Knowledge be preserved in the long run?

Can (scientific) knowledge be reliably preserved over the long term? We have today very efficient and reliable methods to encode, store and retrieve data in a storage medium that is fault tolerant against many types of failures. But does this guarantee -- or does it even seem likely -- that all knowledge can be preserved over thousands of years and beyond? History shows that many types of knowledge that were known before have been lost. We observe that the nature of stored and communicated information and the way it is interpreted is such that it always tends to decay and therefore must lost eventually in the long term. The likely fundamental conclusion is that knowledge cannot be reliably preserved indefinitely.

preprint2010arXiv

Lower Bounds on Near Neighbor Search via Metric Expansion

In this paper we show how the complexity of performing nearest neighbor (NNS) search on a metric space is related to the expansion of the metric space. Given a metric space we look at the graph obtained by connecting every pair of points within a certain distance $r$ . We then look at various notions of expansion in this graph relating them to the cell probe complexity of NNS for randomized and deterministic, exact and approximate algorithms. For example if the graph has node expansion $Φ$ then we show that any deterministic $t$-probe data structure for $n$ points must use space $S$ where $(St/n)^t > Φ$. We show similar results for randomized algorithms as well. These relationships can be used to derive most of the known lower bounds in the well known metric spaces such as $l_1$, $l_2$, $l_\infty$ by simply computing their expansion. In the process, we strengthen and generalize our previous results (FOCS 2008). Additionally, we unify the approach in that work and the communication complexity based approach. Our work reduces the problem of proving cell probe lower bounds of near neighbor search to computing the appropriate expansion parameter. In our results, as in all previous results, the dependence on $t$ is weak; that is, the bound drops exponentially in $t$. We show a much stronger (tight) time-space tradeoff for the class of dynamic low contention data structures. These are data structures that supports updates in the data set and that do not look up any single cell too often.

preprint2010arXiv

Revisiting the Examination Hypothesis with Query Specific Position Bias

Click through rates (CTR) offer useful user feedback that can be used to infer the relevance of search results for queries. However it is not very meaningful to look at the raw click through rate of a search result because the likelihood of a result being clicked depends not only on its relevance but also the position in which it is displayed. One model of the browsing behavior, the {\em Examination Hypothesis} \cite{RDR07,Craswell08,DP08}, states that each position has a certain probability of being examined and is then clicked based on the relevance of the search snippets. This is based on eye tracking studies \cite{Claypool01, GJG04} which suggest that users are less likely to view results in lower positions. Such a position dependent variation in the probability of examining a document is referred to as {\em position bias}. Our main observation in this study is that the position bias tends to differ with the kind of information the user is looking for. This makes the position bias {\em query specific}. In this study, we present a model for analyzing a query specific position bias from the click data and use these biases to derive position independent relevance values of search results. Our model is based on the assumption that for a given query, the positional click through rate of a document is proportional to the product of its relevance and a {\em query specific} position bias. We compare our model with the vanilla examination hypothesis model (EH) on a set of queries obtained from search logs of a commercial search engine. We also compare it with the User Browsing Model (UBM) \cite{DP08} which extends the cascade model of Craswell et al\cite{Craswell08} by incorporating multiple clicks in a query session. We show that the our model, although much simpler to implement, consistently outperforms both EH and UBM on well-used measures such as relative error and cross entropy.

preprint2010arXiv

Understanding Fashion Cycles as a Social Choice

We present a formal model for studying fashion trends, in terms of three parameters of fashionable items: (1) their innate utility; (2) individual boredom associated with repeated usage of an item; and (3) social influences associated with the preferences from other people. While there are several works that emphasize the effect of social influence in understanding fashion trends, in this paper we show how boredom plays a strong role in both individual and social choices. We show how boredom can be used to explain the cyclic choices in several scenarios such as an individual who has to pick a restaurant to visit every day, or a society that has to repeatedly `vote' on a single fashion style from a collection. We formally show that a society that votes for a single fashion style can be viewed as a single individual cycling through different choices. In our model, the utility of an item gets discounted by the amount of boredom that has accumulated over the past; this boredom increases with every use of the item and decays exponentially when not used. We address the problem of optimally choosing items for usage, so as to maximize over-all satisfaction, i.e., composite utility, over a period of time. First we show that the simple greedy heuristic of always choosing the item with the maximum current composite utility can be arbitrarily worse than the optimal. Second, we prove that even with just a single individual, determining the optimal strategy for choosing items is NP-hard. Third, we show that a simple modification to the greedy algorithm that simply doubles the boredom of each item is a provably close approximation to the optimal strategy. Finally, we present an experimental study over real-world data collected from query logs to compare our algorithms.

Rina Panigrahy

What is connected

Connect this record

See the researcher in context

Building this map preview

16 published item(s)

Simple Mechanisms for Representing, Indexing and Manipulating Concepts

A Theoretical View on Sparsely Activated Networks

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

Learning the gravitational force law and other analytic functions

Recovering the Lowest Layer of Deep Networks with High Threshold Activations

Sparse Matrix Factorization

A Differential Equations Approach to Optimizing Regret Trade-offs

Fractal structures in Adversarial Prediction

Optimal amortized regret in every interval

A non-expert view on Turing machines, Proof Verifiers, and Mental reasoning

Prediction strategies without loss

The Mind Grows Circuits

Can Knowledge be preserved in the long run?

Lower Bounds on Near Neighbor Search via Metric Expansion

Revisiting the Examination Hypothesis with Query Specific Position Bias

Understanding Fashion Cycles as a Social Choice