Source author record

Sayan Mukherjee

Sayan Mukherjee appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

40works

30topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

At the Intersection of Deep Sequential Model Framework and State-space Model Framework: Study on Option Pricing

Inference and forecast problems of the nonlinear dynamical system have arisen in a variety of contexts. Reservoir computing and deep sequential models, on the one hand, have demonstrated efficient, robust, and superior performance in modeling simple and chaotic dynamical systems. However, their innate deterministic feature has partially detracted their robustness to noisy system, and their inability to offer uncertainty measurement has also been an insufficiency of the framework. On the other hand, the traditional state-space model framework is robust to noise. It also carries measured uncertainty, forming a just-right complement to the reservoir computing and deep sequential model framework. We propose the unscented reservoir smoother, a model that unifies both deep sequential and state-space models to achieve both frameworks' superiorities. Evaluated in the option pricing setting on top of noisy datasets, URS strikes highly competitive forecasting accuracy, especially those of longer-term, and uncertainty measurement. Further extensions and implications on URS are also discussed to generalize a full integration of both frameworks.

preprint2026arXiv

Synthesizing POMDP Policies: Sampling Meets Model-checking via Learning

Partially Observable Markov Decision Processes (POMDPs) are the standard framework for decision-making under uncertainty. While sampling-based methods scale well, they lack formal correctness guarantees, making them unsuitable for safety-critical applications. Conversely, formal synthesis techniques provide correctness-by-construction but often struggle with scalability, as general POMDP synthesis is undecidable. To bridge this gap, we propose a synthesis framework that integrates sampling, automata learning, and model-checking. Inspired by Angluin's $L^*$ algorithm, our approach utilizes sampling as a membership oracle and model-checking as an equivalence oracle. This enables the synthesis of finite-state controllers with formal guarantees, provided the sampling-induced policy is regular. We establish a relative completeness result for this framework. Experimental results from our prototypical implementation demonstrate that this method successfully solves threshold-safety problems that remain challenging for existing formal synthesis tools. We believe our algorithm serves as a valuable component in a portfolio approach to tackling the inherent difficulty of POMDP synthesis problems.

preprint2023arXiv

Extended probabilities and their application to statistical inference

We propose a new, more general definition of extended probability measures. We study their properties and provide a behavioral interpretation. We put them to use in an inference procedure, whose environment is canonically represented by the probability space $(Ω,\mathcal{F},P)$, when both $P$ and the composition of $Ω$ are unknown. We develop an ex ante analysis -- taking place before the statistical analysis requiring knowledge of $Ω$ -- in which the true composition of $Ω$ is progressively learned. We describe how to update extended probabilities in this setting, and introduce the concept of lower extended probabilities. We apply our findings to a species sampling problem and to the study of the boomerang effect (the empirical observation that sometimes persuasion yields the opposite effect: the persuaded agent moves their opinion away from the opinion of the persuading agent).

preprint2022arXiv

A Grover search-based algorithm for the list coloring problem

Graph coloring is a computationally difficult problem, and currently the best known classical algorithm for $k$-coloring of graphs on $n$ vertices has runtimes $Ω(2^n)$ for $k\ge 5$. The list coloring problem asks the following more general question: given a list of available colors for each vertex in a graph, does it admit a proper coloring? We propose a quantum algorithm based on Grover search to quadratically speed up exhaustive search. Our algorithm loses in complexity to classical ones in specific restricted cases, but improves exhaustive search for cases where the lists and graphs considered are arbitrary in nature.

preprint2022arXiv

Multiple testing with persistent homology

In this paper we propose a computationally efficient multiple hypothesis testing procedure for persistent homology. The computational efficiency of our procedure is based on the observation that one can empirically simulate a null distribution that is universal across many hypothesis testing applications involving persistence homology. Our observation suggests that one can simulate the null distribution efficiently based on a small number of summaries of the collected data and use this null in the same way that p-value tables were used in classical statistics. To illustrate the efficiency and utility of the null distribution we provide procedures for rejecting acyclicity with both control of the Family-Wise Error Rate (FWER) and the False Discovery Rate (FDR). We will argue that the empirical null we propose is very general conditional on a few summaries of the data based on simulations and limit theorems for persistent homology for point processes.

preprint2022arXiv

Tight query complexity bounds for learning graph partitions

Given a partition of a graph into connected components, the membership oracle asserts whether any two vertices of the graph lie in the same component or not. We prove that for $n\ge k\ge 2$, learning the components of an $n$-vertex hidden graph with $k$ components requires at least $(k-1)n-\binom k2$ membership queries. Our result improves on the best known information-theoretic bound of $Ω(n\log k)$ queries, and exactly matches the query complexity of the algorithm introduced by [Reyzin and Srivastava, 2007] for this problem. Additionally, we introduce an oracle, with access to which one can learn the number of components of $G$ in asymptotically fewer queries than learning the full partition, thus answering another question posed by the same authors. Lastly, we introduce a more applicable version of this oracle, and prove asymptotically tight bounds of $\widetildeΘ(m)$ queries for both learning and verifying an $m$-edge hidden graph $G$ using it.

preprint2020arXiv

A Case for Quantifying Statistical Robustness of Specialized Probabilistic AI Accelerators

Statistical machine learning often uses probabilistic algorithms, such as Markov Chain Monte Carlo (MCMC), to solve a wide range of problems. Many accelerators are proposed using specialized hardware to address sampling inefficiency, the critical performance bottleneck of probabilistic algorithms. These accelerators usually improve the hardware efficiency by using some approximation techniques, such as reducing bit representation, truncating small values to zero, or simplifying the Random Number Generator (RNG). Understanding the influence of these approximations on result quality is crucial to meeting the quality requirements of real applications. Although a common approach is to compare the end-point result quality using community-standard benchmarks and metrics, we claim a probabilistic architecture should provide some measure (or guarantee) of statistical robustness. This work takes a first step towards quantifying the statistical robustness of specialized hardware MCMC accelerators by proposing three pillars of statistical robustness: sampling quality, convergence diagnostic, and goodness of fit. Each pillar has at least one quantitative metric without the need to know the ground truth data. We apply this method to analyze the statistical robustness of an MCMC accelerator proposed by previous work, with some modifications, as a case study. The method also applies to other probabilistic accelerators and can be used in design space exploration.

preprint2020arXiv

Beyond Application End-Point Results: Quantifying Statistical Robustness of MCMC Accelerators

Statistical machine learning often uses probabilistic algorithms, such as Markov Chain Monte Carlo (MCMC), to solve a wide range of problems. Probabilistic computations, often considered too slow on conventional processors, can be accelerated with specialized hardware by exploiting parallelism and optimizing the design using various approximation techniques. Current methodologies for evaluating correctness of probabilistic accelerators are often incomplete, mostly focusing only on end-point result quality ("accuracy"). It is important for hardware designers and domain experts to look beyond end-point "accuracy" and be aware of the hardware optimizations impact on other statistical properties. This work takes a first step towards defining metrics and a methodology for quantitatively evaluating correctness of probabilistic accelerators beyond end-point result quality. We propose three pillars of statistical robustness: 1) sampling quality, 2) convergence diagnostic, and 3) goodness of fit. We apply our framework to a representative MCMC accelerator and surface design issues that cannot be exposed using only application end-point result quality. Applying the framework to guide design space exploration shows that statistical robustness comparable to floating-point software can be achieved by slightly increasing the bit representation, without floating-point hardware requirements.

preprint2020arXiv

Random Lie Brackets that Induce Torsion: A Model for Noisy Vector Fields

We define and study a random Lie bracket that induces torsion in expectation. Almost all stochastic analysis on manifolds have assumed parallel transport. Mathematically this assumption is very reasonable. However, in many applied geometry and graphics problems parallel transport is not achieved, the "change in coordinates" are not exact due to noise. We formulate a stochastic model on a manifold for which parallel transport does not hold and analyze the consequences of this model with respect to classic quantities studied in Riemannian geometry. We first define a stochastic lie bracket that induces a stochastic covariant derivative. We then study the connection implied by the stochastic covariant derivative and note that the stochastic lie bracket induces torsion. We then state the induced stochastic geodesic equations and a stochastic differential equation for parallel transport. We also derive the curvature tensors for our construction and a stochastic Laplace-Beltrami operator. We close with a discussion of the motivation and relevance of our construction.

preprint2020arXiv

Stanza: A Nonlinear State Space Model for Probabilistic Inference in Non-Stationary Time Series

Time series with long-term structure arise in a variety of contexts and capturing this temporal structure is a critical challenge in time series analysis for both inference and forecasting settings. Traditionally, state space models have been successful in providing uncertainty estimates of trajectories in the latent space. More recently, deep learning, attention-based approaches have achieved state of the art performance for sequence modeling, though often require large amounts of data and parameters to do so. We propose Stanza, a nonlinear, non-stationary state space model as an intermediate approach to fill the gap between traditional models and modern deep learning approaches for complex time series. Stanza strikes a balance between competitive forecasting accuracy and probabilistic, interpretable inference for highly structured time series. In particular, Stanza achieves forecasting accuracy competitive with deep LSTMs on real-world datasets, especially for multi-step ahead forecasting.

preprint2020arXiv

Subspace Clustering through Sub-Clusters

The problem of dimension reduction is of increasing importance in modern data analysis. In this paper, we consider modeling the collection of points in a high dimensional space as a union of low dimensional subspaces. In particular we propose a highly scalable sampling based algorithm that clusters the entire data via first spectral clustering of a small random sample followed by classifying or labeling the remaining out of sample points. The key idea is that this random subset borrows information across the entire data set and that the problem of clustering points can be replaced with the more efficient and robust problem of "clustering sub-clusters". We provide theoretical guarantees for our procedure. The numerical results indicate we outperform other state-of-the-art subspace clustering algorithms with respect to accuracy and speed.

preprint2019arXiv

A study on dynamical complexity of noise induced blood flow

In this article, the dynamics and complexity of a noise induced blood flow system have been investigated. Changes in the dynamics have been recognized by measuring the periodicity over significant parameters. Chaotic as well as non-chaotic regimes have also been classified. Further, dynamical complexity has been studied by phase space based weighted entropy. Numerical results show a strong correlation between the dynamics and complexity of the noise induced system. The correlation has been confirmed by a cross-correlation analysis.

preprint2016arXiv

Fast moment estimation for generalized latent Dirichlet models

We develop a generalized method of moments (GMM) approach for fast parameter estimation in a new class of Dirichlet latent variable models with mixed data types. Parameter estimation via GMM has been demonstrated to have computational and statistical advantages over alternative methods, such as expectation maximization, variational inference, and Markov chain Monte Carlo. The key computational advan- tage of our method (MELD) is that parameter estimation does not require instantiation of the latent variables. Moreover, a representational advantage of the GMM approach is that the behavior of the model is agnostic to distributional assumptions of the observations. We derive population moment conditions after marginalizing out the sample-specific Dirichlet latent variables. The moment conditions only depend on component mean parameters. We illustrate the utility of our approach on simulated data, comparing results from MELD to alternative methods, and we show the promise of our approach through the application of MELD to several data sets.

preprint2016arXiv

Topological consistency via kernel estimation

We introduce a consistent estimator for the homology (an algebraic structure representing connected components and cycles) of level sets of both density and regression functions. Our method is based on kernel estimation. We apply this procedure to two problems: (1) inferring the homology structure of manifolds from noisy observations, (2) inferring the persistent homology (a multi-scale extension of homology) of either density or regression functions. We prove consistency for both of these problems. In addition to the theoretical results, we demonstrate these methods on simulated data for binary regression and clustering applications.

preprint2015arXiv

Adaptive Randomized Dimension Reduction on Massive Data

The scalability of statistical estimators is of increasing importance in modern applications. One approach to implementing scalable algorithms is to compress data into a low dimensional latent space using dimension reduction methods. In this paper we develop an approach for dimension reduction that exploits the assumption of low rank structure in high dimensional data to gain both computational and statistical advantages. We adapt recent randomized low-rank approximation algorithms to provide an efficient solution to principal component analysis (PCA), and we use this efficient solver to improve parameter estimation in large-scale linear mixed models (LMM) for association mapping in statistical and quantitative genomics. A key observation in this paper is that randomization serves a dual role, improving both computational and statistical performance by implicitly regularizing the covariance matrix estimate of the random effect in a LMM. These statistical and computational advantages are highlighted in our experiments on simulated data and large-scale genomic studies.

preprint2015arXiv

Bayesian group latent factor analysis with structured sparsity

Latent factor models are the canonical statistical tool for exploratory analyses of low-dimensional linear structure for an observation matrix with p features across n samples. We develop a structured Bayesian group factor analysis model that extends the factor model to multiple coupled observation matrices; in the case of two observations, this reduces to a Bayesian model of canonical correlation analysis. The main contribution of this work is to carefully define a structured Bayesian prior that encourages both element-wise and column-wise shrinkage and leads to desirable behavior on high-dimensional data. In particular, our model puts a structured prior on the joint factor loading matrix, regularizing at three levels, which enables element-wise sparsity and unsupervised recovery of latent factors corresponding to structured variance across arbitrary subsets of the observations. In addition, our structured prior allows for both dense and sparse latent factors so that covariation among either all features or only a subset of features can both be recovered. We use fast parameter-expanded expectation-maximization for parameter estimation in this model. We validate our method on both simulated data with substantial structure and real data, comparing against a number of state-of-the-art approaches. These results illustrate useful properties of our model, including i) recovering sparse signal in the presence of dense effects; ii) the ability to scale naturally to large numbers of observations; iii) flexible observation- and factor-specific regularization to recover factors with a wide variety of sparsity levels and percentage of variance explained; and iv) tractable inference that scales to modern genomic and document data sizes.

preprint2015arXiv

Can complexity decrease in Congestive Heart failure?

The complexity of a signal can be measured by the Recurrence period density entropy (RPDE) from the reconstructed phase space. We have chosen a window based RPDE method for the classification of signals, as RPDE is an average entropic measure of the whole phase space. We have observed the changes in the complexity in cardiac signals of normal healthy person (NHP) and congestive heart failure patients (CHFP). The results show that the cardiac dynamics of a healthy subject is more complex and random compare to the same for a heart failure patient, whose dynamics is more deterministic. We have constructed a general threshold to distinguish the border line between a healthy and a congestive heart failure dynamics. The results may be useful for wide range for physiological and biomedical analysis.

preprint2015arXiv

Characterization of Cardio signals by time-frequency domain analysis

Long term behavior of nonlinear deterministic continuous time signals can be studied in terms of their reconstructed attractors. Reconstructed attractors of a continuous signal are meant to be topologically equivalent representations of the dynamics of the unknown dynamical system which generates the signal. Sometimes, geometry of the attractor or its complexity may give important information on the system of interest. However, if the trajectories of the attractor behave as if they are not coming from continuous system or there exists many spike like structures on the path of the system trajectories, then there is no way to characterize the shape of the attractor. In this article, the traditional attractor reconstruction method is first used for two types of ECG signals: Normal healthy persons (NHP) and Congestive Heart failure patients (CHFP). As common in such a framework, the reconstructed attractors are not at all well formed and hence it is not possible to adequately characterize their geometrical features. Thus, we incorporate frequency domain information to the given time signals. This is done by transforming the signals to a time frequency domain by means of suitable Wavelet transforms (WT). The transformed signal concerns two non homogeneous variables and is still quite difficult to use to reconstruct some dynamics out of it. By applying a suitable mapping, this signal is further converted into integer domain and a new type of 3D plot, called integer lag plot, which characterizes and distinguishes the ECG signals of NHP and CHFP, is finally obtained.

preprint2015arXiv

Geometric Representations of Random Hypergraphs

A parametrization of hypergraphs based on the geometry of points in $\mathbf{R}^d$ is developed. Informative prior distributions on hypergraphs are induced through this parametrization by priors on point configurations via spatial processes. This prior specification is used to infer conditional independence models or Markov structure of multivariate distributions. Specifically, we can recover both the junction tree factorization as well as the hyper Markov law. This approach offers greater control on the distribution of graph features than Erdös-Rényi random graphs, supports inference of factorizations that cannot be retrieved by a graph alone, and leads to new Metropolis\slash Hastings Markov chain Monte Carlo algorithms with both local and global moves in graph space. We illustrate the utility of this parametrization and prior specification using simulations.

preprint2015arXiv

Learning Subspaces of Different Dimension

We introduce a Bayesian model for inferring mixtures of subspaces of different dimensions. The key challenge in such a mixture model is specification of prior distributions over subspaces of different dimensions. We address this challenge by embedding subspaces or Grassmann manifolds into a sphere of relatively low dimension and specifying priors on the sphere. We provide an efficient sampling algorithm for the posterior distribution of the model parameters. We illustrate that a simple extension of our mixture of subspaces model can be applied to topic modeling. We also prove posterior consistency for the mixture of subspaces model. The utility of our approach is demonstrated with applications to real and simulated data.

preprint2014arXiv

A high dimensional delay selection for the reconstruction of proper Phase Space with Cross auto-correlation

For the purpose of phase space reconstruction from nonlinear time series, delay selection is one of the most vital criteria. This is normally done by using a general measure viz., mutual information (MI). However, in that case, the delay selection is limited to the estimation of a single delay using MI between two variables only. The corresponding reconstructed phase space is also not satisfactory. To overcome the situation, a high-dimensional estimator of the MI is used; it selects more than one delay between more than two variables. The quality of the reconstructed phase space is tested by shape distortion parameter (SD), it is found that even this multidimensional MI sometimes fails to produce a less distorted phase space. In this paper, an alternative nonlinear measure cross autocorrelation (CAC) is introduced. A comparative study is made between the reconstructed phase spaces of a known three dimensional Neuro dynamical model, Lorenz dynamical model and a three dimensional food web model under MI for two and higher dimensions and also under cross auto-correlation separately. It is found that the least distorted phase space is obtained only under the notion of cross autocorrelation.

preprint2014arXiv

Approximate discrete dynamics of EMG signal

Approximation of a continuous dynamics by discrete dynamics in the form of Poincare map is one of the fascinating mathematical tool, which can describe the approximate behaviour of the dynamics of the dynamical system in lesser dimension than the embedding diemnsion. The present article considers a very rare biomedical signal like Electromyography (EMG) signal. It determines suitable time delay and reconstruct the attractor of embedding diemnsion three. By measuring its Lyapunov exponent, the attractor so reconstructed is found to be chaotic. Naturally the Poincare map obtained by corresponding Poincare section is to be chaotic too. This may be verified by calculation of Lyapunov exponent of the map. The main objective of this article is to show that Poincare map exists in this case as a 2D map for a suitable Poincare section only. In fact, the article considers two Poincare sections of the attractor for construction of the Poincare map. It is seen that one such map is chaotic but the other one is not so, both are verified by calculation of Lyapunov exponent of the map.

preprint2014arXiv

Consistency of maximum likelihood estimation for some dynamical systems

We consider the asymptotic consistency of maximum likelihood parameter estimation for dynamical systems observed with noise. Under suitable conditions on the dynamical systems and the observations, we show that maximum likelihood parameter estimation is consistent. Our proof involves ideas from both information theory and dynamical systems. Furthermore, we show how some well-studied properties of dynamical systems imply the general statistical properties related to maximum likelihood estimation. Finally, we exhibit classical families of dynamical systems for which maximum likelihood estimation is consistent. Examples include shifts of finite type with Gibbs measures and Axiom A attractors with SRB measures.

preprint2014arXiv

Is one dimensional return map sufficient to describe the chaotic dynamics of a three dimensional system?

Study of continuous dynamical system through Poincare map is one of the most popular topics in nonlinear analysis. This is done by taking intersections of the orbit of flow by a hyper-plane parallel to one of the coordinate hyper-planes of co-dimension one. Naturally for a 3D-attractor, the Poincare map gives rise to 2D points, which can describe the dynamics of the attractor properly. In a very special case, sometimes these 2D points are considered as their 1D-projections to obtain a 1D map. However, this is an artificial way of reducing the 2D map by dropping one of the variables. Sometimes it is found that the two coordinates of the points on the Poincare section are functionally related. This also reduces the 2D Poincare map to a 1D map. This reduction is natural, and not artificial as mentioned above. In the present study, this issue is being highlighted. In fact, we find out some examples, which show that even this natural reduction of the 2D Poincare map is not always justified, because the resultant 1D map may fail to generate the original dynamics. This proves that to describe the dynamics of the 3D chaotic attractor, the minimum dimension of the Poincare map must be two, in general.

preprint2014arXiv

New types of nonlinear auto-correlations of bivariate data and their applications

The paper introduces new types of nonlinear correlations between bivariate data sets and derives nonlinear auto-correlations on the same data set. These auto-correlations are of different types to match signals with different types of nonlinearities. Examples are cited in all cases to make the definitions meaningful. Next correlogram diagrams are drawn separately in all cases; from these diagrams proper time lags/delays are determined. These give rise to independent coordinates of the attractors. Finally three dimensional attractors are reconstructed in each case separately with the help of these independent coordinates. Moreover for the purpose of making proper distinction between the signals, the attractors so reconstructed are quantified by a new technique called ellipsoid fit.

preprint2014arXiv

Persistent Homology Transform for Modeling Shapes and Surfaces

In this paper we introduce a statistic, the persistent homology transform (PHT), to model surfaces in $\mathbb{R}^3$ and shapes in $\mathbb{R}^2$. This statistic is a collection of persistence diagrams - multiscale topological summaries used extensively in topological data analysis. We use the PHT to represent shapes and execute operations such as computing distances between shapes or classifying shapes. We prove the map from the space of simplicial complexes in $\mathbb{R}^3$ into the space spanned by this statistic is injective. This implies that the statistic is a sufficient statistic for probability densities on the space of piecewise linear shapes. We also show that a variant of this statistic, the Euler Characteristic Transform (ECT), admits a simple exponential family formulation which is of use in providing likelihood based inference for shapes and surfaces. We illustrate the utility of this statistic on simulated and real data.

preprint2014arXiv

Phase synchronization of instrumental music signals

Signal analysis is one of the finest scientific techniques in communication theory. Some quantitative and qualitative measures describe the pattern of a music signal, vary from one to another. Same musical recital, when played by different instrumentalists, generates different types of music patterns. The reason behind various patterns is the psychoacoustic measures - Dynamics, Timber, Tonality and Rhythm, varies in each time. However, the psycho-acoustic study of the music signals does not reveal any idea about the similarity between the signals. For such cases, study of synchronization of long-term nonlinear dynamics may provide effective results. In this context, phase synchronization (PS) is one of the measures to show synchronization between two non-identical signals. In fact, it is very critical to investigate any other kind of synchronization for experimental condition, because those are completely non identical signals. Also, there exists equivalence between the phases and the distances of the diagonal line in Recurrence plot (RP) of the signals, which is quantifiable by the recurrence quantification measure tau-recurrence rate. This paper considers two nonlinear music signals based on same raga played by two eminent sitar instrumentalists as two non-identical sources. The psycho-acoustic study shows how the Dynamics, Timber, Tonality and Rhythm vary for the two music signals. Then, long term analysis in the form of phase space reconstruction is performed, which reveals the chaotic phase spaces for both the signals. From the RP of both the phase spaces, tau-recurrence rate is calculated. Finally by the correlation of normalized tau-recurrence rate of their 3D phase spaces and the PS of the two music signals has been established. The numerical results well support the analysis.

preprint2014arXiv

Probabilistic Fréchet Means for Time Varying Persistence Diagrams

In order to use persistence diagrams as a true statistical tool, it would be very useful to have a good notion of mean and variance for a set of diagrams. In 2011, Mileyko and his collaborators made the first study of the properties of the Fréchet mean in $(\mathcal{D}_p,W_p)$, the space of persistence diagrams equipped with the p-th Wasserstein metric. In particular, they showed that the Fréchet mean of a finite set of diagrams always exists, but is not necessarily unique. The means of a continuously-varying set of diagrams do not themselves (necessarily) vary continuously, which presents obvious problems when trying to extend the Fréchet mean definition to the realm of vineyards. We fix this problem by altering the original definition of Fréchet mean so that it now becomes a probability measure on the set of persistence diagrams; in a nutshell, the mean of a set of diagrams will be a weighted sum of atomic measures, where each atom is itself a persistence diagram determined using a perturbation of the input diagrams. This definition gives for each $N$ a map $(\mathcal{D}_p)^N \to \mathbb{P}(\mathcal{D}_p)$. We show that this map is Hölder continuous on finite diagrams and thus can be used to build a useful statistic on time-varying persistence diagrams, better known as vineyards.

preprint2014arXiv

Statistical analysis of crystallization database links protein physico-chemical features with crystallization mechanisms

X-ray crystallography is the predominant method for obtaining atomic-scale information about biological macromolecules. Despite the success of the technique, obtaining well diffracting crystals still critically limits going from protein to structure. In practice, the crystallization process proceeds through knowledge-informed empiricism. Better physico-chemical understanding remains elusive because of the large number of variables involved, hence little guidance is available to systematically identify solution conditions that promote crystallization. To help determine relationships between macromolecular properties and their crystallization propensity, we have trained statistical models on samples for 182 proteins supplied by the Northeast Structural Genomics consortium. Gaussian processes, which capture trends beyond the reach of linear statistical models, distinguish between two main physico-chemical mechanisms driving crystallization. One is characterized by low levels of side chain entropy and has been extensively reported in the literature. The other identifies specific electrostatic interactions not previously described in the crystallization context. Because evidence for two distinct mechanisms can be gleaned both from crystal contacts and from solution conditions leading to successful crystallization, the model offers future avenues for optimizing crystallization screens based on partial structural information. The availability of crystallization data coupled with structural outcomes analyzed through state-of-the-art statistical models may thus guide macromolecular crystallization toward a more rational basis.

preprint2014arXiv

The Information Geometry of Mirror Descent

Information geometry applies concepts in differential geometry to probability and statistics and is especially useful for parameter estimation in exponential families where parameters are known to lie on a Riemannian manifold. Connections between the geometric properties of the induced manifold and statistical properties of the estimation problem are well-established. However developing first-order methods that scale to larger problems has been less of a focus in the information geometry community. The best known algorithm that incorporates manifold structure is the second-order natural gradient descent algorithm introduced by Amari. On the other hand, stochastic approximation methods have led to the development of first-order methods for optimizing noisy objective functions. A recent generalization of the Robbins-Monro algorithm known as mirror descent, developed by Nemirovski and Yudin is a first order method that induces non-Euclidean geometries. However current analysis of mirror descent does not precisely characterize the induced non-Euclidean geometry nor does it consider performance in terms of statistical relative efficiency. In this paper, we prove that mirror descent induced by Bregman divergences is equivalent to the natural gradient descent algorithm on the dual Riemannian manifold. Using this equivalence, it follows that (1) mirror descent is the steepest descent direction along the Riemannian manifold of the exponential family; (2) mirror descent with log-likelihood loss applied to parameter estimation in exponential families asymptotically achieves the classical Cramér-Rao lower bound and (3) natural gradient descent for manifolds corresponding to exponential families can be implemented as a first-order method through mirror descent.

preprint2014arXiv

The Topology of Probability Distributions on Manifolds

Let $P$ be a set of $n$ random points in $R^d$, generated from a probability measure on a $m$-dimensional manifold $M \subset R^d$. In this paper we study the homology of $U(P,r)$ -- the union of $d$-dimensional balls of radius $r$ around $P$, as $n \to \infty$, and $r \to 0$. In addition we study the critical points of $d_P$ -- the distance function from the set $P$. These two objects are known to be related via Morse theory. We present limit theorems for the Betti numbers of $U(P,r)$, as well as for number of critical points of index $k$ for $d_P$. Depending on how fast $r$ decays to zero as $n$ grows, these two objects exhibit different types of limiting behavior. In one particular case ($n r^m > C \log n$), we show that the Betti numbers of $U(P,r)$ perfectly recover the Betti numbers of the original manifold $M$, a result which is of significant interest in topological manifold learning.

preprint2013arXiv

Bayesian Sparse Factor Analysis of Genetic Covariance Matrices

Quantitative genetic studies that model complex, multivariate phenotypes are important for both evolutionary prediction and artificial selection. For example, changes in gene expression can provide insight into developmental and physiological mechanisms that link genotype and phenotype. However, classical analytical techniques are poorly suited to quantitative genetic studies of gene expression where the number of traits assayed per individual can reach many thousand. Here, we derive a Bayesian genetic sparse factor model for estimating the genetic covariance matrix (G-matrix) of high-dimensional traits, such as gene expression, in a mixed effects model. The key idea of our model is that we need only consider G-matrices that are biologically plausible. An organism's entire phenotype is the result of processes that are modular and have limited complexity. This implies that the G-matrix will be highly structured. In particular, we assume that a limited number of intermediate traits (or factors, e.g., variations in development or physiology) control the variation in the high-dimensional phenotype, and that each of these intermediate traits is sparse -- affecting only a few observed traits. The advantages of this approach are two-fold. First, sparse factors are interpretable and provide biological insight into mechanisms underlying the genetic architecture. Second, enforcing sparsity helps prevent sampling errors from swamping out the true signal in high-dimensional data. We demonstrate the advantages of our model on simulated data and in an analysis of a published Drosophila melanogaster gene expression data set.

preprint2013arXiv

Fréchet Means for Distributions of Persistence diagrams

Given a distribution $ρ$ on persistence diagrams and observations $X_1,...X_n \stackrel{iid}{\sim} ρ$ we introduce an algorithm in this paper that estimates a Fréchet mean from the set of diagrams $X_1,...X_n$. If the underlying measure $ρ$ is a combination of Dirac masses $ρ= \frac{1}{m} \sum_{i=1}^m δ_{Z_i}$ then we prove the algorithm converges to a local minimum and a law of large numbers result for a Fréchet mean computed by the algorithm given observations drawn iid from $ρ$. We illustrate the convergence of an empirical mean computed by the algorithm to a population mean by simulations from Gaussian random fields.

preprint2013arXiv

Random Walks on Simplicial Complexes and Harmonics

In this paper, we introduce random walks with absorbing states on simplicial complexes. Given a simplicial complex of dimension $d$, a random walk with an absorbing state is defined which relates to the spectrum of the $k$-dimensional Laplacian for $1 \leq k \leq d$ and which relates to the local random walk on a graph defined by Fan Chung. We also examine an application of random walks on simplicial complexes to a semi-supervised learning problem. Specifically, we consider a label propagation algorithm on oriented edges, which applies to a generalization of the partially labelled classification problem on graphs.

preprint2013arXiv

Randomized Dimension Reduction on Massive Data

Scalability of statistical estimators is of increasing importance in modern applications and dimension reduction is often used to extract relevant information from data. A variety of popular dimension reduction approaches can be framed as symmetric generalized eigendecomposition problems. In this paper we outline how taking into account the low rank structure assumption implicit in these dimension reduction approaches provides both computational and statistical advantages. We adapt recent randomized low-rank approximation algorithms to provide efficient solutions to three dimension reduction methods: Principal Component Analysis (PCA), Sliced Inverse Regression (SIR), and Localized Sliced Inverse Regression (LSIR). A key observation in this paper is that randomization serves a dual role, improving both computational and statistical performance. This point is highlighted in our experiments on real and simulated data.

preprint2012arXiv

A Cheeger-Type Inequality on Simplicial Complexes

In this paper, we consider a variation on Cheeger numbers related to the coboundary expanders recently defined by Dotterer and Kahle. A Cheeger-type inequality is proved, which is similar to a result on graphs due to Fan Chung. This inequality is then used to study the relationship between coboundary expanders on simplicial complexes and their corresponding eigenvalues, complementing and extending results found by Gundert and Wagner. In particular, we find these coboundary expanders do not satisfy natural Buser or Cheeger inequalities.

preprint2012arXiv

Statistical inference for dynamical systems: a review

The topic of statistical inference for dynamical systems has been studied extensively across several fields. In this survey we focus on the problem of parameter estimation for non-linear dynamical systems. Our objective is to place results across distinct disciplines in a common setting and highlight opportunities for further research.

preprint2010arXiv

Learning gradients on manifolds

A common belief in high-dimensional data analysis is that data are concentrated on a low-dimensional manifold. This motivates simultaneous dimension reduction and regression on manifolds. We provide an algorithm for learning gradients on manifolds for dimension reduction for high-dimensional data with few observations. We obtain generalization error bounds for the gradient estimates and show that the convergence rate depends on the intrinsic dimension of the manifold and not on the dimension of the ambient space. We illustrate the efficacy of this approach empirically on simulated and real data and compare the method to other dimension reduction procedures.

preprint2010arXiv

Predictor-dependent shrinkage for linear regression via partial factor modeling

In prediction problems with more predictors than observations, it can sometimes be helpful to use a joint probability model, $π(Y,X)$, rather than a purely conditional model, $π(Y \mid X)$, where $Y$ is a scalar response variable and $X$ is a vector of predictors. This approach is motivated by the fact that in many situations the marginal predictor distribution $π(X)$ can provide useful information about the parameter values governing the conditional regression. However, under very mild misspecification, this marginal distribution can also lead conditional inferences astray. Here, we explore these ideas in the context of linear factor models, to understand how they play out in a familiar setting. The resulting Bayesian model performs well across a wide range of covariance structures, on real and simulated data.

preprint2010arXiv

Towards Stratification Learning through Homology Inference

A topological approach to stratification learning is developed for point cloud data drawn from a stratified space. Given such data, our objective is to infer which points belong to the same strata. First we define a multi-scale notion of a stratified space, giving a stratification for each radius level. We then use methods derived from kernel and cokernel persistent homology to cluster the data points into different strata, and we prove a result which guarantees the correctness of our clustering, given certain topological conditions; some geometric intuition for these topological conditions is also provided. Our correctness result is then given a probabilistic flavor: we give bounds on the minimum number of sample points required to infer, with probability, which points belong to the same strata. Finally, we give an explicit algorithm for the clustering, prove its correctness, and apply it to some simulated data.

Sayan Mukherjee

What is connected

Connect this record

See the researcher in context

Building this map preview

40 published item(s)

At the Intersection of Deep Sequential Model Framework and State-space Model Framework: Study on Option Pricing

Synthesizing POMDP Policies: Sampling Meets Model-checking via Learning

Extended probabilities and their application to statistical inference

A Grover search-based algorithm for the list coloring problem

Multiple testing with persistent homology

Tight query complexity bounds for learning graph partitions

A Case for Quantifying Statistical Robustness of Specialized Probabilistic AI Accelerators

Beyond Application End-Point Results: Quantifying Statistical Robustness of MCMC Accelerators

Random Lie Brackets that Induce Torsion: A Model for Noisy Vector Fields

Stanza: A Nonlinear State Space Model for Probabilistic Inference in Non-Stationary Time Series

Subspace Clustering through Sub-Clusters

A study on dynamical complexity of noise induced blood flow

Fast moment estimation for generalized latent Dirichlet models

Topological consistency via kernel estimation

Adaptive Randomized Dimension Reduction on Massive Data

Bayesian group latent factor analysis with structured sparsity

Can complexity decrease in Congestive Heart failure?

Characterization of Cardio signals by time-frequency domain analysis

Geometric Representations of Random Hypergraphs

Learning Subspaces of Different Dimension

A high dimensional delay selection for the reconstruction of proper Phase Space with Cross auto-correlation

Approximate discrete dynamics of EMG signal

Consistency of maximum likelihood estimation for some dynamical systems

Is one dimensional return map sufficient to describe the chaotic dynamics of a three dimensional system?

New types of nonlinear auto-correlations of bivariate data and their applications

Persistent Homology Transform for Modeling Shapes and Surfaces

Phase synchronization of instrumental music signals

Probabilistic Fréchet Means for Time Varying Persistence Diagrams

Statistical analysis of crystallization database links protein physico-chemical features with crystallization mechanisms

The Information Geometry of Mirror Descent

The Topology of Probability Distributions on Manifolds

Bayesian Sparse Factor Analysis of Genetic Covariance Matrices

Fréchet Means for Distributions of Persistence diagrams

Random Walks on Simplicial Complexes and Harmonics

Randomized Dimension Reduction on Massive Data

A Cheeger-Type Inequality on Simplicial Complexes

Statistical inference for dynamical systems: a review

Learning gradients on manifolds

Predictor-dependent shrinkage for linear regression via partial factor modeling

Towards Stratification Learning through Homology Inference