Source author record

Simon Tavaré

Simon Tavaré appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Applications Machine Learning math.PR Computation Populations and Evolution Quantitative Methods

Catalog footprint

What is connected

8works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Gradient Estimation for Binary Latent Variables via Gradient Variance Clipping

Gradient estimation is often necessary for fitting generative models with discrete latent variables, in contexts such as reinforcement learning and variational autoencoder (VAE) training. The DisARM estimator (Yin et al. 2020; Dong, Mnih, and Tucker 2020) achieves state of the art gradient variance for Bernoulli latent variable models in many contexts. However, DisARM and other estimators have potentially exploding variance near the boundary of the parameter space, where solutions tend to lie. To ameliorate this issue, we propose a new gradient estimator \textit{bitflip}-1 that has lower variance at the boundaries of the parameter space. As bitflip-1 has complementary properties to existing estimators, we introduce an aggregated estimator, \textit{unbiased gradient variance clipping} (UGC) that uses either a bitflip-1 or a DisARM gradient update for each coordinate. We theoretically prove that UGC has uniformly lower variance than DisARM. Empirically, we observe that UGC achieves the optimal value of the optimization objectives in toy experiments, discrete VAE training, and in a best subset selection problem.

preprint2020arXiv

A note on the Screaming Toes game

We investigate properties of random mappings whose core is composed of derangements as opposed to permutations. Such mappings arise as the natural framework to study the Screaming Toes game described, for example, by Peter Cameron. This mapping differs from the classical case primarily in the behaviour of the small components, and a number of explicit results are provided to illustrate these differences.

preprint2020arXiv

Random derangements and the Ewens Sampling Formula

We study derangements of $\{1,2,\ldots,n\}$ under the Ewens distribution with parameter $θ$. We give the moments and marginal distributions of the cycle counts, the number of cycles, and asymptotic distributions for large $n$. We develop a $\{0,1\}$-valued non-homogeneous Markov chain with the property that the counts of lengths of spacings between the 1s have the derangement distribution. This chain, an analog of the so-called Feller Coupling, provides a simple way to simulate derangements in time independent of $θ$ for a given $n$ and linear in the size of the derangement.

preprint2015arXiv

Testing the Mean Matrix in High-Dimensional Transposable Data

The structural information in high-dimensional transposable data allows us to write the data recorded for each subject in a matrix such that both the rows and the columns correspond to variables of interest. One important problem is to test the null hypothesis that the mean matrix has a particular structure without ignoring the potential dependence structure among and/or between the row and column variables. To address this, we develop a simple and computationally efficient nonparametric testing procedure to assess the hypothesis that, in each predefined subset of columns (rows), the column (row) mean vector remains constant. In simulation studies, the proposed testing procedure seems to have good performance and unlike traditional approaches, it is powerful without leading to inflated nominal sizes. Finally, we illustrate the use of the proposed methodology via two empirical examples from gene expression microarrays.

preprint2014arXiv

Hypothesis Testing for the Covariance Matrix in High-Dimensional Transposable Data with Kronecker Product Dependence Structure

The matrix-variate normal distribution is a popular model for high-dimensional transposable data because it decomposes the dependence structure of the random matrix into the Kronecker product of two covariance matrices: one for each of the row and column variables. We develop tests for assessing the form of the row (column) covariance matrix in high-dimensional settings while treating the column (row) dependence structure as a nuisance. Our tests are robust to normality departures provided that the Kronecker product dependence structure holds. In simulations, we observe that the proposed tests maintain the nominal level and are powerful against the alternative hypotheses tested. We illustrate the utility of our approach by examining whether genes associated with a given signalling network show correlated patterns of expression in different tissues and by studying correlation patterns within measurements of brain activity collected using electroencephalography.

preprint2013arXiv

Bayesian clustering of replicated time-course gene expression data with weak signals

To identify novel dynamic patterns of gene expression, we develop a statistical method to cluster noisy measurements of gene expression collected from multiple replicates at multiple time points, with an unknown number of clusters. We propose a random-effects mixture model coupled with a Dirichlet-process prior for clustering. The mixture model formulation allows for probabilistic cluster assignments. The random-effects formulation allows for attributing the total variability in the data to the sources that are consistent with the experimental design, particularly when the noise level is high and the temporal dependence is not strong. The Dirichlet-process prior induces a prior distribution on partitions and helps to estimate the number of clusters (or mixture components) from the data. We further tackle two challenges associated with Dirichlet-process prior-based methods. One is efficient sampling. We develop a novel Metropolis-Hastings Markov Chain Monte Carlo (MCMC) procedure to sample the partitions. The other is efficient use of the MCMC samples in forming clusters. We propose a two-step procedure for posterior inference, which involves resampling and relabeling, to estimate the posterior allocation probability matrix. This matrix can be directly used in cluster assignments, while describing the uncertainty in clustering. We demonstrate the effectiveness of our model and sampling procedure through simulated data. Applying our method to a real data set collected from Drosophila adult muscle cells after five-minute Notch activation, we identify 14 clusters of different transcriptional responses among 163 differentially expressed genes, which provides novel insights into underlying transcriptional mechanisms in the Notch signaling pathway. The algorithm developed here is implemented in the R package DIRECT, available on CRAN.

preprint2011arXiv

Sparse Partitioning: Nonlinear regression with binary or tertiary predictors, with application to association studies

This paper presents Sparse Partitioning, a Bayesian method for identifying predictors that either individually or in combination with others affect a response variable. The method is designed for regression problems involving binary or tertiary predictors and allows the number of predictors to exceed the size of the sample, two properties which make it well suited for association studies. Sparse Partitioning differs from other regression methods by placing no restrictions on how the predictors may influence the response. To compensate for this generality, Sparse Partitioning implements a novel way of exploring the model space. It searches for high posterior probability partitions of the predictor set, where each partition defines groups of predictors that jointly influence the response. The result is a robust method that requires no prior knowledge of the true predictor--response relationship. Testing on simulated data suggests Sparse Partitioning will typically match the performance of an existing method on a data set which obeys the existing method's model assumptions. When these assumptions are violated, Sparse Partitioning will generally offer superior performance.

preprint2010arXiv

Assessing molecular variability in cancer genomes

The dynamics of tumour evolution are not well understood. In this paper we provide a statistical framework for evaluating the molecular variation observed in different parts of a colorectal tumour. A multi-sample version of the Ewens Sampling Formula forms the basis for our modelling of the data, and we provide a simulation procedure for use in obtaining reference distributions for the statistics of interest. We also describe the large-sample asymptotics of the joint distributions of the variation observed in different parts of the tumour. While actual data should be evaluated with reference to the simulation procedure, the asymptotics serve to provide theoretical guidelines, for instance with reference to the choice of possible statistics.