Source author record

Martin T. Wells

Martin T. Wells appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Machine Learning math.ST Statistics Theory Computation Applications Computer Vision

Catalog footprint

What is connected

20works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization

We provide a theoretical analysis of Adam under non-stationary stochastic objectives, separating two regimes: Euclidean tracking under adaptive strong monotonicity of the Adam-preconditioned mean-gradient operator, and high-probability projected stationarity guarantees under general $L$-smooth objectives. In the tracking regime, we derive finite-time expected and high-probability bounds that decompose sharply into four components: initialization, objective drift, a first-moment tracking error governed by $β_1$, and a preconditioner perturbation governed by $β_2$. We characterize the burn-in time to reach Adam's irreducible tracking floor under constant and step-decay schedules. We also prove a high-probability bound on the average projected stationarity gap for Adam under distribution shift. Across both analyses, our bounds reveal a noise--drift tradeoff: in noise-dominated regimes, first-moment averaging and adaptive preconditioning can improve the high-probability error, whereas in drift-dominated regimes, stale first-moment information and preconditioner perturbations can compound the cost of nonstationarity, allowing vanilla SGD to achieve a smaller tracking floor. Our explicit $(β_1,β_2,ε)$-dependent bounds delineate when adaptive step-sizing is beneficial versus harmful, and provide a theoretical mechanism for Adam's empirical instability and stabilization under distribution shift.

preprint2022arXiv

Interpretable Latent Variables in Deep State Space Models

We introduce a new version of deep state-space models (DSSMs) that combines a recurrent neural network with a state-space framework to forecast time series data. The model estimates the observed series as functions of latent variables that evolve non-linearly through time. Due to the complexity and non-linearity inherent in DSSMs, previous works on DSSMs typically produced latent variables that are very difficult to interpret. Our paper focus on producing interpretable latent parameters with two key modifications. First, we simplify the predictive decoder by restricting the response variables to be a linear transformation of the latent variables plus some noise. Second, we utilize shrinkage priors on the latent variables to reduce redundancy and improve robustness. These changes make the latent variables much easier to understand and allow us to interpret the resulting latent variables as random effects in a linear mixed model. We show through two public benchmark datasets the resulting model improves forecasting performances.

preprint2022arXiv

K-ARMA Models for Clustering Time Series Data

We present an approach to clustering time series data using a model-based generalization of the K-Means algorithm which we call K-Models. We prove the convergence of this general algorithm and relate it to the hard-EM algorithm for mixture modeling. We then apply our method first with an AR($p$) clustering example and show how the clustering algorithm can be made robust to outliers using a least-absolute deviations criteria. We then build our clustering algorithm up for ARMA($p,q$) models and extend this to ARIMA($p,d,q$) models. We develop a goodness of fit statistic for the models fitted to clusters based on the Ljung-Box statistic. We perform experiments with simulated data to show how the algorithm can be used for outlier detection, detecting distributional drift, and discuss the impact of initialization method on empty clusters. We also perform experiments on real data which show that our method is competitive with other existing methods for similar time series clustering tasks.

preprint2022arXiv

Kendall's Tau for Two-Sample Inference Problems

We consider a Kendall's tau measure between a binary group indicator and the continuous variable under investigation to develop a thorough two-sample comparison procedure. The measure serves as a useful alternative to the hazard ratio whose applicability depends on the proportional hazards assumption. For right censored data, we propose a weighted log-rank statistic with weights adapted to the censoring distributions and develop theoretical properties of the derived estimators. In absence of censoring, the proposed estimator reduces to the WMW statistic. The proposed methodology is applied to analyze several data examples.

preprint2021arXiv

An empirical Bayes approach to estimating dynamic models of co-regulated gene expression

Time-course gene expression datasets provide insight into the dynamics of complex biological processes, such as immune response and organ development. It is of interest to identify genes with similar temporal expression patterns because such genes are often biologically related. However, this task is challenging due to the high dimensionality of these datasets and the nonlinearity of gene expression time dynamics. We propose an empirical Bayes approach to estimating ordinary differential equation (ODE) models of gene expression, from which we derive a similarity metric between genes called the Bayesian lead-lag $R^2$ (LLR2). Importantly, the calculation of the LLR2 leverages biological databases that document known interactions amongst genes; this information is automatically used to define informative prior distributions on the ODE model's parameters. As a result, the LLR2 is a biologically-informed metric that can be used to identify clusters or networks of functionally-related genes with co-moving or time-delayed expression patterns. We then derive data-driven shrinkage parameters from Stein's unbiased risk estimate that optimally balance the ODE model's fit to both data and external biological information. Using real gene expression data, we demonstrate that our methodology allows us to recover interpretable gene clusters and sparse networks. These results reveal new insights about the dynamics of biological systems.

preprint2021arXiv

HALO: Learning to Prune Neural Networks with Shrinkage

Deep neural networks achieve state-of-the-art performance in a variety of tasks by extracting a rich set of features from unstructured data, however this performance is closely tied to model size. Modern techniques for inducing sparsity and reducing model size are (1) network pruning, (2) training with a sparsity inducing penalty, and (3) training a binary mask jointly with the weights of the network. We study different sparsity inducing penalties from the perspective of Bayesian hierarchical models and present a novel penalty called Hierarchical Adaptive Lasso (HALO) which learns to adaptively sparsify weights of a given network via trainable parameters. When used to train over-parametrized networks, our penalty yields small subnetworks with high accuracy without fine-tuning. Empirically, on image recognition tasks, we find that HALO is able to learn highly sparse network (only 5% of the parameters) with significant gains in performance over state-of-the-art magnitude pruning methods at the same level of sparsity. Code is available at https://github.com/skyler120/sparsity-halo.

preprint2020arXiv

Robust Matrix Completion with Mixed Data Types

We consider the matrix completion problem of recovering a structured low rank matrix with partially observed entries with mixed data types. Vast majority of the solutions have proposed computationally feasible estimators with strong statistical guarantees for the case where the underlying distribution of data in the matrix is continuous. A few recent approaches have extended using similar ideas these estimators to the case where the underlying distributions belongs to the exponential family. Most of these approaches assume that there is only one underlying distribution and the low rank constraint is regularized by the matrix Schatten Norm. We propose a computationally feasible statistical approach with strong recovery guarantees along with an algorithmic framework suited for parallelization to recover a low rank matrix with partially observed entries for mixed data types in one step. We also provide extensive simulation evidence that corroborate our theoretical results.

preprint2020arXiv

Tree Space Prototypes: Another Look at Making Tree Ensembles Interpretable

Ensembles of decision trees perform well on many problems, but are not interpretable. In contrast to existing approaches in interpretability that focus on explaining relationships between features and predictions, we propose an alternative approach to interpret tree ensemble classifiers by surfacing representative points for each class -- prototypes. We introduce a new distance for Gradient Boosted Tree models, and propose new, adaptive prototype selection methods with theoretical guarantees, with the flexibility to choose a different number of prototypes in each class. We demonstrate our methods on random forests and gradient boosted trees, showing that the prototypes can perform as well as or even better than the original tree ensemble when used as a nearest-prototype classifier. In a user study, humans were better at predicting the output of a tree ensemble classifier when using prototypes than when using Shapley values, a popular feature attribution method. Hence, prototypes present a viable alternative to feature-based explanations for tree ensembles.

preprint2015arXiv

A Scalable Empirical Bayes Approach to Variable Selection

We develop a model-based empirical Bayes approach to variable selection problems in which the number of predictors is very large, possibly much larger than the number of responses (the so-called 'large p, small n' problem). We consider the multiple linear regression setting, where the response is assumed to be a continuous variable and it is a linear function of the predictors plus error. The explanatory variables in the linear model can have a positive effect on the response, a negative effect, or no effect. We model the effects of the linear predictors as a three-component mixture in which a key assumption is that only a small (unknown) fraction of the candidate predictors have a non-zero effect on the response variable. By treating the coefficients as random effects we develop an approach that is computationally efficient because the number of parameters that have to be estimated is small, and remains constant regardless of the number of explanatory variables. The model parameters are estimated using the EM algorithm which is scalable and leads to significantly faster convergence, compared with simulation-based methods.

preprint2015arXiv

Improved Second Order Estimation in the Singular Multivariate Normal Model

We consider the problem of estimating covariance and precision matrices, and their associated discriminant coefficients, from normal data when the rank of the covariance matrix is strictly smaller than its dimension and the available sample size. Using unbiased risk estimation, we construct novel estimators by minimizing upper bounds on the difference in risk over several classes. Our proposal estimates are empirically demonstrated to offer substantial improvement over classical approaches.

preprint2015arXiv

On the Domain of Attraction of a Tracy-Widom Law with Applications to Testing Multiple Largest Roots

The greatest root statistic arises as the test statistic in several multivariate analysis settings. Suppose there is a global null hypothesis that consists of different independent sub-null hypotheses, and suppose the greatest root statistic is used as the test statistic for each sub-null hypothesis. Such problems may arise when conducting a batch MANOVA or several batches of pairwise testing for equality of covariance matrices. Using the union-intersection testing approach and by letting the problem dimension tend to infinity faster than the number of batches, we show that the global null can be tested using a Gumbel distribution to approximate the critical values. Although the theoretical results are asymptotic, simulation studies indicate that the approximations are very good even for small to moderate dimensions. The results are general and can be applied in any setting where the greatest root statistic is used, not just for the two methods we use for illustrative purposes.

preprint2014arXiv

AIC, Cp and estimators of loss for elliptically symmetric distributions

In this article, we develop a modern perspective on Akaike's Information Criterion and Mallows' Cp for model selection. Despite the diff erences in their respective motivation, they are equivalent in the special case of Gaussian linear regression. In this case they are also equivalent to a third criterion, an unbiased estimator of the quadratic prediction loss, derived from loss estimation theory. Our first contribution is to provide an explicit link between loss estimation and model selection through a new oracle inequality. We then show that the form of the unbiased estimator of the quadratic prediction loss under a Gaussian assumption still holds under a more general distributional assumption, the family of spherically symmetric distributions. One of the features of our results is that our criterion does not rely on the speci ficity of the distribution, but only on its spherical symmetry. Also this family of laws o ffers some dependence property between the observations, a case not often studied.

preprint2014arXiv

Noise Estimation in the Spiked Covariance Model

The problem of estimating a spiked covariance matrix in high dimensions under Frobenius loss, and the parallel problem of estimating the noise in spiked PCA is investigated. We propose an estimator of the noise parameter by minimizing an unbiased estimator of the invariant Frobenius risk using calculus of variations. The resulting estimator is shown, using random matrix theory, to be strongly consistent and essentially asymptotically normal and minimax for the noise estimation problem. We apply the construction to construct a robust spiked covariance matrix estimator with consistent eigenvalues.

preprint2014arXiv

Supervised Classification Using Sparse Fisher's LDA

It is well known that in a supervised classification setting when the number of features is smaller than the number of observations, Fisher's linear discriminant rule is asymptotically Bayes. However, there are numerous modern applications where classification is needed in the high-dimensional setting. Naive implementation of Fisher's rule in this case fails to provide good results because the sample covariance matrix is singular. Moreover, by constructing a classifier that relies on all features the interpretation of the results is challenging. Our goal is to provide robust classification that relies only on a small subset of important features and accounts for the underlying correlation structure. We apply a lasso-type penalty to the discriminant vector to ensure sparsity of the solution and use a shrinkage type estimator for the covariance matrix. The resulting optimization problem is solved using an iterative coordinate ascent algorithm. Furthermore, we analyze the effect of nonconvexity on the sparsity level of the solution and highlight the difference between the penalized and the constrained versions of the problem. The simulation results show that the proposed method performs favorably in comparison to alternatives. The method is used to classify leukemia patients based on DNA methylation features.

preprint2013arXiv

Improved multivariate normal mean estimation with unknown covariance when p is greater than n

We consider the problem of estimating the mean vector of a p-variate normal $(θ,Σ)$ distribution under invariant quadratic loss, $(δ-θ)'Σ^{-1}(δ-θ)$, when the covariance is unknown. We propose a new class of estimators that dominate the usual estimator $δ^0(X)=X$. The proposed estimators of $θ$ depend upon X and an independent Wishart matrix S with n degrees of freedom, however, S is singular almost surely when p>n. The proof of domination involves the development of some new unbiased estimators of risk for the p>n setting. We also find some relationships between the amount of domination and the magnitudes of n and p.

preprint2012arXiv

On Improved Loss Estimation for Shrinkage Estimators

Let $X$ be a random vector with distribution $P_θ$ where $θ$ is an unknown parameter. When estimating $θ$ by some estimator $φ(X)$ under a loss function $L(θ,φ)$, classical decision theory advocates that such a decision rule should be used if it has suitable properties with respect to the frequentist risk $R(θ,φ)$. However, after having observed $X=x$, instances arise in practice in which $φ$ is to be accompanied by an assessment of its loss, $L(θ,φ(x))$, which is unobservable since $θ$ is unknown. A common approach to this assessment is to consider estimation of $L(θ,φ(x))$ by an estimator $δ$, called a loss estimator. We present an expository development of loss estimation with substantial emphasis on the setting where the distributional context is normal and its extension to the case where the underlying distribution is spherically symmetric. Our overview covers improved loss estimators for least squares but primarily focuses on shrinkage estimators. Bayes estimation is also considered and comparisons are made with unbiased estimation.

preprint2011arXiv

Laplace Approximated EM Microarray Analysis: An Empirical Bayes Approach for Comparative Microarray Experiments

A two-groups mixed-effects model for the comparison of (normalized) microarray data from two treatment groups is considered. Most competing parametric methods that have appeared in the literature are obtained as special cases or by minor modification of the proposed model. Approximate maximum likelihood fitting is accomplished via a fast and scalable algorithm, which we call LEMMA (Laplace approximated EM Microarray Analysis). The posterior odds of treatment $\times$ gene interactions, derived from the model, involve shrinkage estimates of both the interactions and of the gene specific error variances. Genes are classified as being associated with treatment based on the posterior odds and the local false discovery rate (f.d.r.) with a fixed cutoff. Our model-based approach also allows one to declare the non-null status of a gene by controlling the false discovery rate (FDR). It is shown in a detailed simulation study that the approach outperforms well-known competitors. We also apply the proposed methodology to two previously analyzed microarray examples. Extensions of the proposed method to paired treatments and multiple treatments are also discussed.

preprint2011arXiv

MM Algorithms for Minimizing Nonsmoothly Penalized Objective Functions

In this paper, we propose a general class of algorithms for optimizing an extensive variety of nonsmoothly penalized objective functions that satisfy certain regularity conditions. The proposed framework utilizes the majorization-minimization (MM) algorithm as its core optimization engine. The resulting algorithms rely on iterated soft-thresholding, implemented componentwise, allowing for fast, stable updating that avoids the need for any high-dimensional matrix inversion. We establish a local convergence theory for this class of algorithms under weaker assumptions than previously considered in the statistical literature. We also demonstrate the exceptional effectiveness of new acceleration methods, originally proposed for the EM algorithm, in this class of problems. Simulation results and a microarray data example are provided to demonstrate the algorithm's capabilities and versatility.

preprint2010arXiv

A Conversation with Shayle R. Searle

Born in New Zealand, Shayle Robert Searle earned a bachelor's degree (1949) and a master's degree (1950) from Victoria University, Wellington, New Zealand. After working for an actuary, Searle went to Cambridge University where he earned a Diploma in mathematical statistics in 1953. Searle won a Fulbright travel award to Cornell University, where he earned a doctorate in animal breeding, with a strong minor in statistics in 1959, studying under Professor Charles Henderson. In 1962, Cornell invited Searle to work in the university's computing center, and he soon joined the faculty as an assistant professor of biological statistics. He was promoted to associate professor in 1965, and became a professor of biological statistics in 1970. Searle has also been a visiting professor at Texas A&M University, Florida State University, Universität Augsburg and the University of Auckland. He has published several statistics textbooks and has authored more than 165 papers. Searle is a Fellow of the American Statistical Association, the Royal Statistical Society, and he is an elected member of the International Statistical Institute. He also has received the prestigious Alexander von Humboldt U.S. Senior Scientist Award, is an Honorary Fellow of the Royal Society of New Zealand and was recently awarded the D.Sc. Honoris Causa by his alma mater, Victoria University of Wellington, New Zealand.

preprint2010arXiv

A Multivariate Variance Components Model for Analysis of Covariance in Designed Experiments

Traditional methods for covariate adjustment of treatment means in designed experiments are inherently conditional on the observed covariate values. In order to develop a coherent general methodology for analysis of covariance, we propose a multivariate variance components model for the joint distribution of the response and covariates. It is shown that, if the design is orthogonal with respect to (random) blocking factors, then appropriate adjustments to treatment means can be made using the univariate variance components model obtained by conditioning on the observed covariate values. However, it is revealed that some widely used models are incorrectly specified, leading to biased estimates and incorrect standard errors. The approach clarifies some issues that have been the source of ongoing confusion in the statistics literature.

Martin T. Wells

What is connected

Connect this record

See the researcher in context

Building this map preview

20 published item(s)

Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization

Interpretable Latent Variables in Deep State Space Models

K-ARMA Models for Clustering Time Series Data

Kendall's Tau for Two-Sample Inference Problems

An empirical Bayes approach to estimating dynamic models of co-regulated gene expression

HALO: Learning to Prune Neural Networks with Shrinkage

Robust Matrix Completion with Mixed Data Types

Tree Space Prototypes: Another Look at Making Tree Ensembles Interpretable

A Scalable Empirical Bayes Approach to Variable Selection

Improved Second Order Estimation in the Singular Multivariate Normal Model

On the Domain of Attraction of a Tracy-Widom Law with Applications to Testing Multiple Largest Roots

AIC, Cp and estimators of loss for elliptically symmetric distributions

Noise Estimation in the Spiked Covariance Model

Supervised Classification Using Sparse Fisher's LDA

Improved multivariate normal mean estimation with unknown covariance when p is greater than n

On Improved Loss Estimation for Shrinkage Estimators

Laplace Approximated EM Microarray Analysis: An Empirical Bayes Approach for Comparative Microarray Experiments

MM Algorithms for Minimizing Nonsmoothly Penalized Objective Functions

A Conversation with Shayle R. Searle

A Multivariate Variance Components Model for Analysis of Covariance in Designed Experiments