Source author record

Paul D. McNicholas

Paul D. McNicholas appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Computation Machine Learning Applications math.ST Statistics Theory

Catalog footprint

What is connected

26works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Classification Fields: Arbitrarily Fine Recursive Hierarchical Clustering From Few Examples

Classical clustering methods usually return either a finite partition of the observed data or a finite dendrogram over it. This finite-sample view is inadequate when the hierarchy of interest is a recursive geometric object with fine-scale refinements that continue beyond the levels directly observed. We introduce classification fields: infinite-depth hierarchical cluster structures on $\mathbb{R}^d$ generated by a local parent-to-child refinement rule. A classification field generator maps each parent centre to an ordered, bounded, and separated tuple of child residuals. Together with a root and a scale factor, this rule recursively generates cluster centres, Voronoi cells, and a metric DAG encoding the hierarchy. Given only a finite prefix of such a hierarchy, we learn a classification field predictor that approximates the generator and can be rolled out to unseen depths. We prove exponential truncation convergence in the completed cell metric and ReLU realizability with width $O(\varepsilon^{-γ})$ and depth $\widetilde O(\varepsilon^{-3γ/2})$, where $γ=\log K/(-\log s)$, up to finite-window aspect-ratio factors. The approximation holds at the level of the induced compact metric structures, measured in the completed cell-metric Hausdorff distance. Experimental validation on matched CFG-generated hierarchies, IFS fractals, and image-induced recursive clustering hierarchies shows that learned predictors preserve ordered child slots, unordered geometry, and hierarchy-level path metrics under recursive rollout. These results support the claim that finite hierarchical observations can reveal local refinement rules capable of generating substantially deeper classification fields.

preprint2022arXiv

Finite mixtures of matrix-variate Poisson-log normal distributions for three-way count data

Three-way data structures, characterized by three entities, the units, the variables and the occasions, are frequent in biological studies. In RNA sequencing, three-way data structures are obtained when high-throughput transcriptome sequencing data are collected for $n$ genes across $p$ conditions at $r$ occasions. Matrix variate distributions offer a natural way to model three-way data and mixtures of matrix variate distributions can be used to cluster three-way data. Clustering of gene expression data is carried out as means of discovering gene co-expression networks. In this work, a mixture of matrix variate Poisson-log normal distributions is proposed for clustering read counts from RNA sequencing. By considering the matrix variate structure, full information on the conditions and occasions of the RNA sequencing dataset is simultaneously considered, and the number of covariance parameters to be estimated is reduced. We propose three different frameworks for parameter estimation: a Markov chain Monte Carlo based approach, a variational Gaussian approximation based approach, and a hybrid approach. Various information criteria are used for model selection. The models are applied to both real and simulated data, and we demonstrate that the proposed approaches can recover the underlying cluster structure in both cases. In simulation studies where the true model parameters are known, our proposed approach shows good parameter recovery.

preprint2020arXiv

An Evolutionary Algorithm with Crossover and Mutation for Model-Based Clustering

An evolutionary algorithm (EA) is developed as an alternative to the EM algorithm for parameter estimation in model-based clustering. This EA facilitates a different search of the fitness landscape, i.e., the likelihood surface, utilizing both crossover and mutation. Furthermore, this EA represents an efficient approach to "hard" model-based clustering and so it can be viewed as a sort of generalization of the k-means algorithm, which is itself equivalent to a restricted Gaussian mixture model. The EA is illustrated on several datasets, and its performance is compared to other hard clustering approaches and model-based clustering via the EM algorithm.

preprint2020arXiv

Clustering Discrete-Valued Time Series

There is a need for the development of models that are able to account for discreteness in data, along with its time series properties and correlation. Our focus falls on INteger-valued AutoRegressive (INAR) type models. The INAR type models can be used in conjunction with existing model-based clustering techniques to cluster discrete-valued time series data. With the use of a finite mixture model, several existing techniques such as the selection of the number of clusters, estimation using expectation-maximization and model selection are applicable. The proposed model is then demonstrated on real data to illustrate its clustering applications.

preprint2020arXiv

Detecting British Columbia Coastal Rainfall Patterns by Clustering Gaussian Processes

Functional data analysis is a statistical framework where data are assumed to follow some functional form. This method of analysis is commonly applied to time series data, where time, measured continuously or in discrete intervals, serves as the location for a function's value. Gaussian processes are a generalization of the multivariate normal distribution to function space and, in this paper, they are used to shed light on coastal rainfall patterns in British Columbia (BC). Specifically, this work addressed the question over how one should carry out an exploratory cluster analysis for the BC, or any similar, coastal rainfall data. An approach is developed for clustering multiple processes observed on a comparable interval, based on how similar their underlying covariance kernel is. This approach provides interesting insights into the BC data, and these insights can be framed in terms of El Niño and La Niña; however, the result is not simply one cluster representing El Niño years and another for La Niña years. From one perspective, the results show that clustering annual rainfall can potentially be used to identify extreme weather patterns.

preprint2020arXiv

Mixtures of Contaminated Matrix Variate Normal Distributions

Analysis of three-way data is becoming ever more prevalent in the literature, especially in the area of clustering and classification. Real data, including real three-way data, are often contaminated by potential outlying observations. Their detection, as well as the development of robust models insensitive to their presence, is particularly important for this type of data because of the practical issues concerning their effective visualization. Herein, the contaminated matrix variate normal distribution is discussed and then utilized in the mixture model paradigm for clustering. One key advantage of the proposed model is the ability to automatically detect potential outlying matrices by computing their \textit{a posteriori} probability to be a "good" or "bad" point. Such detection is currently unavailable using existing matrix variate methods. An expectation conditional maximization algorithm is used for parameter estimation, and both simulated and real data are used for illustration.

preprint2016arXiv

Multivariate response and parsimony for Gaussian cluster-weighted models

A family of parsimonious Gaussian cluster-weighted models is presented. This family concerns a multivariate extension to cluster-weighted modelling that can account for correlations between multivariate responses. Parsimony is attained by constraining parts of an eigen-decomposition imposed on the component covariance matrices. A sufficient condition for identifiability is provided and an expectation-maximization algorithm is presented for parameter estimation. Model performance is investigated on both synthetic and classical real data sets and compared with some popular approaches. Finally, accounting for linear dependencies in the presence of a linear regression structure is shown to offer better performance, vis-à-vis clustering, over existing methodologies.

preprint2016arXiv

Parsimonious mixtures of multivariate contaminated normal distributions

A mixture of multivariate contaminated normal distributions is developed for model-based clustering. In addition to the parameters of the classical normal mixture, our contaminated mixture has, for each cluster, a parameter controlling the proportion of mild outliers and one specifying the degree of contamination. Crucially, these parameters do not have to be specified a priori, adding a flexibility to our approach. Parsimony is introduced via eigen-decomposition of the component covariance matrices, and sufficient conditions for the identifiability of all the members of the resulting family are provided. An expectation-conditional maximization algorithm is outlined for parameter estimation and various implementation issues are discussed. Using a large scale simulation study, the behaviour of the proposed approach is investigated and comparison with well-established finite mixtures is provided. The performance of this novel family of models is also illustrated on artificial and real data.

preprint2015arXiv

Mixtures of Multivariate Power Exponential Distributions

An expanded family of mixtures of multivariate power exponential distributions is introduced. While fitting heavy-tails and skewness has received much attention in the model-based clustering literature recently, we investigate the use of a distribution that can deal with both varying tail-weight and peakedness of data. A family of parsimonious models is proposed using an eigen-decomposition of the scale matrix. A generalized expectation-maximization algorithm is presented that combines convex optimization via a minorization-maximization approach and optimization based on accelerated line search algorithms on the Stiefel manifold. Lastly, the utility of this family of models is illustrated using both toy and benchmark data.

preprint2015arXiv

On nomenclature for, and the relative merits of, two formulations of skew distributions

We examine some distributions used extensively within the model-based clustering literature in recent years, paying special attention to} claims that have been made about their relative efficacy. Theoretical arguments are provided as well as real data examples.

preprint2014arXiv

An Adaptive LASSO-Penalized BIC

Mixture models are becoming a popular tool for the clustering and classification of high-dimensional data. In such high dimensional applications, model selection is problematic. The Bayesian information criterion, which is popular in lower dimensional applications, tends to underestimate the true number of components in high dimensions. We introduce an adaptive LASSO-penalized BIC (ALPBIC) to mitigate this problem. This efficacy of the ALPBIC is illustrated via applications of parsimonious mixtures of factor analyzers. The selection of the best model by ALPBIC is shown to be consistent with increasing numbers of observations based on simulated and real data analyses.

preprint2014arXiv

Hypothesis Testing for Parsimonious Gaussian Mixture Models

Gaussian mixture models with eigen-decomposed covariance structures make up the most popular family of mixture models for clustering and classification, i.e., the Gaussian parsimonious clustering models (GPCM). Although the GPCM family has been used for almost 20 years, selecting the best member of the family in a given situation remains a troublesome problem. Likelihood ratio tests are developed to tackle this problems. These likelihood ratio tests use the heteroscedastic model under the alternative hypothesis but provide much more flexibility and real-world applicability than previous approaches that compare the homoscedastic Gaussian mixture versus the heteroscedastic one. Along the way, a novel maximum likelihood estimation procedure is developed for two members of the GPCM family. Simulations show that the $χ^2$ reference distribution gives reasonable approximation for the LR statistics only when the sample size is considerable and when the mixture components are well separated; accordingly, following Lo (2008), a parametric bootstrap is adopted. Furthermore, by generalizing the idea of Greselin and Punzo (2013) to the clustering context, a closed testing procedure, having the defined likelihood ratio tests as local tests, is introduced to assess a unique model in the general family. The advantages of this likelihood ratio testing procedure are illustrated via an application to the well-known Iris data set.

preprint2014arXiv

Mixtures of Variance-Gamma Distributions

A mixture of variance-gamma distributions is introduced and developed for model-based clustering and classification. The latest in a growing line of non-Gaussian mixture approaches to clustering and classification, the proposed mixture of variance-gamma distributions is a special case of the recently developed mixture of generalized hyperbolic distributions, and a restriction is required to ensure identifiability. Our mixture of variance-gamma distributions is perhaps the most useful such special case and, we will contend, may be more useful than the mixture of generalized hyperbolic distributions in some cases. In addition to being an alternative to the mixture of generalized hyperbolic distributions, our mixture of variance-gamma distributions serves as an alternative to the ubiquitous mixture of Gaussian distributions, which is a special case, as well as several non-Gaussian approaches, some of which are special cases. The mathematical development of our mixture of variance-gamma distributions model relies on its relationship with the generalized inverse Gaussian distribution; accordingly, the latter is reviewed before our mixture of variance-gamma distributions is presented. Parameter estimation carried out within the expectation-maximization framework.

preprint2014arXiv

Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model

The Gaussian cluster-weighted model (CWM) is a mixture of regression models with random covariates that allows for flexible clustering of a random vector composed of response variables and covariates. In each mixture component, it adopts a Gaussian distribution for both the covariates and the responses given the covariates. To robustify the approach with respect to possible elliptical heavy tailed departures from normality, due to the presence of atypical observations, the contaminated Gaussian CWM is here introduced. In addition to the parameters of the Gaussian CWM, each mixture component of our contaminated CWM has a parameter controlling the proportion of outliers, one controlling the proportion of leverage points, one specifying the degree of contamination with respect to the response variables, and one specifying the degree of contamination with respect to the covariates. Crucially, these parameters do not have to be specified a priori, adding flexibility to our approach. Furthermore, once the model is estimated and the observations are assigned to the groups, a finer intra-group classification in typical points, outliers, good leverage points, and bad leverage points - concepts of primary importance in robust regression analysis - can be directly obtained. Relations with other mixture-based contaminated models are analyzed, identifiability conditions are provided, an expectation-conditional maximization algorithm is outlined for parameter estimation, and various implementation and operational issues are discussed. Properties of the estimators of the regression coefficients are evaluated through Monte Carlo experiments and compared to the estimators from the Gaussian CWM. A sensitivity study is also conducted based on a real data set.

preprint2013arXiv

A Partial EM Algorithm for Clustering White Breads

The design of new products for consumer markets has undergone a major transformation over the last 50 years. Traditionally, inventors would create a new product that they thought might address a perceived need of consumers. Such products tended to be developed to meet the inventors own perception and not necessarily that of consumers. The social consequence of a top-down approach to product development has been a large failure rate in new product introduction. By surveying potential customers, a refined target is created that guides developers and reduces the failure rate. Today, however, the proliferation of products and the emergence of consumer choice has resulted in the identification of segments within the market. Understanding your target market typically involves conducting a product category assessment, where 12 to 30 commercial products are tested with consumers to create a preference map. Every consumer gets to test every product in a complete-block design; however, many classes of products do not lend themselves to such approaches because only a few samples can be evaluated before `fatigue' sets in. We consider an analysis of incomplete balanced-incomplete-block data on 12 different types of white bread. A latent Gaussian mixture model is used for this analysis, with a partial expectation-maximization (PEM) algorithm developed for parameter estimation. This PEM algorithm circumvents the need for a traditional E-step, by performing a partial E-step that reduces the Kullback-Leibler divergence between the conditional distribution of the missing data and the distribution of the missing data given the observed data. The results of the white bread analysis are discussed and some mathematical details are given in an appendix.

preprint2013arXiv

Accelerated Failure Time Models for Competing Risks in a Cluster Weighted Modelling Framework

A novel approach for dealing with censored competing risks regression data is proposed. This is implemented by a mixture of accelerated failure time (AFT) models for a competing risks scenario within a cluster-weighted modelling (CWM) framework. Specifically, we make use of the log-normal AFT model here but any commonly used AFT model can be utilized. The alternating expectation conditional maximization algorithm (AECM) is used for parameter estimation and bootstrapping for standard error estimation. Finally, we present our results on some simulated and real competing risks data.

preprint2013arXiv

Capturing Patterns via Parsimonious t Mixture Models

This paper exploits a simplified version of the mixture of multivariate t-factor analyzers (MtFA) for robust mixture modelling and clustering of high-dimensional data that frequently contain a number of outliers. Two classes of eight parsimonious t mixture models are introduced and computation of maximum likelihood estimates of parameters is achieved using the alternating expectation conditional maximization (AECM) algorithm. The usefulness of the methodology is illustrated through applications of image compression and compact facial representation.

preprint2013arXiv

Families of Parsimonious Finite Mixtures of Regression Models

Finite mixtures of regression models offer a flexible framework for investigating heterogeneity in data with functional dependencies. These models can be conveniently used for unsupervised learning on data with clear regression relationships. We extend such models by imposing an eigen-decomposition on the multivariate error covariance matrix. By constraining parts of this decomposition, we obtain families of parsimonious mixtures of regressions and mixtures of regressions with concomitant variables. These families of models account for correlations between multiple responses. An expectation-maximization algorithm is presented for parameter estimation and performance is illustrated on simulated and real data.

preprint2013arXiv

Mixtures of Common Skew-t Factor Analyzers

A mixture of common skew-t factor analyzers model is introduced for model-based clustering of high-dimensional data. By assuming common component factor loadings, this model allows clustering to be performed in the presence of a large number of mixture components or when the number of dimensions is too large to be well-modelled by the mixtures of factor analyzers model or a variant thereof. Furthermore, assuming that the component densities follow a skew-t distribution allows robust clustering of skewed data. The alternating expectation-conditional maximization algorithm is employed for parameter estimation. We demonstrate excellent clustering performance when our model is applied to real and simulated data.This paper marks the first time that skewed common factors have been used.

preprint2013arXiv

Mixtures of Skew-t Factor Analyzers

In this paper, we introduce a mixture of skew-t factor analyzers as well as a family of mixture models based thereon. The mixture of skew-t distributions model that we use arises as a limiting case of the mixture of generalized hyperbolic distributions. Like their Gaussian and t-distribution analogues, our mixture of skew-t factor analyzers are very well-suited to the model-based clustering of high-dimensional data. Imposing constraints on components of the decomposed covariance parameter results in the development of eight flexible models. The alternating expectation-conditional maximization algorithm is used for model parameter estimation and the Bayesian information criterion is used for model selection. The models are applied to both real and simulated data, giving superior clustering results compared to a well-established family of Gaussian mixture models.

preprint2013arXiv

Parsimonious Shifted Asymmetric Laplace Mixtures

A family of parsimonious shifted asymmetric Laplace mixture models is introduced. We extend the mixture of factor analyzers model to the shifted asymmetric Laplace distribution. Imposing constraints on the constitute parts of the resulting decomposed component scale matrices leads to a family of parsimonious models. An explicit two-stage parameter estimation procedure is described, and the Bayesian information criterion and the integrated completed likelihood are compared for model selection. This novel family of models is applied to real data, where it is compared to its Gaussian analogue within clustering and classification paradigms.

preprint2013arXiv

Parsimonious Skew Mixture Models for Model-Based Clustering and Classification

In recent work, robust mixture modelling approaches using skewed distributions have been explored to accommodate asymmetric data. We introduce parsimony by developing skew-t and skew-normal analogues of the popular GPCM family that employ an eigenvalue decomposition of a positive-semidefinite matrix. The methods developed in this paper are compared to existing models in both an unsupervised and semi-supervised classification framework. Parameter estimation is carried out using the expectation-maximization algorithm and models are selected using the Bayesian information criterion. The efficacy of these extensions is illustrated on simulated and benchmark clustering data sets.

preprint2013arXiv

Standardizing Interestingness Measures for Association Rules

Interestingness measures provide information that can be used to prune or select association rules. A given value of an interestingness measure is often interpreted relative to the overall range of the values that the interestingness measure can take. However, properties of individual association rules restrict the values an interestingness measure can achieve. An interesting measure can be standardized to take this into account, but this has only been done for one interestingness measure to date, i.e., the lift. Standardization provides greater insight than the raw value and may even alter researchers' perception of the data. We derive standardized analogues of three interestingness measures and use real and simulated data to compare them to their raw versions, each other, and the standardized lift.

preprint2013arXiv

Variable Selection for Clustering and Classification

As data sets continue to grow in size and complexity, effective and efficient techniques are needed to target important features in the variable space. Many of the variable selection techniques that are commonly used alongside clustering algorithms are based upon determining the best variable subspace according to model fitting in a stepwise manner. These techniques are often computationally intensive and can require extended periods of time to run; in fact, some are prohibitively computationally expensive for high-dimensional data. In this paper, a novel variable selection technique is introduced for use in clustering and classification analyses that is both intuitive and computationally efficient. We focus largely on applications in mixture model-based learning, but the technique could be adapted for use with various other clustering/classification methods. Our approach is illustrated on both simulated and real data, highlighted by contrasting its performance with that of other comparable variable selection techniques on the real data sets.

preprint2012arXiv

A LASSO-Penalized BIC for Mixture Model Selection

The efficacy of family-based approaches to mixture model-based clustering and classification depends on the selection of parsimonious models. Current wisdom suggests the Bayesian information criterion (BIC) for mixture model selection. However, the BIC has well-known limitations, including a tendency to overestimate the number of components as well as a proclivity for, often drastically, underestimating the number of components in higher dimensions. While the former problem might be soluble through merging components, the latter is impossible to mitigate in clustering and classification applications. In this paper, a LASSO-penalized BIC (LPBIC) is introduced to overcome this problem. This approach is illustrated based on applications of extensions of mixtures of factor analyzers, where the LPBIC is used to select both the number of components and the number of latent factors. The LPBIC is shown to match or outperform the BIC in several situations.

preprint2012arXiv

Clustering and Classification via Cluster-Weighted Factor Analyzers

In model-based clustering and classification, the cluster-weighted model constitutes a convenient approach when the random vector of interest constitutes a response variable Y and a set p of explanatory variables X. However, its applicability may be limited when p is high. To overcome this problem, this paper assumes a latent factor structure for X in each mixture component. This leads to the cluster-weighted factor analyzers (CWFA) model. By imposing constraints on the variance of Y and the covariance matrix of X, a novel family of sixteen CWFA models is introduced for model-based clustering and classification. The alternating expectation-conditional maximization algorithm, for maximum likelihood estimation of the parameters of all the models in the family, is described; to initialize the algorithm, a 5-step hierarchical procedure is proposed, which uses the nested structures of the models within the family and thus guarantees the natural ranking among the sixteen likelihoods. Artificial and real data show that these models have very good clustering and classification performance and that the algorithm is able to recover the parameters very well.

Paul D. McNicholas

What is connected

Connect this record

See the researcher in context

Building this map preview

26 published item(s)

Classification Fields: Arbitrarily Fine Recursive Hierarchical Clustering From Few Examples

Finite mixtures of matrix-variate Poisson-log normal distributions for three-way count data

An Evolutionary Algorithm with Crossover and Mutation for Model-Based Clustering

Clustering Discrete-Valued Time Series

Detecting British Columbia Coastal Rainfall Patterns by Clustering Gaussian Processes

Mixtures of Contaminated Matrix Variate Normal Distributions

Multivariate response and parsimony for Gaussian cluster-weighted models

Parsimonious mixtures of multivariate contaminated normal distributions

Mixtures of Multivariate Power Exponential Distributions

On nomenclature for, and the relative merits of, two formulations of skew distributions

An Adaptive LASSO-Penalized BIC

Hypothesis Testing for Parsimonious Gaussian Mixture Models

Mixtures of Variance-Gamma Distributions

Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model

A Partial EM Algorithm for Clustering White Breads

Accelerated Failure Time Models for Competing Risks in a Cluster Weighted Modelling Framework

Capturing Patterns via Parsimonious t Mixture Models

Families of Parsimonious Finite Mixtures of Regression Models

Mixtures of Common Skew-t Factor Analyzers

Mixtures of Skew-t Factor Analyzers

Parsimonious Shifted Asymmetric Laplace Mixtures

Parsimonious Skew Mixture Models for Model-Based Clustering and Classification

Standardizing Interestingness Measures for Association Rules

Variable Selection for Clustering and Classification

A LASSO-Penalized BIC for Mixture Model Selection

Clustering and Classification via Cluster-Weighted Factor Analyzers