Source author record

David S. Matteson

David S. Matteson appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Applications Machine Learning Computation Quantitative Methods eess.IV

Catalog footprint

What is connected

22works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2024arXiv

Drift vs Shift: Decoupling Trends and Changepoint Analysis

We introduce a new approach for decoupling trends (drift) and changepoints (shifts) in time series. Our locally adaptive model-based approach for robustly decoupling combines Bayesian trend filtering and machine learning based regularization. An over-parameterized Bayesian dynamic linear model (DLM) is first applied to characterize drift. Then a weighted penalized likelihood estimator is paired with the estimated DLM posterior distribution to identify shifts. We show how Bayesian DLMs specified with so-called shrinkage priors can provide smooth estimates of underlying trends in the presence of complex noise components. However, their inability to shrink exactly to zero inhibits direct changepoint detection. In contrast, penalized likelihood methods are highly effective in locating changepoints. However, they require data with simple patterns in both signal and noise. The proposed decoupling approach combines the strengths of both, i.e. the flexibility of Bayesian DLMs with the hard thresholding property of penalized likelihood estimators, to provide changepoint analysis in complex, modern settings. The proposed framework is outlier robust and can identify a variety of changes, including in mean and slope. It is also easily extended for analysis of parameter shifts in time-varying parameter models like dynamic regressions. We illustrate the flexibility and contrast the performance and robustness of our approach with several alternative methods across a wide range of simulations and application examples.

preprint2023arXiv

Feature detection and hypothesis testing for extremely noisy nanoparticle images using topological data analysis

We propose a flexible algorithm for feature detection and hypothesis testing in images with ultra low signal-to-noise ratio using cubical persistent homology. Our main application is in the identification of atomic columns and other features in transmission electron microscopy (TEM). Cubical persistent homology is used to identify local minima and their size in subregions in the frames of nanoparticle videos, which are hypothesized to correspond to relevant atomic features. We compare the performance of our algorithm to other employed methods for the detection of columns and their intensity. Additionally, Monte Carlo goodness-of-fit testing using real valued summaries of persistence diagrams derived from smoothed images (generated from pixels residing in the vacuum region of an image) is developed and employed to identify whether or not the proposed atomic features generated by our algorithm are due to noise. Using these summaries derived from the generated persistence diagrams, one can produce univariate time series for the nanoparticle videos, thus providing a means for assessing fluxional behavior. A guarantee on the false discovery rate for multiple Monte Carlo testing of identical hypotheses is also established.

preprint2022arXiv

Bayesian Spillover Graphs for Dynamic Networks

We present Bayesian Spillover Graphs (BSG), a novel method for learning temporal relationships, identifying critical nodes, and quantifying uncertainty for multi-horizon spillover effects in a dynamic system. BSG leverages both an interpretable framework via forecast error variance decompositions (FEVD) and comprehensive uncertainty quantification via Bayesian time series models to contextualize temporal relationships in terms of systemic risk and prediction variability. Forecast horizon hyperparameter $h$ allows for learning both short-term and equilibrium state network behaviors. Experiments for identifying source and sink nodes under various graph and error specifications show significant performance gains against state-of-the-art Bayesian Networks and deep-learning baselines. Applications to real-world systems also showcase BSG as an exploratory analysis tool for uncovering indirect spillovers and quantifying systemic risk.

preprint2022arXiv

Classifying Contaminated Cell Cultures using Time Series Features

We examine the use of time series data, derived from Electric Cell-substrate Impedance Sensing (ECIS), to differentiate between standard mammalian cell cultures and those infected with a mycoplasma organism. With the goal of interpretable results, we perform low-dimensional feature-based classification, extracting application-relevant features from the ECIS time courses. We can achieve very high classification accuracy using only two features, which depend on the cell line under examination. Initial results also show the existence of experimental variation between plates and suggest types of features that may prove more robust to such variation. Our paper is the first to perform a broad examination of ECIS time course features in the context of detecting contamination; to combine different types of features to achieve classification accuracy while preserving interpretability; and to describe and suggest possibilities for ameliorating plate-to-plate variation.

preprint2022arXiv

Interpretable Latent Variables in Deep State Space Models

We introduce a new version of deep state-space models (DSSMs) that combines a recurrent neural network with a state-space framework to forecast time series data. The model estimates the observed series as functions of latent variables that evolve non-linearly through time. Due to the complexity and non-linearity inherent in DSSMs, previous works on DSSMs typically produced latent variables that are very difficult to interpret. Our paper focus on producing interpretable latent parameters with two key modifications. First, we simplify the predictive decoder by restricting the response variables to be a linear transformation of the latent variables plus some noise. Second, we utilize shrinkage priors on the latent variables to reduce redundancy and improve robustness. These changes make the latent variables much easier to understand and allow us to interpret the resulting latent variables as random effects in a linear mixed model. We show through two public benchmark datasets the resulting model improves forecasting performances.

preprint2022arXiv

K-ARMA Models for Clustering Time Series Data

We present an approach to clustering time series data using a model-based generalization of the K-Means algorithm which we call K-Models. We prove the convergence of this general algorithm and relate it to the hard-EM algorithm for mixture modeling. We then apply our method first with an AR($p$) clustering example and show how the clustering algorithm can be made robust to outliers using a least-absolute deviations criteria. We then build our clustering algorithm up for ARMA($p,q$) models and extend this to ARIMA($p,d,q$) models. We develop a goodness of fit statistic for the models fitted to clusters based on the Ljung-Box statistic. We perform experiments with simulated data to show how the algorithm can be used for outlier detection, detecting distributional drift, and discuss the impact of initialization method on empty clusters. We also perform experiments on real data which show that our method is competitive with other existing methods for similar time series clustering tasks.

preprint2021arXiv

Graph-Based Continual Learning

Despite significant advances, continual learning models still suffer from catastrophic forgetting when exposed to incrementally available data from non-stationary distributions. Rehearsal approaches alleviate the problem by maintaining and replaying a small episodic memory of previous samples, often implemented as an array of independent memory slots. In this work, we propose to augment such an array with a learnable random graph that captures pairwise similarities between its samples, and use it not only to learn new tasks but also to guard against forgetting. Empirical results on several benchmark datasets show that our model consistently outperforms recently proposed baselines for task-free continual learning.

preprint2021arXiv

Group Linear non-Gaussian Component Analysis with Applications to Neuroimaging

Independent component analysis (ICA) is an unsupervised learning method popular in functional magnetic resonance imaging (fMRI). Group ICA has been used to search for biomarkers in neurological disorders including autism spectrum disorder and dementia. However, current methods use a principal component analysis (PCA) step that may remove low-variance features. Linear non-Gaussian component analysis (LNGCA) enables simultaneous dimension reduction and feature estimation including low-variance features in single-subject fMRI. We present a group LNGCA model to extract group components shared by more than one subject and subject-specific components. To determine the total number of components in each subject, we propose a parametric resampling test that samples spatially correlated Gaussian noise to match the spatial dependence observed in data. In simulations, our estimated group components achieve higher accuracy compared to group ICA. We apply our method to a resting-state fMRI study on autism spectrum disorder in 342 children (252 typically developing, 90 with autism), where the group signals include resting-state networks. We find examples of group components that appear to exhibit different levels of temporal engagement in autism versus typically developing children, as revealed using group LNGCA. This novel approach to matrix decomposition is a promising direction for feature detection in neuroimaging.

preprint2020arXiv

Factor Analysis of Mixed Data for Anomaly Detection

Anomaly detection aims to identify observations that deviate from the typical pattern of data. Anomalous observations may correspond to financial fraud, health risks, or incorrectly measured data in practice. We show detecting anomalies in high-dimensional mixed data is enhanced through first embedding the data then assessing an anomaly scoring scheme. We focus on unsupervised detection and the continuous and categorical (mixed) variable case. We propose a kurtosis-weighted Factor Analysis of Mixed Data for anomaly detection, FAMDAD, to obtain a continuous embedding for anomaly scoring. We illustrate that anomalies are highly separable in the first and last few ordered dimensions of this space, and test various anomaly scoring experiments within this subspace. Results are illustrated for both simulated and real datasets, and the proposed approach (FAMDAD) is highly accurate for high-dimensional mixed data throughout these diverse scenarios.

preprint2020arXiv

High Dimensional Forecasting via Interpretable Vector Autoregression

Vector autoregression (VAR) is a fundamental tool for modeling multivariate time series. However, as the number of component series is increased, the VAR model becomes overparameterized. Several authors have addressed this issue by incorporating regularized approaches, such as the lasso in VAR estimation. Traditional approaches address overparameterization by selecting a low lag order, based on the assumption of short range dependence, assuming that a universal lag order applies to all components. Such an approach constrains the relationship between the components and impedes forecast performance. The lasso-based approaches work much better in high-dimensional situations but do not incorporate the notion of lag order selection. We propose a new class of hierarchical lag structures (HLag) that embed the notion of lag selection into a convex regularizer. The key modeling tool is a group lasso with nested groups which guarantees that the sparsity pattern of lag coefficients honors the VAR's ordered structure. The HLag framework offers three structures, which allow for varying levels of flexibility. A simulation study demonstrates improved performance in forecasting and lag order selection over previous approaches, and a macroeconomic application further highlights forecasting improvements as well as HLag's convenient, interpretable output.

preprint2020arXiv

Modeling a Nonlinear Biophysical Trend Followed by Long-Memory Equilibrium with Unknown Change Point

Measurements of many biological processes are characterized by an initial trend period followed by an equilibrium period. Scientists may wish to quantify features of the two periods, as well as the timing of the change point. Specifically, we are motivated by problems in the study of electrical cell-substrate impedance sensing (ECIS) data. ECIS is a popular new technology which measures cell behavior non-invasively. Previous studies using ECIS data have found that different cell types can be classified by their equilibrium behavior. However, it can be challenging to identify when equilibrium has been reached, and to quantify the relevant features of cells' equilibrium behavior. In this paper, we assume that measurements during the trend period are independent deviations from a smooth nonlinear function of time, and that measurements during the equilibrium period are characterized by a simple long memory model. We propose a method to simultaneously estimate the parameters of the trend and equilibrium processes and locate the change point between the two. We find that this method performs well in simulations and in practice. When applied to ECIS data, it produces estimates of change points and measures of cell equilibrium behavior which offer improved classification of infected and uninfected cells.

preprint2016arXiv

Mixed Data and Classification of Transit Stops

An analysis of the characteristics and behavior of individual bus stops can reveal clusters of similar stops, which can be of use in making routing and scheduling decisions, as well as determining what facilities to provide at each stop. This paper provides an exploratory analysis, including several possible clustering results, of a dataset provided by the Regional Transit Service of Rochester, NY. The dataset describes ridership on public buses, recording the time, location, and number of entering and exiting passengers each time a bus stops. A description of the overall behavior of bus ridership is followed by a stop-level analysis. We compare multiple measures of stop similarity, based on location, route information, and ridership volume over time.

preprint2015arXiv

Band Depth Clustering for Nonstationary Time Series and Wind Speed Behavior

We explore the behavior of wind speed over time, using the Eastern Wind Dataset published by the National Renewable Energy Laboratory. This dataset gives wind speeds over three years at hundreds of potential wind farm sites. Wind speed analysis is necessary to the integration of wind energy into the power grid; short-term variability in wind speed affects decisions about usage of other power sources, so that the shape of the wind speed curve becomes as important as the overall level. To assess differences in intra-day time series, we propose a functional distance measure, the band distance, which extends the band depth of Lopez-Pintado and Romo (2009). This measure emphasizes the shape of time series or functional observations relative to other members of a dataset, and allows clustering of observations without reliance on pointwise Euclidean distance. To emphasize short-term variability, we examine the short-time Fourier transform of the nonstationary speed time series; we can also adjust for seasonal effects, and use these standardizations as input for the band distance. We show that these approaches to characterizing the data go beyond mean-dependent standard clustering methods, such as k-means, to provide more shape-influenced cluster representatives useful for power grid decisions.

preprint2015arXiv

Change Points via Probabilistically Pruned Objectives

The concept of homogeneity plays a critical role in statistics, both in its applications as well as its theory. Change point analysis is a statistical tool that aims to attain homogeneity within time series data. This is accomplished through partitioning the time series into a number of contiguous homogeneous segments. The applications of such techniques range from identifying chromosome alterations to solar flare detection. In this manuscript we present a general purpose search algorithm called cp3o that can be used to identify change points in multivariate time series. This new search procedure can be applied with a large class of goodness of fit measures. Additionally, a reduction in the computational time needed to identify change points is accomplish by means of probabilistic pruning. With mild assumptions about the goodness of fit measure this new search algorithm is shown to generate consistent estimates for both the number of change points and their locations, even when the number of change points increases with the time series length. A change point algorithm that incorporates the cp3o search algorithm and E-Statistics, e-cp3o, is also presented. The only distributional assumption that the e-cp3o procedure makes is that the absolute $α$th moment exists, for some $α\in(0,2)$. Due to this mild restriction, the e-cp3o procedure can be applied to a majority of change point problems. Furthermore, even with such a mild restriction, the e-cp3o procedure has the ability to detect any type of distributional change within a time series. Simulation studies are used to compare the e-cp3o procedure to other parametric and nonparametric change point procedures, we highlight applications of e-cp3o to climate and financial datasets.

preprint2015arXiv

Predicting Ambulance Demand: a Spatio-Temporal Kernel Approach

Predicting ambulance demand accurately at fine time and location scales is critical for ambulance fleet management and dynamic deployment. Large-scale datasets in this setting typically exhibit complex spatio-temporal dynamics and sparsity at high resolutions. We propose a predictive method using spatio-temporal kernel density estimation (stKDE) to address these challenges, and provide spatial density predictions for ambulance demand in Toronto, Canada as it varies over hourly intervals. Specifically, we weight the spatial kernel of each historical observation by its informativeness to the current predictive task. We construct spatio-temporal weight functions to incorporate various temporal and spatial patterns in ambulance demand, including location-specific seasonalities and short-term serial dependence. This allows us to draw out the most helpful historical data, and exploit spatio-temporal patterns in the data for accurate and fast predictions. We further provide efficient estimation and customizable prediction procedures. stKDE is easy to use and interpret by non-specialized personnel from the emergency medical service industry. It also has significantly higher statistical accuracy than the current industry practice, with a comparable amount of computational expense.

preprint2015arXiv

Predicting Melbourne Ambulance Demand using Kernel Warping

Predicting ambulance demand accurately in fine resolutions in space and time is critical for ambulance fleet management and dynamic deployment. Typical challenges include data sparsity at high resolutions and the need to respect complex urban spatial domains. To provide spatial density predictions for ambulance demand in Melbourne, Australia as it varies over hourly intervals, we propose a predictive spatio-temporal kernel warping method. To predict for each hour, we build a kernel density estimator on a sparse set of the most similar data from relevant past time periods (labeled data), but warp these kernels to a larger set of past data irregardless of time periods (point cloud). The point cloud represents the spatial structure and geographical characteristics of Melbourne, including complex boundaries, road networks, and neighborhoods. Borrowing from manifold learning, kernel warping is performed through a graph Laplacian of the point cloud and can be interpreted as a regularization towards, and a prior imposed, for spatial features. Kernel bandwidth and degree of warping are efficiently estimated via cross-validation, and can be made time- and/or location-specific. Our proposed model gives significantly more accurate predictions compared to a current industry practice, an unwarped kernel density estimation, and a time-varying Gaussian mixture model.

preprint2014arXiv

A Spatio-Temporal Point Process Model for Ambulance Demand

Ambulance demand estimation at fine time and location scales is critical for fleet management and dynamic deployment. We are motivated by the problem of estimating the spatial distribution of ambulance demand in Toronto, Canada, as it changes over discrete 2-hour intervals. This large-scale dataset is sparse at the desired temporal resolutions and exhibits location-specific serial dependence, daily and weekly seasonality. We address these challenges by introducing a novel characterization of time-varying Gaussian mixture models. We fix the mixture component distributions across all time periods to overcome data sparsity and accurately describe Toronto's spatial structure, while representing the complex spatio-temporal dynamics through time-varying mixture weights. We constrain the mixture weights to capture weekly seasonality, and apply a conditionally autoregressive prior on the mixture weights of each component to represent location-specific short-term serial dependence and daily seasonality. While estimation may be performed using a fixed number of mixture components, we also extend to estimate the number of components using birth-and-death Markov chain Monte Carlo. The proposed model is shown to give higher statistical predictive accuracy and to reduce the error in predicting EMS operational performance by as much as two-thirds compared to a typical industry practice.

preprint2014arXiv

Leveraging Cloud Data to Mitigate User Experience from "Breaking Bad"

Low latency and high availability of an app or a web service are key, amongst other factors, to the overall user experience (which in turn directly impacts the bottomline). Exogenic and/or endogenic factors often give rise to breakouts in cloud data which makes maintaining high availability and delivering high performance very challenging. Although there exists a large body of prior research in breakout detection, existing techniques are not suitable for detecting breakouts in cloud data owing to being not robust in the presence of anomalies. To this end, we developed a novel statistical technique to automatically detect breakouts in cloud data. In particular, the technique employs Energy Statistics to detect breakouts in both application as well as system metrics. Further, the technique uses robust statistical metrics, viz., median, and estimates the statistical significance of a breakout through a permutation test. To the best of our knowledge, this is the first work which addresses breakout detection in the presence of anomalies. We demonstrate the efficacy of the proposed technique using production data and report Precision, Recall and F-measure measure. The proposed technique is 3.5 times faster than a state-of-the-art technique for breakout detection and is being currently used on a daily basis at Twitter.

preprint2013arXiv

A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data

Change point analysis has applications in a wide variety of fields. The general problem concerns the inference of a change in distribution for a set of time-ordered observations. Sequential detection is an online version in which new data is continually arriving and is analyzed adaptively. We are concerned with the related, but distinct, offline version, in which retrospective analysis of an entire sequence is performed. For a set of multivariate observations of arbitrary dimension, we consider nonparametric estimation of both the number of change points and the positions at which they occur. We do not make any assumptions regarding the nature of the change in distribution or any distribution assumptions beyond the existence of the alpha-th absolute moment, for some alpha in (0,2). Estimation is based on hierarchical clustering and we propose both divisive and agglomerative algorithms. The divisive method is shown to provide consistent estimates of both the number and location of change points under standard regularity assumptions. We compare the proposed approach with competing methods in a simulation study. Methods from cluster analysis are applied to assess performance and to allow simple comparisons of location estimates, even when the estimated number differs. We conclude with applications in genetics, finance and spatio-temporal analysis.

preprint2013arXiv

ecp: An R Package for Nonparametric Multiple Change Point Analysis of Multivariate Data

There are many different ways in which change point analysis can be performed, from purely parametric methods to those that are distribution free. The ecp package is designed to perform multiple change point analysis while making as few assumptions as possible. While many other change point methods are applicable only for univariate data, this R package is suitable for both univariate and multivariate observations. Estimation can be based upon either a hierarchical divisive or agglomerative algorithm. Divisive estimation sequentially identifies change points via a bisection algorithm. The agglomerative algorithm estimates change point locations by determining an optimal segmentation. Both approaches are able to detect any type of distributional change within the data. This provides an advantage over many existing change point algorithms which are only able to detect changes within the marginal distributions.

preprint2013arXiv

Independent Component Analysis via Distance Covariance

This paper introduces a novel statistical framework for independent component analysis (ICA) of multivariate data. We propose methodology for estimating and testing the existence of mutually independent components for a given dataset, and a versatile resampling-based procedure for inference. Independent components are estimated by combining a nonparametric probability integral transformation with a generalized nonparametric whitening method that simultaneously minimizes all forms of dependence among the components. U-statistics of certain Euclidean distances between sample elements are combined in succession to construct a statistic for testing the existence of mutually independent components. The proposed measures and tests are based on both necessary and sufficient conditions for mutual independence. When independent components exist, one may apply univariate analysis to study or model each component separately. Univariate models may then be combined to obtain a multivariate model for the original observations. We prove the consistency of our estimator under minimal regularity conditions without assuming the existence of independent components a priori, and all assumptions are placed on the observations directly, not on the latent components. We demonstrate the improvements of the proposed method over competing methods in simulation studies. We apply the proposed ICA approach to two real examples and contrast it with principal component analysis.

preprint2011arXiv

Forecasting emergency medical service call arrival rates

We introduce a new method for forecasting emergency call arrival rates that combines integer-valued time series models with a dynamic latent factor structure. Covariate information is captured via simple constraints on the factor loadings. We directly model the count-valued arrivals per hour, rather than using an artificial assumption of normality. This is crucial for the emergency medical service context, in which the volume of calls may be very low. Smoothing splines are used in estimating the factor levels and loadings to improve long-term forecasts. We impose time series structure at the hourly level, rather than at the daily level, capturing the fine-scale dependence in addition to the long-term structure. Our analysis considers all emergency priority calls received by Toronto EMS between January 2007 and December 2008 for which an ambulance was dispatched. Empirical results demonstrate significantly reduced error in forecasting call arrival volume. To quantify the impact of reduced forecast errors, we design a queueing model simulation that approximates the dynamics of an ambulance system. The results show better performance as the forecasting method improves. This notion of quantifying the operational impact of improved statistical procedures may be of independent interest.

David S. Matteson

What is connected

Connect this record

See the researcher in context

Building this map preview

22 published item(s)

Drift vs Shift: Decoupling Trends and Changepoint Analysis

Feature detection and hypothesis testing for extremely noisy nanoparticle images using topological data analysis

Bayesian Spillover Graphs for Dynamic Networks

Classifying Contaminated Cell Cultures using Time Series Features

Interpretable Latent Variables in Deep State Space Models

K-ARMA Models for Clustering Time Series Data

Graph-Based Continual Learning

Group Linear non-Gaussian Component Analysis with Applications to Neuroimaging

Factor Analysis of Mixed Data for Anomaly Detection

High Dimensional Forecasting via Interpretable Vector Autoregression

Modeling a Nonlinear Biophysical Trend Followed by Long-Memory Equilibrium with Unknown Change Point

Mixed Data and Classification of Transit Stops

Band Depth Clustering for Nonstationary Time Series and Wind Speed Behavior

Change Points via Probabilistically Pruned Objectives

Predicting Ambulance Demand: a Spatio-Temporal Kernel Approach

Predicting Melbourne Ambulance Demand using Kernel Warping

A Spatio-Temporal Point Process Model for Ambulance Demand

Leveraging Cloud Data to Mitigate User Experience from "Breaking Bad"

A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data

ecp: An R Package for Nonparametric Multiple Change Point Analysis of Multivariate Data

Independent Component Analysis via Distance Covariance

Forecasting emergency medical service call arrival rates