Source author record

David Ruppert

David Ruppert appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Applications astro-ph.HE astro-ph.IM Computation math.ST Statistics Theory astro-ph.GA

Catalog footprint

What is connected

11works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Density Estimation on a Network

This paper develops a novel approach to density estimation on a network. We formulate nonparametric density estimation on a network as a nonparametric regression problem by binning. Nonparametric regression using local polynomial kernel-weighted least squares have been studied rigorously, and its asymptotic properties make it superior to kernel estimators such as the Nadaraya-Watson estimator. When applied to a network, the best estimator near a vertex depends on the amount of smoothness at the vertex. Often, there are no compelling reasons to assume that a density will be continuous or discontinuous at a vertex, hence a data driven approach is proposed. To estimate the density in a neighborhood of a vertex, we propose a two-step procedure. The first step of this pretest estimator fits a separate local polynomial regression on each edge using data only on that edge, and then tests for equality of the estimates at the vertex. If the null hypothesis is not rejected, then the second step re-estimates the regression function in a small neighborhood of the vertex, subject to a joint equality constraint. Since the derivative of the density may be discontinuous at the vertex, we propose a piecewise polynomial local regression estimate to model the change in slope. We study in detail the special case of local piecewise linear regression and derive the leading bias and variance terms using weighted least squares theory. We show that the proposed approach will remove the bias near a vertex that has been noted for existing methods, which typically do not allow for discontinuity at vertices. For a fixed network, the proposed method scales sub-linearly with sample size and it can be extended to regression and varying coefficient models on a network. We demonstrate the workings of the proposed model by simulation studies and apply it to a dendrite network data set.

preprint2020arXiv

Optimal Sampling for Generalized Linear Models under Measurement Constraints

Under "measurement constraints," responses are expensive to measure and initially unavailable on most of records in the dataset, but the covariates are available for the entire dataset. Our goal is to sample a relatively small portion of the dataset where the expensive responses will be measured and the resultant sampling estimator is statistically efficient. Measurement constraints require the sampling probabilities can only depend on a very small set of the responses. A sampling procedure that uses responses at most only on a small pilot sample will be called "response-free." We propose a response-free sampling procedure \mbox{(OSUMC)} for generalized linear models (GLMs). Using the A-optimality criterion, i.e., the trace of the asymptotic variance, the resultant estimator is statistically efficient within a class of sampling estimators. We establish the unconditional asymptotic distribution of a general class of response-free sampling estimators. This result is novel compared with the existing conditional results obtained by conditioning on both covariates and responses. Under our unconditional framework, the subsamples are no longer independent and new martingale techniques are developed for our asymptotic theory. We further derive the A-optimal response-free sampling distribution. Since this distribution depends on population level quantities, we propose the Optimal Sampling Under Measurement Constraints (OSUMC) algorithm to approximate the theoretical optimal sampling. Finally, we conduct an intensive empirical study to demonstrate the advantages of OSUMC algorithm over existing methods in both statistical and computational perspectives.

preprint2016arXiv

Additive Function-on-Function Regression

We study additive function-on-function regression where the mean response at a particular time point depends on the time point itself as well as the entire covariate trajectory. We develop a computationally efficient estimation methodology based on a novel combination of spline bases with an eigenbasis to represent the trivariate kernel function. We discuss prediction of a new response trajectory, propose an inference procedure that accounts for total variability in the predicted response curves, and construct pointwise prediction intervals. The estimation/inferential procedure accommodates realistic scenarios such as correlated error structure as well as sparse and/or irregular designs. We investigate our methodology in finite sample size through simulations and two real data applications.

preprint2016arXiv

Simultaneously modelling far-infrared dust emission and its relation to CO emission in star forming galaxies

We present a method to simultaneously model the dust far-infrared spectral energy distribution (SED) and the total infrared $-$ carbon monoxide (CO) integrated intensity $(S_{\rm IR}-I_{\rm CO})$ relationship. The modelling employs a hierarchical Bayesian (HB) technique to estimate the dust surface density, temperature ($T_{\rm eff}$), and spectral index at each pixel from the observed far-infrared (FIR) maps. Additionally, given the corresponding CO map, the method simultaneously estimates the slope and intercept between the FIR and CO intensities, which are global properties of the observed source. The model accounts for correlated and uncorrelated uncertainties, such as those present in Herschel observations. Using synthetic datasets, we demonstrate the accuracy of the HB method, and contrast the results with common non-hierarchical fitting methods. As an initial application, we model the dust and gas on 100 pc scales in the Magellanic Clouds from Herschel FIR and NANTEN CO observations. The slopes of the $\log S_{\rm IR}-\log I_{\rm CO}$ relationship are similar in both galaxies, falling in the range 1.1$-$1.7. However, in the SMC the intercept is nearly 3 times higher, which can be explained by its lower metallicity than the LMC, resulting in a larger $S_{\rm IR}$ per unit $I_{\rm CO}$. The HB modelling evidences an increase in $T_{\rm eff}$ in regions with the highest $I_{\rm CO}$ in the LMC. This may be due to enhanced dust heating in the densest molecular regions from young stars. Such simultaneous dust and gas modelling may reveal variations in the properties of the ISM and its association with other galactic characteristics, such as star formation rates and/or metallicities.

preprint2014arXiv

Fast Covariance Estimation for High-dimensional Functional Data

For smoothing covariance functions, we propose two fast algorithms that scale linearly with the number of observations per function. Most available methods and software cannot smooth covariance matrices of dimension $J \times J$ with $J>500$; the recently introduced sandwich smoother is an exception, but it is not adapted to smooth covariance matrices of large dimensions such as $J \ge 10,000$. Covariance matrices of order $J=10,000$, and even $J=100,000$, are becoming increasingly common, e.g., in 2- and 3-dimensional medical imaging and high-density wearable sensor data. We introduce two new algorithms that can handle very large covariance matrices: 1) FACE: a fast implementation of the sandwich smoother and 2) SVDS: a two-step procedure that first applies singular value decomposition to the data matrix and then smoothes the eigenvectors. Compared to existing techniques, these new algorithms are at least an order of magnitude faster in high dimensions and drastically reduce memory requirements. The new algorithms provide instantaneous (few seconds) smoothing for matrices of dimension $J=10,000$ and very fast ($<$ 10 minutes) smoothing for $J=100,000$. Although SVDS is simpler than FACE, we provide ready to use, scalable R software for FACE. When incorporated into R package {\it refund}, FACE improves the speed of penalized functional regression by an order of magnitude, even for data of normal size ($J <500$). We recommend that FACE be used in practice for the analysis of noisy and high-dimensional functional data.

preprint2014arXiv

RAPTT: An Exact Two-Sample Test in High Dimensions Using Random Projections

In high dimensions, the classical Hotelling's $T^2$ test tends to have low power or becomes undefined due to singularity of the sample covariance matrix. In this paper, this problem is overcome by projecting the data matrix onto lower dimensional subspaces through multiplication by random matrices. We propose RAPTT (RAndom Projection T-Test), an exact test for equality of means of two normal populations based on projected lower dimensional data. RAPTT does not require any constraints on the dimension of the data or the sample size. A simulation study indicates that in high dimensions the power of this test is often greater than that of competing tests. The advantage of RAPTT is illustrated on high-dimensional gene expression data involving the discrimination of tumor and normal colon tissues.

preprint2013arXiv

Multilevel Bayesian framework for modeling the production, propagation and detection of ultra-high energy cosmic rays

Ultra-high energy cosmic rays (UHECRs) are atomic nuclei with energies over ten million times energies accessible to human-made particle accelerators. Evidence suggests that they originate from relatively nearby extragalactic sources, but the nature of the sources is unknown. We develop a multilevel Bayesian framework for assessing association of UHECRs and candidate source populations, and Markov chain Monte Carlo algorithms for estimating model parameters and comparing models by computing, via Chib's method, marginal likelihoods and Bayes factors. We demonstrate the framework by analyzing measurements of 69 UHECRs observed by the Pierre Auger Observatory (PAO) from 2004-2009, using a volume-complete catalog of 17 local active galactic nuclei (AGN) out to 15 megaparsecs as candidate sources. An early portion of the data ("period 1," with 14 events) was used by PAO to set an energy cut maximizing the anisotropy in period 1; the 69 measurements include this "tuned" subset, and subsequent "untuned" events with energies above the same cutoff. Also, measurement errors are approximately summarized. These factors are problematic for independent analyses of PAO data. Within the context of "standard candle" source models (i.e., with a common isotropic emission rate), and considering only the 55 untuned events, there is no significant evidence favoring association of UHECRs with local AGN vs. an isotropic background. The highest-probability associations are with the two nearest, adjacent AGN, Centaurus A and NGC 4945. If the association model is adopted, the fraction of UHECRs that may be associated is likely nonzero but is well below 50%. Our framework enables estimation of the angular scale for deflection of cosmic rays by cosmic magnetic fields; relatively modest scales of $\approx\!3^{\circ}$ to $30^{\circ}$ are favored. Models that assign a large fraction of UHECRs to a single nearby source (e.g., Centaurus A) are ruled out unless very large deflection scales are specified a priori, and even then they are disfavored. However, including the period 1 data alters the conclusions significantly, and a simulation study supports the idea that the period 1 data are anomalous, presumably due to the tuning. Accurate and optimal analysis of future data will likely require more complete disclosure of the data.

preprint2013arXiv

Optimal Prediction in an Additive Functional Model

The functional generalized additive model (FGAM) provides a more flexible nonlinear functional regression model than the well-studied functional linear regression model. This paper restricts attention to the FGAM with identity link and additive errors, which we will call the additive functional model, a generalization of the functional linear model. This paper studies the minimax rate of convergence of predictions from the additive functional model in the framework of reproducing kernel Hilbert space. It is shown that the optimal rate is determined by the decay rate of the eigenvalues of a specific kernel function, which in turn is determined by the reproducing kernel and the joint distribution of any two points in the random predictor function. For the special case of the functional linear model, this kernel function is jointly determined by the covariance function of the predictor function and the reproducing kernel. The easily implementable roughness-regularized predictor is shown to achieve the optimal rate of convergence. Numerical studies are carried out to illustrate the merits of the predictor. Our simulations and real data examples demonstrate a competitive performance against the existing approach.

preprint2013arXiv

Restricted Likelihood Ratio Tests for Linearity in Scalar-on-Function Regression

We propose a procedure for testing the linearity of a scalar-on-function regression relationship. To do so, we use the functional generalized additive model (FGAM), a recently developed extension of the functional linear model. For a functional covariate X(t), the FGAM models the mean response as the integral with respect to t of F{X(t),t} where F is an unknown bivariate function. The FGAM can be viewed as the natural functional extension of generalized additive models. We show how the functional linear model can be represented as a simple mixed model nested within the FGAM. Using this representation, we then consider restricted likelihood ratio tests for zero variance components in mixed models to test the null hypothesis that the functional linear model holds. The methods are general and can also be applied to testing for interactions in a multivariate additive model or for testing for no effect in the functional linear model. The performance of the proposed tests is assessed on simulated data and in an application to measuring diesel truck emissions, where strong evidence of nonlinearities in the relationship between the functional predictor and the response are found.

preprint2012arXiv

Fast Bivariate Penalized Splines: the Sandwich Smoother

We propose a fast penalized spline method for bivariate smoothing. Univariate P-spline smoothers (Eilers and Marx, 1996) are applied simultaneously along both coordinates. The new smoother has a sandwich form which suggested the name "sandwich smoother" to a referee. The sandwich smoother has a tensor product structure that simplifies an asymptotic analysis and it can be fast computed. We derive a local central limit theorem for the sandwich smoother, with simple expressions for the asymptotic bias and variance, by showing that the sandwich smoother is asymptotically equivalent to a bivariate kernel regression estimator with a product kernel. As far as we are aware, this is the first central limit theorem for a bivariate spline estimator of any type. Our simulation study shows that the sandwich smoother is orders of magnitude faster to compute than other bivariate spline smoothers, even when the latter are computed using a fast GLAM (Generalized Linear Array Model) algorithm, and comparable to them in terms of mean squared integrated errors. We extend the sandwich smoother to array data of higher dimensions, where a GLAM algorithm improves the computational speed of the sandwich smoother. One important application of the sandwich smoother is to estimate covariance functions in functional data analysis. In this application, our numerical results show that the sandwich smoother is orders of magnitude faster than local linear regression. The speed of the sandwich formula is important because functional data sets are becoming quite large.

preprint2012arXiv

Guilt by Association: Finding Cosmic Ray Sources Using Hierarchical Bayesian Clustering

The Earth is continuously showered by charged cosmic ray particles, naturally produced atomic nuclei moving with velocity close to the speed of light. Among these are ultra high energy cosmic ray particles with energy exceeding 5x10^19 eV, which is ten million times more energetic than the most energetic particles produced at the Large Hadron Collider. Astrophysical questions include: what phenomenon accelerates particles to such high energies, and what sort of nuclei are energized? Also, the magnetic deflection of the trajectories of the cosmic rays makes them potential probes of galactic and intergalactic magnetic fields. We develop a Bayesian hierarchical model that can be used to compare different association models between the cosmic rays and source population, using Bayes factors. A measurement model with directional uncertainties and accounting for non-uniform sky exposure is incoporated into the model. The methodology allows us to learn about astrophysical parameters, such as those governing the source luminosity function and the cosmic magnetic field.

David Ruppert

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

Density Estimation on a Network

Optimal Sampling for Generalized Linear Models under Measurement Constraints

Additive Function-on-Function Regression

Simultaneously modelling far-infrared dust emission and its relation to CO emission in star forming galaxies

Fast Covariance Estimation for High-dimensional Functional Data

RAPTT: An Exact Two-Sample Test in High Dimensions Using Random Projections

Multilevel Bayesian framework for modeling the production, propagation and detection of ultra-high energy cosmic rays

Optimal Prediction in an Additive Functional Model

Restricted Likelihood Ratio Tests for Linearity in Scalar-on-Function Regression

Fast Bivariate Penalized Splines: the Sandwich Smoother

Guilt by Association: Finding Cosmic Ray Sources Using Hierarchical Bayesian Clustering