Researcher profile

Peter D. Hoff

Peter D. Hoff contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - Emerging
21works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

21 published item(s)

preprint2022arXiv

Optimal Conformal Prediction for Small Areas

Existing inferential methods for small area data involve a trade-off between maintaining area-level frequentist coverage rates and improving inferential precision via the incorporation of indirect information. In this article, we propose a method to obtain an area-level prediction region for a future observation which mitigates this trade-off. The proposed method takes a conformal prediction approach in which the conformity measure is the posterior predictive density of a working model that incorporates indirect information. The resulting prediction region has guaranteed frequentist coverage regardless of the working model, and, if the working model assumptions are accurate, the region has minimum expected volume compared to other regions with the same coverage rate. When constructed under a normal working model, we prove such a prediction region is an interval and construct an efficient algorithm to obtain the exact interval. We illustrate the performance of our method through simulation studies and an application to EPA radon survey data.

preprint2020arXiv

Smaller $p$-values in genomics studies using distilled historical information

Medical research institutions have generated massive amounts of biological data by genetically profiling hundreds of cancer cell lines. In parallel, academic biology labs have conducted genetic screens on small numbers of cancer cell lines under custom experimental conditions. In order to share information between these two approaches to scientific discovery, this article proposes a "frequentist assisted by Bayes" (FAB) procedure for hypothesis testing that allows historical information from massive genomics datasets to increase the power of hypothesis tests in specialized studies. The exchange of information takes place through a novel probability model for multimodal genomics data, which distills historical information pertaining to cancer cell lines and genes across a wide variety of experimental contexts. If the relevance of the historical information for a given study is high, then the resulting FAB tests can be more powerful than the corresponding classical tests. If the relevance is low, then the FAB tests yield as many discoveries as the classical tests. Simulations and practical investigations demonstrate that the FAB testing procedure can increase the number of effects discovered in genomics studies while still maintaining strict control of type I error and false discovery rates.

preprint2016arXiv

Adaptive multigroup confidence intervals with constant coverage

Confidence intervals for the means of multiple normal populations are often based on a hierarchical normal model. While commonly used interval procedures based on such a model have the nominal coverage rate on average across a population of groups, their actual coverage rate for a given group will be above or below the nominal rate, depending on the value of the group mean. Alternatively, a coverage rate that is constant as a function of a group's mean can be simply achieved by using a standard $t$-interval, based on data only from that group. The standard $t$-interval, however, fails to share information across the groups and is therefore not adaptive to easily obtained information about the distribution of group-specific means. In this article we construct confidence intervals that have a constant frequentist coverage rate and that make use of information about across-group heterogeneity, resulting in constant-coverage intervals that are narrower than standard $t$-intervals on average across groups. Such intervals are constructed by inverting biased tests for the mean of a normal population. Given a prior distribution on the mean, Bayes-optimal biased tests can be inverted to form Bayes-optimal confidence intervals with frequentist coverage that is constant as a function of the mean. In the context of multiple groups, the prior distribution is replaced by a model of across-group heterogeneity. The parameters for this model can be estimated using data from all of the groups, and used to obtain confidence intervals with constant group-specific coverage that adapt to information about the distribution of group means.

preprint2015arXiv

A Pivot-Based Improvement to Sandwich-Based Confidence Intervals

The current standard for confidence interval construction in the context of a possibly misspecified model is to use an interval based on the sandwich estimate of variance. These intervals provide asymptotically correct coverage, but small-sample coverage is known to be poor. By eliminating a plug-in assumption, we derive a pivot-based method for confidence interval construction under possibly misspecified models. When compared against confidence intervals generated by the sandwich estimate of variance, this method provides more accurate coverage of the pseudo-true parameter at small sample sizes. This is shown in the results of several simulation studies. Asymptotic results show that our pivot-based intervals have large sample efficiency equal to that of intervals based on the sandwich estimate of variance.

preprint2015arXiv

Dyadic data analysis with amen

Dyadic data on pairs of objects, such as relational or social network data, often exhibit strong statistical dependencies. Certain types of second-order dependencies, such as degree heterogeneity and reciprocity, can be well-represented with additive random effects models. Higher-order dependencies, such as transitivity and stochastic equivalence, can often be represented with multiplicative effects. The "amen" package for the R statistical computing environment provides estimation and inference for a class of additive and multiplicative random effects models for ordinal, continuous, binary and other types of dyadic data. The package also provides methods for missing, censored and fixed-rank nomination data, as well as longitudinal dyadic data. This tutorial illustrates the "amen" package via example statistical analyses of several of these different data types.

preprint2015arXiv

Limitations on detecting row covariance in the presence of column covariance

Many inference techniques for multivariate data analysis assume that the rows of the data matrix are realizations of independent and identically distributed random vectors. Such an assumption will be met, for example, if the rows of the data matrix are multivariate measurements on a set of independently sampled units. In the absence of an independent random sample, a relevant question is whether or not a statistical model that assumes such row exchangeability is plausible. One method for assessing this plausibility is a statistical test of row covariation. Maintenance of a constant type I error rate regardless of the column covariance or matrix mean can be accomplished with a test that is invariant under an appropriate group of transformations. In the context of a class of elliptically contoured matrix regression models (such as matrix normal models), I show that there are no non-trivial invariant tests if the number of rows is not sufficiently larger than the number of columns. Furthermore, I show that even if the number of rows is large, there are no non-trivial invariant tests that have power to detect arbitrary row covariance in the presence of arbitrary column covariance. However, we can construct biased tests that have power to detect certain types of row covariance that may be encountered in practice.

preprint2015arXiv

Multilinear tensor regression for longitudinal relational data

A fundamental aspect of relational data, such as from a social network, is the possibility of dependence among the relations. In particular, the relations between members of one pair of nodes may have an effect on the relations between members of another pair. This article develops a type of regression model to estimate such effects in the context of longitudinal and multivariate relational data, or other data that can be represented in the form of a tensor. The model is based on a general multilinear tensor regression model, a special case of which is a tensor autoregression model in which the tensor of relations at one time point are parsimoniously regressed on relations from previous time points. This is done via a separable, or Kronecker-structured, regression parameter along with a separable covariance model. In the context of an analysis of longitudinal multivariate relational data, it is shown how the multilinear tensor regression model can represent patterns that often appear in relational and network data, such as reciprocity and transitivity.

preprint2015arXiv

Relax, Tensors Are Here: Dependencies in International Processes

Previous models of international conflict have suffered two shortfalls. They tended not to embody dynamic changes, focusing rather on static slices of behavior over time. These models have also been empirically evaluated in ways that assumed the independence of each country, when in reality they are searching for the interdependence among all countries. We illustrate a solution to these two hurdles and evaluate this new, dynamic, network based approach to the dependencies among the ebb and flow of daily international interactions using a newly developed, and openly available, database of events among nations.

preprint2014arXiv

Hierarchical array priors for ANOVA decompositions of cross-classified data

ANOVA decompositions are a standard method for describing and estimating heterogeneity among the means of a response variable across levels of multiple categorical factors. In such a decomposition, the complete set of main effects and interaction terms can be viewed as a collection of vectors, matrices and arrays that share various index sets defined by the factor levels. For many types of categorical factors, it is plausible that an ANOVA decomposition exhibits some consistency across orders of effects, in that the levels of a factor that have similar main-effect coefficients may also have similar coefficients in higher-order interaction terms. In such a case, estimation of the higher-order interactions should be improved by borrowing information from the main effects and lower-order interactions. To take advantage of such patterns, this article introduces a class of hierarchical prior distributions for collections of interaction arrays that can adapt to the presence of such interactions. These prior distributions are based on a type of array-variate normal distribution, for which a covariance matrix for each factor is estimated. This prior is able to adapt to potential similarities among the levels of a factor, and incorporate any such information into the estimation of the effects in which the factor appears. In the presence of such similarities, this prior is able to borrow information from well-estimated main effects and lower-order interactions to assist in the estimation of higher-order terms for which data information is limited.

preprint2014arXiv

Information bounds for Gaussian copulas

Often of primary interest in the analysis of multivariate data are the copula parameters describing the dependence among the variables, rather than the univariate marginal distributions. Since the ranks of a multivariate dataset are invariant to changes in the univariate marginal distributions, rank-based estimators are natural candidates for semiparametric copula estimation. Asymptotic information bounds for such estimators can be obtained from an asymptotic analysis of the rank likelihood, that is, the probability of the multivariate ranks. In this article, we obtain limiting normal distributions of the rank likelihood for Gaussian copula models. Our results cover models with structured correlation matrices, such as exchangeable or circular correlation models, as well as unstructured correlation matrices. For all Gaussian copula models, the limiting distribution of the rank likelihood ratio is shown to be equal to that of a parametric likelihood ratio for an appropriately chosen multivariate normal model. This implies that the semiparametric information bounds for rank-based estimators are the same as the information bounds for estimators based on the full data, and that the multivariate normal distributions are least favorable.

preprint2014arXiv

Separable factor analysis with applications to mortality data

Human mortality data sets can be expressed as multiway data arrays, the dimensions of which correspond to categories by which mortality rates are reported, such as age, sex, country and year. Regression models for such data typically assume an independent error distribution or an error model that allows for dependence along at most one or two dimensions of the data array. However, failing to account for other dependencies can lead to inefficient estimates of regression parameters, inaccurate standard errors and poor predictions. An alternative to assuming independent errors is to allow for dependence along each dimension of the array using a separable covariance model. However, the number of parameters in this model increases rapidly with the dimensions of the array and, for many arrays, maximum likelihood estimates of the covariance parameters do not exist. In this paper, we propose a submodel of the separable covariance model that estimates the covariance matrix for each dimension as having factor analytic structure. This model can be viewed as an extension of factor analysis to array-valued data, as it uses a factor model to estimate the covariance along each dimension of the array. We discuss properties of this model as they relate to ordinary factor analysis, describe maximum likelihood and Bayesian estimation methods, and provide a likelihood ratio testing procedure for selecting the factor model ranks. We apply this methodology to the analysis of data from the Human Mortality Database, and show in a cross-validation experiment how it outperforms simpler methods. Additionally, we use this model to impute mortality rates for countries that have no mortality data for several years. Unlike other approaches, our methodology is able to estimate similarities between the mortality rates of countries, time periods and sexes, and use this information to assist with the imputations.

preprint2013arXiv

Bayesian analysis of matrix data with rstiefel

We illustrate the use of the R-package "rstiefel" for matrix-variate data analysis in the context of two examples. The first example considers estimation of a reduced-rank mean matrix in the presence of normally distributed noise. The second example considers the modeling of a social network of friendships among teenagers. Bayesian estimation for these models requires the ability to simulate from the matrix-variate von Mises-Fisher distributions and the matrix-variate Bingham distributions on the Stiefel manifold.

preprint2013arXiv

Comment on "Bayesian Nonparametric Inference - Why and How" by Mueller and Mitra

Due to their great flexibility, nonparametric Bayes methods have proven to be a valuable tool for discovering complicated patterns in data. The term "nonparametric Bayes" suggests that these methods inherit model-free operating characteristics of classical nonparametric methods, as well as coherent uncertainty assessments provided by Bayesian procedures. However, as the authors say in the conclusion to their article, nonparametric Bayesian methods may be more aptly described as "massively parametric." Furthermore, I argue that many of the default nonparametric Bayes procedures are only Bayesian in the weakest sense of the term, and cannot be assumed to provide honest assessments of uncertainty merely because they carry the Bayesian label. However useful such procedures may be, we should be cautious about advertising default nonparametric Bayes procedures as either being "assumption free" or providing descriptions of our uncertainty. If we want our nonparametric Bayes procedures to have a Bayesian interpretation, we should modify default NP Bayes methods to accommodate real prior information, or at the very least, carefully evaluate the effects of hyperparameters on posterior quantities of interest.

preprint2013arXiv

Testing and Modeling Dependencies Between a Network and Nodal Attributes

Network analysis is often focused on characterizing the dependencies between network relations and node-level attributes. Potential relationships are typically explored by modeling the network as a function of the nodal attributes or by modeling the attributes as a function of the network. These methods require specification of the exact nature of the association between the network and attributes, reduce the network data to a small number of summary statistics, and are unable provide predictions simultaneously for missing attribute and network information. Existing methods that model the attributes and network jointly also assume the data are fully observed. In this article we introduce a unified approach to analysis that addresses these shortcomings. We use a latent variable model to obtain a low dimensional representation of the network in terms of node-specific network factors and use a test of dependence between the network factors and attributes as a surrogate for a test of dependence between the network and attributes. We propose a formal testing procedure to determine if dependencies exists between the network factors and attributes. We also introduce a joint model for the network and attributes, for use if the test rejects, that can capture a variety of dependence patterns and be used to make inference and predictions for missing observations.

preprint2013arXiv

Testing for nodal dependence in relational data matrices

Relational data are often represented as a square matrix, the entries of which record the relationships between pairs of objects. Many statistical methods for the analysis of such data assume some degree of similarity or dependence between objects in terms of the way they relate to each other. However, formal tests for such dependence have not been developed. We provide a test for such dependence using the framework of the matrix normal model, a type of multivariate normal distribution parameterized in terms of row- and column-specific covariance matrices. We develop a likelihood ratio test (LRT) for row and column dependence based on the observation of a single relational data matrix. We obtain a reference distribution for the LRT statistic, thereby providing an exact test for the presence of row or column correlations in a square relational data matrix. Additionally, we provide extensions of the test to accommodate common features of such data, such as undefined diagonal entries, a non-zero mean, multiple observations, and deviations from normality.

preprint2012arXiv

Marginally Specified Priors for Nonparametric Bayesian Estimation

Prior specification for nonparametric Bayesian inference involves the difficult task of quantifying prior knowledge about a parameter of high, often infinite, dimension. Realistically, a statistician is unlikely to have informed opinions about all aspects of such a parameter, but may have real information about functionals of the parameter, such the population mean or variance. This article proposes a new framework for nonparametric Bayes inference in which the prior distribution for a possibly infinite-dimensional parameter is decomposed into two parts: an informative prior on a finite set of functionals, and a nonparametric conditional prior for the parameter given the functionals. Such priors can be easily constructed from standard nonparametric prior distributions in common use, and inherit the large support of the standard priors upon which they are based. Additionally, posterior approximations under these informative priors can generally be made via minor adjustments to existing Markov chain approximation algorithms for standard nonparametric prior distributions. We illustrate the use of such priors in the context of multivariate density estimation using Dirichlet process mixture models, and in the modeling of high-dimensional sparse contingency tables.

preprint2011arXiv

A covariance regression model

Classical regression analysis relates the expectation of a response variable to a linear combination of explanatory variables. In this article, we propose a covariance regression model that parameterizes the covariance matrix of a multivariate response vector as a parsimonious quadratic function of explanatory variables. The approach is analogous to the mean regression model, and is similar to a factor analysis model in which the factor loadings depend on the explanatory variables. Using a random-effects representation, parameter estimation for the model is straightforward using either an EM-algorithm or an MCMC approximation via Gibbs sampling. The proposed methodology provides a simple but flexible representation of heteroscedasticity across the levels of an explanatory variable, improves estimation of the mean function and gives better calibrated prediction regions when compared to a homoscedastic model.

preprint2011arXiv

A mixed effects model for longitudinal relational and network data, with applications to international trade and conflict

The focus of this paper is an approach to the modeling of longitudinal social network or relational data. Such data arise from measurements on pairs of objects or actors made at regular temporal intervals, resulting in a social network for each point in time. In this article we represent the network and temporal dependencies with a random effects model, resulting in a stochastic process defined by a set of stationary covariance matrices. Our approach builds upon the social relations models of Warner, Kenny and Stoto [Journal of Personality and Social Psychology 37 (1979) 1742--1757] and Gill and Swartz [Canad. J. Statist. 29 (2001) 321--331] and allows for an intra- and inter-temporal representation of network structures. We apply the methodology to two longitudinal data sets: international trade (continuous response) and militarized interstate disputes (binary response).

preprint2010arXiv

A Statistical View of Learning in the Centipede Game

In this article we evaluate the statistical evidence that a population of students learn about the sub-game perfect Nash equilibrium of the centipede game via repeated play of the game. This is done by formulating a model in which a player's error in assessing the utility of decisions changes as they gain experience with the game. We first estimate parameters in a statistical model where the probabilities of choices of the players are given by a Quantal Response Equilibrium (QRE) (McKelvey and Palfrey, 1995, 1996, 1998), but are allowed to change with repeated play. This model gives a better fit to the data than similar models previously considered. However, substantial correlation of outcomes of games having a common player suggests that a statistical model that captures within-subject correlation is more appropriate. Thus we then estimate parameters in a model which allows for within-player correlation of decisions and rates of learning. Through out the paper we also consider and compare the use of randomization tests and posterior predictive tests in the context of exploratory and confirmatory data analyses.

preprint2010arXiv

Convergence of Nonparametric Long-Memory Phase I Designs

We examine nonparametric dose-finding designs that use toxicity estimates based on all available data at each dose allocation decision. We prove that one such design family, called here "interval design", converges almost surely to the maximum tolerated dose (MTD), if the MTD is the only dose level whose toxicity rate falls within the pre-specified interval around the desired target rate. Another nonparametric family, called "point design", has a positive probability of not converging. In a numerical sensitivity study, a diverse sample of dose-toxicity scenarios was randomly generated. On this sample, the "interval design" convergence conditions are met far more often than the conditions for one-parameter design convergence (the Shen-O'Quigley conditions), suggesting that the interval-design conditions are less restrictive. Implications of these theoretical and numerical results for small-sample behavior of the designs, and for future research, are discussed.

preprint2010arXiv

Separable covariance arrays via the Tucker product, with applications to multivariate relational data

Modern datasets are often in the form of matrices or arrays,potentially having correlations along each set of data indices. For example, data involving repeated measurements of several variables over time may exhibit temporal correlation as well as correlation among the variables. A possible model for matrix-valued data is the class of matrix normal distributions, which is parametrized by two covariance matrices, one for each index set of the data. In this article we describe an extension of the matrix normal model to accommodate multidimensional data arrays, or tensors. We generate a class of array normal distributions by applying a group of multilinear transformations to an array of independent standard normal random variables. The covariance structures of the resulting class take the form of outer products of dimension-specific covariance matrices. We derive some properties of these covariance structures and the corresponding array normal distributions, discuss maximum likelihood and Bayesian estimation of covariance parameters and illustrate the model in an analysis of multivariate longitudinal network data.