Source author record

Terrance D. Savitsky

Terrance D. Savitsky appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Applications math.ST Statistics Theory Computation

Catalog footprint

What is connected

11works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Distributed Bayesian Varying Coefficient Modeling Using a Gaussian Process Prior

Varying coefficient models (VCMs) are widely used for estimating nonlinear regression functions for functional data. Their Bayesian variants using Gaussian process priors on the functional coefficients, however, have received limited attention in massive data applications, mainly due to the prohibitively slow posterior computations using Markov chain Monte Carlo (MCMC) algorithms. We address this problem using a divide-and-conquer Bayesian approach. We first create a large number of data subsamples with much smaller sizes. Then, we formulate the VCM as a linear mixed-effects model and develop a data augmentation algorithm for obtaining MCMC draws on all the subsets in parallel. Finally, we aggregate the MCMC-based estimates of subset posteriors into a single Aggregated Monte Carlo (AMC) posterior, which is used as a computationally efficient alternative to the true posterior distribution. Theoretically, we derive minimax optimal posterior convergence rates for the AMC posteriors of both the varying coefficients and the mean regression function. We provide quantification on the orders of subset sample sizes and the number of subsets. The empirical results show that the combination schemes that satisfy our theoretical assumptions, including the AMC posterior, have better estimation performance than their main competitors across diverse simulations and in a real data analysis.

preprint2022arXiv

Private Tabular Survey Data Products through Synthetic Microdata Generation

We propose two synthetic microdata approaches to generate private tabular survey data products for public release. We adapt a pseudo posterior mechanism that downweights by-record likelihood contributions with weights $\in [0,1]$ based on their identification disclosure risks to producing tabular products for survey data. Our method applied to an observed survey database achieves an asymptotic global probabilistic differential privacy guarantee. Our two approaches synthesize the observed sample distribution of the outcome and survey weights, jointly, such that both quantities together possess a privacy guarantee. The privacy-protected outcome and survey weights are used to construct tabular cell estimates (where the cell inclusion indicators are treated as known and public) and associated standard errors to correct for survey sampling bias. Through a real data application to the Survey of Doctorate Recipients public use file and simulation studies motivated by the application, we demonstrate that our two microdata synthesis approaches to construct tabular products provide superior utility preservation as compared to the additive-noise approach of the Laplace Mechanism. Moreover, our approaches allow the release of microdata to the public, enabling additional analyses at no extra privacy cost.

preprint2022arXiv

Re-weighting of Vector-weighted Mechanisms for Utility Maximization under Differential Privacy

We address practical implementation of a risk-weighted pseudo posterior synthesizer for microdata dissemination with a new re-weighting strategy that maximizes utility of released synthetic data under at any level of formal privacy guarantee. Our re-weighting strategy applies to any vector-weighted pseudo posterior mechanism under which a vector of observation-indexed weights are used to downweight likelihood contributions for high disclosure risk records. We demonstrate our method on two different vector-weighted schemes that target high-risk records. Our new method for constructing record-indexed downeighting maximizes the data utility under any privacy budget for the vector-weighted synthesizers by adjusting the by-record weights, such that their individual Lipschitz bounds approach the bound for the entire database. Our method achieves an $(ε= 2 Δ_{\boldsymbolα})-$asymptotic differential privacy (aDP) guarantee, globally, over the space of databases. We illustrate our methods using simulated highly skewed count data and compare the results to a scalar-weighted synthesizer under the Exponential Mechanism (EM). We also apply our methods to a sample of the Survey of Doctorate Recipients and demonstrate the practicality of our methods.

preprint2021arXiv

Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys

The release of synthetic data generated from a model estimated on the data helps statistical agencies disseminate respondent-level data with high utility and privacy protection. Motivated by the challenge of disseminating sensitive variables containing geographic information in the Consumer Expenditure Surveys (CE) at the U.S. Bureau of Labor Statistics, we propose two non-parametric Bayesian models as data synthesizers for the county identifier of each data record: a Bayesian latent class model and a Bayesian areal model. Both data synthesizers use Dirichlet Process priors to cluster observations of similar characteristics and allow borrowing information across observations. We develop innovative disclosure risks measures to quantify inherent risks in the confidential CE data and how those data risks are ameliorated by our proposed synthesizers. By creating a lower bound and an upper bound of disclosure risks under a minimum and a maximum disclosure risks scenarios respectively, our proposed inherent risks measures provide a range of acceptable disclosure risks for evaluating risks level in the synthetic datasets.

preprint2021arXiv

Risk-Efficient Bayesian Data Synthesis for Privacy Protection

Statistical agencies utilize models to synthesize respondent-level data for release to the public for privacy protection. In this work, we efficiently induce privacy protection into any Bayesian synthesis model by employing a pseudo likelihood that exponentiates each likelihood contribution by an observation record-indexed weight in [0, 1], defined to be inversely proportional to the identification risk for that record. We start with the marginal probability of identification risk for a record, which is composed as the probability that the identity of the record may be disclosed. Our application to the Consumer Expenditure Surveys (CE) of the U.S. Bureau of Labor Statistics demonstrates that the marginally risk-adjusted synthesizer provides an overall improved privacy protection; however, the identification risks actually increase for some moderate-risk records after risk-adjusted pseudo posterior estimation synthesis due to increased isolation after weighting; a phenomenon we label "whack-a-mole". We proceed to construct a weight for each record from a collection of pairwise identification risk probabilities with other records, where each pairwise probability measures the joint probability of re-identification of the pair of records, which mitigates the whack-a-mole issue and produces a more efficient set of synthetic data with lower risk and higher utility for the CE data.

preprint2020arXiv

Bayesian Pseudo Posterior Synthesis for Data Privacy Protection

Statistical agencies utilize models to synthesize respondent-level data for release to the general public as an alternative to the actual data records. A Bayesian model synthesizer encodes privacy protection by employing a hierarchical prior construction that induces smoothing of the real data distribution. Synthetic respondent-level data records are often preferred to summary data tables due to the many possible uses by researchers and data analysts. Agencies balance a trade-off between utility of the synthetic data versus disclosure risks and hold a specific target threshold for disclosure risk before releasing synthetic datasets. We introduce a pseudo posterior likelihood that exponentiates each contribution by an observation record-indexed weight in (0, 1), defined to be inversely proportional to the disclosure risk for that record in the synthetic data. Our use of a vector of weights allows more precise downweighting of high risk records in a fashion that better preserves utility as compared with using a scalar weight. We illustrate our method with a simulation study and an application to the Consumer Expenditure Survey of the U.S. Bureau of Labor Statistics. We demonstrate how the frequentist consistency and uncertainty quantification are affected by the inverse risk-weighting.

preprint2019arXiv

Bayesian Uncertainty Estimation Under Complex Sampling

Social and economic studies are often implemented as complex survey designs. For example, multistage, unequal probability sampling designs utilized by federal statistical agencies are typically constructed to maximize the efficiency of the target domain level estimator (e.g., indexed by geographic area) within cost constraints for survey administration. Such designs may induce dependence between the sampled units; for example, with employment of a sampling step that selects geographically-indexed clusters of units. A sampling-weighted pseudo-posterior distribution may be used to estimate the population model on the observed sample. The dependence induced between co-clustered units inflates the scale of the resulting pseudo-posterior covariance matrix that has been shown to induce under coverage of the credibility sets. By bridging results across Bayesian model mispecification and survey sampling, we demonstrate that the scale and shape of the asymptotic distributions are different between each of the pseudo-MLE, the pseudo-posterior and the MLE under simple random sampling. Through insights from survey sampling variance estimation and recent advances in computational methods, we devise a correction applied as a simple and fast post-processing step to MCMC draws of the pseudo-posterior distribution. This adjustment projects the pseudo-posterior covariance matrix such that the nominal coverage is approximately achieved. We make an application to the National Survey on Drug Use and Health as a motivating example and we demonstrate the efficacy of our scale and shape projection procedure on synthetic data on several common archetypes of survey designs.

preprint2016arXiv

Bayesian Estimation Under Informative Sampling

Bayesian analysis is increasingly popular for use in social science and other application areas where the data are observations from an informative sample. An informative sampling design leads to inclusion probabilities that are correlated with the response variable of interest. Model inference performed on the observed sample taken from the population will be biased for the population generative model under informative sampling since the balance of information in the sample data is different from that for the population. Typical approaches to account for an informative sampling design under Bayesian estimation are often difficult to implement because they require re-parameterization of the hypothesized generating model, or focus on design, rather than model-based, inference. We propose to construct a pseudo-posterior distribution that utilizes sampling weights based on the marginal inclusion probabilities to exponentiate the likelihood contribution of each sampled unit, which weights the information in the sample back to the population. Our approach provides a nearly automated estimation procedure applicable to any model specified by the data analyst for the population and retains the population model parameterization and posterior sampling geometry. We construct conditions on known marginal and pairwise inclusion probabilities that define a class of sampling designs where $L_{1}$ consistency of the pseudo posterior is guaranteed. We demonstrate our method on an application concerning the Bureau of Labor Statistics Job Openings and Labor Turnover Survey.

preprint2015arXiv

Bayesian Nonparameteric Multiresolution Estimation for the American Community Survey

Bayesian hierarchical methods implemented for small area estimation focus on reducing the noise variation in published government official statistics by borrowing information among dependent response values. Even the most flexible models confine parameters defined at the finest scale to link to each data observation in a one-to-one construction. We propose a Bayesian multiresolution formulation that utilizes an ensemble of observations at a variety of coarse scales in space and time to additively nest parameters we define at a finer scale, which serve as our focus for estimation. Our construction is motivated by and applied to the estimation of $1-$ year period employment levels, indexed by county, from statistics published at coarser areal domains and multi-year intervals in the American Community Survey (ACS). We construct a nonparametric mixture of Gaussian processes as the prior on a set of regression coefficients of county-indexed latent functions over multiple survey years. We evaluate a modified Dirichlet process prior that incorporates county-year predictors as the mixing measure. Each county-year parameter of a latent function is estimated from multiple coarse scale observations in space and time to which it links. The multiresolution formulation is evaluated on synthetic data and applied to the ACS.

preprint2015arXiv

Bayesian Nonparametric Functional Mixture Estimation for Time-Series Data, With Application to Estimation of State Employment Totals

The U.S. Bureau of Labor Statistics use monthly, by-state employment totals from the Current Population Survey (CPS) as a key input to develop employment estimates for counties within the states. The monthly CPS by-state totals, however, express high levels of volatility that compromise the accuracy of resulting estimates composed for the counties. Typically-employed models for small area estimation produce de-noised, state-level employment estimates by borrowing information over the survey months, but assume independence among the collection of by-state time series, which is typically violated due to similarities in their underlying economies. We construct Gaussian process and Gaussian Markov random field alternative functional prior specifications, each in a mixture of multivariate Gaussian distributions with a Dirichlet process (DP) mixing measure over the parameters of their covariance or precision matrices. Our DP mixture of functions models allow the data to simultaneously estimate a dependence among the months and between states. A feature of our models is that those functions assigned to the same cluster are drawn from a distribution with the same covariance parameters, so that they are similar, but don't have to be identical. We compare the performances of our two alternatives on synthetic data and apply them to recover de-noised, by-state CPS employment totals for data from $2000-2013$.

preprint2015arXiv

Inferring constructs of effective teaching from classroom observations: An application of Bayesian exploratory factor analysis without restrictions

Ratings of teachers' instructional practices using standardized classroom observation instruments are increasingly being used for both research and teacher accountability. There are multiple instruments in use, each attempting to evaluate many dimensions of teaching and classroom activities, and little is known about what underlying teaching quality attributes are being measured. We use data from multiple instruments collected from 458 middle school mathematics and English language arts teachers to inform research and practice on teacher performance measurement by modeling latent constructs of high-quality teaching. We make inferences about these constructs using a novel approach to Bayesian exploratory factor analysis (EFA) that, unlike commonly used approaches for identifying factor loadings in Bayesian EFA, is invariant to how the data dimensions are ordered. Applying this approach to ratings of lessons reveals two distinct teaching constructs in both mathematics and English language arts: (1) quality of instructional practices; and (2) quality of teacher management of classrooms. We demonstrate the relationships of these constructs to other indicators of teaching quality, including teacher content knowledge and student performance on standardized tests.

Terrance D. Savitsky

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

Distributed Bayesian Varying Coefficient Modeling Using a Gaussian Process Prior

Private Tabular Survey Data Products through Synthetic Microdata Generation

Re-weighting of Vector-weighted Mechanisms for Utility Maximization under Differential Privacy

Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys

Risk-Efficient Bayesian Data Synthesis for Privacy Protection

Bayesian Pseudo Posterior Synthesis for Data Privacy Protection

Bayesian Uncertainty Estimation Under Complex Sampling

Bayesian Estimation Under Informative Sampling

Bayesian Nonparameteric Multiresolution Estimation for the American Community Survey

Bayesian Nonparametric Functional Mixture Estimation for Time-Series Data, With Application to Estimation of State Employment Totals

Inferring constructs of effective teaching from classroom observations: An application of Bayesian exploratory factor analysis without restrictions