Source author record

Rebecca C. Steorts

Rebecca C. Steorts appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Applications Databases Machine Learning Computation math.ST Statistics Theory stat.OT

Catalog footprint

What is connected

21works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Bayesian Graphical Entity Resolution Using Exchangeable Random Partition Priors

Entity resolution (record linkage or deduplication) is the process of identifying and linking duplicate records in databases. In this paper, we propose a Bayesian graphical approach for entity resolution that links records to latent entities, where the prior representation on the linkage structure is exchangeable. First, we adopt a flexible and tractable set of priors for the linkage structure, which corresponds to a special class of random partition models. Second, we propose a more realistic distortion model for categorical/discrete record attributes, which corrects a logical inconsistency with the standard hit-miss model. Third, we incorporate hyperpriors to improve flexibility. Fourth, we employ a partially collapsed Gibbs sampler for inferential speedups. Using a selection of private and nonprivate data sets, we investigate the impact of our modeling contributions and compare our model with two alternative Bayesian models. In addition, we conduct a simulation study for household survey data, where we vary distortion, duplication rates and data set size. We find that our model performs more consistently than the alternatives across a variety of scenarios and typically achieves the highest entity resolution accuracy (F1 score). Open source software is available for our proposed methodology, and we provide a discussion regarding our work and future directions.

preprint2022arXiv

(Almost) All of Entity Resolution

Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as record linkage, de-duplication, or entity resolution. In this article, we review motivational applications and seminal papers that have led to the growth of this area. Specifically, we review the foundational work that began in the 1940's and 50's that have led to modern probabilistic record linkage. We review clustering approaches to entity resolution, semi- and fully supervised methods, and canonicalization, which are being used throughout industry and academia in applications such as human rights, official statistics, medicine, citation networks, among others. Finally, we discuss current research topics of practical importance.

preprint2022arXiv

A Practical Approach to Proper Inference with Linked Data

Entity resolution (ER), comprising record linkage and de-duplication, is the process of merging noisy databases in the absence of unique identifiers to remove duplicate entities. One major challenge of analysis with linked data is identifying a representative record among determined matches to pass to an inferential or predictive task, referred to as the \emph{downstream task}. Additionally, incorporating uncertainty from ER in the downstream task is critical to ensure proper inference. To bridge the gap between ER and the downstream task in an analysis pipeline, we propose five methods to choose a representative (or canonical) record from linked data, referred to as canonicalization. Our methods are scalable in the number of records, appropriate in general data scenarios, and provide natural error propagation via a Bayesian canonicalization stage. The proposed methodology is evaluated on three simulated data sets and one application -- determining the relationship between demographic information and party affiliation in voter registration data from the North Carolina State Board of Elections. We first perform Bayesian ER and evaluate our proposed methods for canonicalization before considering the downstream tasks of linear and logistic regression. Bayesian canonicalization methods are empirically shown to improve downstream inference in both settings through prediction and coverage.

preprint2021arXiv

On the Reliability of Multiple Systems Estimation for the Quantification of Modern Slavery

The quantification of modern slavery has received increased attention recently as organizations have come together to produce global estimates, where multiple systems estimation (MSE) is often used to this end. Echoing a long-standing controversy, disagreements have re-surfaced regarding the underlying MSE assumptions, the robustness of MSE methodology, and the accuracy of MSE estimates in this application. Our goal is to help address and move past these controversies. To do so, we review MSE, its assumptions, and commonly used models for modern slavery applications. We introduce all of the publicly available modern slavery datasets in the literature, providing a reproducible analysis and highlighting current issues. Specifically, we utilize an internal consistency approach that constructs subsets of data for which ground truth is available, allowing us to evaluate the accuracy of MSE estimators. Next, we propose a characterization of the large sample bias of estimators as a function of misspecified assumptions. Then, we propose an alternative to traditional (e.g., bootstrap-based) assessments of reliability, which allows us to visualize trajectories of MSE estimates to illustrate the robustness of estimates. Finally, our complementary analyses are used to provide guidance regarding the application and reliability of MSE methodology.

preprint2021arXiv

Transformed Fay-Herriot Model with Measurement Error in Covariates

Statistical agencies are often asked to produce small area estimates (SAEs) for positively skewed variables. When domain sample sizes are too small to support direct estimators, effects of skewness of the response variable can be large. As such, it is important to appropriately account for the distribution of the response variable given available auxiliary information. Motivated by this issue and in order to stabilize the skewness and achieve normality in the response variable, we propose an area-level log-measurement error model on the response variable. Then, under our proposed modeling framework, we derive an empirical Bayes (EB) predictor of positive small area quantities subject to the covariates containing measurement error. We propose a corresponding mean squared prediction error (MSPE) of EB predictor using both a jackknife and a bootstrap method. We show that the order of the bias is $O(m^{-1})$, where $m$ is the number of small areas. Finally, we investigate the performance of our methodology using both design-based and model-based simulation studies.

preprint2020arXiv

Random Partition Models for Microclustering Tasks

Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as entity resolution, modeling of sparse networks, and DNA sequencing tasks. Such applications require models that yield clusters whose sizes grow sublinearly with the total number of data points -- the microclustering property. Motivated by these issues, we propose a general class of random partition models that satisfy the microclustering property with well-characterized theoretical properties. Our proposed models overcome major limitations in the existing literature on microclustering models, namely a lack of interpretability, identifiability, and full characterization of model asymptotic properties. Crucially, we drop the classical assumption of having an exchangeable sequence of data points, and instead assume an exchangeable sequence of clusters. In addition, our framework provides flexibility in terms of the prior distribution of cluster sizes, computational tractability, and applicability to a large number of microclustering tasks. We establish theoretical properties of the resulting class of priors, where we characterize the asymptotic behavior of the number of clusters and of the proportion of clusters of a given size. Our framework allows a simple and efficient Markov chain Monte Carlo algorithm to perform statistical inference. We illustrate our proposed methodology on the microclustering task of entity resolution, where we provide a simulation study and real experiments on survey panel data.

preprint2016arXiv

Flexible Models for Microclustering with Application to Entity Resolution

Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate. For example, when performing entity resolution, the size of each cluster should be unrelated to the size of the data set, and each cluster should contain a negligible fraction of the total number of data points. These applications require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and introducing a new class of models that can exhibit this property. We compare models within this class to two commonly used clustering models using four entity-resolution data sets.

preprint2016arXiv

Regularized brain reading with shrinkage and smoothing

Functional neuroimaging measures how the brain responds to complex stimuli. However, sample sizes are modest, noise is substantial, and stimuli are high dimensional. Hence, direct estimates are inherently imprecise and call for regularization. We compare a suite of approaches which regularize via shrinkage: ridge regression, the elastic net (a generalization of ridge regression and the lasso), and a hierarchical Bayesian model based on small area estimation (SAE). We contrast regularization with spatial smoothing and combinations of smoothing and shrinkage. All methods are tested on functional magnetic resonance imaging (fMRI) data from multiple subjects participating in two different experiments related to reading, for both predicting neural response to stimuli and decoding stimuli from responses. Interestingly, when the regularization parameters are chosen by cross-validation independently for every voxel, low/high regularization is chosen in voxels where the classification accuracy is high/low, indicating that the regularization intensity is a good tool for identification of relevant voxels for the cognitive task. Surprisingly, all the regularization methods work about equally well, suggesting that beating basic smoothing and shrinkage will take not only clever methods, but also careful modeling.

preprint2015arXiv

A Bayesian Approach to Graphical Record Linkage and De-duplication

We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture-recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature.

preprint2015arXiv

Blocking Methods Applied to Casualty Records from the Syrian Conflict

Estimation of death counts and associated standard errors is of great importance in armed conflict such as the ongoing violence in Syria, as well as historical conflicts in Guatemala, Perú, Colombia, Timor Leste, and Kosovo. For example, statistical estimates of death counts were cited as important evidence in the trial of General Efraín Ríos Montt for acts of genocide in Guatemala. Estimation relies on both record linkage and multiple systems estimation. A key first step in this process is identifying ways to partition the records such that they are computationally manageable. This step is referred to as blocking and is a major challenge for the Syrian database since it is sparse in the number of duplicate records and feature poor in its attributes. As a consequence, we propose locality sensitive hashing (LSH) methods to overcome these challenges. We demonstrate the computational superiority and error rates of these methods by comparing our proposed approach with others in the literature. We conclude with a discussion of many challenges of merging LSH with record linkage to achieve an estimate of the number of uniquely documented deaths in the Syrian conflict.

preprint2015arXiv

Entity Resolution with Empirically Motivated Priors

Databases often contain corrupted, degraded, and noisy data with duplicate entries across and within each database. Such problems arise in citations, medical databases, genetics, human rights databases, and a variety of other applied settings. The target of statistical inference can be viewed as an unsupervised problem of determining the edges of a bipartite graph that links the observed records to unobserved latent entities. Bayesian approaches provide attractive benefits, naturally providing uncertainty quantification via posterior probabilities. We propose a novel record linkage approach based on empirical Bayesian principles. Specifically, the empirical Bayesian--type step consists of taking the empirical distribution function of the data as the prior for the latent entities. This approach improves on the earlier HB approach not only by avoiding the prior specification problem but also by allowing both categorical and string-valued variables. Our extension to string-valued variables also involves the proposal of a new probabilistic mechanism by which observed record values for string fields can deviate from the values of their associated latent entities. Categorical fields that deviate from their corresponding true value are simply drawn from the empirical distribution function. We apply our proposed methodology to a simulated data set of German names and an Italian household survey, showing our method performs favorably compared to several standard methods in the literature. We also consider the robustness of our methods to changes in the hyper-parameters.

preprint2015arXiv

Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set

Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some tasks, this assumption is undesirable. For example, when performing entity resolution, the size of each cluster is often unrelated to the size of the data set. Consequently, each cluster contains a negligible fraction of the total number of data points. Such tasks therefore require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the \emph{microclustering property} and introducing a new model that exhibits this property. We compare this model to several commonly used clustering models by checking model fit using real and simulated data sets.

preprint2014arXiv

A Comparison of Blocking Methods for Record Linkage

Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sensitive hashing, sometimes referred to as "private blocking." We compare these approaches in terms of their recall, reduction ratio, and computational complexity. We evaluate these methods using different synthetic datafiles and conclude with a discussion of privacy-related issues.

preprint2014arXiv

Discussion of "Estimating the Distribution of Dietary Consumption Patterns"

Discussion of "Estimating the Distribution of Dietary Consumption Patterns" by Raymond J. Carroll [arXiv:1405.4667].

preprint2014arXiv

Discussion of "Single and Two-Stage Cross-Sectional and Time Series Benchmarking Procedures for SAE"

We congratulate the authors for a stimulating and valuable manuscript, providing a careful review of the state-of the-art in cross-sectional and time-series benchmarking procedures for small area estimation. They develop a novel two-stage benchmarking method for hierarchical time series models, where they evaluate their procedure by estimating monthly total unemployment using data from the U.S. Census Bureau. We discuss three topics: linearity and model misspecification, computational complexity and model comparisons, and, some aspects on small area estimation in practice. More specifically, we pose the following questions to the authors, that they may wish to answer: How robust is their model to misspecification? Is it time to perhaps move away from linear models of the type considered by (Battese et al. 1988; Fay and Herriot 1979)? What is the asymptotic computational complexity and what comparisons can be made to other models? Should the benchmarking constraints be inherently fixed or should they be random?

preprint2014arXiv

SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication

We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a {\em bipartite} graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate $k$-way posterior probabilities of matches across records, and propagate the uncertainty of record linkage into later analyses. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously proposed methods of record linkage, despite the high dimensional parameter space. We assess our results on real and simulated data.

preprint2014arXiv

Smoothing, Clustering, and Benchmarking for Small Area Estimation

We develop constrained Bayesian estimation methods for small area problems: those requiring smoothness with respect to similarity across areas, such as geographic proximity or clustering by covariates; and benchmarking constraints, requiring (weighted) means of estimates to agree across levels of aggregation. We develop methods for constrained estimation decision-theoretically and discuss their geometric interpretation. Our constrained estimators are the solutions to tractable optimization problems and have closed-form solutions. Mean squared errors of the constrained estimators are calculated via bootstrapping. Our techniques are free of distributional assumptions and apply whether the estimator is linear or non-linear, univariate or multivariate. We illustrate our methods using data from the U.S. Census's Small Area Income and Poverty Estimates program.

preprint2014arXiv

Variational Bayes for Merging Noisy Databases

Bayesian entity resolution merges together multiple, noisy databases and returns the minimal collection of unique individuals represented, together with their true, latent record values. Bayesian methods allow flexible generative models that share power across databases as well as principled quantification of uncertainty for queries of the final, resolved database. However, existing Bayesian methods for entity resolution use Markov monte Carlo method (MCMC) approximations and are too slow to run on modern databases containing millions or billions of records. Instead, we propose applying variational approximations to allow scalable Bayesian inference in these models. We derive a coordinate-ascent approximation for mean-field variational Bayes, qualitatively compare our algorithm to existing methods, note unique challenges for inference that arise from the expected distribution of cluster sizes in entity resolution, and discuss directions for future work in this domain.

preprint2013arXiv

On estimation of mean squared errors of benchmarked empirical Bayes estimators

We consider benchmarked empirical Bayes (EB) estimators under the basic area-level model of Fay and Herriot while requiring the standard benchmarking constraint. In this paper we determine the excess mean squared error (MSE) from constraining the estimates through benchmarking. We show that the increase due to benchmarking is O(m^{-1}), where m is the number of small areas. Furthermore, we find an asymptotically unbiased estimator of this MSE and compare it to the second-order approximation of the MSE of the EB estimator or, equivalently, of the MSE of the empirical best linear unbiased predictor (EBLUP), that was derived by Prasad and Rao (1990). Morever, using methods similar to those of Butar and Lahiri (2003), we compute a parametric bootstrap estimator of the MSE of the benchmarked EB estimator under the Fay-Herriot model and compare it to the MSE of the benchmarked EB estimator found by a second-order approximation. Finally, we illustrate our methods using SAIPE data from the U.S. Census Bureau, and in a simulation study.

preprint2013arXiv

Trouble With The Curve: Improving MLB Pitch Classification

The PITCHf/x database has allowed the statistical analysis of of Major League Baseball (MLB) to flourish since its introduction in late 2006. Using PITCHf/x, pitches have been classified by hand, requiring considerable effort, or using neural network clustering and classification, which is often difficult to interpret. To address these issues, we use model-based clustering with a multivariate Gaussian mixture model and an appropriate adjustment factor as an alternative to current methods. Furthermore, we describe a new pitch classification algorithm based on our clustering approach to address the problems of pitch misclassification. We illustrate our methods for various pitchers from the PITCHf/x database that covers a wide variety of pitch types.

preprint2013arXiv

Two-stage Benchmarking as Applied to Small Area Estimation

There has been recent growth in small area estimation due to the need for more precise estimation of small geographic areas, which has led to groups such as the U.S. Census Bureau, Google, and the RAND corporation utilizing small area estimation procedures. We develop novel two-stage benchmarking methodology using a single weighted squared error loss function that combines the loss at the unit level and the area level without any specific distributional assumptions. We consider this loss while benchmarking the weighted means at each level or both the weighted means and weighted variability at the unit level. Multivariate extensions are immediate. We analyze the behavior of our methods using a complex study from the National Health Interview Survey (NHIS) from 2000, which estimates the proportion of people that do not have health insurance for many domains of an Asian subpopulation. Finally, the methodology is explored via simulated data under the proposed model. We ultimately conclude that three proposed benchmarked Bayes estimators do not dominate each other, leaving much exploration for future research.

Rebecca C. Steorts

What is connected

Connect this record

See the researcher in context

Building this map preview

21 published item(s)

Bayesian Graphical Entity Resolution Using Exchangeable Random Partition Priors

(Almost) All of Entity Resolution

A Practical Approach to Proper Inference with Linked Data

On the Reliability of Multiple Systems Estimation for the Quantification of Modern Slavery

Transformed Fay-Herriot Model with Measurement Error in Covariates

Random Partition Models for Microclustering Tasks

Flexible Models for Microclustering with Application to Entity Resolution

Regularized brain reading with shrinkage and smoothing

A Bayesian Approach to Graphical Record Linkage and De-duplication

Blocking Methods Applied to Casualty Records from the Syrian Conflict

Entity Resolution with Empirically Motivated Priors

Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set

A Comparison of Blocking Methods for Record Linkage

Discussion of "Estimating the Distribution of Dietary Consumption Patterns"

Discussion of "Single and Two-Stage Cross-Sectional and Time Series Benchmarking Procedures for SAE"

SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication

Smoothing, Clustering, and Benchmarking for Small Area Estimation

Variational Bayes for Merging Noisy Databases

On estimation of mean squared errors of benchmarked empirical Bayes estimators

Trouble With The Curve: Improving MLB Pitch Classification

Two-stage Benchmarking as Applied to Small Area Estimation