Source author record

Mauricio Sadinle

Mauricio Sadinle appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications Methodology Machine Learning Databases stat.OT

Catalog footprint

What is connected

7works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

The Central Role of the Identifying Assumption in Population Size Estimation

The problem of estimating the size of a population based on a subset of individuals observed across multiple data sources is often referred to as capture-recapture or multiple-systems estimation. This is fundamentally a missing data problem, where the number of unobserved individuals represents the missing data. As with any missing data problem, multiple-systems estimation requires users to make an untestable identifying assumption in order to estimate the population size from the observed data. If an appropriate identifying assumption cannot be found for a data set, no estimate of the population size should be produced based on that data set, as models with different identifying assumptions can produce arbitrarily different population size estimates -- even with identical observed data fits. Approaches to multiple-systems estimation often do not explicitly specify identifying assumptions. This makes it difficult to decouple the specification of the model for the observed data from the identifying assumption and to provide justification for the identifying assumption. We present a re-framing of the multiple-systems estimation problem that leads to an approach which decouples the specification of the observed-data model from the identifying assumption, and discuss how common models fit into this framing. This approach takes advantage of existing software and facilitates various sensitivity analyses. We demonstrate our approach in a case study estimating the number of civilian casualties in the Kosovo war. Code used to produce this manuscript is available at https://github.com/aleshing/central-role-of-identifying-assumptions.

preprint2016arXiv

Bayesian Estimation of Bipartite Matchings for Record Linkage

The bipartite record linkage task consists of merging two disparate datafiles containing information on two overlapping sets of entities. This is non-trivial in the absence of unique identifiers and it is important for a wide variety of applications given that it needs to be solved whenever we have to combine information from different sources. Most statistical techniques currently used for record linkage are derived from a seminal paper by Fellegi and Sunter (1969). These techniques usually assume independence in the matching statuses of record pairs to derive estimation procedures and optimal point estimators. We argue that this independence assumption is unreasonable and instead target a bipartite matching between the two datafiles as our parameter of interest. Bayesian implementations allow us to quantify uncertainty on the matching decisions and derive a variety of point estimators using different loss functions. We propose partial Bayes estimates that allow uncertain parts of the bipartite matching to be left unresolved. We evaluate our approach to record linkage using a variety of challenging scenarios and show that it outperforms the traditional methodology. We illustrate the advantages of our methods merging two datafiles on casualties from the civil war of El Salvador.

preprint2016arXiv

Itemwise conditionally independent nonresponse modeling for incomplete multivariate data

We introduce a nonresponse mechanism for multivariate missing data in which each study variable and its nonresponse indicator are conditionally independent given the remaining variables and their nonresponse indicators. This is a nonignorable missingness mechanism, in that nonresponse for any item can depend on values of other items that are themselves missing. We show that, under this itemwise conditionally independent nonresponse assumption, one can define and identify nonparametric saturated classes of joint multivariate models for the study variables and their missingness indicators. We also show how to perform sensitivity analysis to violations of the conditional independence assumptions encoded by this missingness mechanism. Throughout, we illustrate the use of this modeling approach with data analyses.

preprint2015arXiv

Detecting duplicates in a homicide registry using a Bayesian partitioning approach

Finding duplicates in homicide registries is an important step in keeping an accurate account of lethal violence. This task is not trivial when unique identifiers of the individuals are not available, and it is especially challenging when records are subject to errors and missing values. Traditional approaches to duplicate detection output independent decisions on the coreference status of each pair of records, which often leads to nontransitive decisions that have to be reconciled in some ad-hoc fashion. The task of finding duplicate records in a data file can be alternatively posed as partitioning the data file into groups of coreferent records. We present an approach that targets this partition of the file as the parameter of interest, thereby ensuring transitive decisions. Our Bayesian implementation allows us to incorporate prior information on the reliability of the fields in the data file, which is especially useful when no training data are available, and it also provides a proper account of the uncertainty in the duplicate detection decisions. We present a study to detect killings that were reported multiple times to the United Nations Truth Commission for El Salvador.

preprint2014arXiv

A Comparison of Blocking Methods for Record Linkage

Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sensitive hashing, sometimes referred to as "private blocking." We compare these approaches in terms of their recall, reduction ratio, and computational complexity. We evaluate these methods using different synthetic datafiles and conclude with a discussion of privacy-related issues.

preprint2013arXiv

A Generalized Fellegi-Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems

We present a probabilistic method for linking multiple datafiles. This task is not trivial in the absence of unique identifiers for the individuals recorded. This is a common scenario when linking census data to coverage measurement surveys for census coverage evaluation, and in general when multiple record-systems need to be integrated for posterior analysis. Our method generalizes the Fellegi-Sunter theory for linking records from two datafiles and its modern implementations. The multiple record linkage goal is to classify the record K-tuples coming from K datafiles according to the different matching patterns. Our method incorporates the transitivity of agreement in the computation of the data used to model matching probabilities. We use a mixture model to fit matching probabilities via maximum likelihood using the EM algorithm. We present a method to decide the record K-tuples membership to the subsets of matching patterns and we prove its optimality. We apply our method to the integration of three Colombian homicide record systems and we perform a simulation study in order to explore the performance of the method under measurement error and different scenarios. The proposed method works well and opens some directions for future research.

preprint2013arXiv

The Strength of Arcs and Edges in Interaction Networks: Elements of a Model-Based Approach

When analyzing interaction networks, it is common to interpret the amount of interaction between two nodes as the strength of their relationship. We argue that this interpretation may not be appropriate, since the interaction between a pair of nodes could potentially be explained only by characteristics of the nodes that compose the pair and, however, not by pair-specific features. In interaction networks, where edges or arcs are count-valued, the above scenario corresponds to a model of independence for the expected interaction in the network, and consequently we propose the notions of arc strength, and edge strength to be understood as departures from this model of independence. We discuss how our notion of arc/edge strength can be used as a guidance to study network structure, and in particular we develop a latent arc strength stochastic blockmodel for directed interaction networks. We illustrate our approach studying the interaction between the Kolkata users of the myGamma mobile network.

Mauricio Sadinle

What is connected

Connect this record

See the researcher in context

Building this map preview

7 published item(s)

The Central Role of the Identifying Assumption in Population Size Estimation

Bayesian Estimation of Bipartite Matchings for Record Linkage

Itemwise conditionally independent nonresponse modeling for incomplete multivariate data

Detecting duplicates in a homicide registry using a Bayesian partitioning approach

A Comparison of Blocking Methods for Record Linkage

A Generalized Fellegi-Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems

The Strength of Arcs and Edges in Interaction Networks: Elements of a Model-Based Approach