Source author record

Stephen E. Fienberg

Stephen E. Fienberg appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications Methodology Machine Learning math.ST Statistics Theory Cryptography and Security stat.OT Computation Databases Discrete Mathematics math.CO Social and Information Networks

Catalog footprint

What is connected

25works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2016arXiv

A Minimax Theory for Adaptive Data Analysis

In adaptive data analysis, the user makes a sequence of queries on the data, where at each step the choice of query may depend on the results in previous steps. The releases are often randomized in order to reduce overfitting for such adaptively chosen queries. In this paper, we propose a minimax framework for adaptive data analysis. Assuming Gaussianity of queries, we establish the first sharp minimax lower bound on the squared error in the order of $O(\frac{\sqrt{k}σ^2}{n})$, where $k$ is the number of queries asked, and $σ^2/n$ is the ordinary signal-to-noise ratio for a single query. Our lower bound is based on the construction of an approximately least favorable adversary who picks a sequence of queries that are most likely to be affected by overfitting. This approximately least favorable adversary uses only one level of adaptivity, suggesting that the minimax risk for 1-step adaptivity with k-1 initial releases and that for $k$-step adaptivity are on the same order. The key technical component of the lower bound proof is a reduction to finding the convoluting distribution that optimally obfuscates the sign of a Gaussian signal. Our lower bound construction also reveals a transparent and elementary proof of the matching upper bound as an alternative approach to Russo and Zou (2015), who used information-theoretic tools to provide the same upper bound. We believe that the proposed framework opens up opportunities to obtain theoretical insights for many other settings of adaptive data analysis, which would extend the idea to more practical realms.

preprint2016arXiv

Dynamic Question Ordering in Online Surveys

Online surveys have the potential to support adaptive questions, where later questions depend on earlier responses. Past work has taken a rule-based approach, uniformly across all respondents. We envision a richer interpretation of adaptive questions, which we call dynamic question ordering (DQO), where question order is personalized. Such an approach could increase engagement, and therefore response rate, as well as imputation quality. We present a DQO framework to improve survey completion and imputation. In the general survey-taking setting, we want to maximize survey completion, and so we focus on ordering questions to engage the respondent and collect hopefully all information, or at least the information that most characterizes the respondent, for accurate imputations. In another scenario, our goal is to provide a personalized prediction. Since it is possible to give reasonable predictions with only a subset of questions, we are not concerned with motivating users to answer all questions. Instead, we want to order questions to get information that reduces prediction uncertainty, while not being too burdensome. We illustrate this framework with an example of providing energy estimates to prospective tenants. We also discuss DQO for national surveys and consider connections between our statistics-based question-ordering approach and cognitive survey methodology.

preprint2016arXiv

Learning with Differential Privacy: Stability, Learnability and the Sufficiency and Necessity of ERM Principle

While machine learning has proven to be a powerful data-driven solution to many real-life problems, its use in sensitive domains has been limited due to privacy concerns. A popular approach known as **differential privacy** offers provable privacy guarantees, but it is often observed in practice that it could substantially hamper learning accuracy. In this paper we study the learnability (whether a problem can be learned by any algorithm) under Vapnik's general learning setting with differential privacy constraint, and reveal some intricate relationships between privacy, stability and learnability. In particular, we show that a problem is privately learnable **if an only if** there is a private algorithm that asymptotically minimizes the empirical risk (AERM). In contrast, for non-private learning AERM alone is not sufficient for learnability. This result suggests that when searching for private learning algorithms, we can restrict the search to algorithms that are AERM. In light of this, we propose a conceptual procedure that always finds a universally consistent algorithm whenever the problem is learnable under privacy constraint. We also propose a generic and practical algorithm and show that under very general conditions it privately learns a wide class of learning problems. Lastly, we extend some of the results to the more practical $(ε,δ)$-differential privacy and establish the existence of a phase-transition on the class of problems that are approximately privately learnable with respect to how small $δ$ needs to be.

preprint2016arXiv

On-Average KL-Privacy and its equivalence to Generalization for Max-Entropy Mechanisms

We define On-Average KL-Privacy and present its properties and connections to differential privacy, generalization and information-theoretic quantities including max-information and mutual information. The new definition significantly weakens differential privacy, while preserving its minimalistic design features such as composition over small group and multiple queries as well as closeness to post-processing. Moreover, we show that On-Average KL-Privacy is **equivalent** to generalization for a large class of commonly-used tools in statistics and machine learning that samples from Gibbs distributions---a class of distributions that arises naturally from the maximum entropy principle. In addition, a byproduct of our analysis yields a lower bound for generalization error in terms of mutual information which reveals an interesting interplay with known upper bounds that use the same quantity.

preprint2015arXiv

A Bayesian Approach to Graphical Record Linkage and De-duplication

We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture-recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature.

preprint2015arXiv

Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo

We consider the problem of Bayesian learning on sensitive datasets and present two simple but somewhat surprising results that connect Bayesian learning to "differential privacy:, a cryptographic approach to protect individual-level privacy while permiting database-level utility. Specifically, we show that that under standard assumptions, getting one single sample from a posterior distribution is differentially private "for free". We will see that estimator is statistically consistent, near optimal and computationally tractable whenever the Bayesian model of interest is consistent, optimal and tractable. Similarly but separately, we show that a recent line of works that use stochastic gradient for Hybrid Monte Carlo (HMC) sampling also preserve differentially privacy with minor or no modifications of the algorithmic procedure at all, these observations lead to an "anytime" algorithm for Bayesian learning under privacy constraint. We demonstrate that it performs much better than the state-of-the-art differential private methods on synthetic and real datasets.

preprint2014arXiv

$β$ models for random hypergraphs with a given degree sequence

We introduce the beta model for random hypergraphs in order to represent the occurrence of multi-way interactions among agents in a social network. This model builds upon and generalizes the well-studied beta model for random graphs, which instead only considers pairwise interactions. We provide two algorithms for fitting the model parameters, IPS (iterative proportional scaling) and fixed point algorithm, prove that both algorithms converge if maximum likelihood estimator (MLE) exists, and provide algorithmic and geometric ways of dealing the issue of MLE existence.

preprint2014arXiv

A Comparison of Blocking Methods for Record Linkage

Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sensitive hashing, sometimes referred to as "private blocking." We compare these approaches in terms of their recall, reduction ratio, and computational complexity. We evaluate these methods using different synthetic datafiles and conclude with a discussion of privacy-related issues.

preprint2014arXiv

Differentially-Private Logistic Regression for Detecting Multiple-SNP Association in GWAS Databases

Following the publication of an attack on genome-wide association studies (GWAS) data proposed by Homer et al., considerable attention has been given to developing methods for releasing GWAS data in a privacy-preserving way. Here, we develop an end-to-end differentially private method for solving regression problems with convex penalty functions and selecting the penalty parameters by cross-validation. In particular, we focus on penalized logistic regression with elastic-net regularization, a method widely used to in GWAS analyses to identify disease-causing genes. We show how a differentially private procedure for penalized logistic regression with elastic-net regularization can be applied to the analysis of GWAS data and evaluate our method's performance.

preprint2014arXiv

Discussion of "Estimating the Distribution of Dietary Consumption Patterns"

Discussion of "Estimating the Distribution of Dietary Consumption Patterns" by Raymond J. Carroll [arXiv:1405.4667].

preprint2014arXiv

From Statistical Evidence to Evidence of Causality

While statisticians and quantitative social scientists typically study the "effects of causes" (EoC), Lawyers and the Courts are more concerned with understanding the "causes of effects" (CoE). EoC can be addressed using experimental design and statistical analysis, but it is less clear how to incorporate statistical or epidemiological evidence into CoE reasoning, as might be required for a case at Law. Some form of counterfactual reasoning, such as the "potential outcomes" approach championed by Rubin, appears unavoidable, but this typically yields "answers" that are sensitive to arbitrary and untestable assumptions. We must therefore recognise that a CoE question simply might not have a well-determined answer. It is nevertheless possible to use statistical data to set bounds within which any answer must lie. With less than perfect data these bounds will themselves be uncertain, leading to a compounding of different kinds of uncertainty. Still further care is required in the presence of possible confounding factors. In addition, even identifying the relevant "counterfactual contrast" may be a matter of Policy as much as of Science. Defining the question is as non-trivial a task as finding a route towards an answer. This paper develops some technical elaborations of these philosophical points, and illustrates them with an analysis of a case study in child protection. Keywords: benfluorex, causes of effects, counterfactual, child protection, effects of causes, Fre'chet bound, potential outcome, probability of causation

preprint2014arXiv

Scalable Privacy-Preserving Data Sharing Methodology for Genome-Wide Association Studies

The protection of privacy of individual-level information in genome-wide association study (GWAS) databases has been a major concern of researchers following the publication of "an attack" on GWAS data by Homer et al. (2008) Traditional statistical methods for confidentiality and privacy protection of statistical databases do not scale well to deal with GWAS data, especially in terms of guarantees regarding protection from linkage to external information. The more recent concept of differential privacy, introduced by the cryptographic community, is an approach that provides a rigorous definition of privacy with meaningful privacy guarantees in the presence of arbitrary external information, although the guarantees may come at a serious price in terms of data utility. Building on such notions, Uhler et al. (2013) proposed new methods to release aggregate GWAS data without compromising an individual's privacy. We extend the methods developed in Uhler et al. (2013) for releasing differentially-private $χ^2$-statistics by allowing for arbitrary number of cases and controls, and for releasing differentially-private allelic test statistics. We also provide a new interpretation by assuming the controls' data are known, which is a realistic assumption because some GWAS use publicly available data as controls. We assess the performance of the proposed methods through a risk-utility analysis on a real data set consisting of DNA samples collected by the Wellcome Trust Case Control Consortium and compare the methods with the differentially-private release mechanism proposed by Johnson and Shmatikov (2013).

preprint2014arXiv

SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication

We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a {\em bipartite} graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate $k$-way posterior probabilities of matches across records, and propagate the uncertainty of record linkage into later analyses. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously proposed methods of record linkage, despite the high dimensional parameter space. We assess our results on real and simulated data.

preprint2013arXiv

A Generalized Fellegi-Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems

We present a probabilistic method for linking multiple datafiles. This task is not trivial in the absence of unique identifiers for the individuals recorded. This is a common scenario when linking census data to coverage measurement surveys for census coverage evaluation, and in general when multiple record-systems need to be integrated for posterior analysis. Our method generalizes the Fellegi-Sunter theory for linking records from two datafiles and its modern implementations. The multiple record linkage goal is to classify the record K-tuples coming from K datafiles according to the different matching patterns. Our method incorporates the transitivity of agreement in the computation of the data used to model matching probabilities. We use a mixture model to fit matching probabilities via maximum likelihood using the EM algorithm. We present a method to decide the record K-tuples membership to the subsets of matching patterns and we prove its optimality. We apply our method to the integration of three Colombian homicide record systems and we perform a simulation study in order to explore the performance of the method under measurement error and different scenarios. The proposed method works well and opens some directions for future research.

preprint2013arXiv

Maximum lilkelihood estimation in the $β$-model

We study maximum likelihood estimation for the statistical model for undirected random graphs, known as the $β$-model, in which the degree sequences are minimal sufficient statistics. We derive necessary and sufficient conditions, based on the polytope of degree sequences, for the existence of the maximum likelihood estimator (MLE) of the model parameters. We characterize in a combinatorial fashion sample points leading to a nonexistent MLE, and nonestimability of the probability parameters under a nonexistent MLE. We formulate conditions that guarantee that the MLE exists with probability tending to one as the number of nodes increases.

preprint2012arXiv

Maximum likelihood estimation in log-linear models

We study maximum likelihood estimation in log-linear models under conditional Poisson sampling schemes. We derive necessary and sufficient conditions for existence of the maximum likelihood estimator (MLE) of the model parameters and investigate estimability of the natural and mean-value parameters under a nonexistent MLE. Our conditions focus on the role of sampling zeros in the observed table. We situate our results within the framework of extended exponential families, and we exploit the geometric properties of log-linear models. We propose algorithms for extended maximum likelihood estimation that improve and correct the existing algorithms for log-linear model analysis.

preprint2012arXiv

Privacy-Preserving Data Sharing for Genome-Wide Association Studies

Traditional statistical methods for confidentiality protection of statistical databases do not scale well to deal with GWAS (genome-wide association studies) databases especially in terms of guarantees regarding protection from linkage to external information. The more recent concept of differential privacy, introduced by the cryptographic community, is an approach which provides a rigorous definition of privacy with meaningful privacy guarantees in the presence of arbitrary external information, although the guarantees come at a serious price in terms of data utility. Building on such notions, we propose new methods to release aggregate GWAS data without compromising an individual's privacy. We present methods for releasing differentially private minor allele frequencies, chi-square statistics and p-values. We compare these approaches on simulated data and on a GWAS study of canine hair length involving 685 dogs. We also propose a privacy-preserving method for finding genome-wide associations based on a differentially-private approach to penalized logistic regression.

preprint2011arXiv

Bayesian Models and Methods in Public Policy and Government Settings

Starting with the neo-Bayesian revival of the 1950s, many statisticians argued that it was inappropriate to use Bayesian methods, and in particular subjective Bayesian methods in governmental and public policy settings because of their reliance upon prior distributions. But the Bayesian framework often provides the primary way to respond to questions raised in these settings and the numbers and diversity of Bayesian applications have grown dramatically in recent years. Through a series of examples, both historical and recent, we argue that Bayesian approaches with formal and informal assessments of priors AND likelihood functions are well accepted and should become the norm in public settings. Our examples include census-taking and small area estimation, US election night forecasting, studies reported to the US Food and Drug Administration, assessing global climate change, and measuring potential declines in disability among the elderly.

preprint2011arXiv

Discussion of "Network routing in a dynamic environment"

Discussion of "Network routing in a dynamic environment" by N.D. Singpurwalla [arXiv:1107.4852]

preprint2011arXiv

Rejoinder

Rejoinder of "Bayesian Models and Methods in Public Policy and Government Settings" by S. E. Fienberg [arXiv:1108.2177]

preprint2010arXiv

Algebraic statistics for a directed random graph model with reciprocation

The p_1 model is a directed random graph model used to describe dyadic interactions in a social network in terms of effects due to differential attraction (popularity) and expansiveness, as well as an additional effect due to reciprocation. In this article we carry out an algebraic statistics analysis of this model. We show that the p_1 model is a toric model specified by a multi-homogeneous ideal. We conduct an extensive study of the Markov bases for p_1 models that incorporate explicitly the constraint arising from multi-homogeneity. Our results are directly relevant to the estimation and conditional goodness-of-fit testing problems in p_1 models.

preprint2010arXiv

Exploring the Consequences of IED Deployment with a Generalized Linear Model Implementation of the Canadian Traveller Problem

The deployment of improvised explosive devices (IEDs) along major roadways has been a favoured strategy of insurgents in recent war zones, both for the ability to cause damage to targets along roadways at minimal cost, but also as a means of controlling the flow of traffic and causing additional expense to opposing forces. Among other related approaches (which we discuss), the adversarial problem has an analogue in the Canadian Traveller Problem, wherein a stretch of road is blocked with some independent probability, and the state of the road is only discovered once the traveller reaches one of the intersections that bound this stretch of road. We discuss the implementation of ideas from social network analysis, namely the notion of "betweenness centrality", and how this can be adapted to the notion of deployment of IEDs with the aid of Generalized Linear Models (GLMs): namely, how we can model the probability of an IED deployment in terms of the increased effort due to Canadian betweenness, how we can include expert judgement on the probability of a deployment, and how we can extend the approach to estimation and updating over several time steps.

preprint2010arXiv

Introduction to papers on the modeling and analysis of network data

preprint2010arXiv

Introduction to papers on the modeling and analysis of network data---II

preprint2010arXiv

User Interest and Interaction Structure in Online Forums

We present a new similarity measure tailored to posts in an online forum. Our measure takes into account all the available information about user interest and interaction --- the content of posts, the threads in the forum, and the author of the posts. We use this post similarity to build a similarity between users, based on principal coordinate analysis. This allows easy visualization of the user activity as well. Similarity between users has numerous applications, such as clustering or classification. We show that including the author of a post in the post similarity has a smoothing effect on principal coordinate projections. We demonstrate our method on real data drawn from an internal corporate forum, and compare our results to those given by a standard document classification method. We conclude our method gives a more detailed picture of both the local and global network structure.

Stephen E. Fienberg

What is connected

Connect this record

See the researcher in context

Building this map preview

25 published item(s)

A Minimax Theory for Adaptive Data Analysis

Dynamic Question Ordering in Online Surveys

Learning with Differential Privacy: Stability, Learnability and the Sufficiency and Necessity of ERM Principle

On-Average KL-Privacy and its equivalence to Generalization for Max-Entropy Mechanisms

A Bayesian Approach to Graphical Record Linkage and De-duplication

Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo

$β$ models for random hypergraphs with a given degree sequence

A Comparison of Blocking Methods for Record Linkage

Differentially-Private Logistic Regression for Detecting Multiple-SNP Association in GWAS Databases

Discussion of "Estimating the Distribution of Dietary Consumption Patterns"

From Statistical Evidence to Evidence of Causality

Scalable Privacy-Preserving Data Sharing Methodology for Genome-Wide Association Studies

SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication

A Generalized Fellegi-Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems

Maximum lilkelihood estimation in the $β$-model

Maximum likelihood estimation in log-linear models

Privacy-Preserving Data Sharing for Genome-Wide Association Studies

Bayesian Models and Methods in Public Policy and Government Settings

Discussion of "Network routing in a dynamic environment"

Rejoinder

Algebraic statistics for a directed random graph model with reciprocation

Exploring the Consequences of IED Deployment with a Generalized Linear Model Implementation of the Canadian Traveller Problem

Introduction to papers on the modeling and analysis of network data

Introduction to papers on the modeling and analysis of network data---II

User Interest and Interaction Structure in Online Forums