Source author record

Michael Hay

Michael Hay appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Cryptography and Security Databases Digital Libraries Information Retrieval Machine Learning

Catalog footprint

What is connected

8works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Benchmarking Differentially Private Synthetic Data Generation Algorithms

This work presents a systematic benchmark of differentially private synthetic data generation algorithms that can generate tabular data. Utility of the synthetic data is evaluated by measuring whether the synthetic data preserve the distribution of individual and pairs of attributes, pairwise correlation as well as on the accuracy of an ML classification model. In a comprehensive empirical evaluation we identify the top performing algorithms and those that consistently fail to beat baseline approaches.

preprint2022arXiv

Precision-based attacks and interval refining: how to break, then fix, differential privacy on finite computers

Despite being raised as a problem over ten years ago, the imprecision of floating point arithmetic continues to cause privacy failures in the implementations of differentially private noise mechanisms. In this paper, we highlight a new class of vulnerabilities, which we call \emph{precision-based attacks}, and which affect several open source libraries. To address this vulnerability and implement differentially private mechanisms on floating-point space in a safe way, we propose a novel technique, called \emph{interval refining}. This technique has minimal error, provable privacy, and broad applicability. We use interval refining to design and implement a variant of the Laplace mechanism that is equivalent to sampling from the Laplace distribution and rounding to a float. We report on the performance of this approach, and discuss how interval refining can be used to implement other mechanisms safely, including the Gaussian mechanism and the exponential mechanism.

preprint2020arXiv

Fair Decision Making using Privacy-Protected Data

Data collected about individuals is regularly used to make decisions that impact those same individuals. We consider settings where sensitive personal data is used to decide who will receive resources or benefits. While it is well known that there is a tradeoff between protecting privacy and the accuracy of decisions, we initiate a first-of-its-kind study into the impact of formally private mechanisms (based on differential privacy) on fair and equitable decision-making. We empirically investigate novel tradeoffs on two real-world decisions made using U.S. Census data (allocation of federal funds and assignment of voting rights benefits) as well as a classic apportionment problem. Our results show that if decisions are made using an $ε$-differentially private version of the data, under strict privacy constraints (smaller $ε$), the noise added to achieve privacy may disproportionately impact some groups over others. We propose novel measures of fairness in the context of randomized differentially private algorithms and identify a range of causes of outcome disparities.

preprint2015arXiv

Principled Evaluation of Differentially Private Algorithms using DPBench

Differential privacy has become the dominant standard in the research community for strong privacy protection. There has been a flood of research into query answering algorithms that meet this standard. Algorithms are becoming increasingly complex, and in particular, the performance of many emerging algorithms is {\em data dependent}, meaning the distribution of the noise added to query answers may change depending on the input data. Theoretical analysis typically only considers the worst case, making empirical study of average case performance increasingly important. In this paper we propose a set of evaluation principles which we argue are essential for sound evaluation. Based on these principles we propose DPBench, a novel evaluation framework for standardized evaluation of privacy algorithms. We then apply our benchmark to evaluate algorithms for answering 1- and 2-dimensional range queries. The result is a thorough empirical study of 15 published algorithms on a total of 27 datasets that offers new insights into algorithm behavior---in particular the influence of dataset scale and shape---and a more complete characterization of the state of the art. Our methodology is able to resolve inconsistencies in prior empirical studies and place algorithm performance in context through comparison to simple baselines. Finally, we pose open research questions which we hope will guide future algorithm design.

preprint2014arXiv

A Data- and Workload-Aware Algorithm for Range Queries Under Differential Privacy

We describe a new algorithm for answering a given set of range queries under $ε$-differential privacy which often achieves substantially lower error than competing methods. Our algorithm satisfies differential privacy by adding noise that is adapted to the input data and to the given query set. We first privately learn a partitioning of the domain into buckets that suit the input data well. Then we privately estimate counts for each bucket, doing so in a manner well-suited for the given query set. Since the performance of the algorithm depends on the input database, we evaluate it on a wide range of real datasets, showing that we can achieve the benefits of data-dependence on both "easy" and "hard" databases.

preprint2012arXiv

An Integrated, Conditional Model of Information Extraction and Coreference with Applications to Citation Matching

Although information extraction and coreference resolution appear together in many applications, most current systems perform them as ndependent steps. This paper describes an approach to integrated inference for extraction and coreference based on conditionally-trained undirected graphical models. We discuss the advantages of conditional probability training, and of a coreference model structure based on graph partitioning. On a data set of research paper citations, we show significant reduction in error by using extraction uncertainty to improve coreference citation matching accuracy, and using coreference to improve the accuracy of the extracted fields.

preprint2010arXiv

Boosting the Accuracy of Differentially-Private Histograms Through Consistency

We show that it is possible to significantly improve the accuracy of a general class of histogram queries while satisfying differential privacy. Our approach carefully chooses a set of queries to evaluate, and then exploits consistency constraints that should hold over the noisy output. In a post-processing phase, we compute the consistent input most likely to have produced the noisy output. The final output is differentially-private and consistent, but in addition, it is often much more accurate. We show, both theoretically and experimentally, that these techniques can be used for estimating the degree sequence of a graph very precisely, and for computing a histogram that can support arbitrary range queries accurately.

preprint2010arXiv

Optimizing Histogram Queries under Differential Privacy

Differential privacy is a robust privacy standard that has been successfully applied to a range of data analysis tasks. Despite much recent work, optimal strategies for answering a collection of correlated queries are not known. We study the problem of devising a set of strategy queries, to be submitted and answered privately, that will support the answers to a given workload of queries. We propose a general framework in which query strategies are formed from linear combinations of counting queries, and we describe an optimal method for deriving new query answers from the answers to the strategy queries. Using this framework we characterize the error of strategies geometrically, and we propose solutions to the problem of finding optimal strategies.

Michael Hay

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

Benchmarking Differentially Private Synthetic Data Generation Algorithms

Precision-based attacks and interval refining: how to break, then fix, differential privacy on finite computers

Fair Decision Making using Privacy-Protected Data

Principled Evaluation of Differentially Private Algorithms using DPBench

A Data- and Workload-Aware Algorithm for Range Queries Under Differential Privacy

An Integrated, Conditional Model of Information Extraction and Coreference with Applications to Citation Matching

Boosting the Accuracy of Differentially-Private Histograms Through Consistency

Optimizing Histogram Queries under Differential Privacy