Source author record

Ninghui Li

Ninghui Li appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Cryptography and Security Databases Machine Learning Data Structures and Algorithms Artificial Intelligence Human-Computer Interaction Information Theory math.IT

Catalog footprint

What is connected

17works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Imitative Membership Inference Attack

A Membership Inference Attack (MIA) assesses how much a target machine learning model reveals about its training data by determining whether specific query instances were part of the training set. State-of-the-art MIAs rely on training hundreds of shadow models that are independent of the target model, leading to significant computational overhead. In this paper, we introduce Imitative Membership Inference Attack (IMIA), which employs a novel imitative training technique to strategically construct a small number of target-informed imitative models that closely replicate the target model's behavior for inference. Extensive experimental results demonstrate that IMIA substantially outperforms existing MIAs in various attack settings while only requiring less than 5% of the computational cost of state-of-the-art approaches.

preprint2022arXiv

Are Your Sensitive Attributes Private? Novel Model Inversion Attribute Inference Attacks on Classification Models

Increasing use of machine learning (ML) technologies in privacy-sensitive domains such as medical diagnoses, lifestyle predictions, and business decisions highlights the need to better understand if these ML technologies are introducing leakage of sensitive and proprietary training data. In this paper, we focus on model inversion attacks where the adversary knows non-sensitive attributes about records in the training data and aims to infer the value of a sensitive attribute unknown to the adversary, using only black-box access to the target classification model. We first devise a novel confidence score-based model inversion attribute inference attack that significantly outperforms the state-of-the-art. We then introduce a label-only model inversion attack that relies only on the model's predicted labels but still matches our confidence score-based attack in terms of attack effectiveness. We also extend our attacks to the scenario where some of the other (non-sensitive) attributes of a target record are unknown to the adversary. We evaluate our attacks on two types of machine learning models, decision tree and deep neural network, trained on three real datasets. Moreover, we empirically demonstrate the disparate vulnerability of model inversion attacks, i.e., specific groups in the training dataset (grouped by gender, race, etc.) could be more vulnerable to model inversion attacks.

preprint2021arXiv

Membership Inference Attacks and Defenses in Classification Models

We study the membership inference (MI) attack against classifiers, where the attacker's goal is to determine whether a data instance was used for training the classifier. Through systematic cataloging of existing MI attacks and extensive experimental evaluations of them, we find that a model's vulnerability to MI attacks is tightly related to the generalization gap -- the difference between training accuracy and test accuracy. We then propose a defense against MI attacks that aims to close the gap by intentionally reduces the training accuracy. More specifically, the training process attempts to match the training and validation accuracies, by means of a new {\em set regularizer} using the Maximum Mean Discrepancy between the softmax output empirical distributions of the training and validation sets. Our experimental results show that combining this approach with another simple defense (mix-up training) significantly improves state-of-the-art defense against MI attacks, with minimal impact on testing accuracy.

preprint2020arXiv

Answering Multi-Dimensional Range Queries under Local Differential Privacy

In this paper, we tackle the problem of answering multi-dimensional range queries under local differential privacy. There are three key technical challenges: capturing the correlations among attributes, avoiding the curse of dimensionality, and dealing with the large domains of attributes. None of the existing approaches satisfactorily deals with all three challenges. Overcoming these three challenges, we first propose an approach called Two-Dimensional Grids (TDG). Its main idea is to carefully use binning to partition the two-dimensional (2-D) domains of all attribute pairs into 2-D grids that can answer all 2-D range queries and then estimate the answer of a higher dimensional range query from the answers of the associated 2-D range queries. However, in order to reduce errors due to noises, coarse granularities are needed for each attribute in 2-D grids, losing fine-grained distribution information for individual attributes. To correct this deficiency, we further propose Hybrid-Dimensional Grids (HDG), which also introduces 1-D grids to capture finer-grained information on distribution of each individual attribute and combines information from 1-D and 2-D grids to answer range queries. To make HDG consistently effective, we provide a guideline for properly choosing granularities of grids based on an analysis of how different sources of errors are impacted by these choices. Extensive experiments conducted on real and synthetic datasets show that HDG can give a significant improvement over the existing approaches.

preprint2020arXiv

Improving Frequency Estimation under Local Differential Privacy

Local Differential Privacy protocols are stochastic protocols used in data aggregation when individual users do not trust the data aggregator with their private data. In such protocols there is a fundamental tradeoff between user privacy and aggregator utility. In the setting of frequency estimation, established bounds on this tradeoff are either nonquantitative, or far from what is known to be attainable. In this paper, we use information-theoretical methods to significantly improve established bounds. We also show that the new bounds are attainable for binary inputs. Furthermore, our methods lead to improved frequency estimators, which we experimentally show to outperform state-of-the-art methods.

preprint2020arXiv

Improving Utility and Security of the Shuffler-based Differential Privacy

When collecting information, local differential privacy (LDP) alleviates privacy concerns of users because their private information is randomized before being sent it to the central aggregator. LDP imposes large amount of noise as each user executes the randomization independently. To address this issue, recent work introduced an intermediate server with the assumption that this intermediate server does not collude with the aggregator. Under this assumption, less noise can be added to achieve the same privacy guarantee as LDP, thus improving utility for the data collection task. This paper investigates this multiple-party setting of LDP. We analyze the system model and identify potential adversaries. We then make two improvements: a new algorithm that achieves a better privacy-utility tradeoff; and a novel protocol that provides better protection against various attacks. Finally, we perform experiments to compare different methods and demonstrate the benefits of using our proposed method.

preprint2020arXiv

Locally Differentially Private Frequency Estimation with Consistency

Local Differential Privacy (LDP) protects user privacy from the data collector. LDP protocols have been increasingly deployed in the industry. A basic building block is frequency oracle (FO) protocols, which estimate frequencies of values. While several FO protocols have been proposed, the design goal does not lead to optimal results for answering many queries. In this paper, we show that adding post-processing steps to FO protocols by exploiting the knowledge that all individual frequencies should be non-negative and they sum up to one can lead to significantly better accuracy for a wide range of tasks, including frequencies of individual values, frequencies of the most frequent values, and frequencies of subsets of values. We consider 10 different methods that exploit this knowledge differently. We establish theoretical relationships between some of them and conducted extensive experimental evaluations to understand which methods should be used for different query tasks.

preprint2020arXiv

PolyScope: Multi-Policy Access Control Analysis to Triage Android Systems

Android filesystem access control provides a foundation for Android system integrity. Android utilizes a combination of mandatory (e.g., SEAndroid) and discretionary (e.g., UNIX permissions) access control, both to protect the Android platform from Android/OEM services and to protect Android/OEM services from third-party apps. However, OEMs often create vulnerabilities when they introduce market-differentiating features because they err when re-configuring this complex combination of Android policies. In this paper, we propose the PolyScope tool to triage the combination of Android filesystem access control policies to vet releases for vulnerabilities. The PolyScope approach leverages two main insights: (1) adversaries may exploit the coarse granularity of mandatory policies and the flexibility of discretionary policies to increase the permissions available to launch attacks, which we call permission expansion, and (2) system configurations may limit the ways adversaries may use their permissions to launch attacks, motivating computation of attack operations. We apply PolyScope to three Google and five OEM Android releases to compute the attack operations accurately to vet these releases for vulnerabilities, finding that permission expansion increases the permissions available to launch attacks, sometimes by more than 10X, but a significant fraction of these permissions (about 15-20%) are not convertible into attack operations. Using PolyScope, we find two previously unknown vulnerabilities, showing how PolyScope helps OEMs triage the complex combination of access control policies down to attack operations worthy of testing.

preprint2020arXiv

PrivSyn: Differentially Private Data Synthesis

In differential privacy (DP), a challenging problem is to generate synthetic datasets that efficiently capture the useful information in the private data. The synthetic dataset enables any task to be done without privacy concern and modification to existing algorithms. In this paper, we present PrivSyn, the first automatic synthetic data generation method that can handle general tabular datasets (with 100 attributes and domain size $>2^{500}$). PrivSyn is composed of a new method to automatically and privately identify correlations in the data, and a novel method to generate sample data from a dense graphic model. We extensively evaluate different methods on multiple datasets to demonstrate the performance of our method.

preprint2020arXiv

Random Spiking and Systematic Evaluation of Defenses Against Adversarial Examples

Image classifiers often suffer from adversarial examples, which are generated by strategically adding a small amount of noise to input images to trick classifiers into misclassification. Over the years, many defense mechanisms have been proposed, and different researchers have made seemingly contradictory claims on their effectiveness. We present an analysis of possible adversarial models, and propose an evaluation framework for comparing different defense mechanisms. As part of the framework, we introduce a more powerful and realistic adversary strategy. Furthermore, we propose a new defense mechanism called Random Spiking (RS), which generalizes dropout and introduces random noises in the training process in a controlled manner. Evaluations under our proposed framework suggest RS delivers better protection against adversarial examples than many existing schemes.

preprint2020arXiv

Towards Effective Differential Privacy Communication for Users' Data Sharing Decision and Comprehension

Differential privacy protects an individual's privacy by perturbing data on an aggregated level (DP) or individual level (LDP). We report four online human-subject experiments investigating the effects of using different approaches to communicate differential privacy techniques to laypersons in a health app data collection setting. Experiments 1 and 2 investigated participants' data disclosure decisions for low-sensitive and high-sensitive personal information when given different DP or LDP descriptions. Experiments 3 and 4 uncovered reasons behind participants' data sharing decisions, and examined participants' subjective and objective comprehensions of these DP or LDP descriptions. When shown descriptions that explain the implications instead of the definition/processes of DP or LDP technique, participants demonstrated better comprehension and showed more willingness to share information with LDP than with DP, indicating their understanding of LDP's stronger privacy guarantee compared with DP.

preprint2016arXiv

Understanding the Sparse Vector Technique for Differential Privacy

The Sparse Vector Technique (SVT) is a fundamental technique for satisfying differential privacy and has the unique quality that one can output some query answers without apparently paying any privacy cost. SVT has been used in both the interactive setting, where one tries to answer a sequence of queries that are not known ahead of the time, and in the non-interactive setting, where all queries are known. Because of the potential savings on privacy budget, many variants for SVT have been proposed and employed in privacy-preserving data mining and publishing. However, most variants of SVT are actually not private. In this paper, we analyze these errors and identify the misunderstandings that likely contribute to them. We also propose a new version of SVT that provides better utility, and introduce an effective technique to improve the performance of SVT. These enhancements can be applied to improve utility in the interactive setting. Through both analytical and experimental comparisons, we show that, in the non-interactive setting (but not the interactive setting), the SVT technique is unnecessary, as it can be replaced by the Exponential Mechanism (EM) with better accuracy.

preprint2015arXiv

Differentially Private $k$-Means Clustering

There are two broad approaches for differentially private data analysis. The interactive approach aims at developing customized differentially private algorithms for various data mining tasks. The non-interactive approach aims at developing differentially private algorithms that can output a synopsis of the input dataset, which can then be used to support various data mining tasks. In this paper we study the tradeoff of interactive vs. non-interactive approaches and propose a hybrid approach that combines interactive and non-interactive, using $k$-means clustering as an example. In the hybrid approach to differentially private $k$-means clustering, one first uses a non-interactive mechanism to publish a synopsis of the input dataset, then applies the standard $k$-means clustering algorithm to learn $k$ cluster centroids, and finally uses an interactive approach to further improve these cluster centroids. We analyze the error behavior of both non-interactive and interactive approaches and use such analysis to decide how to allocate privacy budget between the non-interactive step and the interactive step. Results from extensive experiments support our analysis and demonstrate the effectiveness of our approach.

preprint2015arXiv

Differentially Private Projected Histograms of Multi-Attribute Data for Classification

In this paper, we tackle the problem of constructing a differentially private synopsis for the classification analyses. Several the state-of-the-art methods follow the structure of existing classification algorithms and are all iterative, which is suboptimal due to the locally optimal choices and the over-divided privacy budget among many sequentially composed steps. Instead, we propose a new approach, PrivPfC, a new differentially private method for releasing data for classification. The key idea is to privately select an optimal partition of the underlying dataset using the given privacy budget in one step. Given one dataset and the privacy budget, PrivPfC constructs a pool of candidate grids where the number of cells of each grid is under a data-aware and privacy-budget-aware threshold. After that, PrivPfC selects an optimal grid via the exponential mechanism by using a novel quality function which minimizes the expected number of misclassified records on which a histogram classifier is constructed using the published grid. Finally, PrivPfC injects noise into each cell of the selected grid and releases the noisy grid as the private synopsis of the data. If the size of the candidate grid pool is larger than the processing capability threshold set by the data curator, we add a step in the beginning of PrivPfC to prune the set of attributes privately. We introduce a modified $χ^2$ quality function with low sensitivity and use it to evaluate an attribute's relevance to the classification label variable. Through extensive experiments on real datasets, we demonstrate PrivPfC's superiority over the state-of-the-art methods.

preprint2012arXiv

Differentially Private Grids for Geospatial Data

In this paper, we tackle the problem of constructing a differentially private synopsis for two-dimensional datasets such as geospatial datasets. The current state-of-the-art methods work by performing recursive binary partitioning of the data domains, and constructing a hierarchy of partitions. We show that the key challenge in partition-based synopsis methods lies in choosing the right partition granularity to balance the noise error and the non-uniformity error. We study the uniform-grid approach, which applies an equi-width grid of a certain size over the data domain and then issues independent count queries on the grid cells. This method has received no attention in the literature, probably due to the fact that no good method for choosing a grid size was known. Based on an analysis of the two kinds of errors, we propose a method for choosing the grid size. Experimental results validate our method, and show that this approach performs as well as, and often times better than, the state-of-the-art methods. We further introduce a novel adaptive-grid method. The adaptive grid method lays a coarse-grained grid over the dataset, and then further partitions each cell according to its noisy count. Both levels of partitions are then used in answering queries over the dataset. This method exploits the need to have finer granularity partitioning over dense regions and, at the same time, coarse partitioning over sparse regions. Through extensive experiments on real-world datasets, we show that this approach consistently and significantly outperforms the uniform-grid method and other state-of-the-art methods.

preprint2012arXiv

PrivBasis: Frequent Itemset Mining with Differential Privacy

The discovery of frequent itemsets can serve valuable economic and research purposes. Releasing discovered frequent itemsets, however, presents privacy challenges. In this paper, we study the problem of how to perform frequent itemset mining on transaction databases while satisfying differential privacy. We propose an approach, called PrivBasis, which leverages a novel notion called basis sets. A theta-basis set has the property that any itemset with frequency higher than theta is a subset of some basis. We introduce algorithms for privately constructing a basis set and then using it to find the most frequent itemsets. Experiments show that our approach greatly outperforms the current state of the art.

preprint2011arXiv

On Sampling, Anonymization, and Differential Privacy: Or, k-Anonymization Meets Differential Privacy

This paper aims at answering the following two questions in privacy-preserving data analysis and publishing: What formal privacy guarantee (if any) does $k$-anonymization provide? How to benefit from the adversary's uncertainty about the data? We have found that random sampling provides a connection that helps answer these two questions, as sampling can create uncertainty. The main result of the paper is that $k$-anonymization, when done "safely", and when preceded with a random sampling step, satisfies $(ε,δ)$-differential privacy with reasonable parameters. This result illustrates that "hiding in a crowd of $k$" indeed offers some privacy guarantees. This result also suggests an alternative approach to output perturbation for satisfying differential privacy: namely, adding a random sampling step in the beginning and pruning results that are too sensitive to change of a single tuple. Regarding the second question, we provide both positive and negative results. On the positive side, we show that adding a random-sampling pre-processing step to a differentially-private algorithm can greatly amplify the level of privacy protection. Hence, when given a dataset resulted from sampling, one can utilize a much large privacy budget. On the negative side, any privacy notion that takes advantage of the adversary's uncertainty likely does not compose. We discuss what these results imply in practice.

Ninghui Li

What is connected

Connect this record

See the researcher in context

Building this map preview

17 published item(s)

Imitative Membership Inference Attack

Are Your Sensitive Attributes Private? Novel Model Inversion Attribute Inference Attacks on Classification Models

Membership Inference Attacks and Defenses in Classification Models

Answering Multi-Dimensional Range Queries under Local Differential Privacy

Improving Frequency Estimation under Local Differential Privacy

Improving Utility and Security of the Shuffler-based Differential Privacy

Locally Differentially Private Frequency Estimation with Consistency

PolyScope: Multi-Policy Access Control Analysis to Triage Android Systems

PrivSyn: Differentially Private Data Synthesis

Random Spiking and Systematic Evaluation of Defenses Against Adversarial Examples

Towards Effective Differential Privacy Communication for Users' Data Sharing Decision and Comprehension

Understanding the Sparse Vector Technique for Differential Privacy

Differentially Private $k$-Means Clustering

Differentially Private Projected Histograms of Multi-Attribute Data for Classification

Differentially Private Grids for Geospatial Data

PrivBasis: Frequent Itemset Mining with Differential Privacy

On Sampling, Anonymization, and Differential Privacy: Or, k-Anonymization Meets Differential Privacy