Source author record

Kamalika Chaudhuri

Kamalika Chaudhuri appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Cryptography and Security Artificial Intelligence Databases math.ST Statistics Theory Information Theory math.IT Data Structures and Algorithms Computation and Language Computer Vision Genomics math.OC Methodology

Catalog footprint

What is connected

48works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Agent Security is a Systems Problem

We take the position that agent security must be approached as a systems problem: the AI model powering the agent must be treated as an untrusted component, and security invariants must be enforced at the system level. Through this lens, efforts to increase model robustness (the dominant viewpoint in the community) are insufficient on their own. Instead, we must complement existing efforts with techniques from the systems security domain. Based on our experience as cybersecurity researchers in operating systems, networks, formal methods, and adversarial machine learning, we articulate a set of core principles, grounded in decades of systems security research, that provide a foundation for designing agentic systems with predictable guarantees. As evidence, we analyze eleven representative real-world attacks on agents and discuss how systems principles, if realized, could have prevented these attacks. We also identify the research challenges that stand in the way of implementing these principles in agents.

preprint2026arXiv

Dataset Watermarking for Closed LLMs with Provable Detection

Large language models (LLMs) are pre-trained and post-trained on vast amounts of loosely curated data, raising the possibility that these models may have been trained on proprietary datasets or the same benchmarks used for evaluation. This motivates the need for dataset watermarking: designing datasets such that training on them leaves detectable signatures in the resulting model. Prior work has explored this problem for open models. We introduce the first dataset watermarking method for closed LLMs with provable detection. In particular, we embed a dataset-level watermark signal by increasing the co-occurrence frequency of randomly selected word pairs through rephrasing, and detect it using a statistical test on co-occurrence patterns in model-generated outputs. We evaluate our method with multiple base models and benchmark datasets and show that it reliably detects the watermark ($p <0.01$) in the fine-tuning stage. Notably, our method remains effective in a data mixture setting where the watermarked dataset constitutes only approximately $1\%$ of the total fine-tuning tokens. Furthermore, we show that our method preserves the utility and semantic integrity of the benchmark.

preprint2024arXiv

Communication-Efficient Triangle Counting under Local Differential Privacy

Triangle counting in networks under LDP (Local Differential Privacy) is a fundamental task for analyzing connection patterns or calculating a clustering coefficient while strongly protecting sensitive friendships from a central server. In particular, a recent study proposes an algorithm for this task that uses two rounds of interaction between users and the server to significantly reduce estimation error. However, this algorithm suffers from a prohibitively high communication cost due to a large noisy graph each user needs to download. In this work, we propose triangle counting algorithms under LDP with a small estimation error and communication cost. We first propose two-rounds algorithms consisting of edge sampling and carefully selecting edges each user downloads so that the estimation error is small. Then we propose a double clipping technique, which clips the number of edges and then the number of noisy triangles, to significantly reduce the sensitivity of each user's query. Through comprehensive evaluation, we show that our algorithms dramatically reduce the communication cost of the existing algorithm, e.g., from 6 hours to 8 seconds or less at a 20 Mbps download rate, while keeping a small estimation error.

preprint2023arXiv

Consistent Non-Parametric Methods for Maximizing Robustness

Learning classifiers that are robust to adversarial examples has received a great deal of recent attention. A major drawback of the standard robust learning framework is there is an artificial robustness radius $r$ that applies to all inputs. This ignores the fact that data may be highly heterogeneous, in which case it is plausible that robustness regions should be larger in some regions of data, and smaller in others. In this paper, we address this limitation by proposing a new limit classifier, called the neighborhood optimal classifier, that extends the Bayes optimal classifier outside its support by using the label of the closest in-support point. We then argue that this classifier maximizes the size of its robustness regions subject to the constraint of having accuracy equal to the Bayes optimal. We then present sufficient conditions under which general non-parametric methods that can be represented as weight functions converge towards this limit, and show that both nearest neighbors and kernel classifiers satisfy them under certain conditions.

preprint2023arXiv

Data Redaction from Pre-trained GANs

Large pre-trained generative models are known to occasionally output undesirable samples, which undermines their trustworthiness. The common way to mitigate this is to re-train them differently from scratch using different data or different regularization -- which uses a lot of computational resources and does not always fully address the problem. In this work, we take a different, more compute-friendly approach and investigate how to post-edit a model after training so that it ''redacts'', or refrains from outputting certain kinds of samples. We show that redaction is a fundamentally different task from data deletion, and data deletion may not always lead to redaction. We then consider Generative Adversarial Networks (GANs), and provide three different algorithms for data redaction that differ on how the samples to be redacted are described. Extensive evaluations on real-world image datasets show that our algorithms out-perform data deletion baselines, and are capable of redacting data while retaining high generation quality at a fraction of the cost of full re-training.

preprint2023arXiv

Sample Complexity of Adversarially Robust Linear Classification on Separated Data

We consider the sample complexity of learning with adversarial robustness. Most prior theoretical results for this problem have considered a setting where different classes in the data are close together or overlapping. Motivated by some real applications, we consider, in contrast, the well-separated case where there exists a classifier with perfect accuracy and robustness, and show that the sample complexity narrates an entirely different story. Specifically, for linear classifiers, we show a large class of well-separated distributions where the expected robust loss of any algorithm is at least $Ω(\frac{d}{n})$, whereas the max margin algorithm has expected standard loss $O(\frac{1}{n})$. This shows a gap in the standard and robust losses that cannot be obtained via prior techniques. Additionally, we present an algorithm that, given an instance where the robustness radius is much smaller than the gap between the classes, gives a solution with expected robust loss is $O(\frac{1}{n})$. This shows that for very well-separated data, convergence rates of $O(\frac{1}{n})$ are achievable, which is not the case otherwise. Our results apply to robustness measured in any $\ell_p$ norm with $p > 1$ (including $p = \infty$).

preprint2022arXiv

Bounding Training Data Reconstruction in Private (Deep) Learning

Differential privacy is widely accepted as the de facto method for preventing data leakage in ML, and conventional wisdom suggests that it offers strong protection against privacy attacks. However, existing semantic guarantees for DP focus on membership inference, which may overestimate the adversary's capabilities and is not applicable when membership status itself is non-sensitive. In this paper, we derive the first semantic guarantees for DP mechanisms against training data reconstruction attacks under a formal threat model. We show that two distinct privacy accounting methods -- Renyi differential privacy and Fisher information leakage -- both offer strong semantic protection against data reconstruction attacks.

preprint2022arXiv

Differentially Private Triangle and 4-Cycle Counting in the Shuffle Model

Subgraph counting is fundamental for analyzing connection patterns or clustering tendencies in graph data. Recent studies have applied LDP (Local Differential Privacy) to subgraph counting to protect user privacy even against a data collector in social networks. However, existing local algorithms suffer from extremely large estimation errors or assume multi-round interaction between users and the data collector, which requires a lot of user effort and synchronization. In this paper, we focus on a one-round of interaction and propose accurate subgraph counting algorithms by introducing a recently studied shuffle model. We first propose a basic technique called wedge shuffling to send wedge information, the main component of several subgraphs, with small noise. Then we apply our wedge shuffling to counting triangles and 4-cycles -- basic subgraphs for analyzing clustering tendencies -- with several additional techniques. We also show upper bounds on the estimation error for each algorithm. We show through comprehensive experiments that our one-round shuffle algorithms significantly outperform the one-round local algorithms in terms of accuracy and achieve small estimation errors with a reasonable privacy budget, e.g., smaller than 1 in edge DP.

preprint2022arXiv

Privacy Amplification by Subsampling in Time Domain

Aggregate time-series data like traffic flow and site occupancy repeatedly sample statistics from a population across time. Such data can be profoundly useful for understanding trends within a given population, but also pose a significant privacy risk, potentially revealing e.g., who spends time where. Producing a private version of a time-series satisfying the standard definition of Differential Privacy (DP) is challenging due to the large influence a single participant can have on the sequence: if an individual can contribute to each time step, the amount of additive noise needed to satisfy privacy increases linearly with the number of time steps sampled. As such, if a signal spans a long duration or is oversampled, an excessive amount of noise must be added, drowning out underlying trends. However, in many applications an individual realistically cannot participate at every time step. When this is the case, we observe that the influence of a single participant (sensitivity) can be reduced by subsampling and/or filtering in time, while still meeting privacy requirements. Using a novel analysis, we show this significant reduction in sensitivity and propose a corresponding class of privacy mechanisms. We demonstrate the utility benefits of these techniques empirically with real-world and synthetic time-series data.

preprint2022arXiv

Privacy-Aware Compression for Federated Data Analysis

Federated data analytics is a framework for distributed data analysis where a server compiles noisy responses from a group of distributed low-bandwidth user devices to estimate aggregate statistics. Two major challenges in this framework are privacy, since user data is often sensitive, and compression, since the user devices have low network bandwidth. Prior work has addressed these challenges separately by combining standard compression algorithms with known privacy mechanisms. In this work, we take a holistic look at the problem and design a family of privacy-aware compression mechanisms that work for any given communication budget. We first propose a mechanism for transmitting a single real number that has optimal variance under certain conditions. We then show how to extend it to metric differential privacy for location privacy use-cases, as well as vectors, for application to federated learning. Our experiments illustrate that our mechanism can lead to better utility vs. compression trade-offs for the same privacy loss in a number of settings.

preprint2022arXiv

Revisiting Model-Agnostic Private Learning: Faster Rates and Active Learning

The Private Aggregation of Teacher Ensembles (PATE) framework is one of the most promising recent approaches in differentially private learning. Existing theoretical analysis shows that PATE consistently learns any VC-classes in the realizable setting, but falls short in explaining its success in more general cases where the error rate of the optimal classifier is bounded away from zero. We fill in this gap by introducing the Tsybakov Noise Condition (TNC) and establish stronger and more interpretable learning bounds. These bounds provide new insights into when PATE works and improve over existing results even in the narrower realizable setting. We also investigate the compelling idea of using active learning for saving privacy budget, and empirical studies show the effectiveness of this new idea. The novel components in the proofs include a more refined analysis of the majority voting classifier - which could be of independent interest - and an observation that the synthetic "student" learning problem is nearly realizable by construction under the Tsybakov noise condition.

preprint2022arXiv

Sentence-level Privacy for Document Embeddings

User language data can contain highly sensitive personal content. As such, it is imperative to offer users a strong and interpretable privacy guarantee when learning from their data. In this work, we propose SentDP: pure local differential privacy at the sentence level for a single user document. We propose a novel technique, DeepCandidate, that combines concepts from robust statistics and language modeling to produce high-dimensional, general-purpose $ε$-SentDP document embeddings. This guarantees that any single sentence in a document can be substituted with any other sentence while keeping the embedding $ε$-indistinguishable. Our experiments indicate that these private document embeddings are useful for downstream tasks like sentiment analysis and topic classification and even outperform baseline methods with weaker guarantees like word-level Metric DP.

preprint2022arXiv

Thompson Sampling for Robust Transfer in Multi-Task Bandits

We study the problem of online multi-task learning where the tasks are performed within similar but not necessarily identical multi-armed bandit environments. In particular, we study how a learner can improve its overall performance across multiple related tasks through robust transfer of knowledge. While an upper confidence bound (UCB)-based algorithm has recently been shown to achieve nearly-optimal performance guarantees in a setting where all tasks are solved concurrently, it remains unclear whether Thompson sampling (TS) algorithms, which have superior empirical performance in general, share similar theoretical properties. In this work, we present a TS-type algorithm for a more general online multi-task learning protocol, which extends the concurrent setting. We provide its frequentist analysis and prove that it is also nearly-optimal using a novel concentration inequality for multi-task data aggregation at random stopping times. Finally, we evaluate the algorithm on synthetic data and show that the TS-type algorithm enjoys superior empirical performance in comparison with the UCB-based algorithm and a baseline algorithm that performs TS for each individual task without transfer.

preprint2022arXiv

Understanding Instance-based Interpretability of Variational Auto-Encoders

Instance-based interpretation methods have been widely studied for supervised learning methods as they help explain how black box neural networks predict. However, instance-based interpretations remain ill-understood in the context of unsupervised learning. In this paper, we investigate influence functions [Koh and Liang, 2017], a popular instance-based interpretation method, for a class of deep generative models called variational auto-encoders (VAE). We formally frame the counter-factual question answered by influence functions in this setting, and through theoretical analysis, examine what they reveal about the impact of training samples on classical unsupervised learning methods. We then introduce VAE- TracIn, a computationally efficient and theoretically sound solution based on Pruthi et al. [2020], for VAEs. Finally, we evaluate VAE-TracIn on several real world datasets with extensive quantitative and qualitative analysis.

preprint2021arXiv

Approximate Data Deletion from Machine Learning Models

Deleting data from a trained machine learning (ML) model is a critical task in many applications. For example, we may want to remove the influence of training points that might be out of date or outliers. Regulations such as EU's General Data Protection Regulation also stipulate that individuals can request to have their data deleted. The naive approach to data deletion is to retrain the ML model on the remaining data, but this is too time consuming. In this work, we propose a new approximate deletion method for linear and logistic models whose computational cost is linear in the the feature dimension $d$ and independent of the number of training data $n$. This is a significant gain over all existing methods, which all have superlinear time dependence on the dimension. We also develop a new feature-injection test to evaluate the thoroughness of data deletion from ML models.

preprint2021arXiv

Connecting Interpretability and Robustness in Decision Trees through Separation

Recent research has recognized interpretability and robustness as essential properties of trustworthy classification. Curiously, a connection between robustness and interpretability was empirically observed, but the theoretical reasoning behind it remained elusive. In this paper, we rigorously investigate this connection. Specifically, we focus on interpretation using decision trees and robustness to $l_{\infty}$-perturbation. Previous works defined the notion of $r$-separation as a sufficient condition for robustness. We prove upper and lower bounds on the tree size in case the data is $r$-separated. We then show that a tighter bound on the size is possible when the data is linearly separated. We provide the first algorithm with provable guarantees both on robustness, interpretability, and accuracy in the context of decision trees. Experiments confirm that our algorithm yields classifiers that are both interpretable and robust and have high accuracy. The code for the experiments is available at https://github.com/yangarbiter/interpretable-robust-trees .

preprint2021arXiv

Locally Differentially Private Analysis of Graph Statistics

Differentially private analysis of graphs is widely used for releasing statistics from sensitive graphs while still preserving user privacy. Most existing algorithms however are in a centralized privacy model, where a trusted data curator holds the entire graph. As this model raises a number of privacy and security issues -- such as, the trustworthiness of the curator and the possibility of data breaches, it is desirable to consider algorithms in a more decentralized local model where no server holds the entire graph. In this work, we consider a local model, and present algorithms for counting subgraphs -- a fundamental task for analyzing the connection patterns in a graph -- with LDP (Local Differential Privacy). For triangle counts, we present algorithms that use one and two rounds of interaction, and show that an additional round can significantly improve the utility. For $k$-star counts, we present an algorithm that achieves an order optimal estimation error in the non-interactive local model. We provide new lower-bounds on the estimation error for general graph statistics including triangle counts and $k$-star counts. Finally, we perform extensive experiments on two real datasets, and show that it is indeed possible to accurately estimate subgraph counts in the local differential privacy model.

preprint2021arXiv

Location Trace Privacy Under Conditional Priors

Providing meaningful privacy to users of location based services is particularly challenging when multiple locations are revealed in a short period of time. This is primarily due to the tremendous degree of dependence that can be anticipated between points. We propose a Rényi divergence based privacy framework for bounding expected privacy loss for conditionally dependent data. Additionally, we demonstrate an algorithm for achieving this privacy under Gaussian process conditional priors. This framework both exemplifies why conditionally dependent data is so challenging to protect and offers a strategy for preserving privacy to within a fixed radius for sensitive locations in a user's trace.

preprint2020arXiv

A Closer Look at Accuracy vs. Robustness

Current methods for training robust networks lead to a drop in test accuracy, which has led prior works to posit that a robustness-accuracy tradeoff may be inevitable in deep learning. We take a closer look at this phenomenon and first show that real image datasets are actually separated. With this property in mind, we then prove that robustness and accuracy should both be achievable for benchmark datasets through locally Lipschitz functions, and hence, there should be no inherent tradeoff between robustness and accuracy. Through extensive experiments with robustness methods, we argue that the gap between theory and practice arises from two limitations of current methods: either they fail to impose local Lipschitzness or they are insufficiently generalized. We explore combining dropout with robust training methods and obtain better generalization. We conclude that achieving robustness and accuracy in practice may require using methods that impose local Lipschitzness and augmenting them with deep learning generalization techniques. Code available at https://github.com/yangarbiter/robust-local-lipschitz

preprint2020arXiv

A Non-Parametric Test to Detect Data-Copying in Generative Models

Detecting overfitting in generative models is an important challenge in machine learning. In this work, we formalize a form of overfitting that we call {\em{data-copying}} -- where the generative model memorizes and outputs training samples or small variations thereof. We provide a three sample non-parametric test for detecting data-copying that uses the training set, a separate sample from the target distribution, and a generated sample from the model, and study the performance of our test on several canonical models and datasets. For code \& examples, visit https://github.com/casey-meehan/data-copying

preprint2020arXiv

An Investigation of Data Poisoning Defenses for Online Learning

Data poisoning attacks -- where an adversary can modify a small fraction of training data, with the goal of forcing the trained classifier to high loss -- are an important threat for machine learning in many applications. While a body of prior work has developed attacks and defenses, there is not much general understanding on when various attacks and defenses are effective. In this work, we undertake a rigorous study of defenses against data poisoning for online learning. First, we study four standard defenses in a powerful threat model, and provide conditions under which they can allow or resist rapid poisoning. We then consider a weaker and more realistic threat model, and show that the success of the adversary in the presence of data poisoning defenses there depends on the "ease" of the learning problem.

preprint2020arXiv

Robustness for Non-Parametric Classification: A Generic Attack and Defense

Adversarially robust machine learning has received much recent attention. However, prior attacks and defenses for non-parametric classifiers have been developed in an ad-hoc or classifier-specific basis. In this work, we take a holistic look at adversarial examples for non-parametric classifiers, including nearest neighbors, decision trees, and random forests. We provide a general defense method, adversarial pruning, that works by preprocessing the dataset to become well-separated. To test our defense, we provide a novel attack that applies to a wide range of non-parametric classifiers. Theoretically, we derive an optimally robust classifier, which is analogous to the Bayes Optimal. We show that adversarial pruning can be viewed as a finite sample approximation to this optimal classifier. We empirically show that our defense and attack are either better than or competitive with prior work on non-parametric classifiers. Overall, our results provide a strong and broadly-applicable baseline for future work on robust non-parametrics. Code available at https://github.com/yangarbiter/adversarial-nonparametrics/ .

preprint2020arXiv

Successive Refinement of Privacy

This work examines a novel question: how much randomness is needed to achieve local differential privacy (LDP)? A motivating scenario is providing {\em multiple levels of privacy} to multiple analysts, either for distribution or for heavy-hitter estimation, using the \emph{same} (randomized) output. We call this setting \emph{successive refinement of privacy}, as it provides hierarchical access to the raw data with different privacy levels. For example, the same randomized output could enable one analyst to reconstruct the input, while another can only estimate the distribution subject to LDP requirements. This extends the classical Shannon (wiretap) security setting to local differential privacy. We provide (order-wise) tight characterizations of privacy-utility-randomness trade-offs in several cases for distribution estimation, including the standard LDP setting under a randomness constraint. We also provide a non-trivial privacy mechanism for multi-level privacy. Furthermore, we show that we cannot reuse random keys over time while preserving privacy of each user.

preprint2020arXiv

The Expressive Power of a Class of Normalizing Flow Models

Normalizing flows have received a great deal of recent attention as they allow flexible generative modeling as well as easy likelihood computation. While a wide variety of flow models have been proposed, there is little formal understanding of the representation power of these models. In this work, we study some basic normalizing flows and rigorously establish bounds on their expressive power. Our results indicate that while these flows are highly expressive in one dimension, in higher dimensions their representation power may be limited, especially when the flows have moderate depth.

preprint2020arXiv

When are Non-Parametric Methods Robust?

A growing body of research has shown that many classifiers are susceptible to {\em{adversarial examples}} -- small strategic modifications to test inputs that lead to misclassification. In this work, we study general non-parametric methods, with a view towards understanding when they are robust to these modifications. We establish general conditions under which non-parametric methods are r-consistent -- in the sense that they converge to optimally robust and accurate classifiers in the large sample limit. Concretely, our results show that when data is well-separated, nearest neighbors and kernel classifiers are r-consistent, while histograms are not. For general data distributions, we prove that preprocessing by Adversarial Pruning (Yang et. al., 2019) -- that makes data well-separated -- followed by nearest neighbors or kernel classifiers also leads to r-consistency.

preprint2016arXiv

Active Learning from Imperfect Labelers

We study active learning where the labeler can not only return incorrect labels but also abstain from labeling. We consider different noise and abstention conditions of the labeler. We propose an algorithm which utilizes abstention responses, and analyze its statistical consistency and query complexity under fairly natural assumptions on the noise and abstention rate of the labeler. This algorithm is adaptive in a sense that it can automatically request less queries with a more informed or less noisy labeler. We couple our algorithm with lower bounds to show that under some technical conditions, it achieves nearly optimal query complexity.

preprint2016arXiv

Convex Optimization For Non-Convex Problems via Column Generation

We apply column generation to approximating complex structured objects via a set of primitive structured objects under either the cross entropy or L2 loss. We use L1 regularization to encourage the use of few structured primitive objects. We attack approximation using convex optimization over an infinite number of variables each corresponding to a primitive structured object that are generated on demand by easy inference in the Lagrangian dual. We apply our approach to producing low rank approximations to large 3-way tensors.

preprint2016arXiv

DP-EM: Differentially Private Expectation Maximization

The iterative nature of the expectation maximization (EM) algorithm presents a challenge for privacy-preserving estimation, as each iteration increases the amount of noise needed. We propose a practical private EM algorithm that overcomes this challenge using two innovations: (1) a novel moment perturbation formulation for differentially private EM (DP-EM), and (2) the use of two recently developed composition methods to bound the privacy "cost" of multiple EM iterations: the moments accountant (MA) and zero-mean concentrated differential privacy (zCDP). Both MA and zCDP bound the moment generating function of the privacy loss random variable and achieve a refined tail bound, which effectively decrease the amount of additive noise. We present empirical results showing the benefits of our approach, as well as similar performance between these two composition methods in the DP-EM setting for Gaussian mixture models. Our approach can be readily extended to many iterative learning algorithms, opening up various exciting future directions.

preprint2016arXiv

On the Theory and Practice of Privacy-Preserving Bayesian Data Analysis

Bayesian inference has great promise for the privacy-preserving analysis of sensitive data, as posterior sampling automatically preserves differential privacy, an algorithmic notion of data privacy, under certain conditions (Dimitrakakis et al., 2014; Wang et al., 2015). While this one posterior sample (OPS) approach elegantly provides privacy "for free," it is data inefficient in the sense of asymptotic relative efficiency (ARE). We show that a simple alternative based on the Laplace mechanism, the workhorse of differential privacy, is as asymptotically efficient as non-private posterior inference, under general assumptions. This technique also has practical advantages including efficient use of the privacy budget for MCMC. We demonstrate the practicality of our approach on a time-series analysis of sensitive military records from the Afghanistan and Iraq wars disclosed by the Wikileaks organization.

preprint2016arXiv

The Extended Littlestone's Dimension for Learning with Mistakes and Abstentions

This paper studies classification with an abstention option in the online setting. In this setting, examples arrive sequentially, the learner is given a hypothesis class $\mathcal H$, and the goal of the learner is to either predict a label on each example or abstain, while ensuring that it does not make more than a pre-specified number of mistakes when it does predict a label. Previous work on this problem has left open two main challenges. First, not much is known about the optimality of algorithms, and in particular, about what an optimal algorithmic strategy is for any individual hypothesis class. Second, while the realizable case has been studied, the more realistic non-realizable scenario is not well-understood. In this paper, we address both challenges. First, we provide a novel measure, called the Extended Littlestone's Dimension, which captures the number of abstentions needed to ensure a certain number of mistakes. Second, we explore the non-realizable case, and provide upper and lower bounds on the number of abstentions required by an algorithm to guarantee a specified number of mistakes.

preprint2015arXiv

Active Learning from Weak and Strong Labelers

An active learner is given a hypothesis class, a large set of unlabeled examples and the ability to interactively query labels to an oracle of a subset of these examples; the goal of the learner is to learn a hypothesis in the class that fits the data well by making as few label queries as possible. This work addresses active learning with labels obtained from strong and weak labelers, where in addition to the standard active learning setting, we have an extra weak labeler which may occasionally provide incorrect labels. An example is learning to classify medical images where either expensive labels may be obtained from a physician (oracle or strong labeler), or cheaper but occasionally incorrect labels may be obtained from a medical resident (weak labeler). Our goal is to learn a classifier with low error on data labeled by the oracle, while using the weak labeler to reduce the number of label queries made to this labeler. We provide an active learning algorithm for this setting, establish its statistical consistency, and analyze its label complexity to characterize when it can provide label savings over using the strong labeler alone.

preprint2015arXiv

Convergence Rates of Active Learning for Maximum Likelihood Estimation

An active learner is given a class of models, a large set of unlabeled examples, and the ability to interactively query labels of a subset of these examples; the goal of the learner is to learn a model in the class that fits the data well. Previous theoretical work has rigorously characterized label complexity of active learning, but most of this work has focused on the PAC or the agnostic PAC model. In this paper, we shift our attention to a more general setting -- maximum likelihood estimation. Provided certain conditions hold on the model class, we provide a two-stage active learning algorithm for this problem. The conditions we require are fairly general, and cover the widely popular class of Generalized Linear Models, which in turn, include models for binary and multi-class classification, regression, and conditional random fields. We provide an upper bound on the label requirement of our algorithm, and a lower bound that matches it up to lower order terms. Our analysis shows that unlike binary classification in the realizable case, just a single extra round of interaction is sufficient to achieve near-optimal performance in maximum likelihood estimation. On the empirical side, the recent work in ~\cite{Zhang12} and~\cite{Zhang14} (on active linear and logistic regression) shows the promise of this approach.

preprint2015arXiv

Crowdsourcing Feature Discovery via Adaptively Chosen Comparisons

We introduce an unsupervised approach to efficiently discover the underlying features in a data set via crowdsourcing. Our queries ask crowd members to articulate a feature common to two out of three displayed examples. In addition we also ask the crowd to provide binary labels to the remaining examples based on the discovered features. The triples are chosen adaptively based on the labels of the previously discovered features on the data set. In two natural models of features, hierarchical and independent, we show that a simple adaptive algorithm, using "two-out-of-three" similarity queries, recovers all features with less labor than any nonadaptive algorithm. Experimental results validate the theoretical findings.

preprint2015arXiv

Spectral Learning of Large Structured HMMs for Comparative Epigenomics

We develop a latent variable model and an efficient spectral algorithm motivated by the recent emergence of very large data sets of chromatin marks from multiple human cell types. A natural model for chromatin data in one cell type is a Hidden Markov Model (HMM); we model the relationship between multiple cell types by connecting their hidden states by a fixed tree of known structure. The main challenge with learning parameters of such models is that iterative methods such as EM are very slow, while naive spectral methods result in time and space complexity exponential in the number of cell types. We exploit properties of the tree structure of the hidden states to provide spectral algorithms that are more computationally efficient for current biological datasets. We provide sample complexity bounds for our algorithm and evaluate it experimentally on biological data from nine human cell types. Finally, we show that beyond our specific model, some of our algorithmic ideas can be applied to other graphical models.

preprint2014arXiv

Beyond Disagreement-based Agnostic Active Learning

We study agnostic active learning, where the goal is to learn a classifier in a pre-specified hypothesis class interactively with as few label queries as possible, while making no assumptions on the true function generating the labels. The main algorithms for this problem are {\em{disagreement-based active learning}}, which has a high label requirement, and {\em{margin-based active learning}}, which only applies to fairly restricted settings. A major challenge is to find an algorithm which achieves better label complexity, is consistent in an agnostic setting, and applies to general classification problems. In this paper, we provide such an algorithm. Our solution is based on two novel contributions -- a reduction from consistent active learning to confidence-rated prediction with guaranteed error, and a novel confidence-rated predictor.

preprint2014arXiv

Consistent procedures for cluster tree estimation and pruning

For a density $f$ on ${\mathbb R}^d$, a {\it high-density cluster} is any connected component of $\{x: f(x) \geq λ\}$, for some $λ> 0$. The set of all high-density clusters forms a hierarchy called the {\it cluster tree} of $f$. We present two procedures for estimating the cluster tree given samples from $f$. The first is a robust variant of the single linkage algorithm for hierarchical clustering. The second is based on the $k$-nearest neighbor graph of the samples. We give finite-sample convergence rates for these algorithms which also imply consistency, and we derive lower bounds on the sample complexity of cluster tree estimation. Finally, we study a tree pruning procedure that guarantees, under milder conditions than usual, to remove clusters that are spurious while recovering those that are salient.

preprint2014arXiv

Learning from Data with Heterogeneous Noise using SGD

We consider learning from data of variable quality that may be obtained from different heterogeneous sources. Addressing learning from heterogeneous data in its full generality is a challenging problem. In this paper, we adopt instead a model in which data is observed through heterogeneous noise, where the noise level reflects the quality of the data source. We study how to use stochastic gradient algorithms to learn in this model. Our study is motivated by two concrete examples where this problem arises naturally: learning with local differential privacy based on data from multiple sources with different privacy requirements, and learning from data with labels of variable quality. The main contribution of this paper is to identify how heterogeneous noise impacts performance. We show that given two datasets with heterogeneous noise, the order in which to use them in standard SGD depends on the learning rate. We propose a method for changing the learning rate as a function of the heterogeneity, and prove new regret bounds for our method in two cases of interest. Experiments on real data show that our method performs better than using a single learning rate and using only the less noisy of the two datasets when the noise level is low to moderate.

preprint2014arXiv

Rates of Convergence for Nearest Neighbor Classification

Nearest neighbor methods are a popular class of nonparametric estimators with several desirable properties, such as adaptivity to different distance scales in different regions of space. Prior work on convergence rates for nearest neighbor classification has not fully reflected these subtle properties. We analyze the behavior of these estimators in metric spaces and provide finite-sample, distribution-dependent rates of convergence under minimal assumptions. As a by-product, we are able to establish the universal consistency of nearest neighbor in a broader range of data spaces than was previously known. We illustrate our upper and lower bounds by introducing smoothness classes that are customized for nearest neighbor classification.

preprint2014arXiv

The Large Margin Mechanism for Differentially Private Maximization

A basic problem in the design of privacy-preserving algorithms is the private maximization problem: the goal is to pick an item from a universe that (approximately) maximizes a data-dependent function, all under the constraint of differential privacy. This problem has been used as a sub-routine in many privacy-preserving algorithms for statistics and machine-learning. Previous algorithms for this problem are either range-dependent---i.e., their utility diminishes with the size of the universe---or only apply to very restricted function classes. This work provides the first general-purpose, range-independent algorithm for private maximization that guarantees approximate differential privacy. Its applicability is demonstrated on two fundamental tasks in data mining and machine learning.

preprint2013arXiv

Near-Optimal Algorithms for Differentially-Private Principal Components

Principal components analysis (PCA) is a standard tool for identifying good low-dimensional approximations to data in high dimension. Many data sets of interest contain private or sensitive information about individuals. Algorithms which operate on such data should be sensitive to the privacy risks in publishing their outputs. Differential privacy is a framework for developing tradeoffs between privacy and the utility of these outputs. In this paper we investigate the theory and empirical performance of differentially private approximations to PCA and propose a new method which explicitly optimizes the utility of the output. We show that the sample complexity of the proposed method differs from the existing procedure in the scaling with the data dimension, and that our method is nearly optimal in terms of this scaling. We furthermore illustrate our results, showing that on real data there is a large performance gap between the existing method and our method.

preprint2013arXiv

Noisy Bayesian Active Learning

We consider the problem of noisy Bayesian active learning, where we are given a finite set of functions $\mathcal{H}$, a sample space $\mathcal{X}$, and a label set $\mathcal{L}$. One of the functions in $\mathcal{H}$ assigns labels to samples in $\mathcal{X}$. The goal is to identify the function that generates the labels even though the result of a label query on a sample is corrupted by independent noise. More precisely, the objective is to declare one of the functions in $\mathcal{H}$ as the true label generating function with high confidence using as few label queries as possible, by selecting the queries adaptively and in a strategic manner. Previous work in Bayesian active learning considers Generalized Binary Search, and its variants for the noisy case, and analyzes the number of queries required by these sampling strategies. In this paper, we show that these schemes are, in general, suboptimal. Instead we propose and analyze an alternative strategy for sample collection. Our sampling strategy is motivated by a connection between Bayesian active learning and active hypothesis testing, and is based on querying the label of a sample which maximizes the Extrinsic Jensen-Shannon divergence at each step. We provide upper and lower bounds on the performance of this sampling strategy, and show that these bounds are better than previous bounds.

preprint2012arXiv

An Online Learning-based Framework for Tracking

We study the tracking problem, namely, estimating the hidden state of an object over time, from unreliable and noisy measurements. The standard framework for the tracking problem is the generative framework, which is the basis of solutions such as the Bayesian algorithm and its approximation, the particle filters. However, these solutions can be very sensitive to model mismatches. In this paper, motivated by online learning, we introduce a new framework for tracking. We provide an efficient tracking algorithm for this framework. We provide experimental results comparing our algorithm to the Bayesian algorithm on simulated data. Our experiments show that when there are slight model mismatches, our algorithm outperforms the Bayesian algorithm.

preprint2012arXiv

Convergence Rates for Differentially Private Statistical Estimation

Differential privacy is a cryptographically-motivated definition of privacy which has gained significant attention over the past few years. Differentially private solutions enforce privacy by adding random noise to a function computed over the data, and the challenge in designing such algorithms is to control the added noise in order to optimize the privacy-accuracy-sample size tradeoff. This work studies differentially-private statistical estimation, and shows upper and lower bounds on the convergence rates of differentially private approximations to statistical estimators. Our results reveal a formal connection between differential privacy and the notion of Gross Error Sensitivity (GES) in robust statistics, by showing that the convergence rate of any differentially private approximation to an estimator that is accurate over a large class of distributions has to grow with the GES of the estimator. We then provide an upper bound on the convergence rate of a differentially private approximation to an estimator with bounded range and bounded GES. We show that the bounded range condition is necessary if we wish to ensure a strict form of differential privacy.

preprint2011arXiv

Differentially Private Empirical Risk Minimization

Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the $ε$-differential privacy definition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classification. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classifiers. If the loss and regularizer satisfy certain convexity and differentiability criteria, we prove theoretical results showing that our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further present a privacy-preserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance.

preprint2011arXiv

Privacy constraints in regularized convex optimization

This paper is withdrawn due to some errors, which are corrected in arXiv:0912.0071v4 [cs.LG].

preprint2011arXiv

Spectral Methods for Learning Multivariate Latent Tree Structure

This work considers the problem of learning the structure of multivariate linear tree models, which include a variety of directed tree graphical models with continuous, discrete, and mixed latent variables such as linear-Gaussian models, hidden Markov models, Gaussian mixture models, and Markov evolutionary trees. The setting is one where we only have samples from certain observed variables in the tree, and our goal is to estimate the tree structure (i.e., the graph of how the underlying hidden variables are connected to each other and to the observed variables). We propose the Spectral Recursive Grouping algorithm, an efficient and simple bottom-up procedure for recovering the tree structure from independent samples of the observed variables. Our finite sample size bounds for exact recovery of the tree structure reveal certain natural dependencies on underlying statistical and structural properties of the underlying joint distribution. Furthermore, our sample complexity guarantees have no explicit dependence on the dimensionality of the observed variables, making the algorithm applicable to many high-dimensional settings. At the heart of our algorithm is a spectral quartet test for determining the relative topology of a quartet of variables from second-order statistics.

preprint2010arXiv

A parameter-free hedging algorithm

We study the problem of decision-theoretic online learning (DTOL). Motivated by practical applications, we focus on DTOL when the number of actions is very large. Previous algorithms for learning in this framework have a tunable learning rate parameter, and a barrier to using online-learning in practical applications is that it is not understood how to set this parameter optimally, particularly when the number of actions is large. In this paper, we offer a clean solution by proposing a novel and completely parameter-free algorithm for DTOL. We introduce a new notion of regret, which is more natural for applications with a large number of actions. We show that our algorithm achieves good performance with respect to this new notion of regret; in addition, it also achieves performance close to that of the best bounds achieved by previous algorithms with optimally-tuned parameters, according to previous notions of regret.

preprint2010arXiv

Tracking using explanation-based modeling

We study the tracking problem, namely, estimating the hidden state of an object over time, from unreliable and noisy measurements. The standard framework for the tracking problem is the generative framework, which is the basis of solutions such as the Bayesian algorithm and its approximation, the particle filters. However, the problem with these solutions is that they are very sensitive to model mismatches. In this paper, motivated by online learning, we introduce a new framework -- an {\em explanatory} framework -- for tracking. We provide an efficient tracking algorithm for this framework. We provide experimental results comparing our algorithm to the Bayesian algorithm on simulated data. Our experiments show that when there are slight model mismatches, our algorithm vastly outperforms the Bayesian algorithm.

Kamalika Chaudhuri

What is connected

Connect this record

See the researcher in context

Building this map preview

48 published item(s)

Agent Security is a Systems Problem

Dataset Watermarking for Closed LLMs with Provable Detection

Communication-Efficient Triangle Counting under Local Differential Privacy

Consistent Non-Parametric Methods for Maximizing Robustness

Data Redaction from Pre-trained GANs

Sample Complexity of Adversarially Robust Linear Classification on Separated Data

Bounding Training Data Reconstruction in Private (Deep) Learning

Differentially Private Triangle and 4-Cycle Counting in the Shuffle Model

Privacy Amplification by Subsampling in Time Domain

Privacy-Aware Compression for Federated Data Analysis

Revisiting Model-Agnostic Private Learning: Faster Rates and Active Learning

Sentence-level Privacy for Document Embeddings

Thompson Sampling for Robust Transfer in Multi-Task Bandits

Understanding Instance-based Interpretability of Variational Auto-Encoders

Approximate Data Deletion from Machine Learning Models

Connecting Interpretability and Robustness in Decision Trees through Separation

Locally Differentially Private Analysis of Graph Statistics

Location Trace Privacy Under Conditional Priors

A Closer Look at Accuracy vs. Robustness

A Non-Parametric Test to Detect Data-Copying in Generative Models

An Investigation of Data Poisoning Defenses for Online Learning

Robustness for Non-Parametric Classification: A Generic Attack and Defense

Successive Refinement of Privacy

The Expressive Power of a Class of Normalizing Flow Models

When are Non-Parametric Methods Robust?

Active Learning from Imperfect Labelers

Convex Optimization For Non-Convex Problems via Column Generation

DP-EM: Differentially Private Expectation Maximization

On the Theory and Practice of Privacy-Preserving Bayesian Data Analysis

The Extended Littlestone's Dimension for Learning with Mistakes and Abstentions

Active Learning from Weak and Strong Labelers

Convergence Rates of Active Learning for Maximum Likelihood Estimation

Crowdsourcing Feature Discovery via Adaptively Chosen Comparisons

Spectral Learning of Large Structured HMMs for Comparative Epigenomics

Beyond Disagreement-based Agnostic Active Learning

Consistent procedures for cluster tree estimation and pruning

Learning from Data with Heterogeneous Noise using SGD

Rates of Convergence for Nearest Neighbor Classification

The Large Margin Mechanism for Differentially Private Maximization

Near-Optimal Algorithms for Differentially-Private Principal Components

Noisy Bayesian Active Learning

An Online Learning-based Framework for Tracking

Convergence Rates for Differentially Private Statistical Estimation

Differentially Private Empirical Risk Minimization

Privacy constraints in regularized convex optimization

Spectral Methods for Learning Multivariate Latent Tree Structure

A parameter-free hedging algorithm

Tracking using explanation-based modeling