Source author record

Graham Cormode

Graham Cormode appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Cryptography and Security Databases Machine Learning Computational Complexity Digital Libraries Social and Information Networks physics.soc-ph Artificial Intelligence Computer Science and Game Theory Computer Vision Distributed, Parallel, and Cluster Computing Information Retrieval

Catalog footprint

What is connected

32works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Pruning Compact ConvNets for Efficient Inference

Neural network pruning is frequently used to compress over-parameterized networks by large amounts, while incurring only marginal drops in generalization performance. However, the impact of pruning on networks that have been highly optimized for efficient inference has not received the same level of attention. In this paper, we analyze the effect of pruning for computer vision, and study state-of-the-art ConvNets, such as the FBNetV3 family of models. We show that model pruning approaches can be used to further optimize networks trained through NAS (Neural Architecture Search). The resulting family of pruned models can consistently obtain better performance than existing FBNetV3 models at the same level of computation, and thus provide state-of-the-art results when trading off between computational complexity and generalization performance on the ImageNet benchmark. In addition to better generalization performance, we also demonstrate that when limited computation resources are available, pruning FBNetV3 models incur only a fraction of GPU-hours involved in running a full-scale NAS.

preprint2022arXiv

Aggregation and Transformation of Vector-Valued Messages in the Shuffle Model of Differential Privacy

Advances in communications, storage and computational technology allow significant quantities of data to be collected and processed by distributed devices. Combining the information from these endpoints can realize significant societal benefit but presents challenges in protecting the privacy of individuals, especially important in an increasingly regulated world. Differential privacy (DP) is a technique that provides a rigorous and provable privacy guarantee for aggregation and release. The Shuffle Model for DP has been introduced to overcome challenges regarding the accuracy of local-DP algorithms and the privacy risks of central-DP. In this work we introduce a new protocol for vector aggregation in the context of the Shuffle Model. The aim of this paper is twofold; first, we provide a single message protocol for the summation of real vectors in the Shuffle Model, using advanced composition results. Secondly, we provide an improvement on the bound on the error achieved through using this protocol through the implementation of a Discrete Fourier Transform, thereby minimizing the initial error at the expense of the loss in accuracy through the transformation itself. This work will further the exploration of more sophisticated structures such as matrices and higher-dimensional tensors in this context, both of which are reliant on the functionality of the vector case.

preprint2022arXiv

Applying the Shuffle Model of Differential Privacy to Vector Aggregation

In this work we introduce a new protocol for vector aggregation in the context of the Shuffle Model, a recent model within Differential Privacy (DP). It sits between the Centralized Model, which prioritizes the level of accuracy over the secrecy of the data, and the Local Model, for which an improvement in trust is counteracted by a much higher noise requirement. The Shuffle Model was developed to provide a good balance between these two models through the addition of a shuffling step, which unbinds the users from their data whilst maintaining a moderate noise requirement. We provide a single message protocol for the summation of real vectors in the Shuffle Model, using advanced composition results. Our contribution provides a mechanism to enable private aggregation and analysis across more sophisticated structures such as matrices and higher-dimensional tensors, both of which are reliant on the functionality of the vector case.

preprint2022arXiv

On the Importance of Difficulty Calibration in Membership Inference Attacks

The vulnerability of machine learning models to membership inference attacks has received much attention in recent years. However, existing attacks mostly remain impractical due to having high false positive rates, where non-member samples are often erroneously predicted as members. This type of error makes the predicted membership signal unreliable, especially since most samples are non-members in real world applications. In this work, we argue that membership inference attacks can benefit drastically from \emph{difficulty calibration}, where an attack's predicted membership score is adjusted to the difficulty of correctly classifying the target sample. We show that difficulty calibration can significantly reduce the false positive rate of a variety of existing attacks without a loss in accuracy.

preprint2022arXiv

Opacus: User-Friendly Differential Privacy Library in PyTorch

We introduce Opacus, a free, open-source PyTorch library for training deep learning models with differential privacy (hosted at opacus.ai). Opacus is designed for simplicity, flexibility, and speed. It provides a simple and user-friendly API, and enables machine learning practitioners to make a training pipeline private by adding as little as two lines to their code. It supports a wide variety of layers, including multi-head attention, convolution, LSTM, GRU (and generic RNN), and embedding, right out of the box and provides the means for supporting other user-defined layers. Opacus computes batched per-sample gradients, providing higher efficiency compared to the traditional "micro batch" approach. In this paper we present Opacus, detail the principles that drove its implementation and unique features, and benchmark it against other frameworks for training models with differential privacy as well as standard PyTorch.

preprint2022arXiv

Optimal Membership Inference Bounds for Adaptive Composition of Sampled Gaussian Mechanisms

Given a trained model and a data sample, membership-inference (MI) attacks predict whether the sample was in the model's training set. A common countermeasure against MI attacks is to utilize differential privacy (DP) during model training to mask the presence of individual examples. While this use of DP is a principled approach to limit the efficacy of MI attacks, there is a gap between the bounds provided by DP and the empirical performance of MI attacks. In this paper, we derive bounds for the \textit{advantage} of an adversary mounting a MI attack, and demonstrate tightness for the widely-used Gaussian mechanism. We further show bounds on the \textit{confidence} of MI attacks. Our bounds are much stronger than those obtained by DP analysis. For example, analyzing a setting of DP-SGD with $ε=4$ would obtain an upper bound on the advantage of $\approx0.36$ based on our analyses, while getting bound of $\approx 0.97$ using the analysis of previous work that convert $ε$ to membership inference bounds. Finally, using our analysis, we provide MI metrics for models trained on CIFAR10 dataset. To the best of our knowledge, our analysis provides the state-of-the-art membership inference bounds for the privacy.

preprint2022arXiv

Reconciling Security and Communication Efficiency in Federated Learning

Cross-device Federated Learning is an increasingly popular machine learning setting to train a model by leveraging a large population of client devices with high privacy and security guarantees. However, communication efficiency remains a major bottleneck when scaling federated learning to production environments, particularly due to bandwidth constraints during uplink communication. In this paper, we formalize and address the problem of compressing client-to-server model updates under the Secure Aggregation primitive, a core component of Federated Learning pipelines that allows the server to aggregate the client updates without accessing them individually. In particular, we adapt standard scalar quantization and pruning methods to Secure Aggregation and propose Secure Indexing, a variant of Secure Aggregation that supports quantization for extreme compression. We establish state-of-the-art results on LEAF benchmarks in a secure Federated Learning setup with up to 40$\times$ compression in uplink communication with no meaningful loss in utility compared to uncompressed baselines.

preprint2022arXiv

Sample and Threshold Differential Privacy: Histograms and applications

Federated analytics seeks to compute accurate statistics from data distributed across users' devices while providing a suitable privacy guarantee and being practically feasible to implement and scale. In this paper, we show how a strong $(ε, δ)$-Differential Privacy (DP) guarantee can be achieved for the fundamental problem of histogram generation in a federated setting, via a highly practical sampling-based procedure that does not add noise to disclosed data. Given the ubiquity of sampling in practice, we thus obtain a DP guarantee almost for free, avoid over-estimating histogram counts, and allow easy reasoning about how privacy guarantees may obscure minorities and outliers. Using such histograms, related problems such as heavy hitters and quantiles can be answered with provable error and privacy guarantees. Experimental results show that our sample-and-threshold approach is accurate and scalable.

preprint2022arXiv

Weighted Random Sampling over Joins

Joining records with all other records that meet a linkage condition can result in an astronomically large number of combinations due to many-to-many relationships. For such challenging (acyclic) joins, a random sample over the join result is a practical alternative to working with the oversized join result. Whereas prior works are limited to uniform join sampling where each join row is assigned the same probability, the scope is extended in this work to weighted sampling to support emerging applications such as scientific discovery in observational data and privacy-preserving query answering. Notwithstanding some naive methods, this work presents the first approach for weighted random sampling from join results. Due to a lack of baselines, experiments over various join types and real-world data sets are conducted to show substantial memory savings and competitive performance with main-memory index-based approaches in the equal-probability setting. In contrast to existing uniform sampling approaches that require prepared structures that occupy contested resources to squeeze out slightly faster query-times, the proposed approaches exhibit qualities that are urgently needed in practice, namely reduced memory footprint, streaming operation, support for selections, outer joins, semi joins and anti joins and unequal-probability sampling. All pertinent code and data can be found at: https://github.com/shekelyan/weightedjoinsampling

preprint2021arXiv

Advances and Open Problems in Federated Learning

Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.

preprint2021arXiv

Subspace exploration: Bounds on Projected Frequency Estimation

Given an $n \times d$ dimensional dataset $A$, a projection query specifies a subset $C \subseteq [d]$ of columns which yields a new $n \times |C|$ array. We study the space complexity of computing data analysis functions over such subspaces, including heavy hitters and norms, when the subspaces are revealed only after observing the data. We show that this important class of problems is typically hard: for many problems, we show $2^{Ω(d)}$ lower bounds. However, we present upper bounds which demonstrate space dependency better than $2^d$. That is, for $c,c' \in (0,1)$ and a parameter $N=2^d$ an $N^c$-approximation can be obtained in space $\min(N^{c'},n)$, showing that it is possible to improve on the naïve approach of keeping information for all $2^d$ subsets of $d$ columns. Our results are based on careful constructions of instances using coding theory and novel combinatorial reductions that exhibit such space-approximation tradeoffs.

preprint2016arXiv

A Second Look at Counting Triangles in Graph Streams (revised)

In this paper we present improved results on the problem of counting triangles in edge streamed graphs. For graphs with $m$ edges and at least $T$ triangles, we show that an extra look over the stream yields a two-pass treaming algorithm that uses $O(\frac{m}{\eps^{2.5}\sqrt{T}}\polylog(m))$ space and outputs a $(1+\eps)$ approximation of the number of triangles in the graph. This improves upon the two-pass streaming tester of Braverman, Ostrovsky and Vilenchik, ICALP 2013, which distinguishes between triangle-free graphs and graphs with at least $T$ triangle using $O(\frac{m}{T^{1/3}})$ space. Also, in terms of dependence on $T$, we show that more passes would not lead to a better space bound. In other words, we prove there is no constant pass streaming algorithm that distinguishes between triangle-free graphs from graphs with at least $T$ triangles using $O(\frac{m}{T^{1/2+ρ}})$ space for any constant $ρ\ge 0$.

preprint2016arXiv

The Sparse Awakens: Streaming Algorithms for Matching Size Estimation in Sparse Graphs

Estimating the size of the maximum matching is a canonical problem in graph algorithms, and one that has attracted extensive study over a range of different computational models. We present improved streaming algorithms for approximating the size of maximum matching with sparse (bounded arboricity) graphs. * Insert-Only Streams: We present a one-pass algorithm that takes O(c log^2 n) space and approximates the size of the maximum matching in graphs with arboricity c within a factor of O(c). This improves significantly on the state-of-the-art O~(cn^{2/3})-space streaming algorithms. * Dynamic Streams: Given a dynamic graph stream (i.e., inserts and deletes) of edges of an underlying c-bounded arboricity graph, we present a one-pass algorithm that uses space O~(c^{10/3}n^{2/3}) and returns an O(c)-estimator for the size of the maximum matching. This algorithm improves the state-of-the-art O~(cn^{4/5})-space algorithms, where the O~(.) notation hides logarithmic in $n$ dependencies. In contrast to the previous works, our results take more advantage of the streaming access to the input and characterize the matching size based on the ordering of the edges in the stream in addition to the degree distributions and structural properties of the sparse graphs.

preprint2015arXiv

Kernelization via Sampling with Applications to Dynamic Graph Streams

In this paper we present a simple but powerful subgraph sampling primitive that is applicable in a variety of computational models including dynamic graph streams (where the input graph is defined by a sequence of edge/hyperedge insertions and deletions) and distributed systems such as MapReduce. In the case of dynamic graph streams, we use this primitive to prove the following results: -- Matching: First, there exists an $\tilde{O}(k^2)$ space algorithm that returns an exact maximum matching on the assumption the cardinality is at most $k$. The best previous algorithm used $\tilde{O}(kn)$ space where $n$ is the number of vertices in the graph and we prove our result is optimal up to logarithmic factors. Our algorithm has $\tilde{O}(1)$ update time. Second, there exists an $\tilde{O}(n^2/α^3)$ space algorithm that returns an $α$-approximation for matchings of arbitrary size. (Assadi et al. (2015) showed that this was optimal and independently and concurrently established the same upper bound.) We generalize both results for weighted matching. Third, there exists an $\tilde{O}(n^{4/5})$ space algorithm that returns a constant approximation in graphs with bounded arboricity. -- Vertex Cover and Hitting Set: There exists an $\tilde{O}(k^d)$ space algorithm that solves the minimum hitting set problem where $d$ is the cardinality of the input sets and $k$ is an upper bound on the size of the minimum hitting set. We prove this is optimal up to logarithmic factors. Our algorithm has $\tilde{O}(1)$ update time. The case $d=2$ corresponds to minimum vertex cover. Finally, we consider a larger family of parameterized problems (including $b$-matching, disjoint paths, vertex coloring among others) for which our subgraph sampling primitive yields fast, small-space dynamic graph stream algorithms. We then show lower bounds for natural problems outside this family.

preprint2014arXiv

Modeling Collaboration in Academia: A Game Theoretic Approach

In this work, we aim to understand the mechanisms driving academic collaboration. We begin by building a model for how researchers split their effort between multiple papers, and how collaboration affects the number of citations a paper receives, supported by observations from a large real-world publication and citation dataset, which we call the h-Reinvestment model. Using tools from the field of Game Theory, we study researchers' collaborative behavior over time under this model, with the premise that each researcher wants to maximize his or her academic success. We find analytically that there is a strong incentive to collaborate rather than work in isolation, and that studying collaborative behavior through a game-theoretic lens is a promising approach to help us better understand the nature and dynamics of academic collaboration.

preprint2014arXiv

Parameterized Streaming Algorithms for Vertex Cover

As graphs continue to grow in size, we seek ways to effectively process such data at scale. The model of streaming graph processing, in which a compact summary is maintained as each edge insertion/deletion is observed, is an attractive one. However, few results are known for optimization problems over such dynamic graph streams. In this paper, we introduce a new approach to handling graph streams, by instead seeking solutions for the parameterized versions of these problems where we are given a parameter $k$ and the objective is to decide whether there is a solution bounded by $k$. By combining kernelization techniques with randomized sketch structures, we obtain the first streaming algorithms for the parameterized versions of the Vertex Cover problem. We consider the following three models for a graph stream on $n$ nodes: 1. The insertion-only model where the edges can only be added. 2. The dynamic model where edges can be both inserted and deleted. 3. The \emph{promised} dynamic model where we are guaranteed that at each timestamp there is a solution of size at most $k$. In each of these three models we are able to design parameterized streaming algorithms for the Vertex Cover problem. We are also able to show matching lower bound for the space complexity of our algorithms. (Due to the arXiv limit of 1920 characters for abstract field, please see the abstract in the paper for detailed description of our results)

preprint2014arXiv

People Like Us: Mining Scholarly Data for Comparable Researchers

We present the problem of finding comparable researchers for any given researcher. This problem has many motivations. Firstly, know thyself. The answers of where we stand among research community and who we are most alike may not be easily found by existing evaluations of ones' research mainly based on citation counts. Secondly, there are many situations where one needs to find comparable researchers e.g., for reviewing peers, constructing programming committees or compiling teams for grants. It is often done through an ad hoc and informal basis. Utilizing the large scale scholarly data accessible on the web, we address the problem of automatically finding comparable researchers. We propose a standard to quantify the quality of research output, via the quality of publishing venues. We represent a researcher as a sequence of her publication records, and develop a framework of comparison of researchers by sequence matching. Several variations of comparisons are considered including matching by quality of publication venue and research topics, and performing prefix matching. We evaluate our methods on a large corpus and demonstrate the effectiveness of our methods through examples. In the end, we identify several promising directions for further work.

preprint2013arXiv

Annotations for Sparse Data Streams

Motivated by cloud computing, a number of recent works have studied annotated data streams and variants thereof. In this setting, a computationally weak verifier (cloud user), lacking the resources to store and manipulate his massive input locally, accesses a powerful but untrusted prover (cloud service). The verifier must work within the restrictive data streaming paradigm. The prover, who can annotate the data stream as it is read, must not just supply the answer but also convince the verifier of its correctness. Ideally, both the amount of annotation and the space used by the verifier should be sublinear in the relevant input size parameters. A rich theory of such algorithms -- which we call schemes -- has emerged. Prior work has shown how to leverage the prover's power to efficiently solve problems that have no non-trivial standard data stream algorithms. However, while optimal schemes are now known for several basic problems, such optimality holds only for streams whose length is commensurate with the size of the data universe. In contrast, many real-world datasets are relatively sparse, including graphs that contain only O(n^2) edges, and IP traffic streams that contain much fewer than the total number of possible IP addresses, 2^128 in IPv6. We design the first schemes that allow both the annotation and the space usage to be sublinear in the total number of stream updates rather than the size of the data universe. We solve significant problems, including variations of INDEX, SET-DISJOINTNESS, and FREQUENCY-MOMENTS, plus several natural problems on graphs. On the other hand, we give a new lower bound that, for the first time, rules out smooth tradeoffs between annotation and space usage for a specific problem. Our technique brings out new nuances in Merlin-Arthur communication complexity models, and provides a separation between online versions of the MA and AMA models.

preprint2013arXiv

First Author Advantage: Citation Labeling in Research

Citations among research papers, and the networks they form, are the primary object of study in scientometrics. The act of making a citation reflects the citer's knowledge of the related literature, and of the work being cited. We aim to gain insight into this process by studying citation keys: user-chosen labels to identify a cited work. Our main observation is that the first listed author is disproportionately represented in such labels, implying a strong mental bias towards the first author.

preprint2013arXiv

On the Tradeoff between Stability and Fit

In computing, as in many aspects of life, changes incur cost. Many optimization problems are formulated as a one-time instance starting from scratch. However, a common case that arises is when we already have a set of prior assignments, and must decide how to respond to a new set of constraints, given that each change from the current assignment comes at a price. That is, we would like to maximize the fitness or efficiency of our system, but we need to balance it with the changeout cost from the previous state. We provide a precise formulation for this tradeoff and analyze the resulting {\em stable extensions} of some fundamental problems in measurement and analytics. Our main technical contribution is a stable extension of PPS (probability proportional to size) weighted random sampling, with applications to monitoring and anomaly detection problems. We also provide a general framework that applies to top-$k$, minimum spanning tree, and assignment. In both cases, we are able to provide exact solutions, and discuss efficient incremental algorithms that can find new solutions as the input changes.

preprint2013arXiv

Scienceography: the study of how science is written

Scientific literature has itself been the subject of much scientific study, for a variety of reasons: understanding how results are communicated, how ideas spread, and assessing the influence of areas or individuals. However, most prior work has focused on extracting and analyzing citation and stylistic patterns. In this work, we introduce the notion of 'scienceography', which focuses on the writing of science. We provide a first large scale study using data derived from the arXiv e-print repository. Crucially, our data includes the "source code" of scientific papers-the LaTEX source-which enables us to study features not present in the "final product", such as the tools used and private comments between authors. Our study identifies broad patterns and trends in two example areas-computer science and mathematics-as well as highlighting key differences in the way that science is written in these fields. Finally, we outline future directions to extend the new topic of scienceography.

preprint2013arXiv

Socializing the h-index

A variety of bibliometric measures have been proposed to quantify the impact of researchers and their work. The h-index is a notable and widely-used example which aims to improve over simple metrics such as raw counts of papers or citations. However, a limitation of this measure is that it considers authors in isolation and does not account for contributions through a collaborative team. To address this, we propose a natural variant that we dub the Social h-index. The idea is to redistribute the h-index score to reflect an individual's impact on the research community. In addition to describing this new measure, we provide examples, discuss its properties, and contrast with other measures.

preprint2012arXiv

Accurate and Efficient Private Release of Datacubes and Contingency Tables

A central problem in releasing aggregate information about sensitive data is to do so accurately while providing a privacy guarantee on the output. Recent work focuses on the class of linear queries, which include basic counting queries, data cubes, and contingency tables. The goal is to maximize the utility of their output, while giving a rigorous privacy guarantee. Most results follow a common template: pick a "strategy" set of linear queries to apply to the data, then use the noisy answers to these queries to reconstruct the queries of interest. This entails either picking a strategy set that is hoped to be good for the queries, or performing a costly search over the space of all possible strategies. In this paper, we propose a new approach that balances accuracy and efficiency: we show how to improve the accuracy of a given query set by answering some strategy queries more accurately than others. This leads to an efficient optimal noise allocation for many popular strategies, including wavelets, hierarchies, Fourier coefficients and more. For the important case of marginal queries we show that this strictly improves on previous methods, both analytically and empirically. Our results also extend to ensuring that the returned query answers are consistent with an (unknown) data set at minimal extra cost in terms of time and noise.

preprint2012arXiv

Differentially Private Spatial Decompositions

Differential privacy has recently emerged as the de facto standard for private data release. This makes it possible to provide strong theoretical guarantees on the privacy and utility of released data. While it is well-known how to release data based on counts and simple functions under this guarantee, it remains to provide general purpose techniques to release different kinds of data. In this paper, we focus on spatial data such as locations and more generally any data that can be indexed by a tree structure. Directly applying existing differential privacy methods to this type of data simply generates noise. Instead, we introduce a new class of "private spatial decompositions": these adapt standard spatial indexing methods such as quadtrees and kd-trees to provide a private description of the data distribution. Equipping such structures with differential privacy requires several steps to ensure that they provide meaningful privacy guarantees. Various primitives, such as choosing splitting points and describing the distribution of points within a region, must be done privately, and the guarantees of the different building blocks composed to provide an overall guarantee. Consequently, we expose the design space for private spatial decompositions, and analyze some key examples. Our experimental study demonstrates that it is possible to build such decompositions efficiently, and use them to answer a variety of queries privately with high accuracy.

preprint2012arXiv

Practical Verified Computation with Streaming Interactive Proofs

When delegating computation to a service provider, as in cloud computing, we seek some reassurance that the output is correct and complete. Yet recomputing the output as a check is inefficient and expensive, and it may not even be feasible to store all the data locally. We are therefore interested in proof systems which allow a service provider to prove the correctness of its output to a streaming (sublinear space) user, who cannot store the full input or perform the full computation herself. Our approach is two-fold. First, we describe a carefully chosen instantiation of one of the most efficient general-purpose constructions for arbitrary computations (streaming or otherwise), due to Goldwasser, Kalai, and Rothblum. This requires several new insights to make the methodology more practical. Our main contribution is in achieving a prover who runs in time O(S(n) log S(n)), where S(n) is the size of an arithmetic circuit computing the function of interest. Our experimental results demonstrate that a practical general-purpose protocol for verifiable computation may be significantly closer to reality than previously realized. Second, we describe techniques that achieve genuine scalability for protocols fine-tuned for specific important problems in streaming and database processing. Focusing in particular on non-interactive protocols for problems ranging from matrix-vector multiplication to bipartite perfect matching, we build on prior work to achieve a prover who runs in nearly linear-time, while obtaining optimal tradeoffs between communication cost and the user's working memory. Existing techniques required (substantially) superlinear time for the prover. We argue that even if general-purpose methods improve, fine-tuned protocols will remain valuable in real-world settings for key problems, and hence special attention to specific problems is warranted.

preprint2011arXiv

Differentially Private Publication of Sparse Data

The problem of privately releasing data is to provide a version of a dataset without revealing sensitive information about the individuals who contribute to the data. The model of differential privacy allows such private release while providing strong guarantees on the output. A basic mechanism achieves differential privacy by adding noise to the frequency counts in the contingency tables (or, a subset of the count data cube) derived from the dataset. However, when the dataset is sparse in its underlying space, as is the case for most multi-attribute relations, then the effect of adding noise is to vastly increase the size of the published data: it implicitly creates a huge number of dummy data points to mask the true data, making it almost impossible to work with. We present techniques to overcome this roadblock and allow efficient private release of sparse data, while maintaining the guarantees of differential privacy. Our approach is to release a compact summary of the noisy data. Generating the noisy data and then summarizing it would still be very costly, so we show how to shortcut this step, and instead directly generate the summary from the input data, without materializing the vast intermediate noisy data. We instantiate this outline for a variety of sampling and filtering methods, and show how to use the resulting summary for approximate, private, query answering. Our experimental study shows that this is an effective, practical solution, with comparable and occasionally improved utility over the costly materialization approach.

preprint2011arXiv

Node Classification in Social Networks

When dealing with large graphs, such as those that arise in the context of online social networks, a subset of nodes may be labeled. These labels can indicate demographic values, interest, beliefs or other characteristics of the nodes (users). A core problem is to use this information to extend the labeling so that all nodes are assigned a label (or labels). In this chapter, we survey classification techniques that have been proposed for this problem. We consider two broad categories: methods based on iterative application of traditional classifiers using graph information as features, and methods which propagate the existing labels via random walks. We adopt a common perspective on these methods to highlight the similarities between different approaches within and across the two categories. We also describe some extensions and related directions to the central problem of node classification.

preprint2011arXiv

Structure-Aware Sampling: Flexible and Accurate Summarization

In processing large quantities of data, a fundamental problem is to obtain a summary which supports approximate query answering. Random sampling yields flexible summaries which naturally support subset-sum queries with unbiased estimators and well-understood confidence bounds. Classic sample-based summaries, however, are designed for arbitrary subset queries and are oblivious to the structure in the set of keys. The particular structure, such as hierarchy, order, or product space (multi-dimensional), makes range queries much more relevant for most analysis of the data. Dedicated summarization algorithms for range-sum queries have also been extensively studied. They can outperform existing sampling schemes in terms of accuracy on range queries per summary size. Their accuracy, however, rapidly degrades when, as is often the case, the query spans multiple ranges. They are also less flexible - being targeted for range sum queries alone - and are often quite costly to build and use. In this paper we propose and evaluate variance optimal sampling schemes that are structure-aware. These summaries improve over the accuracy of existing structure-oblivious sampling schemes on range queries while retaining the benefits of sample-based summaries: flexible summaries, with high accuracy on both range queries and arbitrary subset queries.

preprint2011arXiv

Verifying Computations with Streaming Interactive Proofs

When computation is outsourced, the data owner would like to be assured that the desired computation has been performed correctly by the service provider. In theory, proof systems can give the necessary assurance, but prior work is not sufficiently scalable or practical. In this paper, we develop new proof protocols for verifying computations which are streaming in nature: the verifier (data owner) needs only logarithmic space and a single pass over the input, and after observing the input follows a simple protocol with a prover (service provider) that takes logarithmic communication spread over a logarithmic number of rounds. These ensure that the computation is performed correctly: that the service provider has not made any errors or missed out some data. The guarantee is very strong: even if the service provider deliberately tries to cheat, there is only vanishingly small probability of doing so undetected, while a correct computation is always accepted. We first observe that some theoretical results can be modified to work with streaming verifiers, showing that there are efficient protocols for problems in the complexity classes NP and NC. Our main results then seek to bridge the gap between theory and practice by developing usable protocols for a variety of problems of central importance in streaming and database processing. All these problems require linear space in the traditional streaming model, and therefore our protocols demonstrate that adding a prover can exponentially reduce the effort needed by the verifier. Our experimental results show that our protocols are practical and scalable.

preprint2010arXiv

Individual Privacy vs Population Privacy: Learning to Attack Anonymization

Over the last decade there have been great strides made in developing techniques to compute functions privately. In particular, Differential Privacy gives strong promises about conclusions that can be drawn about an individual. In contrast, various syntactic methods for providing privacy (criteria such as kanonymity and l-diversity) have been criticized for still allowing private information of an individual to be inferred. In this report, we consider the ability of an attacker to use data meeting privacy definitions to build an accurate classifier. We demonstrate that even under Differential Privacy, such classifiers can be used to accurately infer "private" attributes in realistic data. We compare this to similar approaches for inferencebased attacks on other forms of anonymized data. We place these attacks on the same scale, and observe that the accuracy of inference of private attributes for Differentially Private data and l-diverse data can be quite similar.

preprint2010arXiv

Information Cost Tradeoffs for Augmented Index and Streaming Language Recognition

This paper makes three main contributions to the theory of communication complexity and stream computation. First, we present new bounds on the information complexity of AUGMENTED-INDEX. In contrast to analogous results for INDEX by Jain, Radhakrishnan and Sen [J. ACM, 2009], we have to overcome the significant technical challenge that protocols for AUGMENTED-INDEX may violate the "rectangle property" due to the inherent input sharing. Second, we use these bounds to resolve an open problem of Magniez, Mathieu and Nayak [STOC, 2010] that asked about the multi-pass complexity of recognizing Dyck languages. This results in a natural separation between the standard multi-pass model and the multi-pass model that permits reverse passes. Third, we present the first passive memory checkers that verify the interaction transcripts of priority queues, stacks, and double-ended queues. We obtain tight upper and lower bounds for these problems, thereby addressing an important sub-class of the memory checking framework of Blum et al. [Algorithmica, 1994].

preprint2010arXiv

Streaming Graph Computations with a Helpful Advisor

Motivated by the trend to outsource work to commercial cloud computing services, we consider a variation of the streaming paradigm where a streaming algorithm can be assisted by a powerful helper that can provide annotations to the data stream. We extend previous work on such {\em annotation models} by considering a number of graph streaming problems. Without annotations, streaming algorithms for graph problems generally require significant memory; we show that for many standard problems, including all graph problems that can be expressed with totally unimodular integer programming formulations, only a constant number of hash values are needed for single-pass algorithms given linear-sized annotations. We also obtain a protocol achieving \textit{optimal} tradeoffs between annotation length and memory usage for matrix-vector multiplication; this result contributes to a trend of recent research on numerical linear algebra in streaming models.

Graham Cormode

What is connected

Connect this record

See the researcher in context

Building this map preview

32 published item(s)

Pruning Compact ConvNets for Efficient Inference

Aggregation and Transformation of Vector-Valued Messages in the Shuffle Model of Differential Privacy

Applying the Shuffle Model of Differential Privacy to Vector Aggregation

On the Importance of Difficulty Calibration in Membership Inference Attacks

Opacus: User-Friendly Differential Privacy Library in PyTorch

Optimal Membership Inference Bounds for Adaptive Composition of Sampled Gaussian Mechanisms

Reconciling Security and Communication Efficiency in Federated Learning

Sample and Threshold Differential Privacy: Histograms and applications

Weighted Random Sampling over Joins

Advances and Open Problems in Federated Learning

Subspace exploration: Bounds on Projected Frequency Estimation

A Second Look at Counting Triangles in Graph Streams (revised)

The Sparse Awakens: Streaming Algorithms for Matching Size Estimation in Sparse Graphs

Kernelization via Sampling with Applications to Dynamic Graph Streams

Modeling Collaboration in Academia: A Game Theoretic Approach

Parameterized Streaming Algorithms for Vertex Cover

People Like Us: Mining Scholarly Data for Comparable Researchers

Annotations for Sparse Data Streams

First Author Advantage: Citation Labeling in Research

On the Tradeoff between Stability and Fit

Scienceography: the study of how science is written

Socializing the h-index

Accurate and Efficient Private Release of Datacubes and Contingency Tables

Differentially Private Spatial Decompositions

Practical Verified Computation with Streaming Interactive Proofs

Differentially Private Publication of Sparse Data

Node Classification in Social Networks

Structure-Aware Sampling: Flexible and Accurate Summarization

Verifying Computations with Streaming Interactive Proofs

Individual Privacy vs Population Privacy: Learning to Attack Anonymization

Information Cost Tradeoffs for Augmented Index and Streaming Language Recognition

Streaming Graph Computations with a Helpful Advisor